Skip to content

This repository is for learning about machine learning implementations in scikit-learn

Notifications You must be signed in to change notification settings

BaoLocPham/Machine-Learning-with-Scikit-Learn

Repository files navigation

Machine Learning From Scratch

image image

This is my repository on learning Machine Learning from scratch, if you want to checkout my repository on Deep learning with TensorFlow click here 👉: TensorFlow-Deep-Learning

Table of contents

Number Notebook Description Extras
00 Basic ML Intuition What is ML, Bias and Variances?
01 Data Preprocess Template Data preprocess template
02 Regression Simple linear regression, multiple, poly, ...
03 Classification Logistic regression, knn, svm, ...
04 Clustering KMeans clustering, Hierarchical clustering
05 Association Rule Learning Apriori, Eclat
06 Reinforcement Learning UCB, Thomson Sampling
07 NLP Introduction to nlp
08 Dimensionality Reduction PCA, Kernel PCA, LDA
## Model Selection Model selection: regression, classifcation
## Case Study Case study

Details

Basic Intuition

Math

Machine learning Fundamentals


Regression

Number Notebook Extras
01 Simple Linear Regression
02 Multiple Linear Regression When to multiple linear regression
03 Polynomial Regression Polynomial Regression
04 Support Vector Regression Introduction to SVR, kernels
05 Decision Tree Regression Decision Tree Regression, Decision Tree ML
06 Random Forest Regression Random Forest, Random Forest ML

Regression: Pros and cons

Regression Model Pros Cons
Linear Regression Works on any size of the dataset, gives informations about relevance of features. The Linear Regression Assumptions.
Polynomial Regression Works on any size of dataset, works very well on non linear problems. Need to choose the right polynomial degree for a good bias, variance tradeoff.
SVR Easily adaptable, works very well on non linear problems, not bias by outlier. Compulsory to apply feature scaling, not well documentated, more difficult to understand.
Decision Tree Regression Interpretablity, no need for feature scaling, works on both linear, nonlinear problems. Poor Results on too small datasets, overfitting can easily occur.
Random Forest Regression Powerful and accurate, good performance on may problems, including nonlinear. Poor Results on too small datasets, overfitting can easily occur.

Classification

Number Notebook Extras
01 Logistic Regression StatQuest: Logistic Regression
02 K-Nearest-Neighbours StatQuest: KNN
03 Support Vector Machine StatQuest: SVM
04 Kernel SVM StatQuest: Polinomial Kernel StatQuest: RBF kernel
05 Naive Bayes StatQuest: Naive Bayes StatQuest: Gaussian Naive Bayes
06 Decision Tree StatQuest: Decision Tree Regression
04 Random Forest Regression StatQuest: Random Forest

Classifications: Pros and Cons

Classification Model Pros Cons
Logistic Regression Probabilistics approach, gives informations about statiscal significance of features. The Logistic Regression Assumptions.
K-NN Simple to understand, fast and efficient. Need to choose the number of neighbours K.
SVM Performant, not biased by outliers, not sensitive to overfitting. Not appropriate for nonlinear problems, not the best choice for large number of features.
Kernel SVM High performance on nonlinear problems, not biased by outliers, not sensitive to overfitting. Not the best choice for large number of features, more complex.
Naive Bayes Efficient not biased by outliers, works on nonlinear problems, probabilitstic approach. Based on the assumption that features have same statistical relevance.
Decision Tree Classification Interpretability, no need for feature scaling, works on both linear, nonlinear problems. Poor results on too small datasets, overfitting can easily occur.
Random Forest Classification Powerful and accurate, good performance on many problems, including nonlinear. No interpretability, overfitting can easily occur, need to choose the number of trees.

Clustering

Number Notebook Extras
01 KMean StatQuest: KMeans Clustering , WCSS and Elbow method
02 Hierarchical StatQuest: Hierarchical Clustering , Dendrogram method

Clustering: Pros and Cons

Regression Model Pros Cons
K-Means Simple to understand, easily adaptable, works well on small or large datasets, fast, efficient and performant. Need to choose the number of cluster.
Hierarchical Clustering The optimal number of clusters can be obtained by the model itself, pratical visualization with the dendrogram. Not appropriate for large datasets.

Association Rule Learning

Number Notebook Extras
01 Apriori Apriori Algorithm
02 Eclat

Reinforcement Learning

Number Notebook Extras
01 Upper Confidence Bound Confidence Bounds, UCB and Multi-armed bandit problem
02 Thomson Sampling Thomson Sampling

The Multi-Armed Bandit Problem


NLP

Number Notebook Extras
01 Introduction to nlp

Dimensionality Reduction

Number Notebook Extras
01 Principal Component Analysis setosa-PCA example, StatQuest-PCA, plotly-PCA visualization

Model selection

Number Notebooks Extras
01 Regression
02 Classification The Accuracy paradox, AUC-ROC and CAP Curves, Precision, Recall and F-1 score

Case study

Number Notebooks Extras
01 Logistic Regression Breast Cancer classifier

Extras

Datasets:

  1. ICU dataset
  2. Repo datasets

Blogs:

Acknowledge:

  • Thanks Kirill Eremenko, Hadelin de Ponteves for creating such an awesome about machine learning online.
  • Thanks Josh Starmer aka StatQuest for your brilliant video about machine learning, help me alot of understanding the math behind the ML algorithm.
  • Thanks mr Vũ Hữu Tiệp for your brilliant blogs about machine learning, helps me a lot from the day i didn't know what is machine learning is.
  • Thanks mr Phạm Đình Khánh for your blogs about machine learning and deep learning.

About

This repository is for learning about machine learning implementations in scikit-learn

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published