# Machine Learning Project Guidelines - For Beginners 
#### Book by *Balasubramanian Chandran*

<br>
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover.png">

This is the Jupyter notebook version of the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

# Table of Contents

#### [00. Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb)
#### [01. Introduction to Machine Learning](01.00-mlpg-Introduction-to-Machine-Learning.ipynb)
- What is Machine Learning?
- Automation vs Machine Learning vs Statistical modeling
- Differences between Machine Learning and Statistical modeling
- Types of Machine Learning
- Differences between Supervised & Unsupervised Learnings
- Difference between ML Algorithms and ML Models
- Is Machine Learning a complete black box?
- Challenges in the adoption of Machine Learning
- Applications of ML in day-to-day life (most common UCs)
- Why is Machine Learning getting so much attention recently?
- A must know 7 regression techniques
- How to select the right regression technique?

#### [02. Grouping of ML Algorithms](02.00-mlpg-Grouping-of-ML-Algorithms.ipynb)
#### [03. Pros and Cons of ML Algorithms](03.00-mlpg-Pros-and-Cons-of-ML-Algorithms.ipynb)
#### [04. Types of ML Algorithms](04.00-mlpg-Types-of-ML-Algorithms.ipynb)
#### [05. Machine Learning Project – Stages](05.00-mlpg-Machine-Learning-Project–Stages.ipynb)
#### [06. Machine Learning Project – Process-Definition](06.00-mlpg-Machine-Learning-Project–Process-Definition.ipynb)
- Machine Learning projects vs Data Science projects
- Role Definitions

#### [07. Stage-1: Business Understanding](07.00-mlpg-Stage-1-Business-Understanding.ipynb)
#### [08. Stage-2: Data Understanding](08.00-mlpg-Stage-2-Data-Understanding.ipynb)
- Exploratory Data Analysis (EDA)
  - Objectives of EDA
  - Prerequisites of EDA
  - Types of variables
  - Distribution of Variables
  - Summary of EDA Techniques
  - Text EDA: Understand the data with Descriptive Statistics
  - Visual EDA: Understand the data with Visualizations

#### [09. Stage-3: Research](09.00-mlpg-Stage-3-Research.ipynb)
- List of selected algorithms to build models
  - Regression algorithms
  - Classification algorithms
- List of model evaluation metrics
  - Regression model evaluation metrics
  - Classification model evaluation metrics
  - How to choose a Binary Classification model evaluation metric (for imbalanced datasets)

#### [10. Stage-4: Data Preprocessing](10.00-mlpg-Stage-4-Data-Preprocessing.ipynb)
- Data Preparation framework (for structured/tabular data)
- Data Preparation tasks
  - Data Cleaning
  - Feature Selection
  - Feature Engineering
  - Dimensionality Reduction
  - Split datasets for train-test
  - Data Transforms
  - Handling Imbalanced classes

#### [11. Stage-5: Model Development](11.00-mlpg-Stage-5-Model-Development.ipynb)
#### [12. Stage-6: Model Training](12.00-mlpg-Stage-6-Model-Training.ipynb)
#### [13. Stage-7: Model Refinement](13.00-mlpg-Stage-7-Model-Refinement.ipynb)
- Hyperparameters optimization
  - Differences between Model Parameters and Model Hyperparameters
  - Hyperparameters-Tuning for Classification Algorithms
  - Hyperparameters Optimization with Random Search and Grid Search

#### [14. Stage-8: Model Evaluation](14.00-mlpg-Stage-8-Model-Evaluation.ipynb)
#### [15. Stage-9: Final Model Selection](15.00-mlpg-Stage-9-Final-Model-Selection.ipynb)
#### [16. Stage-10: Model Validation](16.00-mlpg-Stage-10-Model-Validation.ipynb)
#### [17. Stage-11: Model Deployment](17.00-mlpg-Stage-11-Model-Deployment.ipynb)
#### 18. Other Considerations
- [Machine Learning](18.01-mlpg-Other-Considerations-Machine-Learning.ipynb)
  - No Free Lunch Theorem for Machine Learning
  - Ensemble Learning
  - How Do Ensembles Work
  - Machine Learning workflow
  - Types of Unsupervised Learning
  - Hard Clustering and Soft (or Fuzzy) Clustering
  - How to choose ML algorithms?
- [Modeling](18.02-mlpg-Other-Considerations-Modeling.ipynb)
  - Most commonly used model categories in M
  - The drawbacks of a linear model
  - Cross-Validation
  - Differences between PCA and EFA
  - Dimensionality Reduction and PCA
- [Algorithms](18.03-mlpg-Other-Considerations-Algorithms.ipynb)
  - Multinomial Logistic Regression
  - Robust Regression
  - Linear Discriminant Analysis (LDA)
  - Nearest Radius Neighbors (NRN) algorithm
  - Gaussian Processes Classification (GPC) algorithm
  - Clustering algorithms (Unsupervised)
  - Working of an unsupervised learning algorithm
  - Limitations of k-Means clustering
- [Algorithm Comparisons](18.04-mlpg-Other-Considerations-Algorithm-Comparisons.ipynb)
  - Choosing between Random Forest vs SVM vs KNN
  - Is Random Forest a better model than a Decision Tree?
  - Boosting and Bagging
  - SVM vs Logistic Regression
  - Pros and Cons of NB, RF, GBDT, and NN classifiers
- [Metrics and Error Analysis](18.05-mlpg-Other-Considerations-Metrics-and-Error-Analysis.ipynb)
  - Learning Curves
  - Precision and Recall
  - Performing Error Analysis/Troubleshooting Prediction Errors
  - Performing error analysis for anomaly detection
  - How will you evaluate and select ML models in Supervised Learning?
- [Model Performance Improvement](18.06-mlpg-Other-Considerations-Model-Performance-Improvement.ipynb)
  - L1 and L2 regularization
  - Overfitting, Underfitting, and how to limit Overfitting
  - Improving a spam detection algorithm that uses NB
  - Can kernels be applied to algorithms other than SVM?
  - Important parameters in SVM
  - Key parameters of RFs, GBDTs, and NN classifiers
  - Data leakage: How to detect and prevent it?
  - Sensitivity Analysis of Dataset Size vs. Model Performance
- [Anomaly Detection](18.07-mlpg-Other-Considerations-Anomaly-Detection.ipynb)
  - What is Anomaly detection?
  - Anomaly Detection vs Supervised Learning
  - Time-series data anomaly detection
- [Recommender Systems](18.08-mlpg-Other-Considerations-RecommenderSystems.ipynb)
  - Recommender Systems: Types, Pros & Cons, and Evaluation Metrics
- [Neural Networks](18.09-mlpg-Other-Considerations-Neural-Networks.ipynb)
  - What are Neural networks?
  - Types of Neural Networks
- [Databases](18.10-mlpg-Other-Considerations-Databases.ipynb)
  - NoSQL database
  - Types of NoSQL data stores
  - When to use different NoSQL data stores?
  - Types of databases
- [Python Libraries](18.11-mlpg-Other-Considerations-Python-Libraries.ipynb)
  - PyCaret library
  - Python packages for ML & Data Science – A must-know
- [Testing](18.12-mlpg-Other-Considerations-Testing.ipynb)
  - A/B testing (also known as Split Testing)
  - Hypothesis Testing
- [Deep Learning](18.13-mlpg-Other-Considerations-Deep-Learning.ipynb)
  - What is Deep Learning (DL)?
  - Features/Characteristics of DL
  - Demonstrated values of DL
  - Pros and Cons of DL
- [Definitions](18.14-mlpg-Other-Considerations-Definitions.ipynb)
  - Optimization
  - Linear Regression with One Variable
  - Cost function (CF)
  - Gradient Descent (GD)
  - Matrix, Vector, and Vectorization
  - Linear Regression with Multiple Variables
  - Normal Equation (NE)
  - Backpropagation algorithm
- [Miscellaneous](18.15-mlpg-Other-Considerations-Miscellaneous.ipynb)
  - Pipeline
  - Kernel and Kernel trick
  - A Dimensionality Reduction problem description
  - Big data and their characteristics
  - Web scraping and its use-cases
  - Resilient Distributed Dataset (RDD)
  - How to choose a data layer for an application?

#### [19. Text Analytics – An Introduction](19.00-mlpg-Text-Analytics–An-Introduction.ipynb)
- What can be done with text data?
- Basic functions to handle texts in Python
- NLP and Basic NLTK tasks
- Components of NLP and the techniques used in NLP
- Differences between Stemming and Lemmatization
- Information Extraction (IE)
- How does IE solve business problems?
- Challenges and requirements for IE

#### [20. Social Network Analysis – An Introduction](20.00-mlpg-Social-Network-Analysis–An-Introduction.ipynb)
- Use-cases for Networks
- Types of networks

#### 21. Case Studies
- [Case Study 1: ML system design for email Spam detection](21.01-mlpg-CS1-ML-system-design-for-email-Spam-detection.ipynb)
- [Case Study 2: Develop and evaluate an Anomaly Detection system](21.02-mlpg-CS2-Develop-and-evaluate-an-Anomaly-Detection-system.ipynb)
- [Case Study 3: Normalize & sort dates using Regular Expressions](21.03-mlpg-CS3-Normalize-and-sort-dates-using-Regular-Expressions.ipynb)

#### [22. References](22.00-mlpg-References.ipynb)

# Acronyms
```
A/B Test - A way to compare two versions of the same app to figure out which one performs better
ACID     - Atomicity Consistency Isolation Durability
ADASYN   - Adaptive Synthetic Technique
ANN      - Artificial Neural Networks
ANOVA    - Analysis Of Variance
API      - Application Programming Interface
AUC      - Area Under Curve
CART     - Classification And Regression Trees
CF       - Cost Function
CNN      - Convolutional Neural Networks
CSV      - Comma Separated Values
CV       - Cross-Validation
DL       - Deep Learning
DR       - Dimensionality Reduction
DS       - Data Science
DSPM     - Data Science Project Management
DT       - Decision Tree
EDA      - Exploratory Data Analysis
EFA      - Exploratory Factor Analysis
FPR      - False Positive Rate
GBDT     - Gradient Boosting Decision Tree
GBM      - Gradient Boosting Machine
GD       - Gradient Descent
G-Mean   - Geometric Mean
GPC      - Gaussian Processes Classification
GPU      - Graphical Processing Unit
HTML     - Hypertext Markup Language
HTTP     - Hypertext Transfer Protocol
IE       - Information Extraction
IQR      - Inter Quartile Range
JSON     - JavaScript Object Notation
KNN      - K-Nearest Neighbors
LASSO    - Least Absolute Shrinkage and Selection Operator
LDA      - Linear Discriminant Analysis
LOF      - Local Outlier Factor
MAE      - Mean Absolute Error
MANOVA   - Multivariate Analysis Of Variance
MAP      - Mean Average Precision, Minimum Advertised Price
MCC      - Mathews Correlation Coefficient
MI       - Mutual Information
ML       - Machine Learning
MLlib    - Machine Learning Library
MLP      - Multi-Layer Perceptrons
MLR      - Multivariate Linear Regression
MPI      - Message Passing Interface
MRR      - Mean Reciprocal Rank
MSE      - Mean Squared Error
NaN      - Not a Number
NB       - Naïve Bayes
NDCG     - Normalized Discounted Cumulative Gain
NE       - Normal Equation
NER      - Named Entity Recognition
NLG      - Natural Language Generation
NLP      - Natural Language Processing
NLTK     - Natural Language Toolkit
NLU      - Natural Language Understanding
NN       - Neural Network
OLS      - Ordinary Least Squares
PCA      - Principal Component Analysis
POS      - Part of Speech
PR       - Precision-Recall
PR-AUC   - Precision-Recall Area Under Curve
RANSAC   - Random Sample Consensus
RDBMS    - Relational Database Management System
RDD      - Resilient Distributed Dataset
RE       - Regular Expression
RF       - Random Forest
RMSE     - Root Mean Squared Error
RMSLE    - Roor Meas Squared Logarithmic Error
RNN      - Recurrent Neural Networks
ROC      - Receiver Operating Characteristic
ROI      - Return On Investment
SD       - Standard Deviation
SGE      - Sun Grid Engine (new name Oracle Grid Engine)
SL       - Supervised Learning
SMOTE    - Synthetic Minority Oversampling Technique
SNN	     - Simulated Neural Networks
SOCKS    - Socket Secure
SQL      - Structured Query Language
SSL      - Secure Sockets Layer
SVM      - Support Vector Machine
TLS      - Transport Layer Security
TPU      - Tensor Processing Units
TSV      - Tab Separated Values
UAT      - User Acceptance Testing
UI       - User Interface
ULR      - Univariate Linear Regression
USL      - Unsupervised Learning
XGBoost  - Extreme Boosting
XML      - Extensible Markup Language
Z-Score  - Standard Score
```