# C1: BASICS OF MACHINE LEARNING & DATA PREPROCESSING

## What is Machine Learning?

- **Machine Learning**: Machine Learning (ML) is a field of AI where computers learn patterns from data and make predictions/decisions without being explicitly programmed.  
- **Goal**: Create models that improve automatically as they receive more data.

## Traditional Programming vs Machine Learning

| **Traditional Programming**           | **Machine Learning**                       |
| ------------------------------------- | ------------------------------------------ |
| Rules are explicitly coded by humans. | Model learns rules from data.              |
| Input + Rules → Output                | Input + Output → Model (rules)             |
| Used when logic is well-known.        | Used when logic is complex or unknown.     |

**Example**:  
- Traditional: Writing rules to detect spam emails manually.  
- ML: Feed thousands of labeled emails → model learns spam patterns automatically.

## Understanding an ML Problem

Before starting any ML project:

1. Define the Problem – What do we want to predict?  
2. Understand Data – What features are available? What’s the target variable?  
3. Identify ML Type – Regression, Classification, Clustering, etc.  
4. Decide Success Metric – Accuracy, RMSE, etc.  

## Steps in an ML Project

1. Data Collection  
2. Data Preprocessing (cleaning, encoding, scaling, etc.)  
3. Splitting into Train & Test sets  
4. Choosing a Model (Linear Regression, Decision Tree, etc.)  
5. Training the Model  
6. Evaluating Performance  
7. Hyperparameter Tuning  
8. Deploying the Model  

## Basic Terms in ML

- **Feature / Variable / Input (X)** – Information used for prediction.  
- **Target / Label / Output (y)** – What we want to predict.  
- **Model** – Mathematical function that maps input → output.  
- **Training** – Model learns from data.  
- **Testing** – Evaluating the model on unseen data.  
- **Overfitting** – Model memorizes training data but fails on new data.  
- **Underfitting** – Model is too simple; performs poorly on both train & test.  

## Types of Machine Learning

1. **Supervised Learning** – Model learns with labeled data (X → y).  
   - **Regression**: Predict continuous values (e.g., house price).  
   - **Classification**: Predict categories (e.g., spam/ham).  

2. **Unsupervised Learning** – No labels, model finds patterns.  
   - **Clustering**: Group similar items (e.g., customer segmentation).  
   - **Dimensionality Reduction**: Compress data (e.g., PCA).  

3. **Reinforcement Learning** – Model learns by interacting with the environment (e.g., a game agent learning by trial & error).  

## What is Data Preprocessing?

- Raw data is often incomplete, noisy, or inconsistent.  
- Preprocessing cleans and transforms data so ML models can learn effectively.  

## General Steps

1. **Handling Missing Values**  
   - Types of Missing Data:  
     1. **Standard missing values**: Represented as `NaN`, `Null`, empty space.  
     2. **Non-standard missing values**: Represented as `?`, `-`, `Not available`.  
   - Methods to handle missing values:  
     1. Remove rows/columns.  
     2. Impute values:  
        - Mean/median (for numeric).  
        - Mode (for categorical).  
        - Predict missing value using another model (advanced).  

2. **Handling Non-Numeric Data**  
   - ML models cannot handle text directly, so convert to numbers.  
     - **One-Hot Encoding**: Creates new columns (0/1) for each category.  
     - **Label Encoding**: Assigns integer codes (0,1,2...) to categories.  
     - **Ordinal Encoding**: Similar to label encoding but preserves order (e.g., low < medium < high).  

3. **Normalization & Transformation**  
   - **Normalization**: Scale values between 0 and 1 → `x' = (x - min)/(max - min)`.  
   - **Standardization**: Transform to mean = 0, std = 1 → `z = (x - mean)/std`.  
   - **Why?** Models like regression and SVM are sensitive to scale.  

4. **Outlier Detection & Removal**  
   - Outliers can negatively affect models (especially linear regression).  
     - **Boxplot Method**: Values beyond whiskers.  
     - **IQR Method**:  
       - IQR = Q3 - Q1.  
       - Outliers: < Q1 - 1.5×IQR or > Q3 + 1.5×IQR.  
     - **Z-Score Method**: |z| > 3.  
     - **Scatterplot**: Visual inspection.  

5. **Feature Engineering (Intro)**  
   - Creating new features from existing ones. Example: BMI = weight / height².  
   - Includes:  
     - Feature extraction (PCA, text embeddings).  
     - Feature transformation (log, square root).  
     - Combining features (ratios, interactions).  

6. **Train-Test Split**  
   - **Purpose**: To evaluate the model on unseen data.  
   - Common split: 70% Train / 30% Test or 80/20.  
   - Use `train_test_split()` in scikit-learn.  

## Applications of ML

ML is used in many domains, for example:  
- **Healthcare**: Disease prediction, image analysis.  
- **Finance**: Credit scoring, fraud detection.  
- **Retail**: Recommendation systems, inventory forecasting.  
- **Transportation**: Route optimization, self-driving cars.  
- **Manufacturing**: Predictive maintenance, defect detection.  
