# Introduction to Machine Learning & Data Preprocessing

## üéØ Learning Objectives
By the end of this lesson, you will understand:
- Core machine learning concepts and applications
- Different types of machine learning
- Basic ML workflow with Scikit-Learn
- Model performance evaluation and confusion matrices
- Data preprocessing techniques
- Real-world ML project implementation

## üìö Topics Covered
1. **Introduction to Machine Learning**
2. **What is Data Preprocessing?**
3. **Project: Adult Income Classification (Part 1)**

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("üìö All libraries imported successfully!")
print("üîß Environment configured for machine learning!")

# ü§ñ PART 1: Introduction to Machine Learning

## What is Machine Learning?

Machine Learning (ML) is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every task.

### Key Concepts:
- **Algorithm**: A set of rules or instructions for solving a problem
- **Model**: A trained algorithm that can make predictions on new data
- **Training**: The process of teaching an algorithm using historical data
- **Prediction**: Using a trained model to make decisions on new, unseen data

### Real-World Applications:
- üé¨ **Netflix**: Movie recommendations based on viewing history
- üõí **Amazon**: Product recommendations and fraud detection
- üöó **Tesla**: Self-driving cars using computer vision
- üè• **Healthcare**: Medical diagnosis and drug discovery
- üìß **Gmail**: Spam email detection

## Types of Machine Learning

### 1. üéØ Supervised Learning
- **Definition**: Learning with labeled examples (input-output pairs)
- **Goal**: Predict outcomes for new data
- **Examples**: 
  - Predicting house prices (regression)
  - Email spam detection (classification)
  - Medical diagnosis (classification)

### 2. üîç Unsupervised Learning  
- **Definition**: Finding patterns in data without labels
- **Goal**: Discover hidden structure in data
- **Examples**:
  - Customer segmentation (clustering)
  - Market basket analysis (association rules)
  - Dimensionality reduction (PCA)

### 3. üéÆ Reinforcement Learning
- **Definition**: Learning through trial and error with rewards/penalties
- **Goal**: Learn optimal actions in an environment
- **Examples**:
  - Game playing (Chess, Go)
  - Robot navigation
  - Trading algorithms

### Focus of This Course: **Supervised Learning** üéØ

## Basic ML Workflow with Scikit-Learn

The typical machine learning workflow consists of these steps:

### 1. üìä **Data Collection & Loading**
   - Gather relevant data for your problem
   - Load data into pandas DataFrame

### 2. üîç **Exploratory Data Analysis (EDA)**
   - Understand data structure and patterns
   - Identify missing values, outliers, distributions

### 3. üõ†Ô∏è **Data Preprocessing**
   - Clean and prepare data for modeling
   - Handle missing values, scale features, encode categories

### 4. üéØ **Model Selection & Training**
   - Choose appropriate algorithm
   - Split data into train/test sets
   - Train model on training data

### 5. üìà **Model Evaluation**
   - Test model performance on unseen data
   - Use metrics like accuracy, precision, recall

### 6. üöÄ **Model Deployment**
   - Put model into production
   - Monitor and maintain performance

In [None]:
# Simple Example: Basic ML Workflow
print("üîß BASIC ML WORKFLOW EXAMPLE")
print("=" * 50)

# Step 1: Create sample data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)

print(f"üìä Dataset created: {X.shape[0]} samples, {X.shape[1]} features")
print(f"üéØ Target classes: {np.unique(y)}")

# Step 2: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"‚úÇÔ∏è Data split: {len(X_train)} training, {len(X_test)} testing samples")

# Step 3: Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
print("ü§ñ Model trained successfully!")

# Step 4: Make predictions
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"üìà Model accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")

print("\n‚úÖ Basic workflow complete!")

## Model Performance Evaluation

### Classification Metrics

#### 1. üéØ **Accuracy**
- **Definition**: Percentage of correct predictions
- **Formula**: (Correct Predictions) / (Total Predictions)
- **When to use**: Balanced datasets

#### 2. üìä **Confusion Matrix**
- **Definition**: Table showing actual vs predicted classifications
- **Components**:
  - **True Positives (TP)**: Correctly predicted positive cases
  - **True Negatives (TN)**: Correctly predicted negative cases  
  - **False Positives (FP)**: Incorrectly predicted as positive (Type I error)
  - **False Negatives (FN)**: Incorrectly predicted as negative (Type II error)

#### 3. üîç **Precision**
- **Definition**: Of all positive predictions, how many were correct?
- **Formula**: TP / (TP + FP)

#### 4. üìà **Recall (Sensitivity)**
- **Definition**: Of all actual positives, how many did we find?
- **Formula**: TP / (TP + FN)

In [None]:
# Demonstration: Confusion Matrix
print("üìä CONFUSION MATRIX EXAMPLE")
print("=" * 40)

# Create and visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Class 0', 'Class 1'], 
            yticklabels=['Class 0', 'Class 1'])
plt.title('Confusion Matrix', fontsize=14, fontweight='bold')
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

# Calculate metrics manually
tn, fp, fn, tp = cm.ravel()
print(f"\nüìà Detailed Metrics:")
print(f"True Negatives:  {tn}")
print(f"False Positives: {fp}")  
print(f"False Negatives: {fn}")
print(f"True Positives:  {tp}")

accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0

print(f"\nüéØ Performance Metrics:")
print(f"Accuracy:  {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")

# üõ†Ô∏è PART 2: What is Data Preprocessing?

## Why Preprocessing Matters

Raw data is rarely ready for machine learning algorithms. Data preprocessing is crucial because:

### üö´ **Common Data Problems:**
- **Missing values**: Gaps in data that need to be filled
- **Different scales**: Features with vastly different ranges (e.g., age vs income)
- **Categorical data**: Text labels that need numeric encoding
- **Outliers**: Extreme values that can skew results
- **Irrelevant features**: Noise that hurts model performance

### ‚úÖ **Benefits of Preprocessing:**
- **Improved accuracy**: Clean data leads to better predictions
- **Faster training**: Optimized data trains models quicker
- **Algorithm compatibility**: Makes data suitable for ML algorithms
- **Better convergence**: Helps optimization algorithms work properly

### üéØ **The Golden Rule:**
> **"Garbage in, garbage out"** - Quality preprocessing is essential for quality results!

## Key Preprocessing Techniques

### 1. üìè **MinMaxScaler**
- **Purpose**: Scale features to a fixed range (usually 0-1)
- **Formula**: (value - min) / (max - min)
- **When to use**: When features have different scales
- **Pros**: Preserves relationships, bounded output
- **Cons**: Sensitive to outliers

### 2. üè∑Ô∏è **OneHotEncoder**  
- **Purpose**: Convert categorical variables to binary columns
- **Example**: ['Red', 'Blue', 'Green'] ‚Üí [1,0,0], [0,1,0], [0,0,1]
- **When to use**: Nominal categories (no order)
- **Pros**: No artificial ordering imposed
- **Cons**: Can create many columns (curse of dimensionality)

### 3. üî¢ **LabelEncoder**
- **Purpose**: Convert categorical variables to integers
- **Example**: ['Small', 'Medium', 'Large'] ‚Üí [0, 1, 2]
- **When to use**: Ordinal categories (natural order)
- **Pros**: Compact representation
- **Cons**: May imply false ordering for nominal data

In [None]:
# Demonstration: Preprocessing Techniques
print("üõ†Ô∏è PREPROCESSING DEMONSTRATIONS")
print("=" * 50)

# Create sample data with different types
np.random.seed(42)
sample_data = pd.DataFrame({
    'age': np.random.randint(18, 80, 100),
    'income': np.random.normal(50000, 20000, 100),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 100),
    'city': np.random.choice(['New York', 'London', 'Tokyo', 'Sydney'], 100)
})

print("üìä Original Sample Data:")
print(sample_data.head())
print(f"\nData types:\n{sample_data.dtypes}")
print(f"\nData shape: {sample_data.shape}")

# 1. MinMaxScaler Example
print("\n" + "="*30)
print("üìè MINMAXSCALER EXAMPLE")
print("="*30)

scaler = MinMaxScaler()
numerical_cols = ['age', 'income']
scaled_data = scaler.fit_transform(sample_data[numerical_cols])
scaled_df = pd.DataFrame(scaled_data, columns=numerical_cols)

print("Before scaling:")
print(sample_data[numerical_cols].describe())
print("\nAfter MinMax scaling (0-1 range):")
print(scaled_df.describe())

In [None]:
# 2. OneHotEncoder Example
print("\n" + "="*30)
print("üè∑Ô∏è ONEHOTENCODER EXAMPLE")
print("="*30)

# OneHot encode city (nominal categorical)
ohe = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' to avoid multicollinearity
city_encoded = ohe.fit_transform(sample_data[['city']])
city_columns = [f"city_{cat}" for cat in ohe.categories_[0][1:]]  # Skip first category (dropped)
city_df = pd.DataFrame(city_encoded, columns=city_columns)

print("Original city values (first 10):")
print(sample_data['city'].head(10).tolist())
print(f"\nUnique cities: {sample_data['city'].unique()}")
print(f"\nOneHot encoded columns: {city_columns}")
print("\nEncoded representation (first 10 rows):")
print(city_df.head(10))

# 3. LabelEncoder Example  
print("\n" + "="*30)
print("üî¢ LABELENCODER EXAMPLE")
print("="*30)

# Label encode education (ordinal categorical)
le = LabelEncoder()
education_encoded = le.fit_transform(sample_data['education'])

print("Original education values (first 10):")
print(sample_data['education'].head(10).tolist())
print(f"\nUnique education levels: {sample_data['education'].unique()}")
print(f"Label mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
print(f"\nEncoded education values (first 10): {education_encoded[:10]}")

## Scikit-Learn Pipelines & Avoiding Data Leakage

### üîß **What are Pipelines?**
Pipelines are a way to chain preprocessing steps and models together, ensuring:
- **Reproducibility**: Same steps applied consistently
- **Code cleanliness**: Organized and readable workflow
- **Parameter tuning**: Easy to optimize entire pipeline
- **Deployment**: Simple to put into production

### ‚ö†Ô∏è **Data Leakage Prevention**
**Data leakage** occurs when information from the future or test set "leaks" into training:

#### **Common Leakage Sources:**
1. **Preprocessing before splitting**: Scaling using entire dataset statistics
2. **Target leakage**: Features that wouldn't be available at prediction time
3. **Temporal leakage**: Using future information to predict the past

#### **Prevention Strategy:**
- **‚úÖ Fit preprocessing on training data only**
- **‚úÖ Transform both training and test data with training statistics**
- **‚ùå Never fit preprocessing on test data**

In [None]:
# Pipeline Example: Proper Way to Avoid Data Leakage
print("üîß PIPELINE EXAMPLE - AVOIDING DATA LEAKAGE")
print("=" * 60)

# Create sample data
X_sample = sample_data[['age', 'income', 'education', 'city']].copy()
y_sample = np.random.choice([0, 1], size=len(X_sample))

# Split data FIRST (before any preprocessing)
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.2, random_state=42)

print(f"üìä Data split: {len(X_train)} training, {len(X_test)} testing samples")

# Create preprocessing pipeline
# Define column types
numeric_features = ['age', 'income']
categorical_features = ['education', 'city']

# Create transformers
numeric_transformer = MinMaxScaler()
categorical_transformer = OneHotEncoder(drop='first', sparse_output=False)

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Create full pipeline with model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

print("üîß Pipeline created with preprocessing + model")

# Fit pipeline (preprocessing + model together)
pipeline.fit(X_train, y_train)
print("‚úÖ Pipeline fitted on training data only")

# Make predictions
y_pred_pipeline = pipeline.predict(X_test)
accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)

print(f"üìà Pipeline accuracy: {accuracy_pipeline:.3f}")
print("\nüéØ Benefits achieved:")
print("   ‚úÖ No data leakage")
print("   ‚úÖ Reproducible preprocessing") 
print("   ‚úÖ Easy deployment")
print("   ‚úÖ Clean, organized code")