# Module 00: Introduction to Machine Learning

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 60 minutes  
**Prerequisites**: 
- Basic Python programming
- NumPy and Pandas fundamentals
- Basic statistics knowledge

## Learning Objectives

By the end of this notebook, you will be able to:

1. Define machine learning and explain its core concepts
2. Distinguish between supervised, unsupervised, and reinforcement learning
3. Understand the typical ML workflow and pipeline
4. Set up and use scikit-learn for basic ML tasks
5. Build your first ML model using scikit-learn
6. Recognize common ML terminology and concepts

## 1. Setup and Imports

First, let's import all the libraries we'll need for this introduction.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn basics
import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Set random seeds for reproducibility
np.random.seed(42)

# Display versions
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")

## 2. What is Machine Learning?

### Definition

**Machine Learning (ML)** is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. Instead of writing specific rules, we provide examples and let the algorithm discover patterns.

### Traditional Programming vs Machine Learning

**Traditional Programming:**
- Input: Data + Rules → Output: Answers
- Example: Calculate total price = quantity × unit_price

**Machine Learning:**
- Input: Data + Answers → Output: Rules (Model)
- Example: Given images and labels, learn to recognize cats vs dogs

### When to Use Machine Learning?

ML is useful when:
1. **Problems are too complex** for traditional programming (speech recognition, image classification)
2. **Rules keep changing** (fraud detection, stock market prediction)
3. **Patterns exist in data** but are hard to describe explicitly
4. **You have sufficient data** to train a model

## 3. Types of Machine Learning

Machine learning algorithms can be categorized into three main types:

### 3.1 Supervised Learning

**Definition**: Learning from labeled data (input-output pairs)

**Two main types:**
- **Classification**: Predict discrete categories (spam vs not spam, cat vs dog)
- **Regression**: Predict continuous values (house prices, temperature)

**Common algorithms:**
- Linear Regression, Logistic Regression
- Decision Trees, Random Forests
- Support Vector Machines (SVM)
- Neural Networks

### 3.2 Unsupervised Learning

**Definition**: Learning from unlabeled data (finding hidden patterns)

**Common tasks:**
- **Clustering**: Group similar items together (customer segmentation)
- **Dimensionality Reduction**: Reduce number of features while preserving information (PCA)
- **Anomaly Detection**: Find unusual patterns (fraud detection)

**Common algorithms:**
- K-Means, DBSCAN (clustering)
- PCA, t-SNE (dimensionality reduction)

### 3.3 Reinforcement Learning

**Definition**: Learning through interaction with an environment (trial and error)

**Key concepts:**
- Agent takes actions in an environment
- Receives rewards or penalties
- Learns optimal behavior to maximize cumulative reward

**Applications:**
- Game playing (AlphaGo, chess)
- Robotics
- Autonomous driving

**Note**: This course focuses primarily on supervised and unsupervised learning.

## 4. The Machine Learning Workflow

A typical ML project follows these steps:

1. **Define the Problem**
   - What are you trying to predict?
   - Is it classification or regression?
   - What is success?

2. **Collect and Explore Data**
   - Gather relevant data
   - Visualize and understand distributions
   - Check for missing values and outliers

3. **Prepare Data**
   - Handle missing values
   - Encode categorical variables
   - Scale/normalize features
   - Split into training and test sets

4. **Choose and Train Model**
   - Select appropriate algorithm
   - Train on training data
   - Tune hyperparameters

5. **Evaluate Model**
   - Test on unseen data
   - Calculate performance metrics
   - Validate results

6. **Deploy and Monitor**
   - Put model into production
   - Monitor performance over time
   - Retrain as needed

## 5. Introduction to Scikit-learn

**Scikit-learn** is the most popular Python library for machine learning. It provides:

- **Simple and consistent API**: Most algorithms follow the same pattern
- **Wide variety of algorithms**: Classification, regression, clustering, etc.
- **Built-in datasets**: Perfect for learning and experimentation
- **Preprocessing tools**: Scaling, encoding, feature selection
- **Model evaluation**: Metrics, cross-validation, hyperparameter tuning

### The Scikit-learn API Pattern

Most scikit-learn models follow this consistent interface:

```python
# 1. Import the model class
from sklearn.some_module import SomeModel

# 2. Create a model instance (with hyperparameters)
model = SomeModel(param1=value1, param2=value2)

# 3. Train the model on training data
model.fit(X_train, y_train)

# 4. Make predictions
predictions = model.predict(X_test)

# 5. Evaluate the model
score = model.score(X_test, y_test)
```

This pattern makes it easy to swap between different algorithms!

## 6. Your First ML Model: Iris Classification

Let's build a complete ML model using the famous Iris dataset. This dataset contains measurements of 150 iris flowers from three species.

### 6.1 Load and Explore Data

In [None]:
# Load the Iris dataset (built into scikit-learn)
iris = datasets.load_iris()

# Convert to pandas DataFrame for easier exploration
iris_df = pd.DataFrame(
    data=iris.data,
    columns=iris.feature_names
)
iris_df['species'] = iris.target
iris_df['species_name'] = iris_df['species'].map(
    {0: 'setosa', 1: 'versicolor', 2: 'virginica'}
)

# Display first few rows
print("First 5 rows of the Iris dataset:")
iris_df.head()

In [None]:
# Check dataset shape and info
print(f"Dataset shape: {iris_df.shape}")
print(f"\nFeatures: {iris.feature_names}")
print(f"Target classes: {iris.target_names}")
print(f"\nClass distribution:")
print(iris_df['species_name'].value_counts())

In [None]:
# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot of two features
for species in iris_df['species_name'].unique():
    subset = iris_df[iris_df['species_name'] == species]
    axes[0].scatter(
        subset['sepal length (cm)'],
        subset['sepal width (cm)'],
        label=species,
        alpha=0.7
    )
axes[0].set_xlabel('Sepal Length (cm)')
axes[0].set_ylabel('Sepal Width (cm)')
axes[0].set_title('Iris Species by Sepal Dimensions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot of petal length
iris_df.boxplot(
    column='petal length (cm)',
    by='species_name',
    ax=axes[1]
)
axes[1].set_xlabel('Species')
axes[1].set_ylabel('Petal Length (cm)')
axes[1].set_title('Petal Length Distribution by Species')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

print("Notice how different species have distinct patterns in their measurements!")

### 6.2 Prepare Data for ML

We need to split our data into features (X) and target (y), then create training and test sets.

In [None]:
# Separate features (X) and target (y)
X = iris.data  # Features: 4 measurements
y = iris.target  # Target: species (0, 1, or 2)

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFirst sample features: {X[0]}")
print(f"First sample target: {y[0]} ({iris.target_names[y[0]]})")

In [None]:
# Split into training (80%) and test (20%) sets
# This is CRITICAL: we train on one set and test on another!
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,  # 20% for testing
    random_state=42,  # For reproducibility
    stratify=y  # Maintain class proportions
)

print(f"Training set size: {len(X_train)} samples")
print(f"Test set size: {len(X_test)} samples")
print(f"\nTraining set class distribution:")
print(pd.Series(y_train).value_counts().sort_index())
print(f"\nTest set class distribution:")
print(pd.Series(y_test).value_counts().sort_index())

### 6.3 Feature Scaling

Many ML algorithms work better when features are on the same scale. We'll use **StandardScaler** to normalize our features.

In [None]:
# Create a scaler object
scaler = StandardScaler()

# IMPORTANT: Fit scaler on training data only!
# This prevents data leakage from test set
scaler.fit(X_train)

# Transform both training and test data using the same scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Before scaling (first training sample):")
print(X_train[0])
print("\nAfter scaling (first training sample):")
print(X_train_scaled[0])
print("\nNotice how values are now centered around 0 with similar ranges!")

### 6.4 Train a Model

We'll use **K-Nearest Neighbors (KNN)**, a simple but effective algorithm that classifies based on the k closest training examples.

In [None]:
# Create a KNN classifier with k=5
# This means: classify based on the 5 nearest neighbors
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train the model on scaled training data
knn_model.fit(X_train_scaled, y_train)

print("Model trained successfully!")
print(f"Model type: {type(knn_model).__name__}")
print(f"Number of neighbors: {knn_model.n_neighbors}")

### 6.5 Make Predictions

In [None]:
# Make predictions on test set
y_pred = knn_model.predict(X_test_scaled)

# Show first 10 predictions vs actual values
comparison_df = pd.DataFrame({
    'Actual': [iris.target_names[y] for y in y_test[:10]],
    'Predicted': [iris.target_names[y] for y in y_pred[:10]],
    'Correct': y_test[:10] == y_pred[:10]
})

print("First 10 predictions:")
print(comparison_df)
print(f"\nTotal correct predictions: {sum(y_test == y_pred)} out of {len(y_test)}")

### 6.6 Evaluate Model Performance

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2%}")
print(f"This means the model correctly classified {accuracy:.1%} of test samples!")

# Detailed classification report
print("\nDetailed Performance Report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=iris.target_names
))

In [None]:
# Confusion matrix visualization
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=iris.target_names,
    yticklabels=iris.target_names
)
plt.title('Confusion Matrix: Iris Classification', fontsize=14)
plt.ylabel('Actual Species')
plt.xlabel('Predicted Species')
plt.tight_layout()
plt.show()

print("The diagonal shows correct predictions.")
print("Off-diagonal values show misclassifications.")

## 7. Key ML Terminology

Let's review important terms you'll encounter throughout this course:

### Data-Related Terms

- **Features (X)**: Input variables used for prediction (also called predictors, independent variables)
- **Target (y)**: Output variable we want to predict (also called label, dependent variable)
- **Sample**: A single data point (row in a dataset)
- **Training Set**: Data used to train the model
- **Test Set**: Data used to evaluate model performance (never seen during training!)
- **Validation Set**: Data used to tune hyperparameters (we'll cover this later)

### Model-Related Terms

- **Model**: The learned pattern/function that maps features to target
- **Algorithm**: The method used to learn the model (e.g., KNN, Linear Regression)
- **Training/Fitting**: The process of learning from data
- **Prediction**: Using a trained model to make outputs for new inputs
- **Hyperparameters**: Settings you choose before training (e.g., k=5 in KNN)
- **Parameters**: Values learned during training (e.g., weights in linear regression)

### Performance-Related Terms

- **Accuracy**: Percentage of correct predictions
- **Overfitting**: Model performs great on training data but poorly on test data
- **Underfitting**: Model performs poorly on both training and test data
- **Generalization**: Model's ability to perform well on unseen data
- **Bias-Variance Tradeoff**: Balance between model complexity and generalization (we'll dive deep into this later)

## 8. Practice Exercises

### Exercise 1: Try Different k Values

Train KNN models with k=3 and k=7. Which performs better? Why might that be?

In [None]:
# Your code here
# Hint: Create two models with different n_neighbors values
# Train both and compare their accuracy scores


### Exercise 2: Effect of Feature Scaling

Train a KNN model WITHOUT scaling the features (use X_train and X_test directly). Compare accuracy to the scaled version. What do you notice?

In [None]:
# Your code here
# Train KNN on unscaled data and compare accuracy


### Exercise 3: Predict New Samples

Create a new iris flower with these measurements:
- Sepal length: 5.0 cm
- Sepal width: 3.5 cm
- Petal length: 1.5 cm
- Petal width: 0.3 cm

Use your trained model to predict which species it belongs to. (Remember to scale it first!)

In [None]:
# Your code here
# Create a new sample, scale it, and make a prediction
# Hint: new_sample = np.array([[5.0, 3.5, 1.5, 0.3]])


### Exercise 4: Explore Wine Dataset

Scikit-learn has another built-in dataset: `datasets.load_wine()`. Load it, explore it, and build a KNN classifier. What accuracy do you achieve?

In [None]:
# Your code here
# Load wine dataset, split it, scale it, train KNN, evaluate


## 9. Summary

Congratulations on building your first machine learning model! Let's recap what we covered:

### Key Concepts

1. **Machine Learning** enables computers to learn from data without explicit programming
2. **Three types of ML**:
   - Supervised (labeled data): Classification and Regression
   - Unsupervised (unlabeled data): Clustering and Dimensionality Reduction
   - Reinforcement (learning through interaction)

3. **ML Workflow**:
   - Define problem → Collect data → Prepare data → Train model → Evaluate → Deploy

4. **Scikit-learn API**:
   - Import → Create → Fit → Predict → Score

5. **Critical Practices**:
   - Always split data into train/test sets
   - Fit preprocessing (scalers) on training data only
   - Evaluate on unseen test data

### What You Built

You created a complete ML pipeline:
- Loaded and explored data
- Split into train/test sets
- Scaled features
- Trained a KNN classifier
- Made predictions and evaluated performance

### Next Steps

In the next module, we'll dive deeper into:
- Understanding supervised vs unsupervised learning
- Different types of ML problems
- How to choose the right algorithm

### Additional Resources

- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Scikit-learn Tutorials](https://scikit-learn.org/stable/tutorial/index.html)
- [Machine Learning Glossary](https://developers.google.com/machine-learning/glossary)

## 10. Bonus: Quick Reference

### Common Scikit-learn Imports

```python
# Data splitting
from sklearn.model_selection import train_test_split

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Metrics
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.metrics import classification_report, confusion_matrix
```

### Basic ML Checklist

- [ ] Load and explore data
- [ ] Handle missing values
- [ ] Split into train/test sets (BEFORE any preprocessing)
- [ ] Preprocess features (scaling, encoding)
- [ ] Choose and train model
- [ ] Make predictions on test set
- [ ] Evaluate performance with appropriate metrics
- [ ] Iterate and improve