# Module 1: Introduction to Machine Learning

---

Welcome to the Machine Learning Courseware. This is the first module in a 10-part series that will take you from the fundamentals of Machine Learning to advanced topics like Neural Networks and NLP. By the end of this course, you will be able to build, evaluate, and deploy ML models with confidence.

**Prerequisites:** Basic Python, NumPy, and Pandas knowledge.

---

## Table of Contents

1. [What is Machine Learning?](#1.-What-is-Machine-Learning?)
2. [Types of Machine Learning](#2.-Types-of-Machine-Learning)
3. [The Machine Learning Workflow](#3.-The-Machine-Learning-Workflow)
4. [Key Terminology](#4.-Key-Terminology)
5. [Hands-On: Your First ML Model](#5.-Hands-On:-Your-First-ML-Model)
6. [Exercises](#6.-Exercises)
7. [Summary and Further Reading](#7.-Summary-and-Further-Reading)

---

## 1. What is Machine Learning?

### Definition

> *"Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed."* — **Arthur Samuel, 1959**

> *"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."* — **Tom Mitchell, 1997**

In simpler terms, Machine Learning is about teaching computers to find patterns in data and make decisions or predictions based on those patterns.

### Traditional Programming vs Machine Learning

| Traditional Programming | Machine Learning |
|---|---|
| Input: **Data + Rules** | Input: **Data + Expected Outputs** |
| Output: **Answers** | Output: **Rules (Model)** |
| Human writes the logic | Machine learns the logic |
| Example: Spam filter with hand-coded rules | Example: Spam filter that learns from labeled emails |

### Real-World Applications

Machine Learning is used across a wide range of industries:

- **Email**: Spam filtering, smart replies
- **Search engines**: Google ranking, autocomplete
- **E-commerce**: Product recommendations (Amazon, Netflix)
- **Healthcare**: Disease diagnosis, drug discovery
- **Autonomous vehicles**: Self-driving cars (Tesla, Waymo)
- **Virtual assistants**: Siri, Alexa, Google Assistant
- **Finance**: Fraud detection, credit scoring
- **Social media**: Face recognition, content moderation

---

## 2. Types of Machine Learning

Machine Learning can be broadly categorized into **three types**:

### 2.1 Supervised Learning

The model learns from **labeled data** — each input has a corresponding correct output.

- **Goal**: Learn a mapping from inputs to outputs
- **Types**:
  - **Regression**: Predict a continuous number (e.g., house price)
  - **Classification**: Predict a category or class (e.g., spam or not spam)
- **Common algorithms**: Linear Regression, Logistic Regression, Decision Trees, SVM, KNN

### 2.2 Unsupervised Learning

The model works with **unlabeled data** — it must discover hidden patterns on its own.

- **Goal**: Find structure in the data
- **Types**:
  - **Clustering**: Group similar data points (e.g., customer segments)
  - **Dimensionality Reduction**: Reduce number of features while preserving information (e.g., PCA)
  - **Association**: Find rules in transactions (e.g., market basket analysis)
- **Common algorithms**: K-Means, DBSCAN, PCA, t-SNE

### 2.3 Reinforcement Learning

The model learns by **interacting with an environment** and receiving rewards or penalties.

- **Goal**: Learn a strategy (policy) that maximizes cumulative reward
- **Key concepts**: Agent, Environment, State, Action, Reward
- **Applications**: Game playing (AlphaGo), robotics, recommendation systems

In [None]:
# Visualizing the three types of Machine Learning

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# --- Supervised Learning (Classification Example) ---
np.random.seed(42)
x_class0 = np.random.randn(30, 2) + np.array([2, 2])
x_class1 = np.random.randn(30, 2) + np.array([-2, -2])
axes[0].scatter(x_class0[:, 0], x_class0[:, 1], c='#2196F3', label='Class A',
                s=60, edgecolors='white', linewidth=0.5)
axes[0].scatter(x_class1[:, 0], x_class1[:, 1], c='#FF5722', label='Class B',
                s=60, edgecolors='white', linewidth=0.5)
x_line = np.linspace(-5, 5, 100)
axes[0].plot(x_line, -x_line, 'k--', alpha=0.5, linewidth=2)
axes[0].set_title('Supervised Learning\n(Classification)', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_xlim(-5, 5)
axes[0].set_ylim(-5, 5)

# --- Unsupervised Learning (Clustering Example) ---
from sklearn.datasets import make_blobs
X_blobs, labels = make_blobs(n_samples=90, centers=3, cluster_std=0.8, random_state=42)
colors = ['#4CAF50', '#FF9800', '#9C27B0']
for i in range(3):
    mask = (labels == i)
    axes[1].scatter(X_blobs[mask, 0], X_blobs[mask, 1], c=colors[i],
                    s=60, edgecolors='white', linewidth=0.5)
axes[1].set_title('Unsupervised Learning\n(Clustering)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].text(0.05, 0.95, 'No labels provided —\nmodel discovers groups',
             transform=axes[1].transAxes, fontsize=11, verticalalignment='top',
             style='italic', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# --- Reinforcement Learning (Grid World Illustration) ---
grid = np.zeros((5, 5))
axes[2].imshow(grid, cmap='Pastel1', extent=[0, 5, 0, 5])
for i in range(6):
    axes[2].axhline(i, color='gray', linewidth=0.5)
    axes[2].axvline(i, color='gray', linewidth=0.5)
axes[2].text(0.5, 0.5, 'Agent', fontsize=11, ha='center', va='center',
             bbox=dict(boxstyle='round', facecolor='#2196F3', alpha=0.7), color='white')
axes[2].text(4.5, 4.5, 'Goal', fontsize=11, ha='center', va='center',
             bbox=dict(boxstyle='round', facecolor='#4CAF50', alpha=0.7), color='white')
axes[2].add_patch(plt.Rectangle((2, 1), 1, 1, color='#F44336', alpha=0.5))
axes[2].add_patch(plt.Rectangle((1, 3), 1, 1, color='#F44336', alpha=0.5))
axes[2].add_patch(plt.Rectangle((3, 2), 1, 1, color='#F44336', alpha=0.5))
axes[2].annotate('', xy=(1.5, 0.5), xytext=(0.5, 0.5),
                arrowprops=dict(arrowstyle='->', color='#2196F3', lw=2))
axes[2].annotate('', xy=(1.5, 1.5), xytext=(1.5, 0.5),
                arrowprops=dict(arrowstyle='->', color='#2196F3', lw=2))
axes[2].set_title('Reinforcement Learning\n(Agent navigates to goal)', fontsize=14, fontweight='bold')
axes[2].set_xticks([])
axes[2].set_yticks([])

plt.tight_layout()
plt.suptitle('Three Types of Machine Learning', fontsize=16, fontweight='bold', y=1.02)
plt.show()

---

## 3. The Machine Learning Workflow

Every ML project follows a similar pipeline:

```
Define Problem --> Collect Data --> Prepare Data --> Choose Model --> Train Model --> Evaluate --> Deploy
```

| Step | Description | Tools / Libraries |
|------|-------------|-------------------|
| 1. Define Problem | What do you want to predict or discover? | Domain knowledge |
| 2. Collect Data | Gather relevant data | APIs, CSVs, web scraping, databases |
| 3. Prepare Data | Clean, transform, engineer features | Pandas, NumPy, Scikit-learn |
| 4. Choose Model | Pick an appropriate algorithm | Scikit-learn, XGBoost, TensorFlow |
| 5. Train Model | Fit the model to training data | `model.fit()` |
| 6. Evaluate Model | Measure performance on unseen data | Accuracy, F1, RMSE, ROC |
| 7. Deploy / Iterate | Put in production or improve | Flask, Docker, cloud services |

In practice, steps 3 through 6 are often repeated multiple times as you refine your approach.

---

## 4. Key Terminology

Before we dive into code, let us define the essential terms you will encounter throughout this course:

| Term | Definition | Example |
|------|-----------|--------|
| **Feature** | An input variable used to make predictions | House size, number of rooms |
| **Label / Target** | The output variable we want to predict | House price |
| **Sample / Instance** | A single data point (one row) | One house record |
| **Dataset** | Collection of samples | Table of 1000 houses |
| **Training Set** | Data used to train the model | 80% of the dataset |
| **Test Set** | Data used to evaluate the model | 20% of the dataset |
| **Model** | The mathematical function learned from data | Linear equation, decision tree |
| **Training** | Process of learning patterns from data | Calling `model.fit(X, y)` |
| **Inference** | Using a trained model on new data | Calling `model.predict(X_new)` |
| **Overfitting** | Model memorizes training data, performs poorly on new data | High train accuracy, low test accuracy |
| **Underfitting** | Model is too simple to capture underlying patterns | Low accuracy on both train and test |
| **Hyperparameter** | A setting configured before training begins | Learning rate, tree depth |

---

## 5. Hands-On: Your First ML Model

Let us build a complete ML pipeline using the well-known **Iris dataset**. This dataset contains measurements of 150 iris flowers from 3 species.

We will follow the ML workflow step by step.

### Step 1 and 2: Define the Problem and Collect Data

**Problem**: Given measurements of an iris flower (sepal length, sepal width, petal length, petal width), predict which species it belongs to.

**Data**: We will use `sklearn.datasets.load_iris()`, a classic dataset that comes built into scikit-learn.

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris

# Set a clean visual style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

# Load the Iris dataset
iris = load_iris()

# Create a DataFrame for easier exploration
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

print(f"Dataset shape: {df.shape}")
print(f"\nFeature names: {iris.feature_names}")
print(f"Target classes: {list(iris.target_names)}")
print(f"\nSamples per class:")
print(df['species'].value_counts())
print("\n--- First 10 rows ---")
df.head(10)

### Step 3: Explore and Prepare the Data

Before training a model, we need to understand the structure of our data. Visualization is a critical part of this step.

In [None]:
# Basic statistics
print("=" * 60)
print("DESCRIPTIVE STATISTICS")
print("=" * 60)
df.describe().round(2)

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print("\nNo missing values found in this dataset.")

In [None]:
# Pairplot — visualize relationships between all features, colored by species
g = sns.pairplot(df, hue='species', height=2.5,
                 plot_kws={'alpha': 0.7, 's': 50, 'edgecolor': 'white', 'linewidth': 0.5},
                 diag_kws={'alpha': 0.6})
g.figure.suptitle('Iris Dataset — Feature Relationships by Species', y=1.02,
                  fontsize=16, fontweight='bold')
plt.show()

print("\nObservations:")
print("  - Setosa is clearly separable from the other two species.")
print("  - Versicolor and Virginica overlap in several feature spaces.")
print("  - Petal length and petal width appear to be the most discriminative features.")

In [None]:
# Feature distributions by species
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
colors = ['#2196F3', '#FF9800', '#4CAF50']

for idx, feature in enumerate(iris.feature_names):
    ax = axes[idx // 2, idx % 2]
    for i, species in enumerate(iris.target_names):
        data = df[df['species'] == species][feature]
        ax.hist(data, bins=15, alpha=0.6, label=species, color=colors[i], edgecolor='white')
    ax.set_title(f'{feature}', fontsize=13, fontweight='bold')
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.legend(fontsize=10)

plt.suptitle('Feature Distributions by Species', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
fig, ax = plt.subplots(figsize=(8, 6))
correlation = df.drop('species', axis=1).corr()
sns.heatmap(correlation, annot=True, cmap='RdYlBu_r', center=0, ax=ax,
            square=True, linewidths=1, fmt='.2f',
            cbar_kws={'label': 'Correlation Coefficient'})
ax.set_title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nObservations:")
print("  - Petal length and petal width are highly correlated (0.96).")
print("  - Sepal width has low correlation with the other features.")

### Step 4: Split the Data

We split our data into **training** and **test** sets. The model learns from the training set and is evaluated on the held-out test set, which simulates unseen real-world data.

In [None]:
from sklearn.model_selection import train_test_split

# Features (X) and Target (y)
X = iris.data    # shape: (150, 4)
y = iris.target  # shape: (150,)

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,     # 20% reserved for testing
    random_state=42,   # ensures reproducibility
    stratify=y         # maintains class distribution in both sets
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")
print(f"\nClass distribution in training set: {np.bincount(y_train)}")
print(f"Class distribution in test set:     {np.bincount(y_test)}")

### Step 5: Choose and Train a Model

We will use **K-Nearest Neighbors (KNN)** — one of the simplest and most intuitive ML algorithms.

**How KNN works:**
1. Given a new data point, find the **K closest points** in the training set (using distance).
2. Assign the **majority class** among those K neighbors to the new point.

It is a non-parametric, instance-based algorithm — it does not learn an explicit function; instead, it memorizes all training data and compares at prediction time.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier with K=5 (consider 5 nearest neighbors)
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model on the training data
knn.fit(X_train, y_train)

print("Model trained successfully.")
print(f"\nModel: {knn}")
print(f"Number of training samples: {X_train.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")

### Step 6: Evaluate the Model

Now let us measure how well our model performs on the **test set** — data it has never seen during training.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Overall accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.2%}")

# Detailed classification report
print("\n" + "=" * 60)
print("CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(y_test, y_pred, target_names=iris.target_names))

In [None]:
# Confusion matrix visualization
fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names,
            square=True, linewidths=1, ax=ax,
            annot_kws={'size': 16})
ax.set_xlabel('Predicted Label', fontsize=13)
ax.set_ylabel('True Label', fontsize=13)
ax.set_title('Confusion Matrix — KNN Classifier', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nHow to read the confusion matrix:")
print("  - Diagonal entries (top-left to bottom-right) represent correct predictions.")
print("  - Off-diagonal entries represent misclassifications.")

### Making Predictions on New Data

Let us simulate predicting the species of a flower whose measurements we have not seen before.

In [None]:
# Predict the species of a new flower
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])  # sepal_l, sepal_w, petal_l, petal_w

prediction = knn.predict(new_flower)
probabilities = knn.predict_proba(new_flower)

print("New Flower Measurements:")
print(f"   Sepal Length: {new_flower[0][0]} cm")
print(f"   Sepal Width:  {new_flower[0][1]} cm")
print(f"   Petal Length: {new_flower[0][2]} cm")
print(f"   Petal Width:  {new_flower[0][3]} cm")
print(f"\nPredicted Species: {iris.target_names[prediction[0]]}")
print(f"\nPrediction Probabilities:")
for name, prob in zip(iris.target_names, probabilities[0]):
    bar = '#' * int(prob * 30)
    print(f"   {name:>12s}: {prob:.2%} {bar}")

### Effect of K on Model Performance

The choice of **K** (number of neighbors) is a hyperparameter that significantly affects the model. Let us examine how different values of K influence accuracy.

In [None]:
# Test different values of K
k_values = range(1, 26)
train_accuracies = []
test_accuracies = []

for k in k_values:
    knn_k = KNeighborsClassifier(n_neighbors=k)
    knn_k.fit(X_train, y_train)
    train_accuracies.append(knn_k.score(X_train, y_train))
    test_accuracies.append(knn_k.score(X_test, y_test))

# Plot training vs test accuracy for each K
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(k_values, train_accuracies, 'o-', label='Training Accuracy',
        color='#2196F3', linewidth=2, markersize=6)
ax.plot(k_values, test_accuracies, 's-', label='Test Accuracy',
        color='#FF5722', linewidth=2, markersize=6)
ax.fill_between(k_values, test_accuracies, alpha=0.1, color='#FF5722')

best_k = k_values[np.argmax(test_accuracies)]
ax.axvline(x=best_k, color='green', linestyle='--', alpha=0.7, label=f'Best K={best_k}')

ax.set_xlabel('K (Number of Neighbors)', fontsize=13)
ax.set_ylabel('Accuracy', fontsize=13)
ax.set_title('KNN: Effect of K on Model Accuracy', fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.set_xticks(k_values)
ax.set_ylim(0.85, 1.02)
plt.tight_layout()
plt.show()

print(f"\nBest K: {best_k} (Test Accuracy: {max(test_accuracies):.2%})")
print("\nKey Insight:")
print("  - Small K: the model is highly sensitive to noise (risk of overfitting).")
print("  - Large K: the model becomes overly general (risk of underfitting).")
print("  - The optimal K balances these two extremes.")

---

## 6. Exercises

### Exercise 1: Classify the Scenario

For each scenario below, determine whether it is **Supervised**, **Unsupervised**, or **Reinforcement Learning**, and whether the task is **Classification**, **Regression**, or **Clustering**.

In [None]:
# Exercise 1: Replace the "???" with your answers

scenarios = {
    "Predicting house prices based on features": {
        "ml_type": "???",      # Supervised / Unsupervised / Reinforcement
        "task_type": "???",     # Regression / Classification / Clustering
    },
    "Grouping customers by purchasing behavior": {
        "ml_type": "???",
        "task_type": "???",
    },
    "Email spam detection": {
        "ml_type": "???",
        "task_type": "???",
    },
    "A robot learning to walk": {
        "ml_type": "???",
        "task_type": "???",
    },
    "Predicting tomorrow's temperature": {
        "ml_type": "???",
        "task_type": "???",
    },
}

for scenario, answers in scenarios.items():
    print(f"Scenario: {scenario}")
    print(f"   ML Type:   {answers['ml_type']}")
    print(f"   Task Type: {answers['task_type']}")
    print()

<details>
<summary><b>Click here to see the answers</b></summary>

| Scenario | ML Type | Task Type |
|---|---|---|
| Predicting house prices | Supervised | Regression |
| Grouping customers | Unsupervised | Clustering |
| Email spam detection | Supervised | Classification |
| A robot learning to walk | Reinforcement | Policy learning |
| Predicting temperature | Supervised | Regression |

</details>

### Exercise 2: Explore a New Dataset

Load the **Wine dataset** from scikit-learn, explore it, and build a KNN classifier. Follow the same steps we used for the Iris dataset above.

In [None]:
# Exercise 2: Complete the code below
from sklearn.datasets import load_wine

# Step 1: Load the dataset
wine = load_wine()

# TODO: Create a DataFrame from the wine dataset
# wine_df = pd.DataFrame(???)

# TODO: Print the shape, feature names, and target names

# TODO: Display the first 5 rows

# TODO: Check for missing values

# TODO: Split the data into train and test sets (80/20)

# TODO: Train a KNN classifier (try K=3)

# TODO: Print the test accuracy

# TODO: Try different K values and plot the results

print("Hint: Follow the exact same steps we used for the Iris dataset above.")

<details>
<summary><b>Click here for the solution</b></summary>

```python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

wine = load_wine()
wine_df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
wine_df['target'] = wine.target

print(f"Shape: {wine_df.shape}")
print(f"Features: {wine.feature_names}")
print(f"Classes: {wine.target_names}")
print(wine_df.head())
print(f"Missing values: {wine_df.isnull().sum().sum()}")

X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.2, random_state=42, stratify=wine.target
)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
print(f"Test Accuracy: {knn.score(X_test, y_test):.2%}")
```

</details>

### Exercise 3: Conceptual Questions

Answer the following in your own words (add a markdown cell below each question):

1. What is the difference between a feature and a label?
2. Why is it important to split data into training and test sets?
3. What happens if K=1 in KNN? What about K=n (total number of training samples)?
4. Can you think of a real-world problem in your domain where Machine Learning could be useful?

---

## 7. Summary and Further Reading

### What We Covered

- Machine Learning enables computers to learn patterns from data rather than following explicit rules.
- There are three main categories: Supervised, Unsupervised, and Reinforcement Learning.
- The ML workflow consists of: problem definition, data collection, preprocessing, model selection, training, evaluation, and deployment.
- We built our first classifier using KNN on the Iris dataset and achieved strong accuracy.
- The hyperparameter K controls the trade-off between overfitting and underfitting.

### Recommended Reading

- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course)
- Aurélien Géron, *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*, O'Reilly, 3rd Edition
- Gareth James et al., *An Introduction to Statistical Learning (ISLR)* — freely available at https://www.statlearning.com

### Next Module

In **Module 2: Mathematical Foundations**, we will cover the essential mathematics behind ML algorithms — linear algebra, statistics, probability, and calculus — all through Python code and visualizations.

---