<a href="https://colab.research.google.com/github/Haseeb-zai30/Ai-notebooks/blob/main/day_5_intro_to_Machine_learning_%26_Evaluation_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## INTRODUCTION TO MACHINE LEARNING


Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that allows computers to learn patterns from data and make predictions or decisions without being explicitly programmed for every scenario.

**Traditional programming**: You write rules (if-else conditions) → machine follows them.

**Machine Learning**: You give data + expected results → machine learns rules by itself.

**Key Idea:**

Instead of hard-coding logic, we train a model on data so it can generalize to unseen situations.

**Example:**

**Spam Email Detection**

**Input:** Email text

**Output:** Spam (1) or Not Spam (0)

ML model learns from past labeled emails.

### Why Machine Learning?

Handles large-scale data better than manual programming.

Learns hidden patterns not obvious to humans.

Adapts to new, unseen data.

Forms the base for modern AI applications:

Speech recognition (e.g., Siri, Google Assistant)

Image recognition (e.g., Face Unlock)

Recommendation systems (e.g., Netflix, YouTube)

## Mathematical Perspective

A Machine Learning model tries to find a function:

                          
#                              𝑦=𝑓(𝑥)+𝜖

Where:

𝑥 = input (features, e.g., house size, number of rooms)

𝑦 = output (target, e.g., house price)

𝜖 = error (difference between predicted and actual)

## Types of Machine Learning
### Supervised Learning:

**Meaning:** The model learns from labeled data (data with both input and correct output).

**Goal:** Predict outcomes for new, unseen data.

**Examples:**

**Regression** → Predicting house prices based on size and location.

**Classification** → Email spam filter (spam or not spam).

### 2. Unsupervised Learning

**Meaning:** The model learns from unlabeled data (only inputs, no outputs).

**Goal:** Find hidden patterns or groupings.

**Examples:**

**Clustering** → Customer segmentation  (grouping customers by buying habits).

**Dimensionality Reduction** → Compressing image data  while keeping important features.

### 3. Semi-Supervised Learning

**Meaning:** Uses a small amount of labeled data + a large amount of unlabeled data.

**Goal:** Improve learning when labeling all data is too costly or difficult.

**Examples:**

**Medical diagnosis** → Few labeled scans (with doctor’s notes) + many unlabeled scans.

**Speech recognition** → A few labeled audio clips + lots of raw recordings.

###4. Reinforcement Learning (RL)

**Meaning:** An agent learns by interacting with an environment and receiving rewards or penalties.

**Goal:** Maximize rewards over time.

**Examples:**

**Self-driving cars**  → Learn to drive safely by trial and error.

**Games**  → An AI playing chess or Atari learns by winning/losing.

## The Machine Learning Pipeline

### Collect Data
→ Gather data from sources (databases, sensors, websites, etc.).

**Example:**

Gather housing data (location, size, number of rooms, price history) from property websites or government records.

### Preprocess Data
→ Clean it (remove errors, handle missing values, normalize, encode categories).

**Example:**

Fill missing values (like missing "number of bathrooms"), remove duplicates, normalize prices, and convert categories (like “Yes/No” for parking) into numbers.
###Feature Engineering
 → Pick or create useful features that improve predictions.

**Example:**

Create new features such as price per square foot, or combine “number of rooms + bathrooms” into “total rooms.”

Modeling
###Modeling
 → Choose and train the right algorithm for the problem.

**Example:**

 Use Linear Regression to predict house prices based on features like size, location, and rooms.

###Test & Evaluate
 → Check model accuracy using metrics (Accuracy, MSE, MAE, R²).

**Example:**

Split data into training (80%) and testing (20%).

Check how well the model predicts using metrics like:

MSE / MAE → how far predictions are from actual prices

R² score → how well the model explains the variation in prices
###Deploy & Monitor
 → Put the model into real use, track its performance, and update when needed.

 **Example:**
Put the model in a real-estate app where users can input house details and get an estimated price.

Keep monitoring: if prices in the market change, retrain the model with new data.

**This way, the pipeline turns raw housing data into a working price prediction system.**


## Evaluation metrics:
### 1. Mean Squared Error (MSE):
**Meaning:** Average of the squared differences between predicted and actual values.

**Why:** Squaring makes large errors count more.

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$


Example:

Actual house prices = [200, 250, 300] (in $1000)

Predicted prices = [210, 240, 310]

Errors = [-10, 10, -10]

Squared errors = [100, 100, 100]

MSE = (100+100+100)/3 = 100

In [4]:
import numpy as np
from sklearn.metrics import mean_squared_error

In [5]:
# ---------------- Example Data ----------------
# Actual house prices (in $1000)
actual = np.array([200, 250, 300])

In [6]:

# Predicted house prices (in $1000)
predicted = np.array([210, 240, 310])


In [7]:
# ---------------- Calculate MSE ----------------
mse = mean_squared_error(actual, predicted)

In [8]:
print("Actual prices:", actual)
print("Predicted prices:", predicted)
print("MSE:", mse)

Actual prices: [200 250 300]
Predicted prices: [210 240 310]
MSE: 100.0


In [9]:
# ---------------- Manual Calculation ----------------
# Step 1: Find errors (actual - predicted)
errors = actual - predicted

In [10]:
# Step 2: Square each error
squared_errors = errors ** 2

In [11]:
# Step 3: Take the average
mse_manual = np.mean(squared_errors)

In [12]:
print("\nManual calculation:")
print("Errors:", errors)



Manual calculation:
Errors: [-10  10 -10]


In [13]:
print("Squared errors:", squared_errors)
print("MSE (manual):", mse_manual)


Squared errors: [100 100 100]
MSE (manual): 100.0


### 2. Mean Absolute Error (MAE)

**Meaning:** Average of the absolute differences between predicted and actual values.

**Why:** Easier to understand, since error is in the same units as the target.

**Formula:**
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert
$$

**Example:**

Errors = [-10, 10, -10]

Absolute errors = [10, 10, 10]

MAE = (10+10+10)/3 = 10 → meaning, on average, the model is off by $10,000

In [39]:
# Import necessary library
from sklearn.metrics import mean_absolute_error

In [40]:
# Actual values
y_true = [3, 5, 2, 7]

In [41]:
# Predicted values (model predictions)
y_pred = [2.5, 5.5, 2, 8]


In [42]:
# Using sklearn
mae_sklearn = mean_absolute_error(y_true, y_pred)
print("Sklearn MAE:", mae_sklearn)

Sklearn MAE: 0.5


In [44]:
# Manual calculation
absolute_errors = [abs(a - p) for a, p in zip(y_true, y_pred)]
mae_manual = sum(absolute_errors) / len(absolute_errors)

In [45]:
print("Manual MAE:", mae_manual)

Manual MAE: 0.5


## R² Score (Coefficient of Determination):
**Definition:**
R² measures how well the predicted values explain the variance of the actual values.

$$
R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
$$

**Where:**  
- $y_i$ = actual value  
- $\hat{y}_i$ = predicted value  
- $\bar{y}$ = mean of actual values  

**Interpretation:**  
- $R^2 = 1$ → Perfect prediction  
- $R^2 = 0$ → Model predicts as well as simply using the mean  
- $R^2 < 0$ → Model is worse than just using the mean


In [47]:
# Actual values
y_true = [3, 5, 2, 7]

In [48]:

# Predicted values
y_pred = [2.5, 5.5, 2, 7.5]

In [53]:
from sklearn.metrics import r2_score

r2_score_sklearn = r2_score(y_true, y_pred)
print("Sklearn R2 Score:", r2_score_sklearn)

Sklearn R2 Score: 0.9491525423728814


In [54]:
#-------------manual calculation-------------
# Step 1: Calculate mean of actual values
y_mean = sum(y_true) / len(y_true)


In [50]:
# Step 2: Calculate Total Sum of Squares (SS_tot)
ss_tot = sum((y - y_mean)**2 for y in y_true)

In [51]:
# Step 3: Calculate Residual Sum of Squares (SS_res)
ss_res = sum((y - y_hat)**2 for y, y_hat in zip(y_true, y_pred))

In [52]:
# Step 4: Calculate R2
r2_score_manual = 1 - (ss_res / ss_tot)
print("Manual R2 Score:", r2_score_manual)

Manual R2 Score: 0.9491525423728814
