<a href="https://colab.research.google.com/github/DAPLearning2025/materials/blob/main/Week14_ML_SupervisedLearning_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Simple Tour in ML

1. Intro to Supervised Learning

**Features** **--->** Supervised Learning Model **--->** **Value to Predict**

**Machine Learning** Computer algorithms that have the ability to learn without being explicitly programmed

**Supervised Learning** The branch of ml where the computer learns how to perform a function by looking at labeled training data

**Training: Data** --> **Supervised Learning Model** --> **Correct Output**

**Testing: New Data** --> **Supervised Learning Model** --> **Predicted Value**

**Type of Supervised Learning**
*   Regression: Predicting continuous vales (e.g. house prices)
*   Classification: Predicting categories (e.g spam vs. not spam)

**What is Linear Regression?**

Linear regression is one of the simplest forms of supervised learning. It is used to model the relationship between a dependent variable and one or more independent variables.

Simple Linear Regression:

Equation: Y = b + b1*X + e (Error)

In other wrods,  Prediction = y-intercept + coefficient₁X₁ + coficient₂X₂ + … + coefficientiXi.

Goal: Find the best-fitting line that minimizes the error between the predicted and actual values

Visual Example:

Imagine a scatter plot with data points forming a linear trend. Linear regression finds the line that best fits this trend.



**Libraries**
**NumPy:**
*   Very efficient array and linear algebra functions
*   Free and open source
*   Widely used in the industry
*   The foundation on which other ML libraries are built

**Scikit-learn:**
*   Machine Learning library for Python
*   Free and open source
*   Widely used in the industry
*   Implements many standard machine learning algorithms
*   Free and open source

**Pandas:**
*   Makes it easy to load and work with large data sets similar to a spreadsheet.
*   Free and open source
*   Widely used in the industry
*   Short for "panel data"

These libraries work together perfectly.

NumPy --> provides the basic ability to load and work with a dataset

Pandas --> provides the extra capabilities to make it easy to clean up and do calculations on the dataset.

Sckit-learn --> provides the actual machine learning algorithms we'll run on the data.


**Demo** --> Linear regression as a machine learning problem. Mean Square Error (MSE) as loss function. Interpreting the results of a regresion analysis. R^2 for evaluating regression models.

**General Process to build ML Model**

1.   Define the problem
2.   Ask, does ML neem to be build to solve the problem
3.   Do you have the data (Relative data)
4.   Explore the dataset (Use pandas or any powerful library)
5.   Prepare the data for training and testing phases
6.   Evaluate your model
7.   Does your approach matter? Did you reach your goal?  





# Regression Models

In [None]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

In [None]:
#read data from github path
github_url = r'https://raw.githubusercontent.com/DAPLearning2025/materials/refs/heads/main/Resources%20%26%20Data/auto-mpg.csv'
df=pd.read_csv(github_url, header=0)
df.head()

In [None]:
#Handle the dataset
#Finding missing values
df.isnull().sum()

**Missing Values**: missing data can lead to a reduced sample size, introduce bias, and distort the model's estimates

In [None]:
#Using horspower as feature and deletting the null values
df.dropna(inplace=True)
df.isnull().sum()

In [None]:
#Discovering columns and data type for each field
df.info()

#Find which features have a linear relatioship with target value - mpg

In [None]:
list(df.columns[1:])


In [None]:
all_features = list(df.columns[1:])

for feature in all_features:
  sns.relplot(data = df, y = feature, x = 'mpg', kind='line', height=5, aspect=1)

In [None]:
original_cols = df.columns
# Visualize again to see if outliers were removed.
sns.pairplot(df)

# Started the process using 1, 2, 3, then all features step by step

In [None]:
#We will start using features and compare models based on training & testing score and R squard
tracking_features = {
    "features":[],
    "training_score":[],
    "R_squared":[],
    "mse":[]
}


`R-squared (R²)`: it measures how well the regression model fits the observed data. The R2 score ranges from 0 to 1, where 0 indicates that the model does not explain any of the variability in the dependent variable, and 1 indicates that the model perfectly predicts the dependent variable. R2 score is also called “coeffecient of determination” or “goodness of fit”.

`Mean Squared Error (MSE)`:This is a measure of the average squared difference between the predicted and actual values. A lower MSE indicates a better model fit

In [None]:
#Wanted to plot data and see the realtionship between mpg and horsepower
feature1='horsepower' # first feature
sns.scatterplot(data=df, x='horsepower', y='mpg');

In [None]:
#using Lineaer Reg  for 1 feature. Split data for training & testing samples
# The train_test_split function in scikit-learn automatically shuffles the data
from sklearn.model_selection import train_test_split

X=df[['horsepower']]
y=df['mpg']

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
x_train.sample(5)

In [None]:
# Calling the model
from sklearn.linear_model import LinearRegression
linear_model=LinearRegression() #calling linear reg
linear_model.fit(x_train, y_train) # feeding the training data

In [None]:
#R-square is a measure of how well our linear model captures the underlying variation in our training data
print(f"Training score(using %) is : ", end ="")
horsepower_training_score=linear_model.score(x_train, y_train) * 100 # track the training score
print("{0:.3f}".format(horsepower_training_score))

In [None]:
y_pred = linear_model.predict(x_test)

In [None]:
from sklearn.metrics import r2_score

testing_score = r2_score(y_test, y_pred) * 100 # tracking testing score
print(f"Testing score(using %) is : ", end ="")
print("{0:.3f}".format(testing_score))


In [None]:
# calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE)(using %) is : ", end ="")
print("{0:.3f}".format(mse))

In [None]:
#feeding dictionnary
tracking_features["features"].append(feature1)
tracking_features["training_score"].append(horsepower_training_score)
tracking_features["R_squared"].append(testing_score)
tracking_features["mse"].append(mse)
tracking_features

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(x_test, y_test, color='black')
plt.plot(x_test, y_pred, color='blue', linewidth=3)

plt.xlabel('Horsepower')
plt.ylabel('MPG')

plt.show()

In [None]:
df.columns

In [None]:
#Using different feteaure
feature2='weight'
X=df[['weight']]
y=df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

linear_model=LinearRegression()
linear_model.fit(x_train, y_train)
training_score=linear_model.score(x_train, y_train) * 100
print(f"Training score(using %) is : ", end ="")
print("{0:.3f}".format(training_score))

y_pred = linear_model.predict(x_test)
testing_score = r2_score(y_test, y_pred) * 100
print(f"Testing score(using %) is : ", end ="")
print("{0:.3f}".format(testing_score))

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE)(using %) is : ", end ="")
print("{0:.3f}".format(mse))

#adding to dictionnary
tracking_features["features"].append(feature2)
tracking_features["training_score"].append(training_score)
tracking_features["R_squared"].append(testing_score)
tracking_features["mse"].append(mse)
tracking_features

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

plt.scatter(x_test, y_test, color='black')
plt.plot(x_test, y_pred, color='blue', linewidth=3)

plt.xlabel('weight')
plt.ylabel('MPG')

plt.show()

#Using Multiple Features

In [None]:
#using multiple features
#using 3 features at once
feature3="horsepower, weight,displacement"
X=df[['horsepower', 'weight','displacement']]
y=df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

linear_model=LinearRegression()
linear_model.fit(x_train, y_train)
training_score=linear_model.score(x_train, y_train) * 100
print(f"Training score(using %) is : ", end ="")
print("{0:.3f}".format(training_score))

y_pred = linear_model.predict(x_test)
testing_score = r2_score(y_test, y_pred) * 100
print(f"Testing score(using %) is : ", end ="")
print("{0:.3f}".format(testing_score))

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE)(using %) is : ", end ="")
print("{0:.3f}".format(mse))

#adding to dictionnary
tracking_features["features"].append(feature3)
tracking_features["training_score"].append(training_score)
tracking_features["R_squared"].append(testing_score)
tracking_features["mse"].append(mse)
tracking_features

In [None]:
#w = weights or coefficients of the model (model parameters)
predictors=x_train.columns
coef=pd.Series(linear_model.coef_,predictors).sort_values()
coef.plot(kind='bar', title='Modal Coefficients');

In [None]:
print(coef)

In [None]:
#Let's plot our foundings
plt.figure(figsize=(20,10))

plt.plot(y_pred, label='Predicted')
plt.plot(y_test.values, label='Actual')
plt.ylabel('MPG')

plt.legend()
plt.show()

In [None]:
#adding more features to our model
feature4="horsepower, weight,displacement, acceleration"
X=df[['horsepower', 'weight','displacement','acceleration']]
y=df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


linear_model=LinearRegression()
linear_model.fit(x_train, y_train)
training_score=linear_model.score(x_train, y_train) * 100
print(f"Training score(using %) is : ", end ="")
print("{0:.3f}".format(training_score))

y_pred = linear_model.predict(x_test)
testing_score = r2_score(y_test, y_pred) * 100
print(f"Testing score(using %) is : ", end ="")
print("{0:.3f}".format(testing_score))

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE)(using %) is : ", end ="")
print("{0:.3f}".format(mse))

#adding to dictionnary
tracking_features["features"].append(feature4)
tracking_features["training_score"].append(training_score)
tracking_features["R_squared"].append(testing_score)
tracking_features["mse"].append(mse)
tracking_features

In [None]:
predictors=x_train.columns
coef=pd.Series(linear_model.coef_,predictors).sort_values()
coef.plot(kind='bar', title='Modal Coefficients')

In [None]:
print(coef)

In [None]:
#Let's plot our foundings
plt.figure(figsize=(20,10))

plt.plot(y_pred, label='Predicted')
plt.plot(y_test.values, label='Actual')
plt.ylabel('MPG')

plt.legend()
plt.show()

In [None]:
# Add all the features
featureAll="All relative fields"
X=df.drop('mpg', axis=1)
y=df['mpg']
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

linear_model=LinearRegression()
linear_model.fit(x_train, y_train)
training_score=linear_model.score(x_train, y_train) * 100
print(f"Training score(using %) is : ", end ="")
print("{0:.3f}".format(training_score))

y_pred = linear_model.predict(x_test)
testing_score = r2_score(y_test, y_pred) * 100
print(f"Testing score(using %) is : ", end ="")
print("{0:.3f}".format(testing_score))

mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE)(using %) is : ", end ="")
print("{0:.3f}".format(mse))

#adding to dictionnary
tracking_features["features"].append(featureAll)
tracking_features["training_score"].append(training_score)
tracking_features["R_squared"].append(testing_score)
tracking_features["mse"].append(mse)
tracking_features


In [None]:
linear_model=LinearRegression()
linear_model.fit(x_train, y_train)
print("Training score: ", linear_model.score(x_train, y_train))
print("Testing score: ", linear_model.score(x_test, y_test))
predictors=x_train.columns
coef=pd.Series(linear_model.coef_,predictors).sort_values()

y_pred=linear_model.predict(x_test)
print("Testing score: ", r2_score(y_test, y_pred))

coef.plot(kind='bar', title='Modal Coefficients')

In [None]:
#Let's plot our foundings
plt.figure(figsize=(20,10))

plt.plot(y_pred, label='Predicted')
plt.plot(y_test.values, label='Actual')
plt.ylabel('MPG')

plt.legend()
plt.show()

In [None]:
#compare data in the dictionnary
df_tracking=pd.DataFrame(tracking_features)
df_tracking

In [None]:
df.columns

#Saving and exporting our model for future use.

In [None]:
import pickle

# NAME YOUR MODEL
filename = 'mpg_model.pkl'

# EXPORT AND SAVE YOUR MODEL USING YOUR FILENAME
pickle.dump(linear_model, open(filename, 'wb'))

In [None]:
x_test

In [None]:
#How to load our model for future use
model = pickle.load(open(filename, 'rb'))

#How to use our model to predict
model.predict(x_test)
