Start of Liner Regression with basics.

In [2]:
import pandas as pd # For data handling
import numpy as np # For data manipulation
import matplotlib.pyplot as plt # For plotting
import seaborn as sns # For data visualization
from sklearn.model_selection import train_test_split # For splitting the dataset
from sklearn.linear_model import LinearRegression # For linear regression model
%matplotlib inline 
# To display plots inline in Jupyter notebooks, in simple words it allows us to see the plots directly below the code cells that produce them.

In [3]:
df = pd.read_csv('Advertising.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [4]:
# here we have got an unnamed column which is just the index column from the csv file. We can drop it.
df.drop(columns = 'Unnamed: 0', inplace=True)
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [None]:
# now we have to figure out the relation between the features.
# so the input features are TV, Radio and Newspaper and the output feature is Sales.
# this is the linkage of spend on these media and sales.
# so the x (independent variables) (features) are the input features and y (dependent variable) (target) is the output feature.

# another key thing is that the target column is numerical in nature, so this is a regression problem.
# if the target column was categorical in nature, then it would have been a classification problem.

In [None]:
# so we are trying to establish a relation between this x(input) and the y(output). now how to do that? each and every maching learning algo has a numerical way of establishing this relation. In linear regression, it is done by finding a line of best fit.
# so linear regression tries to fit a line to the data points such that the distance between the data points and the line is minimized. This distance is called the residuals.
# so the line of best fit is the line that minimizes the sum of the squared residuals.

# so in linear regression, is given by the equation: y = a + bx (y = mx + c) where y is the intercept, a is the y-intercept, b is the slope of the line and x is the input feature.(simple linear regression equation( only one feature ))

# in multiple linear regression, the equation is given by: y = a + b1x1 + b2x2 + b3x3 + ... + bnxn where y is the intercept, a is the y-intercept, b1, b2, b3, ..., bn are the coefficients of the features and x1, x2, x3, ..., xn are the input features.

# so here we have three input features, so the equation will be: Sales = a + b1*TV + b2*Radio + b3*Newspaper

# so when we perform the fit using the liner regression, the model will try to fit, i.e training the training data such tath the x and y are related by the above equation. The model will find the values of a, b1, b2 and b3 such that the sum of the squared residuals is minimized. 
# in simple words, the model will find the line of best fit for the data points.

# so in output we will get the values of a, b1, b2 and b3. which are coefficients of the linear regression equation, and the intercept a. which we can use to make predictions on new data points.

In [6]:
# getting the shape (number of rows and columns) of the initial data
df.shape

(200, 4)

In [None]:
# so the first step is to split the data into training and testing sets.
X = df.drop(columns='Sales') # input features
y = df['Sales'] # output feature 

# splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42) # 30% data for testing and 70% for training
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# ((140, 3), (60, 3), (140,), (60,)) means 140 rows and 3 columns for training input features, 60 rows and 3 columns for testing input features, 140 rows for training output feature and 60 rows for testing output feature. y_train and y_test are 1D arrays as they contain only one column because they are the output feature.

((140, 3), (60, 3), (140,), (60,))

In [9]:
# now after splitting the data, we can find the relation between x and y
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# now once the model is trained, we can get the coefficients and intercept of the linear regression equation.
lr.coef_ # coefficients b1, b2 and b3

array([0.04405928, 0.1992875 , 0.00688245])

In [11]:
lr.intercept_ # intercept a

2.7089490925159083

In [None]:
# now our machine learning model has learned the relation (patter) between the input features and the output feature. we can use this model to make predictions on new data points.

X_test.head() # new data points for testing

Unnamed: 0,TV,Radio,Newspaper
95,163.3,31.6,52.9
15,195.4,47.7,52.9
30,292.9,28.3,43.2
158,11.7,36.9,45.2
128,220.3,49.0,3.2


In [15]:
# now we can start predicting the sales for these new data points using the linear regression model.
y_pred_test = lr.predict(X_test)
y_pred_test[:5] # predicted sales for the first 5 new data points

array([16.5653963 , 21.18822792, 21.55107058, 10.88923816, 22.20231988])

In [None]:
# once the predictions are made, we will have to evaluate the model to see how good it is.
# now depending on the ml task, the evaluation metrics will vary.
# since this is a regression problem, we can use metrics or evaluation functions like:

# Mean Absolute Error (MAE)
# Mean Squared Error (MSE) 
# Root Mean Squared Error (RMSE)
# R-squared (R2) score.

# these metrics will help us to understand how well our model is performing and how accurate the predictions are. this is nothing but measuring the error between the actual values and the predicted values.

In [None]:
# lets take example 
''' 
y_true = [85, 59, 65, 45, 30, 98] # actual values
y_pred = [89, 57, 66, 42, 28, 90] # predicted values
'''

# mean absolute error: is the average of the absolute differences between the actual values and the predicted values. it gives an idea of how much the predictions are off from the actual values on average. lower the MAE, better the model. so here it will give: [|85-89| + |59-57| + |65-66| + |45-42| + |30-28| + |98-90|] / 6 = (4 + 2 + 1 + 3 + 2 + 8) / 6 = 3.5

# mean squared error: is the average of the squared differences between the actual values and the predicted values. it gives more weight to larger errors. lower the MSE, better the model. so here it will give: [(85-89)^2 + (59-57)^2 + (65-66)^2 + (45-42)^2 + (30-28)^2 + (98-90)^2] / 6 = (16 + 4 + 1 + 9 + 4 + 64) / 6 = 16.33

# root mean squared error: is the square root of the mean squared error. it gives an idea of how much the predictions are off from the actual values on average, in the same units as the target variable. lower the RMSE, better the model. so here it will give: sqrt(16.33) = 4.04

# r-squared score(R2 score): is a statistical measure that represents the proportion of the variance(variance means: the average of the squared differences from the mean) for the target variable that is explained by the input features in the model. it ranges from 0 to 1, where 1 indicates that the model explains all the variance in the target variable and 0 indicates that the model explains none of the variance. higher the R2 score, better the model. so here it will give: 1 - (sum of squared residuals / total sum of squares) = 1 - ( ( (85-89)^2 + (59-57)^2 + (65-66)^2 + (45-42)^2 + (30-28)^2 + (98-90)^2 ) / ( (85-63.67)^2 + (59-63.67)^2 + (65-63.67)^2 + (45-63.67)^2 + (30-63.67)^2 + (98-63.67)^2 ) ) = 0.72 ; meaning that 72% of the variance in the target variable is explained by the input features in the model. In simple terms: how much variance can be explained by the given features, and variance is nothing but the spread of data points or the information. How much information is captured by the features provided. How much variability can be explained by using the features about the target variable. # so suppose if the predected line is very close to the actual data points, then the r2 score will be high. if the predicted line is far from the actual data points, then the r2 score will be low.



In [16]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score
print(f"Mean Absolute Error: {mean_absolute_error(y_test, y_pred_test)}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred_test)}")
print(f"Root Mean Squared Error: {root_mean_squared_error(y_test, y_pred_test)}")
print(f"R-squared Score: {r2_score(y_test, y_pred_test)}")

Mean Absolute Error: 1.5116692224549082
Mean Squared Error: 3.796797236715219
Root Mean Squared Error: 1.9485372043446385
R-squared Score: 0.8609466508230368


In [None]:
# here the r2 score is 0.86 which means that 86% of the variance in the target variable (Sales) is explained by the input features (TV, Radio, Newspaper) in the model. This indicates that the model is performing well and is able to capture a significant portion of the variability in sales based on the advertising spend on these media channels. simply i can predict with 86% accuracy.

In [None]:
# so once im satisfied with the model performance, i can save the model using joblib or pickle for future use or deploy it and generate predictions on new data.