# Intermediate machine learning with sklearn

## 1. Simple linear regression

Linear regression is used to predict a numeric value based on the input features.

In [15]:
# import libariries 
import pandas as pd 
import seaborn as sb
from sklearn.linear_model import LinearRegression 
# load the dataset 
df = sb.load_dataset('tips') 
# select the features column from df 
X = df[['total_bill']] 
# select the target column from df 
y = df['tip'] 
# create the linear regression model 
model = LinearRegression() 
# fit the model with the data 
model.fit(X, y) 
# Predicting tip based on total_bill: $ 35.27
predicted_tip = model.predict([[35.48]]) 
print(f"Predicted tip for a total bill of $35.27: ${predicted_tip[0]:.2f}")

Predicted tip for a total bill of $35.27: $4.65




In [16]:
# checking the accuracy of the model
accuracy = model.score(X,y)
print(f"Model accuracy: {accuracy:.2f}")

Model accuracy: 0.46


# 2. Multiple linear regression

In [17]:
X = df[['total_bill', 'size']] 
# fit the model
model.fit(X, y) 
# Predicting tip based on total_bill: $ 35.27 and size: 4 
predicted_tip = model.predict([[35.27, 4]])
print(f"Predicted tip for a total bill of $35.27 and size 4: ${predicted_tip[0]:.2f}")

Predicted tip for a total bill of $35.27 and size 4: $4.71




In [18]:
# Accuracy of the model 
accuracy = model.score(X, y) 
print(f"Model accuracy: {accuracy:.2f}")

Model accuracy: 0.47


<b>Interpretations:</b> 

i. The above multiple linear regression model is 47% accurate.<br>
ii. As we included one more feature-size as input the accuracy of the model improved by 1% (increased by 1%). 

Now we will add one more vairable - smoker as input 

smoker is a categorical variable so we need to encode it into numeric for machine learning algorithm to work on it. If we dont encode it we will get an error.

In [19]:
# encoding the smoker variable through manual mapping
df['smoker'] = df['smoker'].map({'Yes': 1, 'No': 0}) 
# Displaying the first few rows of the dataframe to check the encoding
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,0,Sun,Dinner,2
1,10.34,1.66,Male,0,Sun,Dinner,3
2,21.01,3.5,Male,0,Sun,Dinner,3
3,23.68,3.31,Male,0,Sun,Dinner,2
4,24.59,3.61,Female,0,Sun,Dinner,4


<b>Interpretation:</b> The values of smoker column are successfully encoded Yes is converted to 1 and No is converted to 0.

In [20]:
# Prediction with the addition  new encoded variable - smoker as input
X = df[['total_bill', 'size','smoker']] 
# fit the model 
model.fit(X, y) 
# Predicting tip based on total_bill: $ 35.27, size: 4, and smoker: Yes 
predicted_tip = model.predict([[35.27, 4, 1]])
print(f"Predicted tip for a total bill of $35.27, size 4, and smoker Yes: ${predicted_tip[0]:.2f}")
# Predicting tip based on total_bill: $ 35.27, size: 4, and smoker: No
predicted_tip = model.predict([[35.27, 4, 0]]) 
print(f"Predicted tip for a total bill of $35.27, size 4, and smoker No: ${predicted_tip[0]:.2f}")

Predicted tip for a total bill of $35.27, size 4, and smoker Yes: $4.66
Predicted tip for a total bill of $35.27, size 4, and smoker No: $4.74




In [21]:
# checking the accuracy of the model 
accuracy = model.score(X,y) 
print(f"Model accuracy: {accuracy:.2f}")

Model accuracy: 0.47


Now we will add one more vairable - time  as input and encode it through sklearn's labeleconder for prediction of tip

In [22]:
# import LabelEncoder from sklearn 
from sklearn.preprocessing import LabelEncoder 
# creating an instance of LabelEncoder
le = LabelEncoder() 
# encoding the time variable 
df['time'] = le.fit_transform(df['time']) 
# Displaying the first 5 rows of the dataframe to check the encoding
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,0,Sun,0,2
1,10.34,1.66,Male,0,Sun,0,3
2,21.01,3.5,Male,0,Sun,0,3
3,23.68,3.31,Male,0,Sun,0,2
4,24.59,3.61,Female,0,Sun,0,4


<b>Interpretation:</b> The time variable column is successfully converted into numeric values through label encoder.

In [23]:
# checking the actual values of the time variable 
print("Actual values of the time variable:", list(le.classes_))

Actual values of the time variable: ['Dinner', 'Lunch']


<b>Interpretation:</b> The time variable only has two values just like the smoker column.

In [24]:
# Prediction with the addition  new encoded variable - time as input
X = df[['total_bill', 'size','smoker','time']] 
# fit the model 
model.fit(X, y) 
# Predicting tip based on total_bill: $ 35.27, size: 4, and time: Dinner 
predicted_tip = model.predict([[35.27, 4, 1,0]]) 
print(f"Predicted tip for a total bill of $35.27, size 4,smoker yes and time Dinner: ${predicted_tip[0]:.2f}") 
# Predicting tip based on total_bill: $ 35.27, size: 4, and time: Lunch 
predicted_tip = model.predict([[35.27, 4,1, 1]]) 
print(f"Predicted tip for a total bill of $35.27, size 4, smoker yes and time Lunch: ${predicted_tip[0]:.2f}")

Predicted tip for a total bill of $35.27, size 4,smoker yes and time Dinner: $4.66
Predicted tip for a total bill of $35.27, size 4, smoker yes and time Lunch: $4.66




<b>Interpretation:</b> The predicted tip for Dinner time and Lunch time is same indicating their is no effect of time variable on predicted tip.

# Gradient descent technique 

Finding the appropriate value of y-intercept(B0 or y) and slope(B1 or m) through gradient descent.

y = B0 + B1 * x 

where,
y = predicted_value<br>
B0 = y-intercept<br>
B1 = slope<br>
x = value of input feature 

In [25]:
def gradient_descent(x, y, iterations:int=1000, learning_rate:float=0.0001):
    import numpy as np
    import math as m
    x = np.array(x,dtype=float)
    y = np.array(y,dtype=float)

    if len(x) != len(y):
        raise ValueError('The number of features must be equal to the number of target values') 
    m_curr = b_curr = 0
    n = len(x)
    prev_cost = float('inf')
    tolerance = 1e-6
    for i in range(iterations):
        y_predicted = m_curr * x + b_curr
        cost = (1/n) * sum((y - y_predicted) ** 2)

        md = -(2/n) * sum(x * (y - y_predicted))
        bd = -(2/n) * sum(y - y_predicted)

        m_curr -= learning_rate * md
        b_curr -= learning_rate * bd

        if m.isclose(prev_cost-cost,tolerance,rel_tol=1e-9):
            break 
        prev_cost = cost

    return m_curr,b_curr

In [26]:
# using the custom gradient descent function to find the slope and intercept
m,b = gradient_descent(df['tip'],df['total_bill'])
print(f"Slope: {m}, Intercept: {b}")

Slope: 5.184553005972767, Intercept: 1.6252179951331416


In [27]:
# predicting total_bill based on tip using the derived slope and intercept
predicted_total = m * 35.27 + b 
print(f"Predicted total_bill for a tip of $35.27: ${predicted_total:.2f}")

Predicted total_bill for a tip of $35.27: $184.48


In [28]:
m,b = gradient_descent(df['total_bill'],df['tip']) 
print(f'Slope : {m} intercept: {b}') 
predicted_tip = m * predicted_total + b 
print(f'Predicted tip for a total bill of {predicted_total:.2f} : ${predicted_tip:.2f}')

Slope : 0.1422244062881796 intercept: 0.036155515684372974
Predicted tip for a total bill of 184.48 : $26.27
