## LINEAR REGRESSION ALGORITHM


NAME: TIMILEYIN SAMUEL AKINTILO

STUDENT ID: C00302909

#### INTRODUCTION

This notebook showcases the implemention of the Linear regression algorithm using the scikit-learn library. This notebook was developed from the scratch to demostrate a significant practical and theoretical understanding of the underlying machine learning algorithm. 

#### LOG OF CHANGES

This log embodies all the computations carrried out for this analysis and how they affect result of the analysis. The log is structured to follows the Cross Industry Standard Process for Data Mining (CRISP-DM) model, and the changes were logged under each of the six phases as follows:

**1. Business understanding**

The goal of this project is to build a linear regression model to predict the prices of house based on certain attributes. This project will serve a great purpose in housing industry for housing professionals and house buyers by facilitating decision-making in real estates.


**2. Data Understanding**

The dataset used for this analysis was gotten from kaggle (https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction). The dataset includes the 12 key parameters and a target variable representing the price of the house. Below is the data dictionary:

Price: The price of the house.

Area: The total area of the house in square feet.

Bedrooms: The number of bedrooms in the house.

Bathrooms: The number of bathrooms in the house.

Stories: The number of stories in the house.

Mainroad: Whether the house is connected to the main road (Yes/No).

Guestroom: Whether the house has a guest room (Yes/No).

Basement: Whether the house has a basement (Yes/No).

Hot water heating: Whether the house has a hot water heating system (Yes/No).

Airconditioning: Whether the house has an air conditioning system (Yes/No).

Parking: The number of parking spaces available within the house.

Prefarea: Whether the house is located in a preferred area (Yes/No).

Furnishing status: The furnishing status of the house (Fully Furnished, Semi-Furnished, Unfurnished).

**3. Data Preparation**

Before modelling, the data was preprocessed to make it fit for the analysis. The categorical variables were encoded using label encoder into numeric variables.

**4. Modelling**

The following  were implemented during the modelling phase:

**a) Standardizing the features**

**Change:** All the features were standardized to keep them within the same scale

**Result:** The accuracy of the model did not change after standardizig. Before standardization, the Mean squared error was 1771751116594.035 and a R-Squared score of 0.649. After standardization, the Mean squared error was 1771751116594.040 and a R-Squared score of 0.649.


**b) Hyperparameter tuning**

**Change:** The hyperparameters of the model were tune to improve the model's performance.

**Result:** The accuracy of the model of the model did not change significantly even after hyperparameter tuning. 

**4. Evaluation:**
The performance of the model was evaluated using Mean squared error and R-Squared score.


**5. Deployment:**

The best model was saved as a pickle file and was deployed as a web application which which runs locally. The name of the web app is Housing_app.

#### ANALYSIS

First things first, let's import the neccessary libraries.


#### 

In [2]:
import pickle
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, StratifiedKFold
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


Next, we will load the data set and take a look at it.

In [36]:
# load the drug analysis dataset
df = pd.read_csv('Housing.csv')

In [37]:
# check the first few rows of the dataframe
df.head()


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [5]:
# Examine the shape of the dataset
df.shape

(545, 13)

In [6]:
# Examine the columns in the dataframe
df.columns

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')

In [7]:
# check for missing values
df.isnull().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

The dataset has no null values

Some of the columns are categorical, we need to convert them to numerical values. We can use the a label encoder to convert the categorical values to numerical values. 

In [8]:
# Select only categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category'])
categorical_columns.head()

Unnamed: 0,mainroad,guestroom,basement,hotwaterheating,airconditioning,prefarea,furnishingstatus
0,yes,no,no,no,yes,yes,furnished
1,yes,no,no,no,yes,no,furnished
2,yes,no,yes,no,no,yes,semi-furnished
3,yes,no,yes,no,yes,yes,furnished
4,yes,yes,yes,no,yes,no,furnished


In [9]:
# Create a LabelEncoder instance
label_encoder = LabelEncoder()

# Iterate through each column in the DataFrame and apply label encoding
for column in categorical_columns.columns:
    df[column] = label_encoder.fit_transform(df[column])

In [10]:
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,0
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,0
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,1
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,0
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,0


In [11]:
# Display the class labels
class_labels = label_encoder.classes_
print(f'Class Labels: {class_labels}')

Class Labels: ['furnished' 'semi-furnished' 'unfurnished']


Since Label encoder encodes based on alphabetical order, we can see that for the furnishingstatus column, furnished is encoded as 0, semi-furnished as 1, unfurnished as 2. This applies to other columns as well

Great! Now we can proceed to train a Logistic classifier using the dataset.

In [12]:
# Split the data into features and target
X = df.drop('price', axis=1)
y = df['price']

# take a copy of the features
features = X.copy()


In [13]:
X.columns

Index(['area', 'bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom',
       'basement', 'hotwaterheating', 'airconditioning', 'parking', 'prefarea',
       'furnishingstatus'],
      dtype='object')

Now, we can split the dataset into a training set and a test set. We will use 80% of the data for training and 20% for testing.

In [14]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [15]:
# Create a linear regression model
lin = LinearRegression() 

In [16]:
# Fit the model
lin.fit(X_train, y_train)

In [17]:
# Make predictions on the test set
y_pred = lin.predict(X_test)

In [18]:
# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"The Mean Squared Error of the linear regression model is {mse:.3f}")


The Mean Squared Error of the linear regression model is 1771751116594.035


In [19]:
# using r2_score
r_squared = r2_score(y_test, y_pred)
print(f"The R-squared of the linear regression model is {r_squared:.3f}")  

The R-squared of the linear regression model is 0.649


The model has an R-squared value of 0.649.

Next we will standardize the features using the StandardScaler class from the scikit-learn library and compare the results.

In [20]:
# standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

Now, we can split the standardized dataset into a training set and a test set. We will use 80% of the data for training and 20% for testing.

In [21]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
# Create a linear regression model
lin1 =  LinearRegression()

In [23]:
# Fit the model
lin1.fit(X_train, y_train)

In [24]:
# Make predictions on the test set
y_pred = lin1.predict(X_test)

In [25]:
# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"The Mean Squared Error of the linear regression model is {mse:.3f}")


The Mean Squared Error of the linear regression model is 1771751116594.040


In [26]:
# using r2_score
r_squared = r2_score(y_test, y_pred)
print(f"The R-squared of the linear regression model is {r_squared:.3f}")

The R-squared of the linear regression model is 0.649


Standardizing the features did not improve the performance of the linear regression model. The mean squared error and R-squared values are the same as the unstandardized model.

#### Hyperparameter tuning

Now we will attempt to tune the hyperparameters of the Linear Regression model using GridSearchCV

In [27]:
# Create a Linear regressor instance
lin2 = LinearRegression()


In [28]:
# Define the hyperparameter grid to search
param_grid = {
    'fit_intercept': [True, False],
    'n_jobs': [1, 5, 10]
}

In [29]:
# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=lin2, param_grid=param_grid, cv=5)

In [30]:
# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

In [31]:
# Get the best parameters and best estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

In [32]:
# check the best parameters
best_params

{'fit_intercept': True, 'n_jobs': 1}

In [33]:
# Use the best model to make predictions
y_pred = best_estimator.predict(X_test)

In [34]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
print(f"The R-squared of the linear regression model is {r_squared:.3f}")
print(f'Mean Squared Error: {mse:.4f}')

The R-squared of the linear regression model is 0.649
Mean Squared Error: 1771751116594.0400


The accuracy of the model of the model remains unchanged.

#### Saving the model as a pickle file

In [35]:
# Save the model to a file using pickle
with open('lin.pkl', 'wb') as file:
    pickle.dump(lin, file)

BIBLIOGRAPHY

https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction