# Machine Learning module end project

This project involves building a predictive model for car prices using various regression algorithms. Below is a structured approach for solving this problem, which includes loading and preprocessing the dataset, implementing the models, evaluating them, conducting feature importance analysis, and performing hyperparameter tuning.

## 1. Loading and Preprocessing:

In [168]:
# to deactivate warnings
import warnings
import sys
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [169]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [170]:
# Importing the dataset
df = pd.read_csv(r"C:\Users\aksha\Downloads\CarPrice_Assignment.csv")
df.head()

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [171]:
df.shape

(205, 26)

In [172]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

In [173]:
df.columns

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase',
       'carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype',
       'cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg',
       'price'],
      dtype='object')

In [174]:
# Finding the missing values
missing_values = df.isnull().sum()
print("Missing Values")
print(missing_values)

Missing Values
car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64


In [175]:
# checking the total of duplicates
df.duplicated().sum()

0

In [176]:
# Encoding categorical data
categorical_cols = df.select_dtypes(include=['object']).columns
categorical_cols

Index(['CarName', 'fueltype', 'aspiration', 'doornumber', 'carbody',
       'drivewheel', 'enginelocation', 'enginetype', 'cylindernumber',
       'fuelsystem'],
      dtype='object')

In [177]:
# One-hot Encoding
df = pd.get_dummies(df,columns=categorical_cols)

In [178]:
# Feature selection
x = df.drop('price', axis=1)
y = df['price']

In [179]:
# Splitting the dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

In [180]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler


In [181]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

## 2.Model Implementation :

### (1) Linear Regression

    Linear Regression is a statistical method used to model the relationship between a target variable (dependent variable) and one or more predictors (independent variables).

    It assumes that there is a linear relationship between the predictors and the target. 

    The goal of Linear Regression is to fit a straight line (or hyperplane in the case of multiple features) to the data by minimizing the sum of squared residuals (the difference between predicted and actual values).

In [182]:
from sklearn.linear_model import LinearRegression

In [183]:
lr_model = LinearRegression()
lr_model.fit(x_train, y_train)

In [185]:
y_pred = lr_model.predict(x_test)

In [186]:
y_test

15     30760.000
9      17859.167
100     9549.000
132    11850.000
68     28248.000
95      7799.000
159     7788.000
162     9258.000
147    10198.000
182     7775.000
191    13295.000
164     8238.000
65     18280.000
175     9988.000
73     40960.000
152     6488.000
18      5151.000
82     12629.000
86      8189.000
143     9960.000
60      8495.000
101    13499.000
98      8249.000
30      6479.000
25      6692.000
16     41315.000
168     9639.000
195    13415.000
97      7999.000
194    12940.000
67     25552.000
120     6229.000
154     7898.000
202    21485.000
79      7689.000
69     28176.000
145    11259.000
55     10945.000
45      8916.500
84     14489.000
146     7463.000
Name: price, dtype: float64

In [187]:
y_pred

array([-1.74396701e+15,  6.79094818e+15, -2.71285019e+15,  9.75035952e+03,
        8.04082734e+15, -1.14014990e+15,  6.90584390e+03,  9.97271890e+03,
        7.10730646e+14, -1.09418604e+15,  6.31473837e+15,  6.12871890e+03,
        1.51408595e+04,  8.61135952e+03,  7.09821342e+15,  7.16771890e+03,
       -6.24432102e+15,  1.26427306e+04, -2.07010444e+15,  7.00845813e+14,
       -9.46259080e+14,  6.60924641e+15, -3.91589121e+15,  5.29328140e+03,
       -1.31605735e+15, -1.82229650e+15,  3.95261205e+15,  1.46996095e+04,
       -1.39952697e+15,  1.13133595e+04,  6.87839818e+15,  6.45721890e+03,
       -1.09418604e+15,  7.86850326e+15,  4.43173061e+03,  8.83096159e+15,
        7.10730646e+14,  6.36790640e+03, -1.90727316e+15,  1.68404806e+04,
        7.10730646e+14])

### (2) Decision Tree Regressor


    A Decision Tree Regressor is a machine learning model used to predict continuous values by recursively splitting the dataset based on features that result in the most significant reduction in variance. 
 
    It creates a tree-like structure where each node represents a feature split, and the leaves contain the predicted value, typically the mean of the target variable in that subset. 

    While it is easy to interpret and can handle nonlinear relationships, Decision Trees are prone to overfitting, especially when the tree becomes too complex. 

    This can be mitigated by controlling the depth of the tree or using pruning techniques. 

    The model is popular for its simplicity and ability to handle both numerical and categorical data.

In [188]:
from sklearn.tree import DecisionTreeRegressor 

In [189]:
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(x_train, y_train)

In [191]:
y_pred = dt_model.predict(x_test)
y_pred

array([36880. , 17199. ,  8949. , 12170. , 35056. ,  5195. ,  7995. ,
        8358. ,  8949. ,  7995. , 12170. ,  8058. , 12170. , 11199. ,
       31400.5,  6338. ,  5399. , 12964. ,  6989. ,  8949. , 10245. ,
       14399. ,  7299. ,  5389. ,  7609. , 36880. ,  9989. , 16515. ,
        7349. , 15985. , 36880. ,  6229. ,  6989. , 19045. ,  7957. ,
       36880. , 11694. , 11845. ,  8916.5, 14869. ,  9233. ])

In [192]:
y_test

15     30760.000
9      17859.167
100     9549.000
132    11850.000
68     28248.000
95      7799.000
159     7788.000
162     9258.000
147    10198.000
182     7775.000
191    13295.000
164     8238.000
65     18280.000
175     9988.000
73     40960.000
152     6488.000
18      5151.000
82     12629.000
86      8189.000
143     9960.000
60      8495.000
101    13499.000
98      8249.000
30      6479.000
25      6692.000
16     41315.000
168     9639.000
195    13415.000
97      7999.000
194    12940.000
67     25552.000
120     6229.000
154     7898.000
202    21485.000
79      7689.000
69     28176.000
145    11259.000
55     10945.000
45      8916.500
84     14489.000
146     7463.000
Name: price, dtype: float64

### (3) Random Forest Regressor

    A Random Forest Regressor is an ensemble machine learning model that combines multiple decision trees to improve prediction accuracy and reduce overfitting. 

    It works by creating a collection (forest) of decision trees, each trained on a random subset of the data using bootstrapping (sampling with replacement) and a random selection of features for each split. 

    The final prediction is the average of the predictions made by all the individual trees. 

    This technique helps to mitigate the instability of a single decision tree, as the diversity of trees leads to more robust and generalized predictions. 

    Random Forests are powerful for regression tasks due to their ability to capture complex relationships in the data while maintaining high accuracy and resilience to overfitting. 

    However, they can be computationally expensive, especially with large datasets or when a large number of trees is used.

In [193]:
from sklearn.ensemble import RandomForestRegressor

In [194]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(x_train, y_train)

In [195]:
y_pred = rf_model.predict(x_test)
y_pred

array([35336.06 , 19114.06 ,  8979.34 , 13037.84 , 27217.89 ,  6452.07 ,
        8019.99 ,  8036.52 ,  9743.06 ,  8251.64 , 13446.72 ,  7947.76 ,
       13604.7  , 10859.68 , 39044.485,  6391.41 ,  5680.76 , 13937.54 ,
        8932.3  ,  9324.07 , 10055.7  , 15318.27 ,  7127.68 ,  5754.42 ,
        7260.415, 35576.29 ,  9667.51 , 16806.18 ,  7299.57 , 16567.87 ,
       27333.165,  6398.81 ,  8015.79 , 18494.99 ,  8020.5  , 27124.11 ,
       10160.24 , 12696.55 ,  7098.195, 14197.38 ,  8336.69 ])

In [196]:
y_test

15     30760.000
9      17859.167
100     9549.000
132    11850.000
68     28248.000
95      7799.000
159     7788.000
162     9258.000
147    10198.000
182     7775.000
191    13295.000
164     8238.000
65     18280.000
175     9988.000
73     40960.000
152     6488.000
18      5151.000
82     12629.000
86      8189.000
143     9960.000
60      8495.000
101    13499.000
98      8249.000
30      6479.000
25      6692.000
16     41315.000
168     9639.000
195    13415.000
97      7999.000
194    12940.000
67     25552.000
120     6229.000
154     7898.000
202    21485.000
79      7689.000
69     28176.000
145    11259.000
55     10945.000
45      8916.500
84     14489.000
146     7463.000
Name: price, dtype: float64

### (4). Gradient Boosting Regressor

    A Gradient Boosting Regressor is an ensemble learning method that builds models sequentially, where each new model corrects the errors made by the previous ones. 

    It combines the predictions of several weak learners (typically decision trees) to form a strong predictive model. 

    The key idea is that each tree is trained to minimize the residual errors (the difference between the true values and the predictions) of the previous tree, using a gradient descent technique to find the optimal model.

    This process helps the model to improve its performance iteratively. Gradient Boosting is known for its high accuracy and ability to handle complex data structures, making it effective for regression tasks.
    
    However, it can be prone to overfitting if not properly tuned, and it is computationally expensive due to the sequential nature of training.
    
    Hyperparameters such as the learning rate and the number of trees play a crucial role in controlling overfitting and model performance.

In [197]:
from sklearn.ensemble import GradientBoostingRegressor

In [198]:
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(x_train, y_train)

In [199]:
y_pred = gb_model.predict(x_test)
y_pred

array([35028.28752753, 19746.98865554,  8787.94349637, 13085.05603467,
       32013.85947178,  6713.92613419,  7640.66295644,  7394.47381238,
        9229.36798234,  8074.06383638, 14124.14257933,  7576.33651443,
       15903.05424327, 11030.55657354, 40715.00789297,  6319.68588493,
        5927.01455946, 14301.86966345,  8369.7593149 ,  8921.66290602,
       10114.07447862, 15997.89305528,  6901.61526122,  6393.29708365,
        7132.37053347, 35513.78090196, 10325.77877153, 17038.73331271,
        6901.61526122, 17059.0057871 , 32524.61209807,  6525.19826414,
        7109.54919438, 20190.32485444,  8132.94214469, 32598.47131224,
        9955.25626339, 12531.46408569,  6806.16231636, 14301.86966345,
        7849.35827572])

In [200]:
y_test

15     30760.000
9      17859.167
100     9549.000
132    11850.000
68     28248.000
95      7799.000
159     7788.000
162     9258.000
147    10198.000
182     7775.000
191    13295.000
164     8238.000
65     18280.000
175     9988.000
73     40960.000
152     6488.000
18      5151.000
82     12629.000
86      8189.000
143     9960.000
60      8495.000
101    13499.000
98      8249.000
30      6479.000
25      6692.000
16     41315.000
168     9639.000
195    13415.000
97      7999.000
194    12940.000
67     25552.000
120     6229.000
154     7898.000
202    21485.000
79      7689.000
69     28176.000
145    11259.000
55     10945.000
45      8916.500
84     14489.000
146     7463.000
Name: price, dtype: float64

### (5). Support Vector Regressor

    A Support Vector Regressor (SVR) is a type of machine learning model that uses the principles of Support Vector Machines (SVM) for regression tasks.

    It works by finding a hyperplane (or decision boundary) that best fits the data, with the goal of minimizing prediction errors while allowing some margin of tolerance (epsilon) around the predicted values. 

    SVR tries to fit the model within this margin and only penalizes errors that fall outside of it. 

    The model focuses on the support vectors, or the data points that are closest to the decision boundary, which are crucial for defining the model. 

    SVR is particularly effective for capturing complex, nonlinear relationships in data using kernel functions (like polynomial or radial basis function kernels) to transform the input space. 

    While SVR is powerful for high-dimensional data, it can be sensitive to the choice of parameters (such as the kernel type, C, and epsilon) and is computationally expensive for large datasets.

In [201]:
from sklearn.svm import SVR

In [202]:
svr_model = SVR(kernel='rbf', C=1e3, gamma=0.1)
svr_model.fit(x_train, y_train)

## 3. Model Evaluation 


* Compare the performance of all the models based on R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE).
    
* Identify the best performing model and justify why it is the best.


In [203]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [204]:
models = [lr_model, dt_model, rf_model, gb_model, svr_model]
model_names = ['Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor', 'Support Vector Regressor']

In [205]:
for i, model in enumerate(models):
    y_pred = model.predict(x_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"{model_names[i]}:")
    print(f"R_squared:{r2:.3f}:")
    print(f"MSE: {mse:.3f}")
    print(f"MAE: {mae:.3f}")
    print()

Linear Regression:
R_squared:-164922956338909226729472.000:
MSE: 13019681308753720797994640474112.000
MAE: 2259135499228966.000

Decision Tree Regressor:
R_squared:0.849:
MSE: 11931567.213
MAE: 2076.553

Random Forest Regressor:
R_squared:0.955:
MSE: 3565773.219
MAE: 1344.517

Gradient Boosting Regressor:
R_squared:0.934:
MSE: 5174458.968
MAE: 1619.152

Support Vector Regressor:
R_squared:-0.096:
MSE: 86518019.015
MAE: 5629.947



* Based on the provided metrics, the best performing model is the Randoom Forest Regressor.

Here's why:

    Highest R-squared value: Random Forest Regressor has an R-squared value of 0.955, which is the highest among all models. This indicates that it explains the most variance in the data.

    Lowest MSE and MAE: Random Forest Regressor has the lowest MSE and MAE values, which indicates that it has the smallest difference between predicted and actual values.

    Consistency: Random Forest Regressor performs consistently well across all metrics, whereas other models may excel in one metric but perform poorly in others.

## 4. Feature Importance Analysis

* Identify the significant variables affecting car prices (feature selection)


In [206]:
feature_importances = rf_model.feature_importances_
feature_names = x.columns

for i, importance in enumerate(feature_importances):
    print(f"{feature_names[i]}: {importance:.3f}")

car_ID: 0.021
symboling: 0.000
wheelbase: 0.006
carlength: 0.006
carwidth: 0.012
carheight: 0.002
curbweight: 0.294
enginesize: 0.544
boreratio: 0.004
stroke: 0.004
compressionratio: 0.003
horsepower: 0.033
peakrpm: 0.004
citympg: 0.005
highwaympg: 0.042
CarName_Nissan versa: 0.000
CarName_alfa-romero Quadrifoglio: 0.000
CarName_alfa-romero giulia: 0.000
CarName_alfa-romero stelvio: 0.000
CarName_audi 100 ls: 0.000
CarName_audi 100ls: 0.000
CarName_audi 4000: 0.000
CarName_audi 5000: 0.000
CarName_audi 5000s (diesel): 0.000
CarName_audi fox: 0.000
CarName_bmw 320i: 0.000
CarName_bmw x1: 0.000
CarName_bmw x3: 0.000
CarName_bmw x4: 0.000
CarName_bmw x5: 0.000
CarName_bmw z4: 0.002
CarName_buick century: 0.000
CarName_buick century luxus (sw): 0.000
CarName_buick century special: 0.000
CarName_buick electra 225 custom: 0.000
CarName_buick opel isuzu deluxe: 0.000
CarName_buick regal sport coupe (turbo): 0.002
CarName_buick skyhawk: 0.000
CarName_buick skylark: 0.000
CarName_chevrolet impa

Based on the feature importance values, here are the top 5 most significant variables affecting car prices:

* enginesize: 0.544
    
* curbweight: 0.294
    
* horsepower: 0.033
    
* highwaympg: 0.042
    
* carwidth: 0.012

## 5. Hyperparameter Tuning

 Perform hyperparameter tuning and check whether the performance of the model has increased.

In [209]:
from sklearn.model_selection import GridSearchCV

In [210]:
param_grid = {'n_estimators': [100, 200, 300],
             'max_depth': [None, 5, 10],
             'min_samples_split': [2, 5, 10]
             }

In [211]:
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, cv=5)
grid_search.fit(x_train, y_train)

In [212]:
print(f"Best Parameters:{grid_search.best_params_}")
print(f"Best Score:{grid_search.best_score_}")

Best Parameters:{'max_depth': None, 'min_samples_split': 2, 'n_estimators': 300}
Best Score:0.884234244827217


The original score achieved by the Random Forest Regressor model before hyperparameter tuning was:

R-squared: 0.955
MSE: 3565.773
MAE: 1344.517 Best Score (after hyperparameter tuning):
The best score achieved by the Random Forest Regressor model after hyperparameter tuning was:

R-squared: 0.8842
By comparing the scores:

The R-squared value has decreased from 0.955 to 0.8842 after hyperparameter tuning. This indicates that teh model's performance has actually decreased after hyperparameter tuning. Therefore The perfromance of the model ha snot increased after hyperparameter tuning . In fact, the model's performance has decreased slightly.17