# Regression

In this notebook, we will demonstrate regression skills using the Automobile dataset.

## Introduction

The Automobile dataset contains information about various cars, including their specifications, prices, and other attributes. This dataset can be used for various data science tasks, such as regression, classification, and clustering. The dataset can be downloaded from [Automobile Dataset on Kaggle](https://www.kaggle.com/toramky/automobile-dataset).


### Dataset Description

| Feature Columns | Description |
| ----------- | ----------- |
| symboling | Assigned insurance risk rating. A value of +3 indicates the highest risk, while a value of -3 indicates the lowest risk. |
| normalized-losses | The relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc.). |
| make | The manufacturer of the car. |
| fuel-type | The type of fuel the car uses (diesel, gas). |
| aspiration | The type of aspiration for the engine (standard, turbo). |
| num-of-doors | The number of doors the car has (two, four). |
| body-style | The body style of the car (convertible, hardtop, hatchback, sedan, wagon). |
| drive-wheels | The type of drive the car has (4wd, fwd, rwd). |
| engine-location | The location of the engine in the car (front, rear). |
| wheel-base | The car's wheelbase in inches. |
| length | The car's length in inches. |
| width | The car's width in inches. |
| height | The car's height in inches. |
| curb-weight | The car's curb weight in pounds. |
| engine-type | The type of engine the car has (dohc, dohcv, l, ohc, ohcf, ohcv, rotor). |
| num-of-cylinders | The number of cylinders the engine has (eight, five, four, six, three, twelve, two). |
| engine-size | The engine size in cubic inches. |
| fuel-system | The car's fuel system (1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi). |
| bore | The engine's bore in inches. |
| stroke | The engine's stroke in inches. |
| compression-ratio | The engine's compression ratio. |
| horsepower | The car's horsepower. |
| peak-rpm | The car's peak RPM. |
| city-mpg | The car's city miles per gallon. |
| highway-mpg | The car's highway miles per gallon. |
| **Target Columns** | **Description** |
| price | The car's price. |

## Import Libraries and Load Data

In [29]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
pd.options.display.max_columns = None
pd.options.display.max_colwidth = 1000
# Initialize a list to store the results of each model
results = []

df = pd.read_csv('Auto_Data_Proc.csv')
df['price'] = df['price'].astype(float)
df.head()

Unnamed: 0.1,Unnamed: 0,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,price,city-L/100km,highway-L/100km
0,0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,0.811148,0.890278,0.816054,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,13495.0,11.190476,8.703704
1,1,alfa-romero,gas,std,two,convertible,rwd,front,88.6,0.811148,0.890278,0.816054,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,16500.0,11.190476,8.703704
2,2,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,0.822681,0.909722,0.876254,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,16500.0,12.368421,9.038462
3,3,audi,gas,std,four,sedan,fwd,front,99.8,0.84863,0.919444,0.908027,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,13950.0,9.791667,7.833333
4,4,audi,gas,std,four,sedan,4wd,front,99.4,0.84863,0.922222,0.908027,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,17450.0,13.055556,10.681818


## Prepare Data for Regression

In [12]:
# Use 'engine-size' as the predictor
X = df.drop('price',axis=1)
y = df['price']
X = pd.get_dummies(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [16]:
Scaler = StandardScaler()
Scaled_X_train = Scaler.fit_transform(X_train)
Scaled_X_test = Scaler.transform(X_test)

In [18]:
X_train = Scaled_X_train
X_test = Scaled_X_test

## Regression Models

## Models Considered

1. **Linear Regression**: Linear Regression is a simple model that assumes a linear relationship between the input features and the target variable. It's applicable when the relationship between input features and the target variable is approximately linear.

2. **Decision Tree Regressor**: Decision Tree Regressor is a tree-based model that recursively splits the input features into subsets to minimize the variance in the target variable. It's applicable when the dataset has complex relationships that can be modeled using hierarchical decisions.

3. **Random Forest Regressor**: Random Forest Regressor is an ensemble model that combines multiple decision tree regressors, each trained on a random subset of the input features. It's applicable when the dataset has complex relationships that can benefit from the averaging of multiple decision trees, reducing overfitting and improving the model's generalization.

4. **Support Vector Regressor**: Support Vector Regressor is a kernel-based model that transforms the input features into a higher-dimensional space and finds the best linear relationship in that space. It's applicable when the dataset has non-linear relationships that can be modeled more effectively in a higher-dimensional space.

5. **K-Nearest Neighbors Regressor**: K-Nearest Neighbors Regressor is a distance-based model that predicts the target variable based on the average of the target values of the k-nearest neighbors in the input feature space. It's applicable when the dataset has complex relationships that can be modeled by considering local similarities between data points.

6. **Polynomial Regression**: Polynomial Regression is an extension of Linear Regression that models the relationship between input features and the target variable as an nth-degree polynomial. It's applicable when the dataset has non-linear relationships that can be modeled using higher-degree polynomials.


## Linear Regression

In [30]:
from sklearn.linear_model import Ridge

# Create and train the Linear Regression model with GridSearchCV
ridge_params = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge = Ridge()
ridge_grid = GridSearchCV(ridge, ridge_params, cv=5, scoring='neg_mean_squared_error')
ridge_grid.fit(X_train, y_train)

# Predict the test set
y_pred_ridge = ridge_grid.predict(X_test)

# Evaluate the model
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

results.append({'model': 'Linear Regression', 'mse': mae_ridge, 'r2': r2_ridge})
print(r2_ridge)

0.9583190156037937


## Decision Tree Regressor

In [31]:
# Create and train the Decision Tree model with GridSearchCV
tree_params = {'max_depth': range(1, 11), 'min_samples_split': range(2, 21)}
tree = DecisionTreeRegressor()
tree_grid = GridSearchCV(tree, tree_params, cv=5, scoring='neg_mean_squared_error')
tree_grid.fit(X_train, y_train)

# Predict the test set
y_pred_tree = tree_grid.predict(X_test)

# Evaluate the model
mae_tree = mean_absolute_error(y_test, y_pred_tree)
r2_tree = r2_score(y_test, y_pred_tree)

results.append({'model': 'Decision Tree', 'mse': mae_tree, 'r2': r2_tree})

In [32]:
# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict the test set
y_pred = model.predict(X_test)

## Evaluate the Model

## Random Forest Regressor

In [33]:
# Create and train the Random Forest model with GridSearchCV
forest_params = {'n_estimators': range(10, 101, 10), 'max_depth': range(1, 11)}
forest = RandomForestRegressor()
forest_grid = GridSearchCV(forest, forest_params, cv=5, scoring='neg_mean_squared_error')
forest_grid.fit(X_train, y_train)

# Predict the test set
y_pred_forest = forest_grid.predict(X_test)

# Evaluate the model
mae_forest = mean_absolute_error(y_test, y_pred_forest)
r2_forest = r2_score(y_test, y_pred_forest)

results.append({'model': 'Random Forest', 'mse': mae_forest, 'r2': r2_forest})


## Support Vector Regressor

In [34]:
# Create and train the Support Vector Regressor model with GridSearchCV
svm_params = {'C': [0.1, 1, 10], 'epsilon': [0.01, 0.1, 1]}
svm = SVR(kernel='linear')
svm_grid = GridSearchCV(svm, svm_params, cv=5, scoring='neg_mean_squared_error')
svm_grid.fit(X_train, y_train)

# Predict the test set
y_pred_svm = svm_grid.predict(X_test)

# Evaluate the model
mae_svm = mean_absolute_error(y_test, y_pred_svm)
r2_svm = r2_score(y_test, y_pred_svm)

results.append({'model': 'Support Vector Regressor', 'mse': mae_svm, 'r2': r2_svm})


## K-Nearest Neighbors Regressor

In [35]:
# Create and train the K-Nearest Neighbors Regressor model with GridSearchCV
knn_params = {'n_neighbors': range(1, 31), 'weights': ['uniform', 'distance']}
knn = KNeighborsRegressor()
knn_grid = GridSearchCV(knn, knn_params, cv=5, scoring='neg_mean_squared_error')
knn_grid.fit(X_train, y_train)

# Predict the test set
y_pred_knn = knn_grid.predict(X_test)

# Evaluate the model
mae_knn = mean_absolute_error(y_test, y_pred_knn)
r2_knn = r2_score(y_test, y_pred_knn)
results.append({'model': 'Support Vector Regressor', 'mse': mae_knn, 'r2': r2_knn})

## Comparing Models

In [47]:

# Create the results DataFrame
results_df = pd.DataFrame(results, columns=['model', 'mse', 'r2'])

# Sort the results by R-squared score in descending order
results_df.sort_values(by='r2', ascending=False, inplace=True)

# Display the results DataFrame
results_df.reset_index(drop=True, inplace=True)
results_df["RE"] = round(results_df['mse']/df['price'].mean()*100,1)
results_df["mse"] = round(results_df["mse"],0)
results_df["r2"] = round(results_df["r2"],2)

In [48]:
results_df

Unnamed: 0,model,mse,r2,RE
0,Linear Regression,1184.0,0.96,8.9
1,Random Forest,1407.0,0.94,10.6
2,Decision Tree,1780.0,0.89,13.4
3,Support Vector Regressor,2307.0,0.75,17.4
4,Support Vector Regressor,2874.0,0.64,21.7


In this case, the Linear Regression model outperforms the other models significantly. It has a R$^2$ value of 0.95, and a relative error of 9% which is likely more than good enough for most use cases. 

## Conclusion

In this notebook, we demonstrated regression skills by predicting the price of automobiles using engine size as the predictor. We used a linear regression model and evaluated its performance using mean squared error and R-squared score.