<a href="https://colab.research.google.com/github/jovanadobreva/Labs-I2DS/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download and Read the Dataset

run the code below for downloading the dataset

In [1]:
!gdown 1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx

Downloading...
From: https://drive.google.com/uc?id=1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx
To: C:\Users\dsand\PycharmProjects\VNP\Lab3\Doma\ElectricCarData.csv

  0%|          | 0.00/8.20k [00:00<?, ?B/s]
100%|##########| 8.20k/8.20k [00:00<?, ?B/s]


### Import the required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_validate
import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


### Read the dataset

CONTEXT:
This is a dataset of electric vehicles.

It contains the following columns:


*   Brand
*   Model
*   AccelSec - Acceleration as 0-100 km/h
*   TopSpeed_KmH - The top speed in km/h
*   Range_Km - Range in km
*   Efficiency_WhKm - Efficiency Wh/km
*   FastCharge_KmH - Charge km/h
*   RapidCharge - Yes / No
*   PowerTrain - Front, rear, or all wheel drive
*   PlugType
*   BodyStyle - Basic size or style
*   Segment - Market segment
*   Seats - Number of seats
*   PriceEuro - Price in Germany before tax incentives




TASK:
Predict the target 'PriceEuro' and compare the performance of the DecisionTreeRegressor and the XGBRegressor models.

In [2]:
data = pd.read_csv('ElectricCarData.csv')

In [3]:
data.head()

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997


### Encode string variables

In [4]:
def label_data(data: pd.DataFrame, columns: list):
    encoder = LabelEncoder()
    data_copy = data.copy()

    for column in columns:
        data_copy[column] = encoder.fit_transform(data_copy[[column]].astype(str).values.ravel())

        if 'nan' in encoder.classes_:
            data_copy.loc[data_copy[column] == data_copy[column].max(), column] = np.nan
    return data_copy

In [5]:
data_i=label_data(data,['RapidCharge','PowerTrain','Brand','Model','PlugType','BodyStyle','Segment'])

In [6]:
data_i['FastCharge_KmH'] = pd.to_numeric(data_i['FastCharge_KmH'], errors='coerce')

In [7]:
imputer = KNNImputer(n_neighbors=3)
data_i['FastCharge_KmH']= imputer.fit_transform(data_i['FastCharge_KmH'].to_numpy().reshape(-1, 1))

## Split the dataset for training and testing in ratio 80:20

In [9]:
X=data_i.drop('PriceEuro',axis=1)
Y=data_i['PriceEuro']  




In [10]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)


## Initialize the DecisionTreeRegressor model, and use the fit function for training the model.

Add values for the parameters max_depth, min_samples_split, and max_features.

Fit the model using the fit function


In [12]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
model = DecisionTreeRegressor()
param_grid = {
    'max_depth': [3, 5, 10,None], 
    'min_samples_split': [2, 5, 10], 
    'max_features': ['sqrt', 'log2',None]   
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, Y_train)
best_model = grid_search.best_estimator_


## Predict the outcomes for X test

In [24]:
best_model.fit(X_train, Y_train)
y_pred=best_model.predict(X_test)



## Assess the model performance, by using sklearn metrics for regression

In [25]:
mae = mean_absolute_error(Y_test, y_pred)
mse = mean_squared_error(Y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, y_pred)

range=data_i['PriceEuro'].max()-data['PriceEuro'].min()

print(f'Range: {range}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'R-squared (R²): {r2}')

Range: 194871
Mean Absolute Error (MAE): 17871.214285714286
Mean Squared Error (MSE): 1067060521.9034392
Root Mean Squared Error (RMSE): 32665.892332882002
R-squared (R²): 0.46095573141093715


## Initialize the XGBRegressor model, and use the fit function

Add values for the parameters: n_estimators, max_depth, learning_rate, and set the objective to "reg:squarederror"

Fit the model using the fit function

In [35]:
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, max_depth=10)
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2], 
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 10],  
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)

grid_search.fit(X_train, Y_train)


## Predict the outcomes for X test

In [37]:
bestModel=grid_search.best_estimator_
bestModel.fit(X_train, Y_train)
y_pred=bestModel.predict(X_test)

## Assess the model performance, by using sklearn metrics for regression

In [38]:
mae = mean_absolute_error(Y_test, y_pred)
mse = mean_squared_error(Y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, y_pred)

range=data_i['PriceEuro'].max()-data['PriceEuro'].min()

print(f'Range: {range}')
print(f'Mean Absolute Error (MAE): {mae}')
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'R-squared (R²): {r2}')

Range: 194871
Mean Absolute Error (MAE): 14567.281901041666
Mean Squared Error (MSE): 915751048.5445086
Root Mean Squared Error (RMSE): 30261.378827550285
R-squared (R²): 0.537392258644104


## Compare the performances of both model for at least three regression metircs