<a href="https://colab.research.google.com/github/jovanadobreva/Labs-I2DS/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees and Gradient Boosting

## Setting up the Environment

For this laboratory exercise, you will need to install the Anaconda package & environment manager. We will install a minimal distribution, [Miniconda](https://docs.conda.io/projects/miniconda/en/latest/). Choose the adequate distribution for your operating system, download and install it.

Or use the following commands:

### Windows
```shell
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o miniconda.exe
start /wait "" miniconda.exe /S
del miniconda.exe
```

### Linux
```shell
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
```

### macOS

```shell
mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
```

For both Linux and macOS after installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:

```shell
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
```


Once you have installed miniconda, run the following commands to create an environment:
```bash
conda create --name myenv
```

'myenv' is the name of the environment, you can change the name however you want.

When conda asks you to proceed, type y

After successfully creating the environment, activate it with the following command:
```bash
conda activate myenv
```

For more detailed information you can read the [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).

Now, once the environment is activated, proceed to install the required libraries.

```bash
pip install numpy pandas scikit-learn xgboost matplotlib seaborn gdown
```

In the next step, we need to add the environment to jupyter. Use the following commands to install ipykernel and add the environment to ipykernel.

```bash
pip install ipykernel
```
```bash
python -m ipykernel install --name=myenv
```


Next, start Jupyter Notebook, download this starter notebook and open it. On the dropdown menu in the Kernel tab choose the name of the environment you created, like in the picture below.


![jupyter](https://drive.google.com/uc?export=view&id=1N-27jjlIgpTILi-_6lny7ng8sE52SAZx)


## Download and Read the Dataset

run the code below for downloading the dataset

In [None]:
!gdown 1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx

Downloading...
From: https://drive.google.com/uc?id=1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx
To: /content/ElectricCarData.csv
  0% 0.00/8.20k [00:00<?, ?B/s]100% 8.20k/8.20k [00:00<00:00, 26.5MB/s]


### Import the required libraries

In [120]:
import pandas as pd
import numpy as np
from sklearn import metrics
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder


### Read the dataset

CONTEXT:
This is a dataset of electric vehicles.

It contains the following columns:


*   Brand
*   Model
*   AccelSec - Acceleration as 0-100 km/h
*   TopSpeed_KmH - The top speed in km/h
*   Range_Km - Range in km
*   Efficiency_WhKm - Efficiency Wh/km
*   FastCharge_KmH - Charge km/h
*   RapidCharge - Yes / No
*   PowerTrain - Front, rear, or all wheel drive
*   PlugType
*   BodyStyle - Basic size or style
*   Segment - Market segment
*   Seats - Number of seats
*   PriceEuro - Price in Germany before tax incentives




TASK:
Predict the target 'PriceEuro' and compare the performance of the DecisionTreeRegressor and the XGBRegressor models.

In [121]:
data = pd.read_csv('ElectricCarData.csv')

In [122]:
data.head()

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997


### Encode string variables

In [124]:
def label_data(data:pd.DataFrame, columns:list):
  encoder = LabelEncoder()
  data_copy = data.copy()

  for column in columns:
    data_copy[column] = encoder.fit_transform(data_copy[[column]].astype(str).values.ravel())   
  return data_copy

In [125]:
def drop_data(data:pd.DataFrame, columns:list):
  data_copy = data.copy()
  data_copy.drop(columns, axis=1, inplace=True)
  return data_copy

In [126]:
data = drop_data(data=data, columns=['Brand', 'Model'])

In [127]:
data.head()

Unnamed: 0,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997


In [128]:
data = label_data(data=data, columns=['RapidCharge','PowerTrain', 'PlugType','BodyStyle','Segment'])

In [129]:
data.sample(3)

Unnamed: 0,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
92,7.9,167,365,175,320,1,1,2,6,1,5,36837
65,4.0,250,425,197,890,1,0,2,7,5,4,109302
10,5.1,180,370,216,440,1,0,2,6,3,5,69484


In [123]:
data.isna().sum()

Brand              0
Model              0
AccelSec           0
TopSpeed_KmH       0
Range_Km           0
Efficiency_WhKm    0
FastCharge_KmH     0
RapidCharge        0
PowerTrain         0
PlugType           0
BodyStyle          0
Segment            0
Seats              0
PriceEuro          0
dtype: int64

In [131]:
data.columns[data.apply(lambda col: col.astype(str).str.contains(r'[^0-9\.]', regex=True).any())]

Index(['FastCharge_KmH'], dtype='object')

In [130]:
data['FastCharge_KmH'].unique()


array(['940', '250', '620', '560', '190', '220', '420', '650', '540',
       '440', '230', '380', '210', '590', '780', '170', '260', '930',
       '850', '910', '490', '470', '270', '450', '350', '710', '240',
       '390', '570', '610', '340', '730', '920', '-', '550', '900', '520',
       '430', '890', '410', '770', '460', '360', '810', '480', '290',
       '330', '740', '510', '320', '500'], dtype=object)

In [132]:
data['FastCharge_KmH'] = data['FastCharge_KmH'].replace('-', np.nan)

In [134]:
data.isna().sum()

AccelSec           0
TopSpeed_KmH       0
Range_Km           0
Efficiency_WhKm    0
FastCharge_KmH     5
RapidCharge        0
PowerTrain         0
PlugType           0
BodyStyle          0
Segment            0
Seats              0
PriceEuro          0
dtype: int64

In [135]:
imputer = SimpleImputer(strategy='mean')

In [137]:
data['FastCharge_KmH'] = imputer.fit_transform(data[['FastCharge_KmH']])

In [139]:
data['FastCharge_KmH'].isnull().sum()

np.int64(0)

## Split the dataset for training and testing in ratio 80:20

In [141]:
input_data = data.copy()
input_data = drop_data(data=input_data, columns=['PriceEuro'])
input_data.sample(3)

Unnamed: 0,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats
55,8.3,145,170,168,190.0,1,2,2,1,1,4
94,9.0,150,250,168,330.0,1,1,2,0,1,4
63,4.8,200,365,232,340.0,1,0,2,6,4,5


In [142]:
target_data = data.copy()['PriceEuro']

In [143]:
X_train, X_test, Y_train, Y_test = train_test_split(input_data, target_data, test_size=0.2)

In [145]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Initialize the DecisionTreeRegressor model, and use the fit function for training the model.

Add values for the parameters max_depth, min_samples_split, and max_features.

Fit the model using the fit function


In [190]:
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor

In [191]:
modelDTR = DecisionTreeRegressor(random_state=42)
modelDTR.fit(X_train, Y_train)

## Predict the outcomes for X test

In [192]:
y_pred = modelDTR.predict(X_test)

In [193]:
print("Models Predictions:", y_pred)

Models Predictions: [ 30000.  55480.  46900.  40000.  34459.  53500.  40000.  41906.  38987.
  60437.  60437.  34361.  50000.  79990. 125000. 125000.  45000.  34400.
  30000.  65620.  75351.]


## Assess the model performance, by using sklearn metrics for regression

In [194]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print(f"Mean absolute error: {mean_absolute_error(Y_test, y_pred):,}")
print(f"Mean squared error: {mean_squared_error(Y_test, y_pred):,}")
print(f"R2 score: {r2_score(Y_test, y_pred):,}")


Mean absolute error: 15,499.047619047618
Mean squared error: 776,829,075.5238096
R2 score: 0.3954620510243436


## Initialize the XGBRegressor model, and use the fit function

Add values for the parameters: n_estimators, max_depth, learning_rate, and set the objective to "reg:squarederror"

In [195]:
modelXGB = XGBRegressor(
    n_estimators=100,           
    max_depth=3,                
    learning_rate=0.1,          
    objective='reg:squarederror',  
    random_state=42
)



Fit the model using the fit function

In [196]:
modelXGB.fit(X_train, Y_train)


## Predict the outcomes for X test

In [197]:
Y_pred = modelXGB = modelXGB.predict(X_test)

In [198]:
print("Models Predictions:", Y_pred)

Models Predictions: [ 33969.78   52815.008  56885.316  37799.96   34285.17   62148.27
  38177.56   38927.152  38005.207  59762.766  53403.766  34577.97
  48221.613 160126.8   123801.3   102707.414  42649.004  28713.586
  40905.277 103963.27   86341.53 ]


## Assess the model performance, by using sklearn metrics for regression

In [205]:
print(f"Mean absolute error: {mean_absolute_error(Y_test, Y_pred):,}")
print(f"Mean squared error: {mean_squared_error(Y_test, Y_pred):,}")
print(f"R2 score: {r2_score(Y_test, Y_pred):,}")

Mean absolute error: 12,632.77380952381
Mean squared error: 827,908,141.0342058
R2 score: 0.3557116389274597


## Compare the performances of both model for at least three regression metircs

In [206]:
print(f"Mean absolute error: XGB: {mean_absolute_error(Y_test, Y_pred):,} vs DTR: {mean_absolute_error(Y_test, y_pred):,}")
print(f"Mean squared error: XGB: {mean_squared_error(Y_test, Y_pred):,} vs DTR: {mean_squared_error(Y_test, y_pred):,}")
print(f"R² score: XGB: {r2_score(Y_test, Y_pred):,} vs DTR: {r2_score(Y_test, y_pred):,}")

Mean absolute error: XGB: 12,632.77380952381 vs DTR: 15,499.047619047618
Mean squared error: XGB: 827,908,141.0342058 vs DTR: 776,829,075.5238096
R² score: XGB: 0.3557116389274597 vs DTR: 0.3954620510243436
