<a href="https://colab.research.google.com/github/Bonnnana/Introduction-to-Data-Science/blob/main/Lab/Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Decision Trees and Gradient Boosting

## Setting up the Environment

For this laboratory exercise, you will need to install the Anaconda package & environment manager. We will install a minimal distribution, [Miniconda](https://docs.conda.io/projects/miniconda/en/latest/). Choose the adequate distribution for your operating system, download and install it.

Or use the following commands:

### Windows
```shell
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o miniconda.exe
start /wait "" miniconda.exe /S
del miniconda.exe
```

### Linux
```shell
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
```

### macOS

```shell
mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
```

For both Linux and macOS after installing, initialize your newly-installed Miniconda. The following commands initialize for bash and zsh shells:

```shell
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
```


Once you have installed miniconda, run the following commands to create an environment:
```bash
conda create --name myenv
```

'myenv' is the name of the environment, you can change the name however you want.

When conda asks you to proceed, type y

After successfully creating the environment, activate it with the following command:
```bash
conda activate myenv
```

For more detailed information you can read the [documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands).

Now, once the environment is activated, proceed to install the required libraries.

```bash
pip install numpy pandas scikit-learn xgboost matplotlib seaborn gdown
```

In the next step, we need to add the environment to jupyter. Use the following commands to install ipykernel and add the environment to ipykernel.

```bash
pip install ipykernel
```
```bash
python -m ipykernel install --name=myenv
```


Next, start Jupyter Notebook, download this starter notebook and open it. On the dropdown menu in the Kernel tab choose the name of the environment you created, like in the picture below.


![jupyter](https://drive.google.com/uc?export=view&id=1N-27jjlIgpTILi-_6lny7ng8sE52SAZx)


## Download and Read the Dataset

run the code below for downloading the dataset

In [1]:
!gdown 1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx

Downloading...
From: https://drive.google.com/uc?id=1boIax8d9Sat6OJzkiIjjpqmtSZKuRYrx
To: /content/ElectricCarData.csv
  0% 0.00/8.20k [00:00<?, ?B/s]100% 8.20k/8.20k [00:00<00:00, 19.1MB/s]


### Import the required libraries

In [32]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.impute import SimpleImputer
import numpy as np
from xgboost import XGBRegressor

### Read the dataset

CONTEXT:
This is a dataset of electric vehicles.

It contains the following columns:


*   Brand
*   Model
*   AccelSec - Acceleration as 0-100 km/h
*   TopSpeed_KmH - The top speed in km/h
*   Range_Km - Range in km
*   Efficiency_WhKm - Efficiency Wh/km
*   FastCharge_KmH - Charge km/h
*   RapidCharge - Yes / No
*   PowerTrain - Front, rear, or all wheel drive
*   PlugType
*   BodyStyle - Basic size or style
*   Segment - Market segment
*   Seats - Number of seats
*   PriceEuro - Price in Germany before tax incentives




TASK:
Predict the target 'PriceEuro' and compare the performance of the DecisionTreeRegressor and the XGBRegressor models.

In [3]:
data = pd.read_csv('/content/ElectricCarData.csv')

In [4]:
data.head()

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
0,Tesla,Model 3 Long Range Dual Motor,4.6,233,450,161,940,Yes,AWD,Type 2 CCS,Sedan,D,5,55480
1,Volkswagen,ID.3 Pure,10.0,160,270,167,250,Yes,RWD,Type 2 CCS,Hatchback,C,5,30000
2,Polestar,2,4.7,210,400,181,620,Yes,AWD,Type 2 CCS,Liftback,D,5,56440
3,BMW,iX3,6.8,180,360,206,560,Yes,RWD,Type 2 CCS,SUV,D,5,68040
4,Honda,e,9.5,145,170,168,190,Yes,RWD,Type 2 CCS,Hatchback,B,4,32997


### Encode string variables

In [5]:
data.isnull().sum()

Unnamed: 0,0
Brand,0
Model,0
AccelSec,0
TopSpeed_KmH,0
Range_Km,0
Efficiency_WhKm,0
FastCharge_KmH,0
RapidCharge,0
PowerTrain,0
PlugType,0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Brand            103 non-null    object 
 1   Model            103 non-null    object 
 2   AccelSec         103 non-null    float64
 3   TopSpeed_KmH     103 non-null    int64  
 4   Range_Km         103 non-null    int64  
 5   Efficiency_WhKm  103 non-null    int64  
 6   FastCharge_KmH   103 non-null    object 
 7   RapidCharge      103 non-null    object 
 8   PowerTrain       103 non-null    object 
 9   PlugType         103 non-null    object 
 10  BodyStyle        103 non-null    object 
 11  Segment          103 non-null    object 
 12  Seats            103 non-null    int64  
 13  PriceEuro        103 non-null    int64  
dtypes: float64(1), int64(5), object(8)
memory usage: 11.4+ KB


In [7]:
data.describe()

Unnamed: 0,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,Seats,PriceEuro
count,103.0,103.0,103.0,103.0,103.0,103.0
mean,7.396117,179.194175,338.786408,189.165049,4.883495,55811.563107
std,3.01743,43.57303,126.014444,29.566839,0.795834,34134.66528
min,2.1,123.0,95.0,104.0,2.0,20129.0
25%,5.1,150.0,250.0,168.0,5.0,34429.5
50%,7.3,160.0,340.0,180.0,5.0,45000.0
75%,9.0,200.0,400.0,203.0,5.0,65000.0
max,22.4,410.0,970.0,273.0,7.0,215000.0


In [8]:
list(data['FastCharge_KmH'])

['940',
 '250',
 '620',
 '560',
 '190',
 '620',
 '220',
 '420',
 '650',
 '540',
 '440',
 '230',
 '380',
 '650',
 '210',
 '590',
 '780',
 '170',
 '260',
 '260',
 '420',
 '930',
 '230',
 '850',
 '910',
 '560',
 '490',
 '470',
 '270',
 '380',
 '450',
 '350',
 '230',
 '710',
 '240',
 '390',
 '190',
 '570',
 '230',
 '440',
 '560',
 '210',
 '610',
 '170',
 '170',
 '340',
 '210',
 '730',
 '540',
 '350',
 '590',
 '920',
 '390',
 '560',
 '490',
 '190',
 '380',
 '-',
 '380',
 '550',
 '230',
 '900',
 '520',
 '340',
 '430',
 '890',
 '190',
 '710',
 '-',
 '410',
 '260',
 '540',
 '770',
 '460',
 '270',
 '230',
 '550',
 '-',
 '360',
 '810',
 '470',
 '480',
 '-',
 '380',
 '290',
 '330',
 '740',
 '470',
 '540',
 '440',
 '510',
 '-',
 '320',
 '500',
 '330',
 '470',
 '220',
 '420',
 '440',
 '540',
 '440',
 '450',
 '480']

In [9]:
def simple_impute_data(data:pd.DataFrame, columns:list, strategy:str):
  imputer = SimpleImputer(strategy=strategy)
  data_copy = data.copy()

  for column in columns:
    data_copy[column] = imputer.fit_transform(data_copy[[column]])
  return data_copy

In [10]:
data['FastCharge_KmH'] = data['FastCharge_KmH'].replace('-', np.nan).astype(float)
data = simple_impute_data(data=data, columns=['FastCharge_KmH'], strategy='mean')

In [11]:
def label_data(data:pd.DataFrame, columns:list):
  encoder = LabelEncoder()
  data_copy = data.copy()

  for column in columns:
    data_copy[column] = encoder.fit_transform(data_copy[[column]].astype(str).values.ravel())

    # if 'nan' in encoder.classes_:
    #   data_copy.loc[data_copy[column] == -1, column] = np.nan
  return data_copy

In [12]:
list(set(data['Brand']))

['Smart ',
 'Skoda ',
 'Volkswagen ',
 'CUPRA ',
 'SEAT ',
 'Lucid ',
 'Polestar ',
 'Fiat ',
 'Byton ',
 'Opel ',
 'BMW ',
 'Lexus ',
 'Aiways ',
 'Sono ',
 'Ford ',
 'Mazda ',
 'Mini ',
 'Volvo ',
 'Peugeot ',
 'MG ',
 'Kia ',
 'DS ',
 'Renault ',
 'Mercedes ',
 'Tesla ',
 'Hyundai ',
 'Honda ',
 'Citroen ',
 'Porsche ',
 'Jaguar ',
 'Nissan ',
 'Audi ',
 'Lightyear ']

In [13]:
list(set(data['Model']))

['i3 120 Ah',
 'Taycan Turbo S',
 'UX 300e',
 'Enyaq iV 80X',
 'Enyaq iV 50',
 'Kona Electric 39 kWh',
 'Cybertruck Tri Motor',
 'e-Golf ',
 'MX-30 ',
 'ID.4 ',
 'e-Niro 39 kWh',
 'iX3 ',
 'e-Soul 39 kWh',
 'e-tron Sportback 55 quattro',
 'U5 ',
 'e-C4 ',
 'Roadster ',
 'Model S Long Range',
 'Cybertruck Single Motor',
 'M-Byte 72 kWh 2WD',
 'e-tron 55 quattro',
 'ID.3 Pure',
 'e-tron Sportback 50 quattro',
 'e-Soul 64 kWh',
 'e-tron GT ',
 'Ariya 87kWh',
 'e-NV200 Evalia ',
 'Model 3 Long Range Dual Motor',
 'CITIGOe iV ',
 'Kangoo Maxi ZE 33',
 'Cybertruck Dual Motor',
 'Model X Performance',
 'EQ fortwo cabrio',
 'Taycan Cross Turismo ',
 'ID.3 Pro S',
 'Ariya e-4ORCE 87kWh',
 'Zoe ZE40 R110',
 'EQA ',
 'Ariya e-4ORCE 63kWh',
 'I-Pace ',
 'e-tron 50 quattro',
 'Model 3 Long Range Performance',
 'Taycan Turbo',
 'Air ',
 'Taycan 4S',
 'Zoe ZE50 R135',
 'Mustang Mach-E ER RWD',
 '500e Hatchback',
 'e-208 ',
 'EQV 300 Long',
 'M-Byte 95 kWh 4WD',
 'M-Byte 95 kWh 2WD',
 'ID.3 1st',
 'e-

In [14]:
list(set(data['RapidCharge']))

['No', 'Yes']

In [15]:
list(set(data['PowerTrain']))

['RWD', 'FWD', 'AWD']

In [16]:
list(set(data['PlugType']))

['Type 2', 'Type 2 CCS', 'Type 1 CHAdeMO', 'Type 2 CHAdeMO']

In [17]:
list(set(data['BodyStyle']))

['Liftback',
 'MPV',
 'Hatchback',
 'SUV',
 'Station',
 'Cabrio',
 'Sedan',
 'Pickup',
 'SPV']

In [18]:
list(set(data['Segment']))

['A', 'E', 'S', 'N', 'C', 'F', 'B', 'D']

In [19]:
data = label_data(data=data, columns=['Brand', 'Model', 'RapidCharge', 'PowerTrain', 'PlugType', 'BodyStyle', 'Segment'])

In [20]:
data.sample(5)

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats,PriceEuro
34,17,44,9.0,150,180,178,240.0,1,1,2,6,2,5,32646
41,10,37,9.9,155,255,154,210.0,1,1,2,6,1,5,33971
42,1,96,5.7,200,380,228,610.0,1,0,2,6,4,5,81639
74,29,64,9.0,140,225,156,270.0,1,1,2,1,2,5,25500
49,0,71,9.0,150,335,188,350.0,1,1,2,6,2,5,36057


## Split the dataset for training and testing in ratio 80:20

In [21]:
def drop_data(data:pd.DataFrame, columns:list):
  data_copy = data.copy()
  data_copy.drop(columns, axis=1, inplace=True)
  return data_copy

In [22]:
input_data = data.copy()
input_data = drop_data(data=input_data, columns=['PriceEuro'])
input_data.sample(3)

Unnamed: 0,Brand,Model,AccelSec,TopSpeed_KmH,Range_Km,Efficiency_WhKm,FastCharge_KmH,RapidCharge,PowerTrain,PlugType,BodyStyle,Segment,Seats
80,31,29,7.3,160,340,171,470.0,1,2,2,1,2,5
78,8,58,6.0,180,340,206,360.0,1,0,2,6,3,5
4,9,78,9.5,145,170,168,190.0,1,2,2,1,1,4


In [23]:
target_data = data.copy()['PriceEuro']
target_data.sample(3)

Unnamed: 0,PriceEuro
80,38987
67,55000
51,215000


In [57]:
X_train, X_test, Y_train, Y_test = train_test_split(input_data, target_data, test_size=0.2)

## Initialize the DecisionTreeRegressor model, and use the fit function for training the model.

Add values for the parameters max_depth, min_samples_split, and max_features.

Fit the model using the fit function


In [58]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Brand            103 non-null    int64  
 1   Model            103 non-null    int64  
 2   AccelSec         103 non-null    float64
 3   TopSpeed_KmH     103 non-null    int64  
 4   Range_Km         103 non-null    int64  
 5   Efficiency_WhKm  103 non-null    int64  
 6   FastCharge_KmH   103 non-null    float64
 7   RapidCharge      103 non-null    int64  
 8   PowerTrain       103 non-null    int64  
 9   PlugType         103 non-null    int64  
 10  BodyStyle        103 non-null    int64  
 11  Segment          103 non-null    int64  
 12  Seats            103 non-null    int64  
 13  PriceEuro        103 non-null    int64  
dtypes: float64(2), int64(12)
memory usage: 11.4 KB


In [78]:
regressor = DecisionTreeRegressor(max_depth=5, min_samples_split=7, max_features=2)

# Train Decision Tree Regressor
regressor.fit(X_train, Y_train)

## Predict the outcomes for X test

In [79]:
y_pred = regressor.predict(X_test)

## Assess the model performance, by using sklearn metrics for regression

In [80]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("Mean Absolute Error:", mean_absolute_error(Y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(Y_test, y_pred))
print("R2 Score:", r2_score(Y_test, y_pred))

Mean Absolute Error: 6136.64761904762
Mean Squared Error: 70445352.50698836
R2 Score: 0.8946079006929213


## Initialize the XGBRegressor model, and use the fit function

Add values for the parameters: n_estimators, max_depth, learning_rate, and set the objective to "reg:squarederror"

Fit the model using the fit function

In [107]:
model = XGBRegressor(max_depth=3, n_estimators=500,  learning_rate=0.01, objective="reg:squarederror")
model.fit(X_train, Y_train)

## Predict the outcomes for X test

In [108]:
y_pred_2 = model.predict(X_test)

## Assess the model performance, by using sklearn metrics for regression

In [109]:
print("Mean Absolute Error:", mean_absolute_error(Y_test, y_pred_2))
print("Mean Squared Error:", mean_squared_error(Y_test, y_pred_2))
print("R2 Score:", r2_score(Y_test, y_pred_2))

Mean Absolute Error: 10983.082124255952
Mean Squared Error: 292088919.783329
R2 Score: 0.5630108118057251


## Compare the performances of both model for at least three regression metircs

In [110]:
print("Mean Absolute Error DecisionTree:", mean_absolute_error(Y_test, y_pred))
print("Mean Absolute Error XGBoost:", mean_absolute_error(Y_test, y_pred_2))


Mean Absolute Error DecisionTree: 6136.64761904762
Mean Absolute Error XGBoost: 10983.082124255952


In [111]:
print("Mean Squared Error DecisionTree:", mean_squared_error(Y_test, y_pred))
print("Mean Squared Error XGBoost:", mean_squared_error(Y_test, y_pred_2))


Mean Squared Error DecisionTree: 70445352.50698836
Mean Squared Error XGBoost: 292088919.783329


In [112]:
print("R2 Score DecisionTree:", r2_score(Y_test, y_pred))
print("R2 Score XGBoost:", r2_score(Y_test, y_pred_2))

R2 Score DecisionTree: 0.8946079006929213
R2 Score XGBoost: 0.5630108118057251


*With small datasets, models like XGBoost may not perform well compared to simpler models like Decision Trees because XGBoost is designed for larger datasets and benefits from ensembling.*