# DATA PREPROCESSING OF THE TRAINING DATASET

We will be analyzing the data and preprocessing it in order to fit it better for our future model, and hopefully get better clasification results.

### INDEX

1. hola [section 1](#section1)
***

## 1. Imports

In [48]:
import pandas as pd
import numpy as np
import os
from IPython.display import clear_output
import matplotlib.pyplot as plt
import seaborn as sns

# import local scripts
import sys  
sys.path.insert(1, './scripts.py')
import scripts



## 2. Dataset loading

In [49]:
dir = '../apau-smog-prediction/'
dirlist = os.listdir(dir)
for i  in dirlist:
    exec(i.split('.')[0] + ' = pd.read_csv(dir + i)')

Now we have loaded the three datasets present in `../apau-smog-prediction`. These are:

1. sample_submission : example of a file to submit to the Kaggle platform.
2. test_nolabel : unlabeled dataset to predict. Our objective will be to find the best label predictions possible.
3. train : training data

The focus of this preprocessing wil be the training dataset, as that is what will be used to train our classification models.

***

## 3. Basic data visualization.

We will be performing basic visualization of our data. First, we have to take into account what is the type of data of each column:

#### 3.1 Categorical and numerical data.

In [50]:
train.info()
train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586 entries, 0 to 585
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                586 non-null    object 
 1   Model Year                        586 non-null    int64  
 2   Make                              586 non-null    object 
 3   Model                             586 non-null    object 
 4   Vehicle Class                     586 non-null    object 
 5   Engine Size (L)                   586 non-null    float64
 6   Cylinders                         586 non-null    int64  
 7   Transmission                      586 non-null    object 
 8   Fuel Type                         586 non-null    object 
 9   Fuel Consumption City (L/100 km)  586 non-null    float64
 10  Hwy (L/100 km)                    586 non-null    float64
 11  Comb (L/100 km)                   586 non-null    float64
 12  Comb (mp

Unnamed: 0,id,Model Year,Make,Model,Vehicle Class,Engine Size (L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Hwy (L/100 km),Comb (L/100 km),Comb (mpg),CO2 Emissions (g/km),Smog
0,ab44e9bec15,2022,Mercedes-Benz,A 250 4MATIC Hatch,Station wagon: Small,2.0,4,AM7,Z,10.0,7.0,8.7,32,202,2
1,45926762371,2022,Mazda,Mazda3 5-Door,Mid-size,2.0,4,AS6,X,8.6,6.7,7.7,37,181,4
2,e9be56e153f,2022,Porsche,Panamera 4 ST,Full-size,2.9,6,AM8,Z,12.8,10.2,11.7,24,274,2
3,077092760df,2022,Mazda,CX-3 4WD,Compact,2.0,4,AS6,X,8.6,7.4,8.1,35,189,1
4,c1c2579b795,2022,Aston Martin,DBS V12,Minicompact,5.2,12,A8,Z,16.4,10.7,13.8,20,324,1


Categorical data:
 1. `'id'`
 2. `'Model Year' (year is not a magnitude)`
 3. `'Make'`               
 4. `'Model'`
 5. `'Vehicle Class'`
 6. `'Transmission'`
 7. `'Fuel Type'`

Numerical data:
 1. `'Engine Size (L)'`
 2. `'Cylinders'`
 3. `'Fuel Consumtion City (L/100km)'`
 4. `'Hwy (L/100 km)'`&nbsp;&nbsp; --> road consumption
 5. `'Comb (L/100 km)'` --> combinated city and road consumption
 6. `'Comb (mpg)'`
 7. `'CO2 Emissions (g/km)'`

Having said this, let's take a look at the distribution of values of each column. We will be plotting histograms for the numerical data and bar plots for the categorical data.

In [52]:
scripts.display_histograms(train)

AttributeError: module 'scripts' has no attribute 'display_histograms'

## 3. Basic column selection

In [None]:
train.head()

Unnamed: 0,id,Model Year,Make,Model,Vehicle Class,Engine Size (L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Hwy (L/100 km),Comb (L/100 km),Comb (mpg),CO2 Emissions (g/km),Smog
0,ab44e9bec15,2022,Mercedes-Benz,A 250 4MATIC Hatch,Station wagon: Small,2.0,4,AM7,Z,10.0,7.0,8.7,32,202,2
1,45926762371,2022,Mazda,Mazda3 5-Door,Mid-size,2.0,4,AS6,X,8.6,6.7,7.7,37,181,4
2,e9be56e153f,2022,Porsche,Panamera 4 ST,Full-size,2.9,6,AM8,Z,12.8,10.2,11.7,24,274,2
3,077092760df,2022,Mazda,CX-3 4WD,Compact,2.0,4,AS6,X,8.6,7.4,8.1,35,189,1
4,c1c2579b795,2022,Aston Martin,DBS V12,Minicompact,5.2,12,A8,Z,16.4,10.7,13.8,20,324,1


Our dataset contains various columns not useful for the model prediction, such as the same data in a different scale. We will be dropping those columns from the dataset. 


Firstly, we will be dropping the 'Comb (mpg)' column, as it displays the same information as Comb (L/100km) but in different units.

In [None]:
train.drop(columns='Comb (mpg)', inplace=True)


Also, note that we have city and road consumption columns, and also a column displaying the combination of those two. To simplify the complexity of our model, we will drop the first two.

Finally, we will also drop ID, as it is completely uncorrelated to the Smog column, thus innecesary for training data.

In [None]:
train.drop(columns=['id','Fuel Consumption City (L/100 km)', 'Hwy (L/100 km)'], inplace=True)
train.head()

Unnamed: 0,id,Model Year,Make,Model,Vehicle Class,Engine Size (L),Cylinders,Transmission,Fuel Type,Comb (L/100 km),CO2 Emissions (g/km),Smog
0,ab44e9bec15,2022,Mercedes-Benz,A 250 4MATIC Hatch,Station wagon: Small,2.0,4,AM7,Z,8.7,202,2
1,45926762371,2022,Mazda,Mazda3 5-Door,Mid-size,2.0,4,AS6,X,7.7,181,4
2,e9be56e153f,2022,Porsche,Panamera 4 ST,Full-size,2.9,6,AM8,Z,11.7,274,2
3,077092760df,2022,Mazda,CX-3 4WD,Compact,2.0,4,AS6,X,8.1,189,1
4,c1c2579b795,2022,Aston Martin,DBS V12,Minicompact,5.2,12,A8,Z,13.8,324,1
