# Processing and cleaning the data

In this notebook we will try to identify null value and outliers in the data. We will also rename some columns and values if needed. 

The input of this file is `raw_vehicles.csv` and the output will be `processed_vehicles.csv`.

### Importing libraries

In [1]:
import pandas as pd

dataset_path = './datasets/raw_vehicles.csv'
df = pd.read_csv(dataset_path)

### How many observations and features do we have?

In [2]:
observations, features = df.shape

print('Observations:', observations)
print('Features:', features)

Observations: 1067
Features: 13


### What are the values types of the features?

In [3]:
df.dtypes

MODELYEAR                     int64
MAKE                         object
MODEL                        object
VEHICLECLASS                 object
ENGINESIZE                  float64
CYLINDERS                     int64
TRANSMISSION                 object
FUELTYPE                     object
FUELCONSUMPTION_CITY        float64
FUELCONSUMPTION_HWY         float64
FUELCONSUMPTION_COMB        float64
FUELCONSUMPTION_COMB_MPG      int64
CO2EMISSIONS                  int64
dtype: object

The mayority of the features are numerical (discrete and continuous) and some of them are categorical. The feature we will be trying to predict is CO2-Emissions, which in this case is a continuous variable.

In [4]:
(
    df
    .dtypes
    .value_counts()
)

object     5
int64      4
float64    4
Name: count, dtype: int64

### Do we have null values?

Luckily, we don't have any null values in our dataset.

In [5]:
# check if any column has null values
(
    df
    .isnull()
    .any()
)

MODELYEAR                   False
MAKE                        False
MODEL                       False
VEHICLECLASS                False
ENGINESIZE                  False
CYLINDERS                   False
TRANSMISSION                False
FUELTYPE                    False
FUELCONSUMPTION_CITY        False
FUELCONSUMPTION_HWY         False
FUELCONSUMPTION_COMB        False
FUELCONSUMPTION_COMB_MPG    False
CO2EMISSIONS                False
dtype: bool

In [6]:
# check the number of null values in each column
(
    df
    .isnull()
    .sum()
)

MODELYEAR                   0
MAKE                        0
MODEL                       0
VEHICLECLASS                0
ENGINESIZE                  0
CYLINDERS                   0
TRANSMISSION                0
FUELTYPE                    0
FUELCONSUMPTION_CITY        0
FUELCONSUMPTION_HWY         0
FUELCONSUMPTION_COMB        0
FUELCONSUMPTION_COMB_MPG    0
CO2EMISSIONS                0
dtype: int64

### Renaming columns

Some of the columns have confusing names, so we will rename them with more descriptive names.

In [7]:
old_names = df.columns

In [8]:
df.rename(columns={'TRANSMISSION':'transmission_type'}, inplace=True)
df.rename(columns={'MODELYEAR':'release_year'}, inplace=True)
df.rename(columns={'MAKE':'manufacturer'}, inplace=True)
df.rename(columns={'FUELCONSUMPTION_COMB':'fuel_consumption_combinated_in_kpl'}, inplace=True)
df.rename(columns={'FUELCONSUMPTION_CITY':'fuel_consumption_on_city'}, inplace=True)
df.rename(columns={'FUELCONSUMPTION_HWY':'fuel_consumption_on_highway'}, inplace=True)
df.rename(columns={'FUELCONSUMPTION_COMB_MPG':'fuel_consumption_combinated_in_mpg'}, inplace=True)

# snake case formatting
df.rename(columns={'ENGINESIZE':'engine_size'}, inplace=True)
df.rename(columns={'MODEL':'model'}, inplace=True)
df.rename(columns={'CO2EMISSIONS':'co2_emissions'}, inplace=True)
df.rename(columns={'FUELTYPE':'fuel_type'}, inplace=True)
df.rename(columns={'CYLINDERS':'cylinders'}, inplace=True)
df.rename(columns={'VEHICLECLASS':'vehicle_class'}, inplace=True)

In [9]:
new_names = df.columns

df_names = pd.DataFrame(
    {
        'old_name': old_names,
        'new_name': new_names
    }
)

df_names

Unnamed: 0,old_name,new_name
0,MODELYEAR,release_year
1,MAKE,manufacturer
2,MODEL,model
3,VEHICLECLASS,vehicle_class
4,ENGINESIZE,engine_size
5,CYLINDERS,cylinders
6,TRANSMISSION,transmission_type
7,FUELTYPE,fuel_type
8,FUELCONSUMPTION_CITY,fuel_consumption_on_city
9,FUELCONSUMPTION_HWY,fuel_consumption_on_highway


At this step, we are ready to export the dataset to a new file in .csv format to start the analysis.

In [10]:
# export to csv
df.to_csv('./datasets/processed_vehicles.csv', index=False)