# One Hot Encoding

A machine learning technique to convert categorical data into columns of numerical data, allows us to better leverage information in categorical data to make predictions and to understand the patterns. 

How it works: Each binary variable represents one category from the original variable. It is also when the integer encoded variables are removed and a new binary variable is added for each unique integer value. 

Eg. Colour columns might give "red" [1,0,0] 
                              "green" [0,1,0]
                              "blue" [0,0,1] 

Subsequently, with the new columns created from one hot encoding, we plan to carry out linear regression with all the data available. 

Advantages of using one hot encoding is when there is no ordeal relationship such as Fuel Type Z,D,X,E,N, can still harness a relationship. 



Our categorical variables : 
    Vehicle Class,Engine Size, Cylinders, Transmission,Fuel Type that we plan to do encoding on since only these variables have unique and similar values 
    

### Import Essential Libraries

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer


## Import the cleaned dataset with no outliers & checking of data

In [2]:
df = pd.read_csv('CO2 Emissions_Canada_cleaned_removed_outlier.csv')
df.drop(df.columns[0], axis=1, inplace=True)

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])


Number of rows: 5965
Number of columns: 13


In [3]:
df.head()

Unnamed: 0,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption City (L/100 km),Fuel Consumption Hwy (L/100 km),Fuel Consumption Comb (L/100 km),Fuel Consumption Comb (mpg),CO2 Emissions(g/km),Number of Gears
0,ACURA,ILX,COMPACT,2.0,4,AS,Z,9.9,6.7,8.5,33,196,5
1,ACURA,ILX,COMPACT,2.4,4,M,Z,11.2,7.7,9.6,29,221,6
2,ACURA,ILX HYBRID,COMPACT,1.5,4,AV,Z,6.0,5.8,5.9,48,136,7
3,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS,Z,12.7,9.1,11.1,25,255,6
4,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS,Z,12.1,8.7,10.6,27,244,6


In [4]:
# Assuming df_imputed is your DataFrame
column_names = df.columns

# Print each column name
for column in column_names:
    print(column)


Make
Model
Vehicle Class
Engine Size(L)
Cylinders
Transmission
Fuel Type
Fuel Consumption City (L/100 km)
Fuel Consumption Hwy (L/100 km)
Fuel Consumption Comb (L/100 km)
Fuel Consumption Comb (mpg)
CO2 Emissions(g/km)
Number of Gears


## One-hot encoding 
Viewing of all the categorical data, except 'Make' and 'Model' as explained in our EDA

Want to make all the categorical data become numerical.
Relevant Categorical Data : 
Vehicle Class, Transmission, Fuel Type 


### Predictive Model


We need all the data to be numerical.
We can use the
function pertaining to pandas.

 might not want to include : Count the number of times a unique value appear in fuel type for a sample

In [4]:
n_values_count = df['Fuel Type'].value_counts()

# Display the count of 'N' values
print("Number of 'N' values in the 'Fuel Type' column:", n_values_count.get('N', 0))

Number of 'N' values in the 'Fuel Type' column: 1


Want to make a copy of the categorical values 

In [5]:
# Assuming 'df' is your original DataFrame
# List of categorical column names
categorical_columns = ['Vehicle Class', 'Transmission', 'Fuel Type']

# Create a new DataFrame containing only the specified categorical columns from the original DataFrame
categorical_df = df[categorical_columns].copy()

# Now, 'categorical_df' is a new DataFrame containing only the categorical data


In [6]:
print(categorical_df.head(10))


  Vehicle Class Transmission Fuel Type
0       COMPACT           AS         Z
1       COMPACT            M         Z
2       COMPACT           AV         Z
3   SUV - SMALL           AS         Z
4   SUV - SMALL           AS         Z
5      MID-SIZE           AS         Z
6      MID-SIZE           AS         Z
7      MID-SIZE           AS         Z
8      MID-SIZE            M         Z
9       COMPACT           AS         Z


## One-hot encoding 
Viewing of all the unique values in each section of categorical data. 

In [7]:
# display unique values of categorical columns before encoding
print("Unique values before encoding: ")
for column in categorical_columns:
    print(column, ":", df[column].unique())
    # Add this line to display the number of unique values
    print("Number of unique values:", df[column].nunique())


Unique values before encoding: 
Vehicle Class : ['COMPACT' 'SUV - SMALL' 'MID-SIZE' 'TWO-SEATER' 'MINICOMPACT'
 'SUBCOMPACT' 'FULL-SIZE' 'STATION WAGON - SMALL' 'SUV - STANDARD'
 'VAN - CARGO' 'VAN - PASSENGER' 'PICKUP TRUCK - STANDARD' 'MINIVAN'
 'SPECIAL PURPOSE VEHICLE' 'STATION WAGON - MID-SIZE'
 'PICKUP TRUCK - SMALL']
Number of unique values: 16
Transmission : ['AS' 'M' 'AV' 'AM' 'A']
Number of unique values: 5
Fuel Type : ['Z' 'D' 'X' 'E' 'N']
Number of unique values: 5


Number of unique values in all the columns = 26

### Applying one hot encoding to variables such as Vehicle Class, Transmission, Fuel Type. 


### To add all the neccessary data variables of categorical after encoding and numerical data.

We focus on the specifications of the car, by including

Numerical : Engine size(L), Fuel Consumption Comb(L/100km), Number of Gears


Categorical:
            Vehicle Class (FULL-SIZE,MID-SIZE,MINICOMPACT, MINIVAN, TRUCK-SMALL..)
            Cylinders (10/12)
            Transmission(AM, AS, AV, M)
            Fuel Type(E,N,X,Z)

In [19]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Load and display the initial DataFrame
df = pd.read_csv('CO2 Emissions_Canada_cleaned_removed_outlier.csv')
df.drop(df.columns[0], axis=1, inplace=True)

print("Number of rows before dropping the necessary:", df.shape[0])

# Save the target variable and then drop the columns 'Model', 'CO2 Emissions(g/km)', 'Make', 'Fuel Consumption City(L/100km)', 'Fuel Consumption Hwy(L/100km)'
df.drop(['Model', 'Fuel Consumption Comb (L/100 km)', 'Make', 'Fuel Consumption City (L/100 km)', 'Fuel Consumption Hwy (L/100 km)'], axis=1, inplace=True, errors='ignore')

print("Number of rows after dropping necessary':", df.shape[0])

# List of remaining categorical column names to encode
categorical_columns = ['Vehicle Class', 'Transmission', 'Fuel Type']

# Perform one-hot encoding for the categorical data
for column in categorical_columns:
    dummies = pd.get_dummies(df[column], prefix=column, drop_first=True)
    df.drop(column, axis=1, inplace=True)  # Drop the original column from df
    df = df.join(dummies)  # Join the dummy columns to df

# Impute missing values in the dataset
imputer = SimpleImputer(strategy='median')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)



# At this point, the columns 'Fuel Consumption City(L/100km)' and 'Fuel Consumption Hwy(L/100km)' have already been removed.
# Thus, the last drop statement is not needed. We can directly display the head of the imputed DataFrame.
df_imputed.head()


Number of rows before dropping the necessary: 5965
Number of rows after dropping necessary': 5965


Unnamed: 0,Engine Size(L),Cylinders,Fuel Consumption Comb (mpg),CO2 Emissions(g/km),Number of Gears,Vehicle Class_FULL-SIZE,Vehicle Class_MID-SIZE,Vehicle Class_MINICOMPACT,Vehicle Class_MINIVAN,Vehicle Class_PICKUP TRUCK - SMALL,...,Vehicle Class_VAN - CARGO,Vehicle Class_VAN - PASSENGER,Transmission_AM,Transmission_AS,Transmission_AV,Transmission_M,Fuel Type_E,Fuel Type_N,Fuel Type_X,Fuel Type_Z
0,2.0,4.0,33.0,196.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,2.4,4.0,29.0,221.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,1.5,4.0,48.0,136.0,7.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,3.5,6.0,25.0,255.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,3.5,6.0,27.0,244.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


In [20]:
print("Number of rows:", df_imputed.shape[0])
print("Number of columns:", df_imputed.shape[1])


Number of rows: 5965
Number of columns: 28


In [21]:
# Assuming df_imputed is your DataFrame
column_names = df_imputed.columns

# Print each column name
for column in column_names:
    print(column)


Engine Size(L)
Cylinders
Fuel Consumption Comb (mpg)
CO2 Emissions(g/km)
Number of Gears
Vehicle Class_FULL-SIZE
Vehicle Class_MID-SIZE
Vehicle Class_MINICOMPACT
Vehicle Class_MINIVAN
Vehicle Class_PICKUP TRUCK - SMALL
Vehicle Class_PICKUP TRUCK - STANDARD
Vehicle Class_SPECIAL PURPOSE VEHICLE
Vehicle Class_STATION WAGON - MID-SIZE
Vehicle Class_STATION WAGON - SMALL
Vehicle Class_SUBCOMPACT
Vehicle Class_SUV - SMALL
Vehicle Class_SUV - STANDARD
Vehicle Class_TWO-SEATER
Vehicle Class_VAN - CARGO
Vehicle Class_VAN - PASSENGER
Transmission_AM
Transmission_AS
Transmission_AV
Transmission_M
Fuel Type_E
Fuel Type_N
Fuel Type_X
Fuel Type_Z


By printing the number of rows and columns, the number of rows did not change which means that no data is lost during the process of dropping columns.  Additionally, the number of useful data columns changed from 13 to 28, meaning that there are some columns created by changing unique categorical data to columns.

### Creation of a new CSV File with after one hot encoding

In [12]:
df_imputed.to_csv('CO2 Emissions_Canada_cleaned_removed_outliers_after_encoding.csv')
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

Number of rows: 5965
Number of columns: 28


In [13]:
# display the number of unique values in each column
print("Unique values per column:\n", df_imputed.nunique())

Unique values per column:
 Engine Size(L)                             49
Cylinders                                   7
Fuel Consumption Comb (L/100 km)          165
Fuel Consumption Comb (mpg)                49
Number of Gears                             7
Vehicle Class_FULL-SIZE                     2
Vehicle Class_MID-SIZE                      2
Vehicle Class_MINICOMPACT                   2
Vehicle Class_MINIVAN                       2
Vehicle Class_PICKUP TRUCK - SMALL          2
Vehicle Class_PICKUP TRUCK - STANDARD       2
Vehicle Class_SPECIAL PURPOSE VEHICLE       2
Vehicle Class_STATION WAGON - MID-SIZE      2
Vehicle Class_STATION WAGON - SMALL         2
Vehicle Class_SUBCOMPACT                    2
Vehicle Class_SUV - SMALL                   2
Vehicle Class_SUV - STANDARD                2
Vehicle Class_TWO-SEATER                    2
Vehicle Class_VAN - CARGO                   2
Vehicle Class_VAN - PASSENGER               2
Transmission_AM                             2
Transmi

KeyError: "['CO2 Emissions(g/km)'] not found in axis"