This notebook contains scripts for the data cleaning stage of the machine learning process.

Steps to train a model - These refer to generic steps. The specific ones to train a model for the dataset of this problem would be provided seperately. After the model training process

Step 1. Exploratory data analysis: 

a. to identify the target data type (discrete or continous). This helps in identifying the type of ML task it is. That is, whether it is a regression problem or a classification problem.

b. to identify the data distribution to determine if it is balanced or imbalanced. This helps in the choice of cross validation techniques, and whether to undersample or over sample, or use technique such as SMOTE (Synthetic Minority Oversampling technique. The choice of evaluation technique can also be determined by the data distribution.

c. to identify neccesary data cleansing and wrangling that should be done. Examples are the required data transformations. The categorical data to be transformed to numerical. Also is to determine the presence of missing values.

Step 2: Data Cleaning and Wrangling

A messy, noisy and unclean data is taken for cleaning where there the missing values are treated, the categorical data are also transformed to numerical data. In essence, based on the feedback from the EDA, appropriate steps are taken to ensure that the data is in a good shape and model training fit.

The two steps above are important for pre model training.

Step 3: Model Training

For the baseline model (a preliminarily trained model - without any feature enginering and hyperparamter tuning method)

a. An appropriate cross validation technique (to ensure that the model does not overfit) is picked.

b. Appropriate machine learning algorithm(s)/model(s) is/are picked.

c. Appropriate evaluation techniques are picked.

d. The data is scaled/normalized to ensure that every feature has uniform contributing power to the target class.

e. Train the model with the cleaned and normalized data

f. Evaluate the model.

Step 4: Model Improvement

Improving a model concentrates on feature selection/engineering and hyper parameter tuning.

For feature selection/engineering:

a. Use low variance and high correlation to remove features. 

b. Tree models provide feature importance ranking method to see how the feature importance ranks. Discretionary selection of features can be done based on the importance ranking information

c. Use dimensionality reduction technique such as Principal Component Analysis to derive principal components/features from too many features

d. Domain knowledge can also be leveraged to determine the important features to be retained for model training.

For hyper paramater tuning:

a. Grid search
b. Random search
c. Evolutionary optimization
d. Bayesian optimization
e. Gradient-based optimization

Then, evaluate the model performances, compare, and select the best performing one for final model training.

Specific steps to train this model:

1. Data inspection - 

The data is manually inspected to look for any need for manual re-arrangement that would make file reading more straightforward. It was observed that the headings are in two rows and also a feature (FUEL CONSUMPTION) is multi-variate, so to speak. Headings are modifiied to become of a single row. Also, the three variables pointing to FUEL CONSUMPTION are made to become independent features, in their separate columns. 

The presence of categorical data is also noticed and would be considered during the data cleaning and wrangling.

Lastly, identifying the target variable, "CO2 EMISSIONS", shows that the ML problem us a regression task - to predict a feature on a continuous scale.

2. Exploring for preprocessing - 

This is to identify missing values, categorical variables that needed to be mapped to numerical values, and ordinal data to be handled.

In [14]:
filepath = "/content/drive/MyDrive/MY2010-2014 Fuel Consumption Ratings 5-cycle (Dataset) - MY2010-2014 Fuel Consumption Ratings 5-cycle (Dataset).csv" # The path to the data file

In [None]:
import chardet
with open(filepath, 'rb') as file: # There was a preliminary errors while reading the dataset due to the encoding style. This code is to identify the type of encoding used for the file.
  print(chardet.detect(file.read()))

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


In [15]:
# Import libaries needed for our modeling
import pandas as pd

In [16]:
def csv_to_df (path):
  """
  This function takes in filepath of the csv file and returns it data frame
  """
  data_columns = ["MODEL (YEAR)", "MAKE", "MODEL", "VEHICLE CLASS", "ENGINE SIZE (L)", "CYLINDERS", "TRANSMISSION", "FUEL TYPE", "CITY (L/100 km)", "HWY (L/100 km)", "COMB (L/100 km)", "COMB (mpg)", "CO2 EMISSIONS"] # The columns to be read. This is important when the data sheet contains irrelevant columns
  df = pd.read_csv(path, low_memory=False, usecols = data_columns, encoding='ISO-8859-1') # This encoding parameter is identified earlier
  return df

In [17]:
fuel_conspt_rating = csv_to_df(filepath) # Call the function to load the dataset into a dataframe

In [18]:
fuel_conspt_rating.head(5) # Show the first five rows of the dataset

Unnamed: 0,MODEL (YEAR),MAKE,MODEL,VEHICLE CLASS,ENGINE SIZE (L),CYLINDERS,TRANSMISSION,FUEL TYPE,CITY (L/100 km),HWY (L/100 km),COMB (L/100 km),COMB (mpg),CO2 EMISSIONS
0,2010,ACURA,CSX,COMPACT,2.0,4.0,AS5,X,10.9,7.8,9.5,30.0,219.0
1,2010,ACURA,CSX,COMPACT,2.0,4.0,M5,X,10.0,7.6,8.9,32.0,205.0
2,2010,ACURA,CSX,COMPACT,2.0,4.0,M6,Z,11.6,8.1,10.0,28.0,230.0
3,2010,ACURA,MDX AWD,SUV,3.7,6.0,AS6,Z,14.8,11.3,13.2,21.0,304.0
4,2010,ACURA,RDX AWD TURBO,SUV,2.3,4.0,AS5,Z,13.2,10.3,11.9,24.0,274.0


In [19]:
fuel_conspt_rating.shape # The shape of the dataframe shows that it has 5384 rows and 13 columns

(5384, 13)

In [20]:
fuel_conspt_rating.isnull().sum() # Check for missing values 

MODEL (YEAR)       16
MAKE                6
MODEL              25
VEHICLE CLASS      25
ENGINE SIZE (L)    25
CYLINDERS          25
TRANSMISSION       25
FUEL TYPE          25
CITY (L/100 km)    25
HWY (L/100 km)     25
COMB (L/100 km)    25
COMB (mpg)         25
CO2 EMISSIONS      25
dtype: int64

In [21]:
def treat_na(df):
  """
  This function treats the missing values. It drops the rows with missing year and fill the numerical values with 0.0 and categorical values with "None"
  """
  
  categorical_cols = ['MAKE', 'MODEL', 'VEHICLE CLASS', 'TRANSMISSION', 'FUEL TYPE'] # These are columns with categorical data
  numerical_cols = ['ENGINE SIZE (L)', 'CYLINDERS', 'CITY (L/100 km)', 'HWY (L/100 km)', 'COMB (L/100 km)', 'COMB (mpg)', 'CO2 EMISSIONS'] # These are numerical data
  
  

  treated_df = df.dropna(subset = ['MODEL (YEAR)']) # Drop the missing value in the 

  for num_col in numerical_cols:
    treated_df[num_col].fillna(0.0, inplace=True)

  for cat_col in categorical_cols:
    treated_df[cat_col].fillna("None", inplace=True)

  return treated_df

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
fuel_conspt_rating_treated = treat_na(fuel_conspt_rating) # fuel_conspt_rating_treated here is the dataframe without missing values

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [None]:
fuel_conspt_rating_treated.isnull().sum() # To confirm is there are still missing values

MODEL (YEAR)       0
MAKE               0
MODEL              0
VEHICLE CLASS      0
ENGINE SIZE (L)    0
CYLINDERS          0
TRANSMISSION       0
FUEL TYPE          0
CITY (L/100 km)    0
HWY (L/100 km)     0
COMB (L/100 km)    0
COMB (mpg)         0
CO2 EMISSIONS      0
dtype: int64

In [24]:
def class_map_column(df, col_name):
  """
  This function takes the col_name of the categorical columns and maps it with a numerical value.
  """
  unique_col_data = set()
  unique_col_data_dict = {}
  val_idx = 1

  for (idx, row) in df.iterrows():
    val = row[col_name]
    if val not in unique_col_data:
      unique_col_data.add(val)
      unique_col_data_dict[val] = val_idx
      df.loc[idx, [col_name]] = val_idx
      val_idx += 1
    else:
      df.loc[idx, [col_name]] = unique_col_data_dict.get(val)

  return df

In [25]:
def class_map_all_columns (df):
  """
  The function for mapping all the categorical columns
  """
  categorical_cols = ['MAKE', 'MODEL', 'VEHICLE CLASS', 'TRANSMISSION', 'FUEL TYPE'] 

  for col_name in categorical_cols:
    df = class_map_column(df, col_name)

  return df


In [26]:
fuel_conspt_rating_treated = class_map_all_columns (fuel_conspt_rating_treated) # fuel_conspt_rating_treated here is the dataframe with the categorical columns mapped

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [29]:
# It is observed, again, at this stage that some MODEL YEAR data has string values. These are neccesarily needed to be removed
data_columns = ["MODEL (YEAR)", "MAKE", "MODEL", "VEHICLE CLASS", "ENGINE SIZE (L)", "CYLINDERS", "TRANSMISSION", "FUEL TYPE", "CITY (L/100 km)", "HWY (L/100 km)", "COMB (L/100 km)", "COMB (mpg)", "CO2 EMISSIONS"]
fuel_conspt_rating_final = (fuel_conspt_rating_treated.drop(data_columns, axis=1).join(fuel_conspt_rating_treated[data_columns].apply(pd.to_numeric, errors='coerce')))

fuel_conspt_rating_final = fuel_conspt_rating_final[fuel_conspt_rating_final[data_columns].notnull().all(axis=1)]

In [30]:
def save_to_csv (df, savefilepath):
  df.to_csv(savefilepath, encoding='utf-8', index=None)

In [31]:
savefilepath = "/content/drive/MyDrive/cleaned-MY2010-2014 Fuel Consumption Ratings 5-cycle (Dataset).csv"
save_to_csv (fuel_conspt_rating_final, savefilepath)

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive
