## COMP2006: Graded Lab 4

In this lab, you will gain some experience in dealing with missing data and further practice converting non-numeric features in a dataset to numeric.

**Target**: to predict `Comb Unadj FE - Conventional Fuel`

**Data set**: make sure you use the data assigned to your group!

| Groups | Data set |
| :-: | :-: |
| 1 | veh1_missing.csv |
| 2 | veh2_missing.csv |
| 3 | veh3_missing.csv |
| 4 | veh4_missing.csv |
| 5 | veh5_missing.csv |
| 6 | veh6_missing.csv |
| 7 | veh7_missing.csv |
| 8 | veh8_missing.csv |
| 10 | veh10_missing.csv |

**Important Notes:**
- Use [Chapter 7](https://mlbook.explained.ai/bulldozer-intro.html) of the textbook as a **guide**:
     - you only need to use **random forest** models;
- Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
- Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective

> **Don't make assumptions!**


### Part 0

### Group Number 7
 - Manuel Bishop Noriega - ID 4362207
 - Robert E. Matney III - ID: 4364229

     

### Part 1 - Create and evaluate an initial model

In this part you should: 
 - use Section 7.3 of the textbook as a guide, except:
     - use all of the data; and
     - use 150 decision trees in your random forest models
 - read in the data
 - isolate all numeric features from original data set
 - fill in any missing values with 0
 - create and evaluate a baseline model 

#### Code (10 marks)

### Importing libraries and setting up some useful functions

In [36]:
import pandas as pd
# from rfpimp import *
from rfpimp_MC import *
from sklearn.ensemble import RandomForestRegressor
import warnings # to avoid some warnings about plots
warnings.filterwarnings('ignore')

# ------------------------------- HELPER FUNCTIONS ----------------------------------
# to prevent repeating code we'll create an evaluation function called evalute
# it will take features and target as parameters and use them to
# create and fit a random forest regressor model and calculate it's oob performance
# it returns a tuple with rf model and its oob average from 10 runs
def evaluate(X,y):
    
    oob_scores = []
    for i in range(10):
        rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
        rf.fit(X, y)
        oob_scores.append(rf.oob_score_)
    oob=sum(oob_scores) / len(oob_scores)
    print(f'Mean OOB score: {oob}')
    print(f'{rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height')

    return rf, oob

# showimp() show features importances, it accepts up to 4 params
# rf: a randomforest regressor
# X,y are the features and target
# features, I added this to perform different tests while converting non-numeric features
# plots the features' importances graphic
def showimp(rf,X,y,features):
    I=importances(rf,X,y, features=features)
    plot_importances(I,color='blue')

In [37]:
# PART ONE
#-------------------------- reading the data ----------------------
df_raw = pd.read_csv("veh7_missing.csv")
df=df_raw.copy() # let's keep a safe copy of our oringal data set


### Before doing anything let's take a look of our data set

In [38]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1196 entries, 0 to 1195
Data columns (total 12 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   Eng Displ                                                  831 non-null    float64
 1   # Cyl                                                      1196 non-null   int64  
 2   Comb Unadj FE - Conventional Fuel                          1196 non-null   object 
 3   # Gears                                                    1196 non-null   int64  
 4   Max Ethanol % - Gasoline                                   1164 non-null   float64
 5   Intake Valves Per Cyl                                      1196 non-null   int64  
 6   Exhaust Valves Per Cyl                                     1196 non-null   int64  
 7   Stop/Start System (Engine Management System)  Description  1196 non-null   object 
 8   Lockup T

Unnamed: 0,Eng Displ,# Cyl,Comb Unadj FE - Conventional Fuel,# Gears,Max Ethanol % - Gasoline,Intake Valves Per Cyl,Exhaust Valves Per Cyl,Stop/Start System (Engine Management System) Description,Lockup Torque Converter,Calc Approach Desc,Cyl Deact?,Trans Creeper Gear
0,4.4,8,20.538,7,10.0,2,2,Yes,,not filled in,^^,N
1,2.0,4,34.5365,6,15.0,2,2,No,,not filled in,^^,N
2,3.4,6,31.5932,7,10.0,2,2,none,N,not filled in,N,N
3,4.4,8,22.0246,6,10.0,2,2,none,N,Derived 5-cycle label,N,N
4,4.3,6,25.772,6,85.0,1,1,No,,not filled in,^^,N


***
It seems that we have some missing values in our dataset, but because we want to quickly build our baseline model, let's just use numeric features by now and fill in missing values with 0 and keep going with building and evaluating baseline model.

**NOTE:**
 After doing the above, while trying to fit our model we got an error `ValueError: could not convert string to float: 'Mod'` then we realized that target feature was shown as `object dtype` in df.info() even when values seem to be all float, after a closer look we found some missing values hidden behind 'Mod' string in our target feature. Then, considering that problem was only in around 1.1% of the total rows, we just decided to drop those rows.
***

In [39]:
df_num=df.select_dtypes(include=['number']) 
df_num=df_num.fillna(0)
df_num.isna().any()

Eng Displ                   False
# Cyl                       False
# Gears                     False
Max Ethanol % - Gasoline    False
Intake Valves Per Cyl       False
Exhaust Valves Per Cyl      False
dtype: bool

In [40]:
# now we pull out target and call evaluate() to create and test baseline model
y=df['Comb Unadj FE - Conventional Fuel']
evaluate(df_num,y)

ValueError: could not convert string to float: 'Mod'

Three features have missing values in our original dataset
- Eng Displ
- Max Ethanol % - Gasoline
- Lockup Torque Converter

But sometimes missing values are hidden as meaningless values in our features, so let's take a closer look of our dataset

In [15]:
print(f'This will tell us which features have EVIDENT missing values\n\n{df.isna().any()}' )
# df.info()
# df=df.fillna(0)
#print(df.isna().any())
print(df.nunique(axis=0))
print(df.info())
print(df.describe())
# df['Comb Unadj FE - Conventional Fuel'].unique()

# def sniff(df):
#     with pd.option_context("display.max_colwidth",20):
#         info=pd.DataFrame()
#         info['sample']=df.iloc[120]
#         info['data type']=df.dtypes
#         info['percent missing']=df.isnull().sum()*100/len(df)
#         return info.sort_values('data type')


# print(sniff(df))
# sniff(df)

# from pandas.api.types import is_string_dtype, is_object_dtype
# # if is_numeric_dtype((df['Comb Unadj FE - Conventional Fuel'])):
# temp=pd.to_numeric(df['Comb Unadj FE - Conventional Fuel'],errors='coerce',downcast='float')
# # df['Comb Unadj FE - Conventional Fuel'].isnull().any()
# # df['Comb Unadj FE - Conventional Fuel'].describe()
# print(temp.isna().any())
# print(df['Comb Unadj FE - Conventional Fuel'].isna().any())
# # df['Stop/Start System (Engine Management System)  Description'].astype(float)


Eng Displ                                                     44
# Cyl                                                          8
Comb Unadj FE - Conventional Fuel                            899
# Gears                                                        7
Max Ethanol % - Gasoline                                       3
Intake Valves Per Cyl                                          2
Exhaust Valves Per Cyl                                         2
Stop/Start System (Engine Management System)  Description      3
Lockup Torque Converter                                        3
Calc Approach Desc                                             4
Cyl Deact?                                                     3
Trans Creeper Gear                                             2
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1196 entries, 0 to 1195
Data columns (total 12 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------ 

### Part 2 - Normalize missing values

In this part you should: 
 - use Section 7.4 of the textbook as a guide
 - convert **all** representations of missing data to a **single** representation
 
#### Code (15 marks)

#### Question (5 marks)

Note here all the different ways missing data was represented in the data.   

**Enter your answer here:**

### Part 3 - Categorical features

In this part you should: 
 - use Section 7.5.1 as a guide
 - only use ordinal encoding 
 - convert **all** non-numeric features to numeric 
 - handle any missing values
 
#### Code (25 marks)

### Part 4 - Numeric features

In this part you should: 
 - use Section 7.5.2 as a guide
 - handle any missing values
 
#### Code (30 marks)

### Part 5 - Create and evaluate a final model

In this part you should:
 - create and evaluate a model using all the features after processing them in Parts 2, 3, and 4 above 

#### Code (10 marks)

#### Questions (5 marks)

Provide answers to the following:
 1. calculate the percent difference between the results of Part 1 and Part 5 (make sure you are using the correct formula for percent difference) 
 2. based on the percent difference, state whether or not the results of Part 5 are an improvement over the results of Part 1

**Enter your answers here:**