<a href="https://colab.research.google.com/github/MrBCPT/Prediction-of-Product-Sales/blob/main/Code_Along_Challenge_Regression_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Regression Metrics**
- **Umuzi**
- **Course 2 - Moderate Machine Learning**
- **Week 2 - Lecture 1**

## **Project Desciption**

### **Task**
- The objective of this project is to predict the 'mpg' of a car.


### **Data Dictionary:**

**Attribute** | **Description**  
--- | ---
model | model of the car
price | price car last sold for
transmission | transmission type: Automatic or Manual
mileage | current mileage of the car
fuelType | fuel type the car runs on
tax | tax paid on car at last sale
mpg | miles per gallon of car (target)
engineSize | size of engine in cubic litres

### **Import Libraries**

In [None]:
## Pandas
import pandas as pd
## Numpy
import numpy as np

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

## Models
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression

## Regression Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

## Set global scikit-learn configuration
from sklearn import set_config
## Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

### **Functions**

In [None]:
## Create a function to take the true and predicted values
## and print MAE, MSE, RMSE, and R2 metrics for a model
def eval_regression(y_true, y_pred, name='model'):
  """Takes true targets and predictions from a regression model and prints
  MAE, MSE, RMSE, AND R2 scores
  Set 'name' to name of model and 'train' or 'test' as appropriate"""
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(y_true, y_pred)

  print(f'{name} Scores')
  print(f'MAE: {mae:,.4f} \nMSE: {mse:,.4f} \nRMSE: {rmse:,.4f} \nR2: {r2:.4f}\n')

## **1. Load and inspect the data**

### **Load the Data**

In [None]:
## Load Data
path = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vS2dIT3WEj2j4nSpai7K0wSCwFc_hQBYQR6Xf10VtnyI64EItM9SWxN1UFU_XhrkWdUp6ayrUOoJSgY/pub?output=csv'
df = pd.read_csv(path)

### **Inspect the Data**

In [None]:
## Display the first (5) rows of the dataframe
df.head()

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,SLK,2005,5200,Automatic,63000,Petrol,325,32.1,1.8
1,S Class,2017,34948,Automatic,27000,Hybrid,20,61.4,2.1
2,SL CLASS,2016,49948,Automatic,6200,Petrol,555,28.0,5.5
3,G Class,2016,61948,Automatic,16000,Petrol,325,30.4,4.0
4,G Class,2016,73948,Automatic,4000,Petrol,325,30.1,4.0


- The data appears to have loaded correctly.

In [None]:
## Display the number of rows and columns for the dataframe
df.shape
print(f'There are {df.shape[0]} rows, and {df.shape[1]} columns.')

There are 13119 rows, and 9 columns.


In [None]:
## Display the column names, count of non-null values, and their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13119 entries, 0 to 13118
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         13119 non-null  object 
 1   year          13119 non-null  int64  
 2   price         13119 non-null  int64  
 3   transmission  13119 non-null  object 
 4   mileage       13119 non-null  int64  
 5   fuelType      13119 non-null  object 
 6   tax           13119 non-null  int64  
 7   mpg           13119 non-null  float64
 8   engineSize    13119 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 922.6+ KB


- If there are no missing values we will not need to impute any.
 - We will not need to use SimpleImputer in our preprocessing steps.
- All datatypes look correct.

- **Never use fillna() to impute values prior to Train_Test_Split, as it would cause data leakage**

In [None]:
## Display the descriptive statistics for the numeric columns
df.describe(include="number") # or 'object'

Unnamed: 0,year,price,mileage,tax,mpg,engineSize
count,13119.0,13119.0,13119.0,13119.0,13119.0,13119.0
mean,2017.296288,24698.59692,21949.559037,129.972178,55.155843,2.07153
std,2.224709,11842.675542,21176.512267,65.260286,15.220082,0.572426
min,1970.0,650.0,1.0,0.0,1.1,0.0
25%,2016.0,17450.0,6097.5,125.0,45.6,1.8
50%,2018.0,22480.0,15189.0,145.0,56.5,2.0
75%,2019.0,28980.0,31779.5,145.0,64.2,2.1
max,2020.0,159999.0,259000.0,580.0,217.3,6.2


In [None]:
## Display the descriptive statistics for the non-numeric columns
df.describe(include="object") # or 'object'

Unnamed: 0,model,transmission,fuelType
count,13119,13119,13119
unique,27,4,4
top,C Class,Semi-Auto,Diesel
freq,3747,6848,9187


## **2. Clean the Data**

### **Remove Unnecessary Columns**

- There are no columns to be dropped.

### **Remove Unecessary Rows**

#### **Duplicates**

In [None]:
## Display the number of duplicate rows in the dataset
print(f'There are {df.duplicated().sum()} duplicate rows.')

There are 259 duplicate rows.


In [None]:
## Drop duplicate rows
df = df.drop_duplicates()

In [None]:
## Confirm duplicate rows have been dropped
print(f'There are {df.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


#### **Categorical Columns**

In [None]:
## Print the unique values for the column
print('Unique models:\n', df['model'].unique())
print('\n')
## Print the unique values for the column
print('Unique transmissions:\n', df['transmission'].unique())
print('\n')
## Print the unique values for the column
print('Unique fuel types:\n', df['fuelType'].unique())
print('\n')

Unique models:
 ['SLK' 'S Class' 'SL CLASS' 'G Class' 'GLE Class' 'GLA Class' 'A Class'
 'B Class' 'GLC Class' 'C Class' 'E Class' 'GL Class' 'CLS Class'
 'CLC Class' 'CLA Class' 'V Class' 'M Class' 'CL Class' 'GLS Class'
 'GLB Class' 'X-CLASS' '180' 'CLK' 'R Class' '230' '220' '200']


Unique transmissions:
 ['Automatic' 'Manual' 'Semi-Auto' 'Other']


Unique fuel types:
 ['Petrol' 'Hybrid' 'Diesel' 'Other']




- No unusual values noted.

## **3. Split the Data**



In [None]:
## Define features (X) and target (y)
target = 'mpg'
X = df.drop(columns = [target])
y = df[target]


In [None]:
## Split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## **1. Model the Data - Challenge**

### **Baseline Model**
- Instantiate the baseline model using the 'mean' strategy.
- Create a model pipeline using the preprocessor and model.
- Fit the model pipeline with the X_train dataset.(Never fit on X_test.)


## **Decision Tree Model**

In [None]:
## Import the model
from sklearn.tree import DecisionTreeRegressor

In [None]:
## Create an instance of the model


## Create a model pipeline


## Fit the model



### Decision Tree Metrics

In [None]:
## Make predictions using the model

## Display model performance metrics using a function



## **Bagged Tree Model**

In [None]:
## Import the model
from sklearn.ensemble import BaggingRegressor

In [None]:
# Create an instance of the model


# Create a model pipeline


# Fit the model



### Bagged Tree Metrics

In [None]:
## Make predictions using the model

## Display model performance metrics using a function




## **Random Forest Model**

In [None]:
## Import the model
from sklearn.ensemble import RandomForestRegressor

In [None]:
## Create an instance of the model

## Create a model pipeline


## Fit the model



### Random Forest Metrics

In [None]:
## Make predictions using the model

## Display model performance metrics using a function




## **K-Nearest Neighbors**

In [None]:
## Import the model
from sklearn.neighbors import KNeighborsRegressor

In [None]:
## Create an instance of the model


## Create a model pipeline


## Fit the model



### K-Nearest Neighbors Metrics

In [None]:
## Make predictions using the model

## Display model performance metrics using a function




# **Challenge 2:**

## **Recommendations**

You now have tried several different models on your data set. You now need to determine which model to implement.

- Overall, which model would you recommend your client deploy?
- Justify your recommendation by referencing metrics.  Which metrics are most important for your client to know about?

**Model recommended:**
*


**Why this model?**
*
