# **Milestone 2**

## **Model Building**

1. What we want to predict is the "Price". We will use the normalized version 'price_log' for modeling.
2. Before we proceed to the model, we'll have to encode categorical features. We will drop categorical features like - Name 
3. We'll split the data into train and test, to be able to evaluate the model that we build on the train data.
4. Build Regression models using train data.
5. Evaluate the model performance.

In [60]:
#Import required libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Remove the limit from the number of displayed columns and rows. It helps to see the entire dataframe while printing it
pd.set_option("display.max_columns", None)
# pd.set_option('display.max_rows', None)
pd.set_option("display.max_rows", 200)

In [61]:
cars_data = pd.read_csv("..\\..\\Public_Datasets\\used_cars_milestone_1.csv")

In [62]:
# drop observations missing the dependent values 
cars_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7252 entries, 0 to 7251
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             7252 non-null   int64  
 1   Name                   7252 non-null   object 
 2   Location               7252 non-null   object 
 3   Year                   7252 non-null   int64  
 4   Kilometers_Driven      7252 non-null   int64  
 5   Fuel_Type              7252 non-null   object 
 6   Transmission           7252 non-null   object 
 7   Owner_Type             7252 non-null   object 
 8   Mileage                7252 non-null   float64
 9   Engine                 7252 non-null   float64
 10  Power                  7252 non-null   float64
 11  Seats                  7252 non-null   float64
 12  New_price              7252 non-null   float64
 13  Price                  6018 non-null   float64
 14  kilometers_driven_log  7252 non-null   float64
 15  pric

In [63]:
cars_data.dropna(inplace=True)

### **Split Data**

<li>Step1: Split the data into X and Y . 
<li>Step2: Encode the categorical variables in X using pd.dummies.
<li>Step3: Split the data into train and test using train_test_split.

<b>Think about it:</b> Why we should drop 'Name','Price','price_log','Kilometers_Driven' from X before splitting?

In [64]:
# Step-1
y = cars_data[["price_log", "Price"]]
X = cars_data.drop(['Unnamed: 0', 'Model', 'Name', 'Price', 'price_log', 'Kilometers_Driven'],axis=1)


In [65]:
# Step-2 Use pd.get_dummies(drop_first=True)
X = pd.get_dummies(X, drop_first=True)

In [66]:
# Step-3 Splitting data into training and test set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print(X_train.shape, X_test.shape)

(4212, 55) (1806, 55)


In [67]:
X_train.columns

Index(['Year', 'Mileage', 'Engine', 'Power', 'Seats', 'New_price',
       'kilometers_driven_log', 'Location_Bangalore', 'Location_Chennai',
       'Location_Coimbatore', 'Location_Delhi', 'Location_Hyderabad',
       'Location_Jaipur', 'Location_Kochi', 'Location_Kolkata',
       'Location_Mumbai', 'Location_Pune', 'Fuel_Type_Diesel',
       'Fuel_Type_Electric', 'Fuel_Type_LPG', 'Fuel_Type_Petrol',
       'Transmission_Manual', 'Owner_Type_Fourth & Above', 'Owner_Type_Second',
       'Owner_Type_Third', 'Brand_Audi', 'Brand_BMW', 'Brand_Bentley',
       'Brand_Chevrolet', 'Brand_Datsun', 'Brand_Fiat', 'Brand_Force',
       'Brand_Ford', 'Brand_Honda', 'Brand_Hyundai', 'Brand_ISUZU',
       'Brand_Isuzu', 'Brand_Jaguar', 'Brand_Jeep', 'Brand_Lamborghini',
       'Brand_Land', 'Brand_Mahindra', 'Brand_Maruti', 'Brand_Mercedes-Benz',
       'Brand_Mini', 'Brand_Mitsubishi', 'Brand_Nissan', 'Brand_Porsche',
       'Brand_Renault', 'Brand_Skoda', 'Brand_Smart', 'Brand_Tata',
       'Brand

In [75]:
# Let us write a function for calculating r2_score and RMSE on train and test data.
# This function takes model as an input on which we have trained particular algorithm.
# the categorical column as the input and returns the boxplots and histograms for the variable.
def get_model_score(model, flag=True):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train = model.predict(X_train)
    pred_train_ = np.exp(pred_train)
    pred_test = model.predict(X_test)
    pred_test_ = np.exp(pred_test)
    
    train_r2=metrics.r2_score(y_train['Price'],pred_train_)
    test_r2=metrics.r2_score(y_test['Price'],pred_test_)
    train_rmse=metrics.mean_squared_error(y_train['Price'],pred_train_,squared=False)
    test_rmse=metrics.mean_squared_error(y_test['Price'],pred_test_,squared=False)
    
    #Adding all scores in the list
    score_list.extend((train_r2,test_r2,train_rmse,test_rmse))
    
    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag==True: 
        print("R-square on training set : ",metrics.r2_score(y_train['Price'],pred_train_))
        print("R-square on test set : ",metrics.r2_score(y_test['Price'],pred_test_))
        print("RMSE on training set : ",np.sqrt(metrics.mean_squared_error(y_train['Price'],pred_train_)))
        print("RMSE on test set : ",np.sqrt(metrics.mean_squared_error(y_test['Price'],pred_test_)))
    
    # returning the list with train and test scores
    return score_list

<hr>

For Regression Problems, some of the algorithms used are :<br>

**1) Linear Regression** <br>
**2) Ridge / Lasso Regression** <br>
**3) Decision Trees** <br>
**4) Random Forest** <br>

### **Fitting a linear model**

Linear Regression can be implemented using: <br>

**1) Sklearn:** https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html <br>
**2) Statsmodels:** https://www.statsmodels.org/stable/regression.html

In [69]:
# import Linear Regression from sklearn
from sklearn.linear_model import LinearRegression

In [70]:
# Create a linear regression model
lr = LinearRegression()       

In [73]:
# Fit linear regression model
lr.fit(X_train, y_train['price_log']) 

LinearRegression()

In [76]:
# Get score of the model.
LR_score = get_model_score(lr)

R-square on training set :  0.8613929869989658
R-square on test set :  0.8607541049366532
RMSE on training set :  4.159526260224498
RMSE on test set :  4.1588249220448565


#### **Observations from results: _____**

#### **Important variables of Linear Regression**

Building a model using statsmodels

In [None]:
# Import Statsmodels 

# Add constant for test and train
x_train = __________

# Add constant to test data
x_test = ___________

def build_ols_model(train):
    # Create the model
    olsmodel = sm.OLS(y_train["price_log"], train)
    return olsmodel.fit()


# Fit linear model on new dataset
olsmodel1 = build_ols_model(x_train)
print(olsmodel1.summary())

In [None]:
# Retrive Coeff values, p-values and store them in the dataframe
olsmod = pd.DataFrame(olsmodel1.params, columns=['coef'])
olsmod['pval']=olsmodel1.pvalues

In [None]:
# FIlter by significant p-value (pval <0.05) and sort descending by Odds ratio
olsmod = olsmod.sort_values(by="pval", ascending=False)
pval_filter = olsmod['pval']<=0.05
olsmod[pval_filter]

In [None]:
# we are looking are overall significant varaible
pval_filter = olsmod['pval']<=0.05
imp_vars = olsmod[pval_filter].index.tolist()

# we are going to get overall varaibles (un-one-hot encoded varables) from categorical varaibles
sig_var = []
for col in imp_vars:
    if '' in col:
        first_part = col.split('_')[0]
        for c in data.columns:
            if first_part in c and c not in sig_var :
                sig_var.append(c)
 

start = '\033[1m'
end = '\033[95m'
print(start+'Most overall significant categorical varaibles of LINEAR REGRESSION  are '+end,':\n',sig_var)

<b>Build Ridge / Lasso Regression similar to Linear Regression:</b><br>

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [None]:
# import Ridge/ Lasso Regression from sklearn

In [None]:
# Create a Ridge regression model

In [None]:
# Fit Ridge regression model.

In [None]:
# Get score of the model.

In [None]:
# Observations

### **Decision Tree** 

https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html

In [None]:
# import Decision tree for Regression from sklearn

In [None]:
# Create a decision tree regression model
dtree = _____(random_state=1)

In [None]:
# Fit decision tree regression model.
dtree.fit(_______,_______)

In [None]:
# Get score of the model.
Dtree_model = get_model_score(_____)

#### **Observations from results: _____**

Print the importance of features in the tree building ( The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )


In [None]:
print(pd.DataFrame(dtree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

#### **Observations and insights: _____**

### **Random Forest**

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

In [None]:
# import Randomforest for Regression from sklearn

In [None]:
# Create a Randomforest regression model 

In [None]:
# Fit Randomforest regression model.

In [None]:
# Get score of the model.

#### **Observations and insights: _____**

**Feature Importance**

In [None]:
# Print important features similar to decision trees

#### **Observations and insights: _____**

### **Hyperparameter Tuning: Decision Tree**

In [None]:
# Choose the type of classifier. 
dtree_tuned = __________(random_state=1)

# Grid of parameters to choose from.
# Check documentation for all the parametrs that the model takes and play with those.
parameters = {________________}

# Type of scoring used to compare parameter combinations
scorer = _________

# Run the grid search
grid_obj = GridSearchCV(_____________)
grid_obj = grid_obj.fit(______________)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
dtree_tuned.fit(____,____)

In [None]:
# Get score of the dtree_tuned

#### **Observations and insights: _____**

**Feature Importance**

In [None]:
# Print important features of tuned decision tree similar to decision trees

#### **Observations and insights: _____**

### **Hyperparameter Tuning: Random Forest**

In [None]:
# Choose the type of classifier. 

# Define the parameters for Grid to choose from 
# Check documentation for all the parametrs that the model takes and play with those

# Type of scoring used to compare parameter combinations

# Run the grid search

# Set the clf to the best combination of parameters

# Fit the best algorithm to the data. 

In [None]:
# Get score of the model.

#### **Observations and insights: _____**

**Feature Importance**

In [None]:
# Print important features of tuned decision tree similar to decision trees

#### **Observations and insights: ______**



In [None]:
# defining list of models ypu have trained
models = [lr,dtree, __________________]

# defining empty lists to add train and test results
r2_train = []
r2_test = []
rmse_train= []
rmse_test= []

# looping through all the models to get the rmse and r2 scores
for model in models:
    # accuracy score
    j = get_model_score(model,False)
    r2_train.append(j[0])
    r2_test.append(j[1])
    rmse_train.append(j[2])
    rmse_test.append(j[3])

In [None]:
comparison_frame = pd.DataFrame({'Model':['Linear Regression','Decision Tree',___________,___________], 
                                          'Train_r2': r2_train,'Test_r2': r2_test,
                                          'Train_RMSE':rmse_train,'Test_RMSE':rmse_test}) 
comparison_frame

#### **Observations: _____**

**Note:** You can also try some other algorithms such as kNN and compare the model performance with the existing ones

### **Insights**

####**Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

####**Comparison of various techniques and their relative performance**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

####**Proposal for the final solution design**:
- What model do you propose to be adopted? Why is this the best solution to adopt?

####**Key recommendations for implementation**: 
- What are some key recommendations to implement the solutions? What should the implementation roadmap look like? What further analysis needs to be done or what other associated problems need to be solved?