<table align="center" width=100%>
    <tr>
        <td width="15%">
            <img src="GL-2.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=8px>
                    <b> Take-Home <br>(Day 1)
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

### Import the required libraries

In [1]:
# type your code here
# import 'Pandas' 
import pandas as pd 

# import 'Numpy' 
import numpy as np

# 'Statsmodels' is used to build and analyze various statistical models
import statsmodels
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.tools.eval_measures import rmse

# import various metrics from 'Scikit-learn' (sklearn)
from sklearn.model_selection import train_test_split

# to set the digits after decimal place 
pd.options.display.float_format = '{:.5f}'.format

# suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

#### Read the data

Load the csv file and set the first column as index

In [2]:
# type your code here
df_car = pd.read_csv("car_data.csv", index_col = 0)

# display the first two rows of the data
df_car.head(2)

Unnamed: 0_level_0,Year,Selling_Price,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner
Car_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ritz,2014,3.35,5.59,27000,Petrol,Dealer,Manual,0
sx4,2013,4.75,9.54,43000,Diesel,Dealer,Manual,0


Our objective is to predict the selling price of the cars data.

**The data definition is as follows:** <br><br>
**Car_Name:** name of the car <br>

**YearThis:** year in which the car was bought <br>

**Present_Price:** current ex-showroom price of the car (in lakhs)<br>

**Kms_Driven:** distance completed by the car in km <br>

**Fuel_Type:** fuel type of the car <br>

**Seller_Type:** defines whether the seller is a dealer or an individual<br>

**Transmission:** defines whether the car is manual or automatic <br>

**Owner:** defines the number of owners the car has previously had <br>

**Selling_Price:** price the owner wants to sell the car at (in lakhs) (response variable)

### Let's begin with some hands-on practice exercises

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. Build a full model and interpret the beta coefficients </b>
                </font>
            </div>
        </td>
    </tr>
</table>

        Hint: A full model is a model which includes all the features 

In [3]:
# type your code here
df_car_num = df_car.select_dtypes(include=np.number).drop(["Selling_Price"],axis=1)
df_car_cat = df_car.select_dtypes(include="object")
dummy_variables = pd.get_dummies(df_car_cat, drop_first=True)

X = pd.concat([df_car_num, dummy_variables],axis=1)

y = df_car["Selling_Price"]
LM_model_full = sm.OLS(y, sm.add_constant(X)).fit()
print(LM_model_full.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.883
Model:                            OLS   Adj. R-squared:                  0.879
Method:                 Least Squares   F-statistic:                     274.3
Date:                Sun, 11 Sep 2022   Prob (F-statistic):          5.71e-131
Time:                        22:20:43   Log-Likelihood:                -593.62
No. Observations:                 301   AIC:                             1205.
Df Residuals:                     292   BIC:                             1239.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                   -789

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>2. Is there multicollinearity present? If yes, which variables are involved in multicollinearity?    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [4]:
# type your code here
vif = pd.DataFrame()
df_numeric = df_car.select_dtypes(include=[np.number])
vif["Features"] = df_numeric.columns
vif["VIF"] = [variance_inflation_factor(df_numeric.values, i) for i in range(df_numeric.shape[1])]

vif

Unnamed: 0,Features,VIF
0,Year,2.78056
1,Selling_Price,9.35503
2,Present_Price,9.33909
3,Kms_Driven,2.21597
4,Owner,1.07427


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>3. What is the impact of present price of the car and seller type on the selling price?
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [5]:
# type your code here
X = df_car[["Present_Price"]]
dummy_variable = pd.get_dummies(df_car["Seller_Type"], prefix="Seller", drop_first=True)
X = pd.concat([X, dummy_variable],axis=1)
y = df_car["Selling_Price"]
MLR_full = sm.OLS(y, sm.add_constant(X)).fit()
print(MLR_full.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.786
Model:                            OLS   Adj. R-squared:                  0.785
Method:                 Least Squares   F-statistic:                     548.4
Date:                Sun, 11 Sep 2022   Prob (F-statistic):          1.34e-100
Time:                        22:22:00   Log-Likelihood:                -683.71
No. Observations:                 301   AIC:                             1373.
Df Residuals:                     298   BIC:                             1385.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 1.5423      0.26

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>4. Consider all the numeric features in the data. Do all of them significantly contribute to explaining the variation in the selling price?
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [6]:
# type your code here
X = df_car.select_dtypes(include=[np.number]).drop(["Selling_Price"],axis=1)

y = df_car["Selling_Price"]
LM_model_num = sm.OLS(y, sm.add_constant(X)).fit()

print(LM_model_num.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.852
Model:                            OLS   Adj. R-squared:                  0.850
Method:                 Least Squares   F-statistic:                     426.6
Date:                Sun, 11 Sep 2022   Prob (F-statistic):          1.66e-121
Time:                        22:23:36   Log-Likelihood:                -628.25
No. Observations:                 301   AIC:                             1267.
Df Residuals:                     296   BIC:                             1285.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          -937.7642     94.392     -9.935

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>5. In the model obtained in question 4, consider the interaction effect of the present price of the car and the year in which it was purchased. Compare the resultant model with the model obtained in previous question and give your interpretation 
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [7]:
# type your code here
X = df_car.select_dtypes(include=[np.number]).drop(["Selling_Price"],axis=1)
X['Price*Year'] = df_car['Present_Price']*df_car['Year'] 
y = df_car["Selling_Price"]

LM_model_interaction = sm.OLS(y, sm.add_constant(X)).fit()

print(LM_model_interaction.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.963
Method:                 Least Squares   F-statistic:                     1546.
Date:                Sun, 11 Sep 2022   Prob (F-statistic):          3.05e-209
Time:                        22:24:07   Log-Likelihood:                -418.79
No. Observations:                 301   AIC:                             849.6
Df Residuals:                     295   BIC:                             871.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const           101.2676     58.597      1.728

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>6. What is the impact of fuel type of cars on the selling price? 
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [8]:
# type your code here
X = df_car["Fuel_Type"]
X = pd.get_dummies(X, drop_first=True)
y = df_car["Selling_Price"]
LM_model = sm.OLS(y, sm.add_constant(X)).fit()

print(LM_model.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.305
Model:                            OLS   Adj. R-squared:                  0.300
Method:                 Least Squares   F-statistic:                     65.41
Date:                Sun, 11 Sep 2022   Prob (F-statistic):           2.80e-24
Time:                        22:24:38   Log-Likelihood:                -861.21
No. Observations:                 301   AIC:                             1728.
Df Residuals:                     298   BIC:                             1740.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.1000      3.006      1.031      0.3

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>7. Does the model significantly explain variation in the target variable? Justify your answer with analysis of variation 
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

            Regress the selling price over the transmission.
            
            Selling_Price ~ Transmission

In [9]:
# type your code here
X = df_car["Fuel_Type"]
X = pd.get_dummies(X, drop_first=True)
y = df_car["Selling_Price"]

LM_model = sm.OLS(y, sm.add_constant(X)).fit()
print(LM_model.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.305
Model:                            OLS   Adj. R-squared:                  0.300
Method:                 Least Squares   F-statistic:                     65.41
Date:                Sun, 11 Sep 2022   Prob (F-statistic):           2.80e-24
Time:                        22:25:10   Log-Likelihood:                -861.21
No. Observations:                 301   AIC:                             1728.
Df Residuals:                     298   BIC:                             1740.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.1000      3.006      1.031      0.3

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>8. Regress the selling price over the present price. Compare the 99% and 95% confidence interval of present price of a car
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [10]:
# type your code here
X = df_car["Present_Price"]
y = df_car["Selling_Price"]
LM_model = sm.OLS(y, sm.add_constant(X)).fit() 
print("The 99% CI is: \n", LM_model.conf_int(0.01)[1:])

print("\n\n")

print("The 95% CI is: \n", LM_model.conf_int(0.05)[1:])

The 99% CI is: 
                     0       1
Present_Price 0.47481 0.55889



The 95% CI is: 
                     0       1
Present_Price 0.48494 0.54876


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                        <b>9. Verify the statement: The sum of the residuals in any regression model that contains an intercept β<sub>0</sub> is always zero
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

        To verify the result, we will fit a regression model of 'Present_Price' on 'Selling_Price' 

In [11]:
# type your code here
X = df_car["Present_Price"]
y = df_car["Selling_Price"]
LM_model = sm.OLS(y, sm.add_constant(X)).fit()
resid_sum = LM_model.resid.sum()

round(resid_sum, 10)

-0.0

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>10. Consider two models as specified below. Compare the performance of the models
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

                First model:
        
        Selling_Price ~ Year + Present_Price + Kms_Driven + Owner + Fuel_Type + Seller_Type + Transmission
        
        
                Second model:
        
        Selling_Price ~ Year + Present_Price + Kms_Driven + Owner 

In [12]:
# type your code here
df_car_num = df_car.select_dtypes(include=np.number).drop(["Selling_Price"],axis=1)
df_car_cat = df_car.select_dtypes(include="object")
dummy_variables = pd.get_dummies(df_car_cat, drop_first=True)

X = pd.concat([df_car_num, dummy_variables],axis=1)

X.insert(loc = 0, column = 'intercept',value = np.ones(X.shape[0]))


y = df_car["Selling_Price"]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1,test_size = 0.3)


print('X_train', X_train.shape)
print('y_train', y_train.shape)

print('X_test', X_test.shape)
print('y_test', y_test.shape)

X_train (210, 9)
y_train (210,)
X_test (91, 9)
y_test (91,)


In [14]:
MLR_full = sm.OLS(y_train, X_train).fit()
print(MLR_full.summary())

                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.884
Model:                            OLS   Adj. R-squared:                  0.879
Method:                 Least Squares   F-statistic:                     191.2
Date:                Sun, 11 Sep 2022   Prob (F-statistic):           1.35e-89
Time:                        22:28:04   Log-Likelihood:                -423.34
No. Observations:                 210   AIC:                             864.7
Df Residuals:                     201   BIC:                             894.8
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
intercept               -844

In [15]:
y_pred = MLR_full.predict(X_test)

In [16]:
cols = ['Model', 'R-Squared', 'Adj. R-Squared',  'RMSE']

result_tabulation = pd.DataFrame(columns = cols)

linreg_full_model = pd.Series({'Model': "Linreg full model",
                           'R-Squared': MLR_full.rsquared,
                      'Adj. R-Squared': MLR_full.rsquared_adj ,
                                'RMSE': rmse(y_test, y_pred)
                   })


result_tabulation = result_tabulation.append(linreg_full_model, ignore_index = True)
result_tabulation

Unnamed: 0,Model,R-Squared,Adj. R-Squared,RMSE
0,Linreg full model,0.88385,0.87923,1.66717


In [17]:
X_train = X_train.drop(['Fuel_Type_Diesel', 'Fuel_Type_Petrol', 'Seller_Type_Individual','Transmission_Manual'],axis=1)
MLR_num = sm.OLS(y_train, X_train).fit()
print(MLR_num.summary())


                            OLS Regression Results                            
Dep. Variable:          Selling_Price   R-squared:                       0.853
Model:                            OLS   Adj. R-squared:                  0.850
Method:                 Least Squares   F-statistic:                     298.1
Date:                Sun, 11 Sep 2022   Prob (F-statistic):           3.23e-84
Time:                        22:29:41   Log-Likelihood:                -447.86
No. Observations:                 210   AIC:                             905.7
Df Residuals:                     205   BIC:                             922.5
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
intercept      -996.2108    116.383     -8.560

In [18]:
X_test = X_test.drop(['Fuel_Type_Diesel', 'Fuel_Type_Petrol', 'Seller_Type_Individual','Transmission_Manual'],axis=1)


y_pred = MLR_num.predict(X_test)

In [19]:
linreg_num_model = pd.Series({'Model': "Linreg numeric model",
                          'R-Squared': MLR_num.rsquared,
                     'Adj. R-Squared': MLR_num.rsquared_adj ,
                               'RMSE': rmse(y_test, y_pred)
                   })

result_tabulation = result_tabulation.append(linreg_num_model, ignore_index = True)

result_tabulation

Unnamed: 0,Model,R-Squared,Adj. R-Squared,RMSE
0,Linreg full model,0.88385,0.87923,1.66717
1,Linreg numeric model,0.85329,0.85043,1.87571
