<a href="https://colab.research.google.com/github/Fordalo/Data_SCIENCE_WORK/blob/main/Cars_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Cars Dataset to do an ANOVA analysis

First we want to load the the R cars dataset using stats models. Then we will take a look at the dataset to do a quick analysis to determine next steps

In [1]:
import pandas as pd
import numpy as np
import datetime
import statsmodels.api as sm 
from statsmodels.formula.api import ols
from matplotlib import pyplot as plt
from google.colab import drive
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score 

cars = sm.datasets.get_rdataset("Cars93", "MASS")
carsdframe = pd.DataFrame(cars.data)
carsdframe

  import pandas.util.testing as tm


Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25,31,,Front,...,5,177,102,68,37,26.5,11.0,2705,non-USA,Acura Integra
1,Acura,Legend,Midsize,29.2,33.9,38.7,18,25,Driver & Passenger,Front,...,5,195,115,71,38,30.0,15.0,3560,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20,26,Driver only,Front,...,5,180,102,67,37,28.0,14.0,3375,non-USA,Audi 90
3,Audi,100,Midsize,30.8,37.7,44.6,19,26,Driver & Passenger,Front,...,6,193,106,70,37,31.0,17.0,3405,non-USA,Audi 100
4,BMW,535i,Midsize,23.7,30.0,36.2,22,30,Driver only,Rear,...,4,186,109,69,39,27.0,13.0,3640,non-USA,BMW 535i
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,Volkswagen,Eurovan,Van,16.6,19.7,22.7,17,21,,Front,...,7,187,115,72,38,34.0,,3960,non-USA,Volkswagen Eurovan
89,Volkswagen,Passat,Compact,17.6,20.0,22.4,21,30,,Front,...,5,180,103,67,35,31.5,14.0,2985,non-USA,Volkswagen Passat
90,Volkswagen,Corrado,Sporty,22.9,23.3,23.7,18,25,,Front,...,4,159,97,66,36,26.0,15.0,2810,non-USA,Volkswagen Corrado
91,Volvo,240,Compact,21.8,22.7,23.5,21,28,Driver only,Rear,...,5,190,104,67,37,29.5,14.0,2985,non-USA,Volvo 240


After looking at the dataset, we can conclude that there are a couple categorical columns. Within our Anova model, we will see how car type impacts the price of a car. Please note that we must multiply the price of a car by 10 to get the true car price. Now lets see how many levels the type category has

In [2]:
carsdframe['Type'].unique()

array(['Small', 'Midsize', 'Compact', 'Large', 'Sporty', 'Van'],
      dtype=object)

Lets extract our X and Y from the data frame. We also need to dumbify our categorical column so that we can use it in our OLS. Pandas has a simple get dummies function

In [3]:
Xtest = carsdframe[['Type']]
ytest = carsdframe[['Price']]


pd.get_dummies(data=Xtest, drop_first=True), ytest

(    Type_Large  Type_Midsize  Type_Small  Type_Sporty  Type_Van
 0            0             0           1            0         0
 1            0             1           0            0         0
 2            0             0           0            0         0
 3            0             1           0            0         0
 4            0             1           0            0         0
 ..         ...           ...         ...          ...       ...
 88           0             0           0            0         1
 89           0             0           0            0         0
 90           0             0           0            1         0
 91           0             0           0            0         0
 92           0             1           0            0         0
 
 [93 rows x 5 columns],     Price
 0    15.9
 1    33.9
 2    29.1
 3    37.7
 4    30.0
 ..    ...
 88   19.7
 89   20.0
 90   23.3
 91   22.7
 92   26.7
 
 [93 rows x 1 columns])

Our dimensions are now fit for an OLS model. We also need to keep in mind that our compact variable is not shown in the dummy matrix above, so that is our reference category. I will come back to this point once we perform our OLS. We will now call the linear regression object from stats models.

In [4]:
Xtest = pd.get_dummies(data=Xtest, drop_first=True)
CarsAnova = LinearRegression()
CarsAnova.fit(Xtest, ytest)
print(CarsAnova.intercept_)
print(CarsAnova.coef_)

[18.2125]
[[ 6.0875      9.00568182 -8.04583333  1.18035714  0.8875    ]]


The average price of a Compact car is 18.2K. We can also see that being a midsize car adds teh most value. In the following section, We will attemp to perform an OLS manually using simple matrix operations. The stats Linear Regresison module automatically fits an intercept. If we did it manually, we would have to add an intercept column to our TYPE matrix

In [5]:
#To do a manual OLS, I must manually create my intercept columns. I can easily do this in two different ways. The first way displaied turns our data into a numpy matrix. The second way keeps the pandas dataframe casting

#intcpt = np.ones(len(ytest))
#Xtest_mod = np.column_stack((intcpt, Xtest))

Xtest.insert(0, 'Intercept', [1 for i in range(len(Xtest))])

#When manually computihng OLS, its important to covert dataframes into numpy arrays / matrix so that we can do numpy matrix operations
Xtest2 = Xtest.to_numpy()
ytest2 = ytest.to_numpy()
XTX = Xtest2.transpose() @ Xtest2
XTY = Xtest2.transpose() @ ytest
betas = np.linalg.inv(XTX) @ XTY
betas 



Unnamed: 0,Price
0,18.2125
1,6.0875
2,9.005682
3,-8.045833
4,1.180357
5,0.8875


As you can see, we got the same exact values as when we performed OLS with the statsmodels module. Now I will calculate error metrics and yhat (predictions)

In [6]:
yhat = Xtest2 @ betas

#creating numpy version of predictions to do more calculations later
yhat2 = yhat.to_numpy()

In [7]:
carsanova_error = ytest2 - yhat2
carsanova_error[0:10]

array([[  5.73333333],
       [  6.68181818],
       [ 10.8875    ],
       [ 10.48181818],
       [  2.78181818],
       [-11.51818182],
       [ -3.5       ],
       [ -0.6       ],
       [ -0.91818182],
       [ 10.4       ]])

In this section, we will calculate R^2 (Best metric that hel[ps us determine how well our model has performed

In [8]:
R_sq = (np.var(ytest2) - np.var(carsanova_error))/ np.var(ytest2)
SSE = np.sum(np.square(carsanova_error))
MSE = 1/(len(Xtest2) - 2) * SSE
RMSE = np.sqrt(MSE)
print('RSQUARED: {} \nRMSE: {} \nSSE: {} \nMSE: {}'.format(R_sq, RMSE, SSE, MSE))

RSQUARED: 0.3985818528346552 
RMSE: 7.532045954448933 
SSE: 5162.586179653679 
MSE: 56.73171625993054


With just the 'type' attribute, we were able to increase our models efficiency by 39%. I will now attempt to perform an OLS with two categorical attributes. Lets see how price is impacted by the car type and airbags. As we can see below, there are 3 levels to our airbag column therefore our dummy matrix will be N X 3

In [9]:
carsdframe['AirBags'].value_counts()

Driver only           43
None                  34
Driver & Passenger    16
Name: AirBags, dtype: int64

In [10]:
xtest3 = pd.get_dummies(data= carsdframe[['AirBags', 'Type']], drop_first=True)
#xtest3.insert(0, 'Intercept', [1 for i in range(len(xtest3))])
xtest3

Unnamed: 0,AirBags_Driver only,AirBags_None,Type_Large,Type_Midsize,Type_Small,Type_Sporty,Type_Van
0,0,1,0,0,1,0,0
1,0,0,0,1,0,0,0
2,1,0,0,0,0,0,0
3,0,0,0,1,0,0,0
4,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...
88,0,1,0,0,0,0,1
89,0,1,0,0,0,0,0
90,0,1,0,0,0,1,0
91,1,0,0,0,0,0,0


In [11]:
ytest3 = carsdframe['Price']
Carsonova2 = LinearRegression()
Carsonova2.fit(xtest3, ytest3)
SSE2 = np.sum(np.square(ytest3 - Carsonova2.predict(xtest3)))
MSE2 = mean_squared_error(y_true=ytest3, y_pred=yhat2, squared=False)
RMSE2 = np.sqrt(MSE2)
R_sq2 = r2_score(ytest3, Carsonova2.predict(xtest3))
yhat2 = Carsonova2.predict(xtest3)

#MSE can be computed this way as well
#(1/(len(xtest3) - 2)) * SSE2


print(xtest3.columns)
print('Y intercept:  {} \nBETAS: {}\n\n\n'.format(Carsonova2.intercept_, Carsonova2.coef_))
print('ERROR METRICS \n\nSSE: {} \nMSE: {}\nRMSE: {}\nRSQ: {}'.format(SSE2, MSE2, RMSE2, R_sq2))



Index(['AirBags_Driver only', 'AirBags_None', 'Type_Large', 'Type_Midsize',
       'Type_Small', 'Type_Sporty', 'Type_Van'],
      dtype='object')
Y intercept:  24.283278239050148 
BETAS: [ -5.15216212 -10.15259856   3.29537038   7.35691165  -5.15459312
   0.22922837   3.30250817]



ERROR METRICS 

SSE: 4356.409779328956 
MSE: 7.450616038363192
RMSE: 2.7295816599550915
RSQ: 0.4924977895569449


Notice that our Rsquared increased by about 10 points when we added one more attribute. 

# ANCOVA
In this section, we will have atleast 1 continuous attribute, and atleast 1 categorical attribute. Therefore, our OLS will be an analysis of covariances. Lets look at horsepower and airbags

In [18]:
xtest5 = pd.get_dummies(carsdframe[[ 'AirBags', 'Horsepower']], drop_first=True)
print(xtest5.columns)
print


ytest5 = carsdframe['Price']

ANCOVA_model = LinearRegression(fit_intercept = True)
ANCOVA_model.fit(xtest5, ytest5)
inctp4 = ANCOVA_model.intercept_
betas4 = ANCOVA_model.coef_


print('Intercept: {} \ncoefficients: {}'.format(inctp4, betas4))

Index(['Horsepower', 'AirBags_Driver only', 'AirBags_None'], dtype='object')
Intercept: 5.590568890909495 
coefficients: [ 0.12375243 -3.15178032 -6.62672993]


In [29]:
x_star = np.matrix([240,0,1])
ANCOVA_model.predict(x_star)



  "X does not have valid feature names, but"


array([28.66442146])

In [13]:
ytest5 = carsdframe['Price']
xtest5.insert(0, 'Intercept', [1 for i in range(len(xtest5))])
# xtest5mod = xtest5.to_numpy 
# ytest5mod = ytest5.to_numpy

#xtest5.insert(0, 'Intercept', [1 for i in range(len(xtest5))])




In [14]:
print(xtest5)

    Intercept  Horsepower  AirBags_Driver only  AirBags_None
0           1         140                    0             1
1           1         200                    0             0
2           1         172                    1             0
3           1         172                    0             0
4           1         208                    1             0
..        ...         ...                  ...           ...
88          1         109                    0             1
89          1         134                    0             1
90          1         178                    0             1
91          1         114                    1             0
92          1         168                    0             0

[93 rows x 4 columns]
