# Multivariate Regression

Let's grab a small little data set of Blue Book car values:

In [1]:
import pandas as pd

df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')


In [2]:
df.head()

Unnamed: 0,Price,Mileage,Make,Model,Trim,Type,Cylinder,Liter,Doors,Cruise,Sound,Leather
0,17314.103129,8221,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,1
1,17542.036083,9135,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
2,16218.847862,13196,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,1,0
3,16336.91314,16342,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,0
4,16339.170324,19832,Buick,Century,Sedan 4D,Sedan,6,3.1,4,1,0,1


We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [7]:
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Liter']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Liter']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Liter']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

      Mileage  Cylinder     Liter
0   -1.417485  0.527410  0.056736
1   -1.305902  0.527410  0.056736
2   -0.810128  0.527410  0.056736
3   -0.426058  0.527410  0.056736
4    0.000008  0.527410  0.056736
5    0.293493  0.527410  0.056736
6    0.335001  0.527410  0.056736
7    0.382369  0.527410  0.056736
8    0.511409  0.527410  0.056736
9    0.914768  0.527410  0.056736
10  -1.171368  0.527410  0.509277
11  -0.581834  0.527410  0.509277
12  -0.390532  0.527410  0.509277
13  -0.003899  0.527410  0.509277
14   0.430591  0.527410  0.509277
15   0.480156  0.527410  0.509277
16   0.509822  0.527410  0.509277
17   0.757160  0.527410  0.509277
18   1.594886  0.527410  0.509277
19   1.810849  0.527410  0.509277
20  -1.326046  0.527410  0.509277
21  -1.129860  0.527410  0.509277
22  -0.667658  0.527410  0.509277
23  -0.405792  0.527410  0.509277
24  -0.112796  0.527410  0.509277
25  -0.044552  0.527410  0.509277
26   0.190700  0.527410  0.509277
27   0.337442  0.527410  0.509277
28   0.566102 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


0,1,2,3
Dep. Variable:,Price,R-squared:,0.06
Model:,OLS,Adj. R-squared:,0.057
Method:,Least Squares,F-statistic:,17.16
Date:,"Thu, 20 Dec 2018",Prob (F-statistic):,8.29e-11
Time:,13:24:42,Log-Likelihood:,-9208.5
No. Observations:,804,AIC:,18420.0
Df Residuals:,801,BIC:,18440.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mileage,-1264.9827,806.301,-1.569,0.117,-2847.696,317.730
Cylinder,3949.1399,2807.814,1.406,0.160,-1562.403,9460.683
Liter,1707.3093,2807.083,0.608,0.543,-3802.798,7217.417

0,1,2,3
Omnibus:,214.158,Durbin-Watson:,0.009
Prob(Omnibus):,0.0,Jarque-Bera (JB):,444.825
Skew:,1.499,Prob(JB):,2.56e-97
Kurtosis:,5.071,Cond. No.,6.83


In [8]:
y.groupby(df.Liter).mean()

Liter
1.6    10752.833305
1.8    15881.386094
2.0    29968.972727
2.2    13441.277078
2.3    29288.283553
2.5    24960.948265
2.8    30455.144774
3.0    16550.926211
3.1    15989.528107
3.4    16238.093335
3.5    17788.263153
3.6    26150.134403
3.8    20158.316888
4.6    39535.972594
5.7    37076.585744
6.0    39155.712375
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

## Activity

Mess around with the fake input data, and see if you can create a measurable influence of number of doors on price. Have some fun with it - why stop at 4 doors?