# Module 6 Exercises - Correlation and Models

### Exercise 1:

From the datasets folder, load the "tamiami.csv" file as a dataframe. Rename the columns (in order) to the following:

- location
- sales
- employees
- restaurants
- foodcarts
- price

Then do a correlation table on that dataframe. What features (columns) are correlated? What features aren't correlated?

In [3]:
import pandas as pd
import numpy as np

In [5]:
Location = "datasets/tamiami.csv"
df = pd.read_csv(Location)
df.head()
df.rename(columns={'Cart Location':'location','Hot Dog Sales':'Sales','Employees in Nearby Office Buildings':
                   'employees',
                   'Num of Nearby Restaurants':'restaurants','Num of Other Food Carts Nearby':'foodcarts',
                   'Price':'price'},inplace=True)
df.head()

Unnamed: 0,location,Sales,employees,restaurants,foodcarts,price
0,1,100,1600,8,12,4.16
1,2,80,1200,6,13,4.63
2,3,450,2800,19,6,0.5
3,4,580,4300,19,2,0.47
4,5,100,1400,6,13,4.24


In [6]:
df.corr()

Unnamed: 0,location,Sales,employees,restaurants,foodcarts,price
location,1.0,0.042705,-0.068923,0.049701,0.077219,-0.138444
Sales,0.042705,1.0,0.943238,0.913674,-0.919762,-0.966378
employees,-0.068923,0.943238,1.0,0.856976,-0.874692,-0.88154
restaurants,0.049701,0.913674,0.856976,1.0,-0.761793,-0.933951
foodcarts,0.077219,-0.919762,-0.874692,-0.761793,1.0,0.860154
price,-0.138444,-0.966378,-0.88154,-0.933951,0.860154,1.0


In [None]:
#according to correlation table, the following show higher positive correlations:
    #hot dog sales and nearby employees
    #hot dog sales and number of nearby restaurants
    #nearby employees and number of nearby restaurants
    #number of nearby foodcarts and price

#the following show higher negative correlations:
    #hot dog sales and price
    #hot dog sales and number of nearby foodcarts
    #number of nearby employees and number of nearby foodcarts
    #number of nearby employees and price
    #number of nearby restaurants and price

#can conclude:
    #that more employees and more nearby restaurants means more sales
    #more competition from restaurants pushes prices lower, but competition from other food carts seems to keep prices slightly higher
    #sell more hot dogs with lower prices
    #sell fewer hot dogs when there are more competing foodcarts
    #more nearby employees means more nearby restaurants but not more foodcarts
    #more nearby employees keeps prices lower


### Exercise 2:

Using the dataframe from the previous exercise, choose features (columns) to create a linear regression formula to predict sales. Try it with and without the y-intercept. How does it make a difference? Does adding or removing features in your model formula make a difference in the output?

In [9]:
#use this library to build a statistical test for linear regression
import statsmodels.formula.api as smf #this library doesn't give output from model, simply fits/analyzes data

In [11]:
#OLS is Ordinary Least Squares, the most common type of linear regression
#the fit function uses the predictive values to calculate the best linear regression line
result = smf.ols('Sales ~ location + employees + restaurants + foodcarts + price', data=df).fit()
result.summary()


0,1,2,3
Dep. Variable:,Sales,R-squared:,0.981
Model:,OLS,Adj. R-squared:,0.977
Method:,Least Squares,F-statistic:,245.2
Date:,"Thu, 11 Apr 2019",Prob (F-statistic):,8.86e-20
Time:,11:05:54,Log-Likelihood:,-137.9
No. Observations:,30,AIC:,287.8
Df Residuals:,24,BIC:,296.2
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,300.4511,70.074,4.288,0.000,155.826,445.076
location,0.3962,0.671,0.590,0.561,-0.989,1.782
employees,0.0574,0.015,3.794,0.001,0.026,0.089
restaurants,3.6295,3.147,1.153,0.260,-2.866,10.125
foodcarts,-9.7950,2.891,-3.388,0.002,-15.762,-3.828
price,-44.1035,12.980,-3.398,0.002,-70.894,-17.313

0,1,2,3
Omnibus:,0.214,Durbin-Watson:,2.503
Prob(Omnibus):,0.899,Jarque-Bera (JB):,0.418
Skew:,0.02,Prob(JB):,0.811
Kurtosis:,2.423,Cond. No.,31800.0


In [12]:
#try regression model without y-intercept
result = smf.ols('Sales ~ location + employees + restaurants + foodcarts + price - 1', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.99
Model:,OLS,Adj. R-squared:,0.987
Method:,Least Squares,F-statistic:,475.0
Date:,"Thu, 11 Apr 2019",Prob (F-statistic):,6.3e-24
Time:,11:12:41,Log-Likelihood:,-146.43
No. Observations:,30,AIC:,302.9
Df Residuals:,25,BIC:,309.9
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
location,1.7322,0.774,2.238,0.034,0.138,3.327
employees,0.0928,0.017,5.606,0.000,0.059,0.127
restaurants,11.9293,3.231,3.692,0.001,5.274,18.584
foodcarts,-9.0329,3.757,-2.404,0.024,-16.771,-1.295
price,-2.4492,11.209,-0.218,0.829,-25.535,20.637

0,1,2,3
Omnibus:,0.612,Durbin-Watson:,2.101
Prob(Omnibus):,0.736,Jarque-Bera (JB):,0.657
Skew:,-0.071,Prob(JB):,0.72
Kurtosis:,2.289,Cond. No.,4130.0


In [22]:
#by removing y-intercept, R^2 value is higher

#price not highly correlated, so remove

result = smf.ols('Sales ~ location + employees + restaurants + foodcarts -1', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.99
Model:,OLS,Adj. R-squared:,0.988
Method:,Least Squares,F-statistic:,616.3
Date:,"Thu, 11 Apr 2019",Prob (F-statistic):,2.42e-25
Time:,11:17:52,Log-Likelihood:,-146.46
No. Observations:,30,AIC:,300.9
Df Residuals:,26,BIC:,306.5
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
location,1.7765,0.733,2.422,0.023,0.269,3.284
employees,0.0908,0.014,6.656,0.000,0.063,0.119
restaurants,12.3681,2.485,4.978,0.000,7.261,17.476
foodcarts,-9.8218,1.020,-9.630,0.000,-11.918,-7.725

0,1,2,3
Omnibus:,0.855,Durbin-Watson:,2.105
Prob(Omnibus):,0.652,Jarque-Bera (JB):,0.774
Skew:,-0.092,Prob(JB):,0.679
Kurtosis:,2.235,Cond. No.,876.0


In [24]:
result = smf.ols('Sales ~ employees + restaurants + foodcarts', data=df).fit()
result.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.966
Model:,OLS,Adj. R-squared:,0.962
Method:,Least Squares,F-statistic:,245.9
Date:,"Thu, 11 Apr 2019",Prob (F-statistic):,3.41e-19
Time:,11:19:55,Log-Likelihood:,-146.49
No. Observations:,30,AIC:,301.0
Df Residuals:,26,BIC:,306.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,141.3851,58.740,2.407,0.023,20.643,262.128
employees,0.0575,0.019,3.011,0.006,0.018,0.097
restaurants,13.1304,2.436,5.391,0.000,8.124,18.137
foodcarts,-15.3045,2.983,-5.130,0.000,-21.436,-9.173

0,1,2,3
Omnibus:,1.028,Durbin-Watson:,1.842
Prob(Omnibus):,0.598,Jarque-Bera (JB):,0.869
Skew:,0.137,Prob(JB):,0.648
Kurtosis:,2.213,Cond. No.,20700.0
