# Linear Regression: Data Analytics Demo

We will explore measures of corruption of different countries based on a data set that measures number of parking violations for diplomats visiting NYC. Before 2002 diplomats were granted immunity from parking violations and that changed after November 2002. We can use this use this data to begin and explore measures of corruption using Linear Regression methods and hypothesis testing. 

In [2]:
#Importing our libraries needed
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

In [3]:
#changing working directory and importing data
import os
os.chdir('C:\\Users\\feder\Desktop\\Stata Documents\\dta_files')
df=pd.read_stata('parking-data.dta',convert_missing=False)
print(df.head(5))
var_names=list(df.columns.values)
print(var_names)

  wbcode               country prepost  viol_rush  viol_hydrant  \
0    AGO                Angola     pre   4.051054    154.142624   
1    AGO                Angola     pos   0.000000      2.616488   
2    ALB               Albania     pre   2.430633    110.188683   
3    ALB               Albania     pos   0.000000      0.000000   
4    ARE  United Arab Emirates     pre   0.000000      0.000000   

   viol_afterhours           due  violations         fines  mission  \
0        32.003330  83109.406250  744.381226  40293.812500        1   
1         1.635305   2045.766113   15.371863   1208.490112        1   
2        78.995560  29185.822266  256.634308  13970.061523        1   
3         0.981183    760.416687    5.560036    609.968628        1   
4         0.000000      0.000000    0.000000      0.000000        1   

     ...      r_southamerica  r_oceana  r_asia  dislike_usa  dislike_others  \
0    ...                   0         0       0     2.028961        2.124807   
1    ...    

In [4]:
#Describe a single variable out of the 67 that we have stored
print(df['violations'].describe())
print(df['prepost'].describe())
print(df['viol_pc'].describe())

count     298.000000
mean      100.879173
std       302.233124
min         0.000000
25%         0.654122
50%         5.723566
75%        51.914753
max      3392.960693
Name: violations, dtype: float64
count     302
unique      2
top       pos
freq      151
Name: prepost, dtype: object
count    298.000000
mean       9.862918
std       25.237558
min        0.000000
25%        0.077223
50%        0.605063
75%        7.803244
max      249.364914
Name: viol_pc, dtype: float64


In [5]:
#Creating Dummy Variables for Regressional Analysis
dummies=pd.get_dummies(df['prepost'])
df['pre']=dummies['pre']
df['post']=dummies['pos']
df['viol_pc']= df['viol_pc'].fillna(9.8629)
df['post']= df['post'].dropna()

We will now regress our dummy variable signaling The start of the enforcement period. Our estimated equation is as follows. 
Viol_pc=B0 + B1*(Post). Replacing the value of 1 means that we are in the enforcement period, and a value of 0 means we are the pre enforcement period.

In [6]:
import statsmodels.api as sm
intercept=sm.add_constant(df['post'])
model=sm.OLS(df['viol_pc'],intercept)
regression=model.fit()
print(regression.params)
print(regression.summary())

const    19.181368
post    -18.636904
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                viol_pc   R-squared:                       0.139
Model:                            OLS   Adj. R-squared:                  0.136
Method:                 Least Squares   F-statistic:                     48.28
Date:                Wed, 20 Feb 2019   Prob (F-statistic):           2.30e-11
Time:                        17:21:29   Log-Likelihood:                -1378.4
No. Observations:                 302   AIC:                             2761.
Df Residuals:                     300   BIC:                             2768.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------

From our regression above we can analyze certain elements:
1. Our constant value means that on average there are about 19.18 violations per diplomat per year by any given country.

2. Our Post variable coefficient tells us that holding all else constant after Novemeber 2002 there is a decrease on average of 18.6369 violations per diplomat per year.

3. Both our p values are below 0.05 meaning that at 95% confidence we can say that there is a statistically significant decrease in violations after November 2002.

## Conclusion

    Using a simple bi-variate regression we can estimate that violations overall did go down, and we discussed what each coefficient meant as is stated in the analysis above. In this case our policy did reduce the amount of violations committed.
    We can improve on this model by adding more variables to get rid of ommitted variable bias. We also discussed scenarios in which we might want to know this information. In scenarios where a country wants to implement a similar policy, they can use this data and see if it was a statistically significant reduction or not. 