# Project 9 - Working with OLS

Having built statistics functions, we are now ready to build a function for regression analysis. We will start by building the an regression. We will use linear algebra to estimate parameters that minimize the sum of the squared errors. This is an ordinary least squares regression.

An OLS regression with one exogenous variable takes the form.

y = alpha + (beta1)(x1) + mu

Beta0 = alpha + mu

We merge the error term, which represents bias in the data, with alpha to yield the constant, Beta0. This is necessary since OLS assumes an unbiased estimator where:

Sum of ei = 0

Each estimate of a point created from a particular observation takes the form.

yi = Beta0 + (Beta1)(x1,i) + ei


This can be generalized to include k exogenous variables:


( )


Ideally, we want to form a prediction where, on average, the right-hand side of the equation yields the correct value on the left-hand side. When we perform an OLS regression, we form a predictor that minimizes the sum of the distance between each predicted value and the observed value drawn from the data. For example, if the prediction for a particular value of y is 8, and the actual value is 10, the error of the prediction is -2 and the squared error is 4.

To find the function that minimizes the sum squared errors, we will use matrix algebra, also known as linear algebra. For those unfamiliar, the next section uses the numpy library to perform matrix operations. For clarity, we will review the linear algebra functions that we will use with simple examples.

### Linear Algebra for OLS

### Inverting a Matrix

## Linear Algebra in numpy

In [1]:
import numpy as np
x1 = np.array([1,2,1])
x2 = np.array([4,1,5])
x3 = np.array([6,8,6])
print(x1,x2,x3, sep = "\n")

[1 2 1]
[4 1 5]
[6 8 6]


In [2]:
x1 = np.matrix(x1)
x2 = np.matrix(x2)
x3 = np.matrix(x3)
print(x1,x2,x3, sep = "\n")

[[1 2 1]]
[[4 1 5]]
[[6 8 6]]


In [3]:
X = np.concatenate((x1, x2, x3))
X

matrix([[1, 2, 1],
        [4, 1, 5],
        [6, 8, 6]])

In [4]:
X_inverse = X.getI()
X_inverse

matrix([[-8.5000000e+00, -1.0000000e+00,  2.2500000e+00],
        [ 1.5000000e+00, -7.6861594e-17, -2.5000000e-01],
        [ 6.5000000e+00,  1.0000000e+00, -1.7500000e+00]])

In [5]:
np.round(X_inverse, 2)

array([[-8.5 , -1.  ,  2.25],
       [ 1.5 , -0.  , -0.25],
       [ 6.5 ,  1.  , -1.75]])

In [6]:
X_transpose = X.getT()
X_transpose

matrix([[1, 4, 6],
        [2, 1, 8],
        [1, 5, 6]])

## Regression Function

Now that we have learned the necessary operations, we can understand the operations of the regression function. If you would like to build your own regression module, reconstruct the scripts form Chapter 7. In this lesson, we will use the statsmodels OLS method to reconstruct and compare statistics from an OLS regression.

Recall that we estimate the vector of beta parameters for each variable with the equation:

Beta = (X'X)^-1 (X'Y)

Each estimated Beta value is multiplied by each observation of the relevant exogenous variable estimate the effect of the value on the endogenous, Y value.

We will run a regression In order to estimate the parameters, we will need to import data, define the dependent variable and independent variables, and transform these into matrix objects.

Let's use the data from chapter 6 with the addition real GDP per capita. This combined set of data is saved in the repository as a file created in chapter 8.

In [7]:
import pandas as pd

ngdp = pd.read_excel("https://www.rug.nl/ggdc/historicaldevelopment/maddison/data/mpd2020.xlsx",
                    index_col = [0, 2],
                    parse_dates = True,
                    sheet_name = "Full data")
ngdp

Unnamed: 0_level_0,Unnamed: 1_level_0,country,gdppc,pop
countrycode,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AFG,1820,Afghanistan,,3280.00000
AFG,1870,Afghanistan,,4207.00000
AFG,1913,Afghanistan,,5730.00000
AFG,1950,Afghanistan,1156.0000,8150.00000
AFG,1951,Afghanistan,1170.0000,8284.00000
...,...,...,...,...
ZWE,2014,Zimbabwe,1594.0000,13313.99205
ZWE,2015,Zimbabwe,1560.0000,13479.13812
ZWE,2016,Zimbabwe,1534.0000,13664.79457
ZWE,2017,Zimbabwe,1582.3662,13870.26413


In [8]:

filename = "efotw-2022-master-index-data-for-researchers-iso.xlsx"
data = pd.read_excel(filename, 
                     index_col = [2,0], 
                     header = [0],
                     sheet_name = "EFW Panel Data 2022 Report")
rename = {"Panel Data Summary Index": "Summary",
         "Area 1":"Size of Government",
         "Area 2":"Legal System and Property Rights",
         "Area 3":"Sound Money",
         "Area 4":"Freedom to Trade Internationally",
         "Area 5":"Regulation"}
data = data.dropna(how="all", axis = 1).rename(columns = rename)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,ISO_Code_2,World Bank Region,"World Bank Current Income Classification, 1990-present (L=Low income, LM=Lower middle income, UM=Upper middle income, H=High income)",Countries,Summary,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Regulation,Standard Deviation of the 5 EFW Areas
ISO_Code_3,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ALB,2020,AL,Europe & Central Asia,UM,Albania,7.640000,7.817077,5.260351,9.788269,8.222499,7.112958,1.652742
DZA,2020,DZ,Middle East & North Africa,LM,Algeria,5.120000,4.409943,4.131760,7.630287,3.639507,5.778953,1.613103
AGO,2020,AO,Sub-Saharan Africa,LM,Angola,5.910000,8.133385,3.705161,6.087996,5.373190,6.227545,1.598854
ARG,2020,AR,Latin America & the Caribbean,UM,Argentina,4.870000,6.483768,4.796454,4.516018,3.086907,5.490538,1.254924
ARM,2020,AM,Europe & Central Asia,UM,Armenia,7.840000,7.975292,6.236215,9.553009,7.692708,7.756333,1.178292
...,...,...,...,...,...,...,...,...,...,...,...,...
VEN,1970,VE,Latin America & the Caribbean,,"Venezuela, RB",7.242943,8.349529,5.003088,9.621851,7.895993,5.209592,2.028426
VNM,1970,VN,East Asia & Pacific,,Vietnam,,,,,,,
YEM,1970,YE,Middle East & North Africa,,"Yemen, Rep.",,,,,,,
ZMB,1970,ZM,Sub-Saharan Africa,,Zambia,4.498763,5.374545,4.472812,5.137395,,5.307952,0.412514


In [9]:
rename = {"Panel Data Summary Index": "Summary",
         "Area 1": "Size of Government",
         "Area 2": "Legal System and Property Rights",
         "Area 3": ""}

In [10]:
data["RGDP Per Capita"] = ngdp["gdppc"]

In [11]:
del data["Standard Deviation of the 5 EFW Areas"]

In [12]:
#THIS IS WHERE THE UPDATED DATETIME FORMAT GOES
data.reset_index(inplace = True)
data["Year"] = data["Year"].astype(str).astype("datetime64[ns]").sort_index()
data = data.set_index(["ISO_Code_3", "Year"]).sort_index()
data

Unnamed: 0_level_0,Unnamed: 1_level_0,ISO_Code_2,World Bank Region,"World Bank Current Income Classification, 1990-present (L=Low income, LM=Lower middle income, UM=Upper middle income, H=High income)",Countries,Summary,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Regulation,RGDP Per Capita
ISO_Code_3,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AGO,1970-01-01,AO,Sub-Saharan Africa,,Angola,,,,,,,2818.0000
AGO,1975-01-01,AO,Sub-Saharan Africa,,Angola,,,,,,,1710.0000
AGO,1980-01-01,AO,Sub-Saharan Africa,,Angola,,,,,,,1532.0000
AGO,1985-01-01,AO,Sub-Saharan Africa,,Angola,,,,,,,1242.0000
AGO,1990-01-01,AO,Sub-Saharan Africa,LM,Angola,,,,,,,1384.0000
...,...,...,...,...,...,...,...,...,...,...,...,...
ZWE,2016-01-01,ZW,Sub-Saharan Africa,L,Zimbabwe,6.121996,5.332597,4.056407,8.086016,6.404937,6.520805,1534.0000
ZWE,2017-01-01,ZW,Sub-Saharan Africa,L,Zimbabwe,5.599886,4.699843,4.071445,7.983888,4.503965,6.399757,1582.3662
ZWE,2018-01-01,ZW,Sub-Saharan Africa,LM,Zimbabwe,5.876298,5.170946,4.041897,7.312324,6.396649,6.303135,1611.4052
ZWE,2019-01-01,ZW,Sub-Saharan Africa,LM,Zimbabwe,4.719465,5.628359,4.026568,1.413372,6.397045,6.132583,


In [13]:
data.to_excel("EFWAndRGDP.xls")

  data.to_excel("EFWAndRGDP.xls")


In [14]:
data = data[data.keys()[3:]]

In [15]:
data.sort_index(inplace = True)
# Transform year to datetime format
# Look for update on how to do that

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().sort_index(


In [16]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,Countries,Summary,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Regulation,RGDP Per Capita
ISO_Code_3,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AGO,1970-01-01,Angola,,,,,,,2818.0000
AGO,1975-01-01,Angola,,,,,,,1710.0000
AGO,1980-01-01,Angola,,,,,,,1532.0000
AGO,1985-01-01,Angola,,,,,,,1242.0000
AGO,1990-01-01,Angola,,,,,,,1384.0000
...,...,...,...,...,...,...,...,...,...
ZWE,2016-01-01,Zimbabwe,6.121996,5.332597,4.056407,8.086016,6.404937,6.520805,1534.0000
ZWE,2017-01-01,Zimbabwe,5.599886,4.699843,4.071445,7.983888,4.503965,6.399757,1582.3662
ZWE,2018-01-01,Zimbabwe,5.876298,5.170946,4.041897,7.312324,6.396649,6.303135,1611.4052
ZWE,2019-01-01,Zimbabwe,4.719465,5.628359,4.026568,1.413372,6.397045,6.132583,


In [17]:
reg_vars = list(data.keys())

In [18]:
y_var = [reg_vars[-1]]
x_vars = reg_vars[2:-1]


In [19]:
reg_data = data[reg_vars]

In [20]:
import statsmodels.api as sm

In [44]:
reg_vars = list(data.keys())
del reg_vars[:2]
reg_vars

['Size of Government',
 'Legal System and Property Rights',
 'Sound Money',
 'Freedom to Trade Internationally',
 'Regulation',
 'RGDP Per Capita']

In [45]:
y_var = [reg_vars[-1]]
x_vars = reg_vars[:-1]
reg_data = data[reg_vars]
reg_data.corr().round(2)

Unnamed: 0,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Regulation,RGDP Per Capita
Size of Government,1.0,-0.1,0.16,0.15,0.2,-0.16
Legal System and Property Rights,-0.1,1.0,0.52,0.63,0.64,0.66
Sound Money,0.16,0.52,1.0,0.68,0.6,0.46
Freedom to Trade Internationally,0.15,0.63,0.68,1.0,0.64,0.51
Regulation,0.2,0.64,0.6,0.64,1.0,0.53
RGDP Per Capita,-0.16,0.66,0.46,0.51,0.53,1.0


In [46]:
y = reg_data.dropna()[y_var]
X = reg_data.dropna()[x_vars]
X["Constant"] = 1
results = sm.OLS(y, X).fit()

In [54]:
results.summary()

0,1,2,3
Dep. Variable:,RGDP Per Capita,R-squared:,0.486
Model:,OLS,Adj. R-squared:,0.485
Method:,Least Squares,F-statistic:,593.5
Date:,"Fri, 21 Apr 2023",Prob (F-statistic):,0.0
Time:,16:46:10,Log-Likelihood:,-34081.0
No. Observations:,3145,AIC:,68170.0
Df Residuals:,3139,BIC:,68210.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Size of Government,-2752.2138,202.274,-13.606,0.000,-3148.817,-2355.611
Legal System and Property Rights,3966.0733,196.152,20.219,0.000,3581.474,4350.672
Sound Money,902.3584,177.099,5.095,0.000,555.117,1249.599
Freedom to Trade Internationally,1279.8725,211.796,6.043,0.000,864.601,1695.144
Regulation,2141.0305,281.044,7.618,0.000,1589.982,2692.079
Constant,-1.66e+04,1627.397,-10.197,0.000,-1.98e+04,-1.34e+04

0,1,2,3
Omnibus:,2952.722,Durbin-Watson:,0.174
Prob(Omnibus):,0.0,Jarque-Bera (JB):,189244.77
Skew:,4.324,Prob(JB):,0.0
Kurtosis:,40.005,Cond. No.,113.0


In [56]:
predictor = results.predict(reg_data)
reg_data[y_var[0] + " Predictor"] = predictor
reg_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reg_data[y_var[0] + " Predictor"] = predictor


Unnamed: 0_level_0,Unnamed: 1_level_0,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Regulation,RGDP Per Capita,RGDP Per Capita Predictor
ISO_Code_3,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AGO,1970-01-01,,,,,,2818.0000,
AGO,1975-01-01,,,,,,1710.0000,
AGO,1980-01-01,,,,,,1532.0000,
AGO,1985-01-01,,,,,,1242.0000,
AGO,1990-01-01,,,,,,1384.0000,
...,...,...,...,...,...,...,...,...
ZWE,2016-01-01,5.332597,4.056407,8.086016,6.404937,6.520805,1534.0000,-2.542625e+07
ZWE,2017-01-01,4.699843,4.071445,7.983888,4.503965,6.399757,1582.3662,-2.622988e+07
ZWE,2018-01-01,5.170946,4.041897,7.312324,6.396649,6.303135,1611.4052,-2.671160e+07
ZWE,2019-01-01,5.628359,4.026568,1.413372,6.397045,6.132583,,


In [57]:
y_hat = reg_data[y_var[0] + " Predictor"]
y_mean = reg_data[y_var[0]].mean()
y = reg_data[y_var[0]]

In [27]:
reg_data["Residuals"] = (y.sub(y_hat))
reg_data["Squared Explained"] = y_hat.sub(y_mean).pow(2)
reg_data["Squared Residuals"] = y.sub(y_hat).pow(2)
reg_data["Squared Totals"] = y.sub(y_mean).pow(2)
reg_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Summary,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,RGDP Per Capita,RGDP Per Capita Predictor,Residuals,Squared Explained,Squared Residuals,Squared Totals
ISO_Code_3,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AGO,1970-01-01,,,,,,2818.0000,,,,,1.375566e+08
AGO,1975-01-01,,,,,,1710.0000,,,,,1.647745e+08
AGO,1980-01-01,,,,,,1532.0000,,,,,1.693760e+08
AGO,1985-01-01,,,,,,1242.0000,,,,,1.770085e+08
AGO,1990-01-01,,,,,,1384.0000,,,,,1.732502e+08
...,...,...,...,...,...,...,...,...,...,...,...,...
ZWE,2016-01-01,6.121996,5.332597,4.056407,8.086016,6.404937,1534.0000,-2.442081e+07,2.442234e+07,5.970865e+14,5.964507e+14,1.693239e+08
ZWE,2017-01-01,5.599886,4.699843,4.071445,7.983888,4.503965,1582.3662,-2.519259e+07,2.519417e+07,6.353996e+14,6.347462e+14,1.680675e+08
ZWE,2018-01-01,5.876298,5.170946,4.041897,7.312324,6.396649,1611.4052,-2.565548e+07,2.565710e+07,6.589505e+14,6.582866e+14,1.673155e+08
ZWE,2019-01-01,4.719465,5.628359,4.026568,1.413372,6.397045,,,,,,


In [28]:
SSR = reg_data["Squared Explained"].sum()
SSE = reg_data["Squared Residuals"].sum()
SST = reg_data["Squared Totals"].sum()
SSR, SSE, SST

(4.448090858569706e+20, 4.448412576531591e+20, 1066057730482.8079)

##  Calculate Estimator Variance

With the sum of squared errors calculated, the next step is to calculate the estimator variance and use this to construct the covariance matrix. The covariance matrix is used to derive the standard errors and related statistics for each estimated coefficient.

We estimate the variance of the error term of the estimator for the dependent variable.

 

number of observations

number of independent variables

An increase in the number of exogenous variables tends ot increase the fit of a model. By dividing the 
 by degrees of freedom, 
 , improvements in fit that result from increases in the number of variables are offset in part by a reduction in degrees of freedom.

Finally, we calculate the covariance matrix, 
:



In [29]:
n = results.nobs
k = len(results.params)
estimator_variance = SSE / (n - k)
n, k, estimator_variance

(3161.0, 6, 1.4099564426407578e+17)

In [30]:
cov_matrix = results.cov_params()
cov_matrix

Unnamed: 0,Summary,Size of Government,Legal System and Property Rights,Sound Money,Freedom to Trade Internationally,Constant
Summary,1649068.0,-378891.187765,-427915.270669,-381112.886725,-381852.828909,-603648.1
Size of Government,-378891.2,126185.143221,110511.806814,84181.554424,77296.330367,-86901.45
Legal System and Property Rights,-427915.3,110511.806814,143026.647648,93741.262897,80975.648002,70122.19
Sound Money,-381112.9,84181.554424,93741.262897,117328.592332,71115.189138,77308.57
Freedom to Trade Internationally,-381852.8,77296.330367,80975.648002,71115.189138,128976.446018,163138.3
Constant,-603648.1,-86901.447257,70122.194933,77308.568406,163138.25768,2539853.0


In [31]:
# calculate covariance matrix by hand
XtXInv = np.matrix(matmul(X.T, X)).getI()
# multiply by estimator variance
ev_mul_XTXInv = estimator_variance * XTXInv
# transform to pandas dataframe
pd.DataFrame(ev_mul_XTXInv,
             columns = X.keys(), index = X.keys()) 

NameError: name 'matmul' is not defined

In [None]:
results.params

In [None]:
print("beta", "\t\t\tSE")
for x_var in X.keys():
    beta_x = results.params[x_var]
    StdErrX = cov_matrix.loc[x_var][x_var]**(.5)
    #print(beta_x, StdErrX, sep = "\t")
    #print("t:", beta_x / StdErrX)
    parameters[x_var] = {}
    parameters[x_var]["Beta"] = beta_x
    parameters[x_var]["SE"] = StdErrX
    parameters[x_var]["t-stats"] = beta_x / StdErrX
parameters = pd.DataFrame(paramaters).T
parameters

## Calculate R^2

The variance term will be used to help us calculate other values. First we estimate the square root of the mean squared error. Since the mean squared error is the variance of the estimator, this means we simply take the square root the variance term


The square-root of the MSE provides a more readily interpretable estimate of the estimator variance, showing the average distance of predicted values from actual values, corrected for the number of independent variables.

We also estimate the R2 value. This value indicates the explanator power of the regression

 

This compares the average squared distance between the predicted values and the average value against the average squared distance between observed values and average values. Ordinary least squares regression minimizes the squared distance between the predicted value and the average value. If values are perfectly predicted, then the SSR would equal the SST. Usually, the SSR is less than the SST. It will never be greater than the SST.

In [None]:
r2 = SSR / SST
r2

## Adjusted R-Squared

Although the 
 is a useful measure to understand the quality of the explanation provided by the selected exogenous variables. Recall that:

 

Notice that as the degrees of freedom decrease, the numerator necessarily decreases as well. One should not depend solely on the adjusted 
 to consider the strength of a regression's results, but it is often useful to help gauge whether or not a marginal addition of a variable improves explanatory power of a regression.

 


In [None]:
r2_adjusted = 1 - (SSE / (n - k)) / (SST / (n - 1))
r2_adjusted

In [None]:
results.summary()

## Common Problems with OLS

Although our regression generates a large t-statitic, our errors are not normally distributed. This is due in part to our use of untransformed time-series data. To make the data normally distributed, we could log the data or calculate either the annual difference or percent change. Logging the data will maintain levels. Since this data suffers from a trend, we will calculate the annual difference of index values and the annual percent change of real GDP per capita values after we review the distribution of residuals.

### Check the distribution of residuals

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({"font.size":26})
fig, ax = plt.subplots(figsize = (12,8))
reg_data[["Residuals"]].plot.hist(bins = 100, ax = ax)
plt.xticks(rotation = 90)

### Thinking through unit-root and cointegration problems

In [None]:
reg_data.loc["USA"][x_vars + y_var]
fig, ax = plt.subplots(figsize = (24, 12))
plot_df.diff(5).dropna().plot.line(ax = ax, secondary_y = y_var, legend = True)

In [None]:
np.log(data[y_var]).diff(5).plot.hist(bins = 10)

### WARNING: having more recent data biases estimates toward present inferences from present data

In [None]:
## Regressions with Logged Differences
years_diff = 5
reg_data = data
# take the log of real gdp then difference within group
reg_data["RGDP Per Capita"] = np.log(data["RGDP Per Capita"]).groupby(
    "ISO_Code_3").diff(years_diff)
reg_data = reg_data.replace([np.inf, -np.inf], np.NaN)
reg_data.loc["USA"]

In [None]:
r_df = reg_data.dropna(axis = 0, how = "any")
y = r_df[y_var]
x = r_df[x_vars]
X["Constant"] = 1
results = sm.OLS(y, X).fit()
r_df["Predictor"] = results.predict()
r_df["Residuals"] = results.resid
results.summary

In [None]:
fig, ax = plt.subplots(figsize = (12, 8))
r_df[["Residuals"]].plot.hist(bins = 100, ax = ax)
ax.axvline(r_df["Residuals"].mean(), ls = "- -", linewidth = 5, color = "k")

In [None]:
results_dict = {"Beta": results.params,
               "t-stats": results.tvalues,
               "p-values": results.pvalues,
               "SE": results.bse}
results_df = pd.DataFrame(results_dict).round(3)
results_df.to_csv("y = RGDP, x = EFW, LogDiffResults.csv")
results_df

In [None]:
fig, ax = plt.subplots(figsize = (20, 12))
r_df.plot.scatter(x = y_var[0],
                 y = "Predictor",
                 s = 50,
                 alpha = .7,
                 ax = ax)

In [None]:
all_vars = y_var + x_vars

for var in all_vars:
    fig, ax = plt.subplots(figsize = 20, 12)
    r_df.plot.scatter(x = var,
                     y = "Residuals",
                     s = 50,
                     alpha = .5,
                     ax = ax)
    ax.axhline(r_df["Residuals"].mean(), ls = "- -", linewidth = 5, color = "k")

In [None]:
for country in countries:
    cumulative_data = r_df[[y_var[0], "Predictor"]] + 1
cumulative_data

In [None]:
for countries in countries:
    try:
        plot_data = r_df.loc[country]
        fig, ax = plt.subplots(figsize = (20, 10))
        plot_data[[y_var[0], "Predictor"]].add(1).cumprod().plot.line(ax = ax, legend = True)
    except:
        print(country + "does not appear to be in index")

In [None]:
r_df = reg_data.copy()
r_df["RGDP Per Capita Lag"] = reg_data["RGDP Per Capita"].groupby("ISO_Code_3").shift(years_diff)
r_df = r_df.dropna(axis = 0, how = "any")
x_vars.append("RGDP Per Capita Lag")
y = [y_var]
X = [x_vars]
X["Constant"] = 1
results = sm.OLS(y, x).fit()
r_df["Predictor"] = results.predict()
results.summary()

In [None]:
r_df["Residuals"] = results.resid
fig, ax = plt.subplots(figsize = (12,8))

r_df[["Residuals"]].plot.hist(bins = 100, ax = ax)

In [None]:
fig, ax = plt.subplots(figsize = (14,10))
r_df.plot.scatter(x = y_var[0],
                 y = "Predictor", 
                  s = 30, ax = ax)
plt.xticks(rotation=90)
plt.show()
plt.close()
# cycle through all variables included in regression
# concantonate y_var and x_vars 
for var in y_var + x_vars:
    fig, ax = plt.subplots(figsize = (14,10))
    r_df.plot.scatter(x = y_var[0],
                     y = "Residuals", 
                      s = 30, ax = ax)
ax.axhline(0, ls = "--", color = "k")
plt.xticks(rotation=90)
plt.show()
plt.close() 

In [None]:
del r_df["Predictor"]
del r_df["Residual"]
del r_df["RGDP Per Capita Lag"]
# delete RGDP per capita lag for x vars 
x_vars = r_df.keys()[2:7]

In [None]:
x_vars = list(r_df.keys()[2:7])
y_var = [r_df.keys()[7]]
x_vars, y_var
r_df = r_df[y_var + x_vars].groupby("ISO_Code_3").diff(years_diff).dropna()

In [None]:
r_df = r_df.dropna(axis = 0, how = "any")

In [None]:
y = [y_var]
X = [x_vars]
# X["Constant"] = 1
results = sm.OLS(y, x).fit()
r_df["Predictor"] = results.predict()
results.summary()

In [None]:
r_df["Residuals"] = results.resid
fig, ax = plt.subplots(figsize = (12,8))

r_df[["Residuals"]].plot.hist(bins = 100, ax = ax)

In [None]:
def plot_residuals (df, y_var, x_vars):
    fig, ax = plt.subplots(figsize = (14,10))
    r_df.plot.scatter(x = y_var[0],
                     y = "Predictor", 
                      s = 30, ax = ax)
    plt.xticks(rotation=90)
    plt.show()
    plt.close()

    for var in y_var + x_vars:
        fig, ax = plt.subplots(figsize = (14,10))
        r_df.plot.scatter(x = y_var[0],
                         y = "Residuals", 
                          s = 30, ax = ax)
    ax.axhline(0, ls = "--", color = "k")
    plt.xticks(rotation=90)
    plt.show()
    plt.close() 
plot_residuals(r_df, y_var, x_vars)