# Regression using a package called statsmodels

Using this [article](https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9) as a shell. But trying to recreate the results from the `regression_research.ipynb` notebook.

In [2]:
import statsmodels.api as sm
import numpy as np 
import pandas as pd 
import sklearn.metrics as skm

In [3]:
data = "https://raw.githubusercontent.com/Blackman9t/Machine_Learning/master/Original_2000_2014_Fuel_Consumption_Ratings.csv"
missing_data = ["n/a","na","--","?","non","Non","None"]

df = pd.read_csv(data, na_values=missing_data)

df.rename(columns={'FUEL_CONSUMPTION_CITY(L/100km)':'FUEL_CONS_CITY', 
                        'ENGINE_SIZE(L)':'ENGINE_SIZE',
                       'HWY_(L/100km)':'HWY_L100km',
                       'COMB_(L/100km)':'COMB_L100km',
                       'COMB_(mpg)':'COMB_MPG',
                       'CO2_EMISSIONS(g/km)':'CO2_EMISSIONS'},
                       inplace=True)

df.head()

Unnamed: 0,MODEL_YEAR,MAKE,MODEL,VEHICLE_CLASS,ENGINE_SIZE,CYLINDERS,TRANSMISSION,FUEL_TYPE,FUEL_CONS_CITY,HWY_L100km,COMB_L100km,COMB_MPG,CO2_EMISSIONS
0,2000,ACURA,1.6EL,COMPACT,1.6,4,A4,X,9.2,6.7,8.1,35,186
1,2000,ACURA,1.6EL,COMPACT,1.6,4,M5,X,8.5,6.5,7.6,37,175
2,2000,ACURA,3.2TL,MID-SIZE,3.2,6,AS5,Z,12.2,7.4,10.0,28,230
3,2000,ACURA,3.5RL,MID-SIZE,3.5,6,A4,Z,13.4,9.2,11.5,25,264
4,2000,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.0,7.0,8.6,33,198


In [4]:
# Define the target results.
target = pd.DataFrame(df.CO2_EMISSIONS)
target.head()

Unnamed: 0,CO2_EMISSIONS
0,186
1,175
2,230
3,264
4,198


For some reason the rest of the notebook has disappeared. The results obtained were the same as the step by step notebook so that can be used as baseline for comparison. 

In [5]:
# Drop the variables that have no correlation or are colinear. 
X = df[['ENGINE_SIZE','CYLINDERS','COMB_MPG']]
X = sm.add_constant(X)
y = target["CO2_EMISSIONS"]

model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,CO2_EMISSIONS,R-squared:,0.859
Model:,OLS,Adj. R-squared:,0.859
Method:,Least Squares,F-statistic:,29080.0
Date:,"Thu, 10 Dec 2020",Prob (F-statistic):,0.0
Time:,16:17:04,Log-Likelihood:,-64830.0
No. Observations:,14343,AIC:,129700.0
Df Residuals:,14339,BIC:,129700.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,322.6685,1.828,176.557,0.000,319.086,326.251
ENGINE_SIZE,7.9042,0.363,21.790,0.000,7.193,8.615
CYLINDERS,5.9362,0.247,24.074,0.000,5.453,6.420
COMB_MPG,-5.0140,0.039,-128.876,0.000,-5.090,-4.938

0,1,2,3
Omnibus:,1915.873,Durbin-Watson:,1.15
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9929.254
Skew:,0.542,Prob(JB):,0.0
Kurtosis:,6.929,Cond. No.,287.0


In [6]:
mae = skm.mean_absolute_error(target, predictions)
mae

14.94697825421709

In [7]:
mse = skm.mean_squared_error(target, predictions)
mse

493.75715761878746