# Regression Analysis

This notebook takes a look at the basics of regression analysis with special focus on application with the world's most popular language(citation needed). The notebook will cover(to no particular depth), the following topics.

- **Linear Regression**
   - **Univariate Linear Regression**
   - **Multivariate Linear Regression**

## **Univariate Linear Regression**

**Key points:** 

 - Only single predictor variable.[MIT OCW](http://www.mit.edu/~6.s085/notes/lecture3.pdf)
 - Fits an equation of the form y= $\beta$<sub>0</sub> + $\beta$<sub>1</sub>x   where y is the response variable and x the predictor variable. $\beta$<sub>1</sub> is the slope and $\beta$<sub>0</sub> is the intercept. This is similar to the famous y = mx + c or for some y = mx + b
 - It should be pretty intuitive that the assumption is that the data follows a straight line.
 - It is also worth noting that this model assumes no categorial data ie values are all continous.
 
 To extent the equation, an error(noise) "attribute" is added to what is known as a probablistic model for linearly related data[MIT OCW](http://www.mit.edu/~6.s085/notes/lecture3.pdf) : 
 y= $\beta$<sub>0</sub> + $\beta$<sub>1</sub>x then becomes y= $\beta$<sub>0</sub> + $\beta$<sub>1</sub>x<sub>i</sub> + $\epsilon$<sub>i</sub> where $\epsilon$ is the error/ noise(Gaussian) term.
 
 **The aim** 
 
 Estimate $\beta$<sub>0</sub> and $\beta$<sub>1</sub> with the model $\hat{y}$ = $\hat{\beta}$<sub>0</sub> + $\hat{\beta}$<sub>1</sub>x and minimise the error ie differences between predicted and actual data [Eric](https://risk-engineering.org/static/PDF/slides-linear-regression.pdf).

**Applied Linear Regression: A python primer**

To perform basic linear regression, we will uses `seaborn`, `pandas`, `matplotlib` and `statsmodels/scipy`

The data set( carbon Nano tubes)used in this example can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00448/) or downloaded from [sweetpy](https://github.com/Nelson-Gon/sweetpy).

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import statsmodels
import scipy

**Reading the data and exploratory data analysis**

We shall use `pandas` `read_csv` to read our data.

In [11]:
carbon_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00448/carbon_nanotubes.csv",
                       sep= ";", decimal = ",")


In [12]:
carbon_df.head(6)

Unnamed: 0,Chiral indice n,Chiral indice m,Initial atomic coordinate u,Initial atomic coordinate v,Initial atomic coordinate w,Calculated atomic coordinates u',Calculated atomic coordinates v',Calculated atomic coordinates w'
0,2,1,0.679005,0.701318,0.017033,0.721039,0.730232,0.017014
1,2,1,0.717298,0.642129,0.231319,0.738414,0.65675,0.232369
2,2,1,0.489336,0.303751,0.088462,0.477676,0.263221,0.088712
3,2,1,0.413957,0.632996,0.040843,0.408823,0.657897,0.039796
4,2,1,0.334292,0.543401,0.15989,0.303349,0.558807,0.157373
5,2,1,0.510664,0.696249,0.255128,0.496977,0.725608,0.25597


**What do the columns mean?**

The above data is(was) provided by [Aci & Avci, 2016](https://doi.org/10.1007/s00339-016-0153-1). The attributes are described as follows:

- Chiral indice n: n parameter of the selected chiral vector.
- Chiral indice m: n parameter of the selected chiral vector.
- Initial atomic coordinate u: Randomly generated u parameter of the initial atomic coordinates of all carbon atoms.
- Initial atomic coordinate v: Randomly generated v parameter of the initial atomic coordinates of all carbon atoms.
- Initial atomic coordinate w: Randomly generated w parameter of the initial atomic coordinates of all carbon atoms.
- Calculated atomic coordinate uâ€™: Calculated uâ€™ parameter of the atomic coordinates of all carbon atoms.
- Calculated atomic coordinate vâ€™: Calculated vâ€™ parameter of the atomic coordinates of all carbon atoms.
- Calculated atomic coordinate wâ€™: Calculated wâ€™ parameter of the atomic coordinates of all carbon atoms.


It should be noted that the authors used artificial neural networks(ANNs. The basic model is(according to the authors ): C<sub>h</sub> = na<sub>1</sub> + ma<sub>2</sub> $\equiv$ (n,m) with n and m being integer chiral indices.

**Exploratory Data Analysis**

1.**Basic Stats about the data**

In [14]:
carbon_df.describe()

Unnamed: 0,Chiral indice n,Chiral indice m,Initial atomic coordinate u,Initial atomic coordinate v,Initial atomic coordinate w,Calculated atomic coordinates u',Calculated atomic coordinates v',Calculated atomic coordinates w'
count,10721.0,10721.0,10721.0,10721.0,10721.0,10721.0,10721.0,10721.0
mean,8.225725,3.337189,0.500064,0.500072,0.499637,0.500064,0.500072,0.499834
std,2.138919,1.683881,0.286524,0.286495,0.288503,0.290935,0.291012,0.289095
min,2.0,1.0,0.045149,0.045149,6.1e-05,0.038504,0.03893,0.0
25%,7.0,2.0,0.218041,0.217594,0.249483,0.213364,0.212922,0.249242
50%,8.0,3.0,0.500181,0.500297,0.500057,0.500538,0.50002,0.499755
75%,10.0,5.0,0.781959,0.782709,0.749191,0.786588,0.787161,0.749463
max,12.0,6.0,0.954851,0.954851,0.999411,0.961496,0.96107,1.0


**Finding Missingness**

In [22]:
# missingness
carbon_df.apply(lambda x: any(x.isnull()))

Chiral indice n                     False
Chiral indice m                     False
Initial atomic coordinate u         False
Initial atomic coordinate v         False
Initial atomic coordinate w         False
Calculated atomic coordinates u'    False
Calculated atomic coordinates v'    False
Calculated atomic coordinates w'    False
dtype: bool

From the above, we can see that our data has no missing values which is great. Now, since we need to carry out univariate linear regression, we need to "reshape" our data keeping in mind our equation above. In otherwords, what features can be used to predict each other? To do so, we can fit a univariate linear model that aims to predict atomic coordinate v based on chiral index m alone for instance.

In [43]:
import statsmodels.api as sm


In [54]:
# The model
from IPython.display import display
x = carbon_df["Chiral indice m"]
y = carbon_df["Initial atomic coordinate v"]
#x = sm.add_constant(x)
model = sm.OLS(y,x).fit()

predictions = model.predict(x)
model.summary()

0,1,2,3
Dep. Variable:,Initial atomic coordinate v,R-squared (uncentered):,0.6
Model:,OLS,Adj. R-squared (uncentered):,0.6
Method:,Least Squares,F-statistic:,16090.0
Date:,"Sat, 17 Aug 2019",Prob (F-statistic):,0.0
Time:,15:15:16,Log-Likelihood:,-4390.1
No. Observations:,10721,AIC:,8782.0
Df Residuals:,10720,BIC:,8790.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Chiral indice m,0.1194,0.001,126.851,0.000,0.118,0.121

0,1,2,3
Omnibus:,834.833,Durbin-Watson:,0.135
Prob(Omnibus):,0.0,Jarque-Bera (JB):,294.088
Skew:,-0.119,Prob(JB):,1.3800000000000002e-64
Kurtosis:,2.224,Cond. No.,1.0


From the above results, fitting a linear model to predict the initial atomic coordinate based on chiral index m alone gives a statistically significant model(p value < 0.05) with a moderate adjusted R squared.  