![](DaThabor_Logo.png)

----------

# **SIMPLE DATA SCIENCE**

<br>

May, 2020

<br>

## Simple Linear Regression

<br>

----------

<br>

> <span style="font-family: Verdana; font-size:18;color:darkblue;"> **If I wouldn't question my data, the data might question me -- DaThabor, May 2020** </span>

<br>

<br>

<img src="Theory-icon.png" alt="Drawing" style="width: 150px;"/>

<br>

<span style="font-family: Brush Script MT; font-size:1.5em;color:orange;"> Golden, add some text in golden to see how the font looks like </span>

<br>

### <span style="color:green">**THEORY**</span>

<br>

<span style="color:green">A simple linear regression is a statistical method to to summarize and study relationships between two continuous (quantitative) variables:</span>

<span style="color:green"> One variable, denoted **x**, is regarded as the predictor, explanatory, or independent variable. </span>
<span style="color:green"> The other variable, denoted **y**, is regarded as the response, outcome, or dependent variable. </span>

<span style="color:green"> Because the other terms are used less frequently today, we'll use the `independent variable` and `dependent variable` terms to refer to the variables encountered. 

<span style="color:green"> The other terms are mentioned only to make you aware of them should you encounter them. Simple linear regression gets its adjective "simple," because it concerns the study of only one independent variable. In contrast, multiple linear regression, which we study later, gets its adjective "multiple," because it concerns the study of two or more independent variables.</span>
    
<br>
    
<img src="ML_icon.jpg" alt="" style="width: 200px;"/>

<br>
    
Some text here
    
<br>
    
<img src="data.png" alt="" style="width: 180px;"/>
    
Some text here

### <span style="color:blue">**MATHEMATICS**</span>

<br>

<img src="Maths-icon.png" alt="" style="width: 180px;"/>

<br>

<span style="color:blue">The least squares method</span>

In [8]:
%%HTML
<video width=640" height="480" controls>
  <source src="Theory_StraightLine.mp4" type="video/mp4">
</video>

### Use cases

Two use cases will be looked into for the simple regression, so we get a good idea of working with different data sets, but the same statistical methods.

<br>

The use cases are:

- Predict the `Sales` amount, based on any of the independent variables `TV`, `Radio` or `Newspaper`, from the **advertisement** data set. The variables `TV`, `Radio` and `Newspaper` stand for the budget spent on advertisement in any of those categories.
- Predict the Housing price, based on any correlated independent variable in the **Boston housing** data set.

## Step 1: Set up the environment

The first step in any analysis and modeling is to make sure all required packages are loaded into the environment. Run the following cell block to get all packages imported

In [None]:
%run -i "_RunPackages.py"

## Step 2: Load data sets

Then we need to make sure we have all the data into our environment. As we are using three use cases, we need to make sure we load all three data sets into this environment.

In [None]:
%run -i "_RunLinearRegression_datasets.py"

## Step 3: Load the data dictionary

In [None]:
%run -i "_RunDataDictionary_LinearRegression.py"

Load the data, add the target variable `MEDV` and look at the first 5 observations.

In [None]:
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)

In [None]:
boston_df['MEDV'] = boston.target
boston_df.head()

In the next step we will have a look at the description of the data, by looking at the values of all features or variables (columns) we have.

In [None]:
boston_df.describe()

Then we look at the shape of the data

In [None]:
boston_df.shape

To get an idea of the correlation between variables, we will plot a correlation plot, to see which variables are highly correlated with our target variable `MEDV`

In [None]:
plt.figure(figsize=(16,6))
sns.heatmap(boston_df.corr(), 
            cmap="YlGnBu", 
            annot = True, 
            linewidths=.5)
plt.show()

We can see that both `RM` and `LSTAT` are having high positive and negative correlations compared to the `MEDV` variable. Let's have a look at the scatter plots for both.

In [None]:
plt.figure(figsize=(20, 5))

features = ['LSTAT', 'RM']
target = boston_df['MEDV']

for i, col in enumerate(features):
    plt.subplot(1, len(features) , i+1)
    x = boston_df[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('MEDV')

Next up we want to find the equations(s) for the straight line in both plots. 

The equation is:

$y = \beta_{0} + \beta_{1}x + \epsilon$

<br>

- ,where $\beta_{0}$ = y-intercept;
- and $\beta_{1}$ = slope of the line;
- and $\epsilon$ is a random error margin

In [None]:
X = boston_df['RM'].values.reshape(-1,1)
y = boston_df['MEDV'].values.reshape(-1,1)

reg = LinearRegression()
reg.fit(X, y)
print("The linear model is: Y = {:.5} + {:.5}X + \u03B5".format(reg.intercept_[0], 
                                                                reg.coef_[0][0]))

In [None]:
predictions = reg.predict(X)
plt.figure(figsize=(16, 8))
plt.scatter(
    boston_df['RM'],
    boston_df['MEDV'],
    c='black'
)
plt.plot(
    boston_df['RM'],
    predictions,
    c='blue',
    linewidth=2
)
plt.xlabel("Average number of rooms per dwelling")
plt.ylabel("Price")
plt.show()

In [None]:
X_lstat = boston_df['LSTAT'].values.reshape(-1,1)
y_lstat = boston_df['MEDV'].values.reshape(-1,1)

reg = LinearRegression()
reg.fit(X_lstat, y_lstat)
print("The linear model is: Y = {:.5} + {:.5}X + \u03B5".format(reg.intercept_[0], 
                                                                reg.coef_[0][0]))

In [None]:
predictions = reg.predict(X_lstat)
plt.figure(figsize=(16, 8))
plt.scatter(
    boston_df['LSTAT'],
    boston_df['MEDV'],
    c='black'
)
plt.plot(
    boston_df['LSTAT'],
    predictions,
    c='blue',
    linewidth=2
)
plt.xlabel("Percentage of lower status of the population")
plt.ylabel("Price")
plt.show()

Then we use an Ordinary Least Squared (OLS) model for the linear regression [More info on OLS](https://statisticsbyjim.com/regression/ols-linear-regression-assumptions/)

In [None]:
X = boston_df['RM']
y = boston_df['MEDV']
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

In [None]:
X_lstat = boston_df['LSTAT']
y_lstat = boston_df['MEDV']
X2 = sm.add_constant(X_lstat)
est = sm.OLS(y_lstat, X2)
est2 = est.fit()
print(est2.summary())

From the OLS model, we are not getting great values for R-squared, which might mean either the model is not set up correctly, or the model itself doesn't perform well. Let's try another linear regression model.

In [None]:
X = pd.DataFrame(np.c_[boston_df['LSTAT'], boston_df['RM']], columns = ['LSTAT','RM'])
Y = boston_df['MEDV']

Split the data into a train and test data set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

Run the Linear Regression model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)

Evaluate the model

In [None]:
# model evaluation for training set
y_train_predict = lin_model.predict(X_train)
rmse = (np.sqrt(mean_squared_error(Y_train, y_train_predict)))
r2 = r2_score(Y_train, y_train_predict)

print("The model performance for training set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))
print("\n")

# model evaluation for testing set
y_test_predict = lin_model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(Y_test, y_test_predict)))
r2 = r2_score(Y_test, y_test_predict)

print("The model performance for testing set")
print("--------------------------------------")
print('RMSE is {}'.format(rmse))
print('R2 score is {}'.format(r2))

We can now see that using this linear regression model, the R-squared value is arounf 63% for the train set and 66% for the test set. These values are not that great, but good enough for now.