#   <center>Week 2: Linear Regression frow scratch </center>

-----------

<strong>Objective</strong>: To build a <strong>univariate linear regression model</strong> that can predict profit given population size.

Suppose you are the CEO of a restaurant franchise and are considering different cities for opening a new outlet. The chain already has trucks in various cities and you have data for <strong>profits</strong> and <strong>populations</strong> from the cities.

You would like to use this data to help you select which city to expand to next.



---------------

# 1.  Load data

To load and plot the data, three python libraries(Numpy, Pandas and Matplotlib) should be used. The libraries are first imported in the codeblock below.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The file <strong>ex1data1.txt</strong> contains the dataset for our linear regression problem. The <strong>first column</strong> is the <strong>population</strong> of a city and the </strong>second column</strong> is the <strong>profit</strong> of a food truck in that city. A negative value for profit indicates a loss.

In [None]:
# Load dataset
column_names = ["Population","Profit"]  
data = pd.read_csv('data/ex1data1.txt', names = column_names)

- Data shape

In [None]:
print('data shape: {}, column size: {}, row size: {}' \
      .format(data.shape, data.shape[0], data.shape[1]))

-  Display 5 random samples

In [None]:
data.sample(5)

- First 5 samples

In [None]:
data.head(5)

- Last 5 samples

In [None]:
data.tail(5)

----------

# 2. Exploratory Data Analysis

In [None]:
#Data Summary
data.describe()

In [None]:
# Correlation 
corr = data.corr()
corr.style.background_gradient(cmap='Spectral')

In [None]:
# Covariance
cov = data.cov()
cov.style.background_gradient(cmap='Spectral')

-----

## 2.1. Visualization 

It is a good practice to visualize your data before building a model. The aim of data visualization is to give you an insight on the problem. We will be using <strong>matplotlib</strong>  and <strong>seaborn</strong> libracy for visuals.

- Boxplot

In [None]:
data.Population.plot(kind='box', figsize=(8,8), color='brown')

- Histogram

In [None]:
data.Population.plot(kind='hist', figsize=(12,8))

In [None]:
data.Profit.plot(kind='hist', figsize=(12,8), color='red')

- Scatterplot

In [None]:
# scatterplot
data.plot(kind = 'scatter', x = 'Population', y = 'Profit',
          s= 40, color = 'blue', figsize=(12,8))

# labels
plt.xlabel('Population of City in 10,000s', fontsize = 14)
plt.ylabel('Profit in $10,000s', fontsize =14)

-----

# 3. Problem Formulation

From the exploratory data analysis we can observe there is a positive correlation between the two variables. This means that we can develop a model that will be able to make reasonable predictions

We have 97 training examples and one independent variable `x` on the first column and one dependent variable `y` on the second column

**Recall:**
- Our untrained model is given by: ![title](img/model.gif)
    
- Which you can also be written as ![title](img/model2.gif)

where:
- `x` is the input values
- `y` is the ground truth or actual values
- `theta` is the <strong>weight or learnable<strong> parameters  



In [None]:
# input values - xs and 1s
nrows = data.shape[0]
ncols = data.shape[1]

x = data.loc[:, 'Population'].values #converts to Numpy array
x = x.reshape(nrows, 1)  # Alternatively x.reshape(-1,1)

In [None]:
x.shape

Because `x_0` is `1`, we want to create a `97 x 2` matrix that contains the input values on the first column and ones in the second column.

In [None]:
one_stack = np.ones((nrows,1))
x_stack = np.hstack((x, one_stack))

In [None]:
x_stack[:3]

In [None]:
# output variable
y = data.loc[:, 'Profit'].values # converts to Numpy array
y = y.reshape(nrows,1)

In [None]:
y.shape

# 4. Model Building

Recall from the class, the process to training a linear regression model is as follow.

We want to find the appropriate value of `theta` that will give us a good estimate of a city's profit if lthe city's population is supplied.

To do this, 
- We want to start with a random value of `theta` to generate a hypothesis
![title](img/model3.gif)

- Then continually correct values of `theta` until the deviation of the hypothesis/prediction `h` from the ground-truth `y` is greatly reduced

**Note:** 
- Matrix multiplication was utilized for the calculation. 

In [None]:
def train(x,y):
    print_every = 50 
    iteration = 2500
    
    # Zero initialization of parameters 
    theta = [[0],[0]]
    
    # Here, want to save our cost function or loss or square error, 
    # so that we can have an idea of how the deviation of the hypothesis from the ground thruth reduces
    cost_function = np.zeros(iteration)
    
    for i in range(0, iteration):
        # Step 1: we make a prediction using the random weights (theta) that we initialized
        # @ is a fancy way do performing dot products
        h = x @ theta
        
        # Step 2: We take a step to correct the weights (theta) to that the next predicion will be better
        theta = update_weight(h, theta, x)
        
        # Step 3: We measure the deviation or error
        cost_function[i] = cost(x, theta)
        
        # Display result every 50 iterations
        if i % print_every  == 0:
            print("Iteration: {}, Cost function: {} ".format(i, cost_function[i]))
    
    return theta, cost_function

But how do we check the error like we defined above?

Remember the error formular (cost function)? 
![title](img/model44.gif)

where:
- `m` is the number of training example
- `x` is the input data
- `h` is the hypothesis
- `y` is the prediction

The equation tries to find the square error between the ground truth and the prediction.



In [None]:
def cost(x, theta):
    m = nrows
    h = x @ theta
    return (1/2 * m) * np.sum(np.square(h - y))

Now that we are able to get a sense of the error, how do we update how weight (theta) such that is predicts better?

Like we discussed in class, gradient descent algorithim will be used for this purpose.

The general formular for gradient is given below:
![title](img/model6.gif)
which can be differentiated to give:
![title](img/model5.gif)
where
- alpha is the <strong>learning rate</strong>

ie: we continually update the weight(theta) by taking steps(alpha) for the derived gradient of the error until we have sufficiently minimized theta.

In [None]:
def update_weight(h, theta, x):
    m = nrows
    alpha = 0.001
    theta = theta - alpha * (1 / m * ( x.T @ (h - y)))
    return theta

- Now, lets pass in our data and train.

In [None]:
theta, cost_values = train(x_stack,y)

In [None]:
# Now, lets see our cost_values
cost_values

In [None]:
cost_values

In [None]:
# Now, lets see our trained weights
print('Our learned value of theta: ',theta.ravel())

In [None]:
theta.shape

In [None]:
theta.ravel().shape # Alternatively theta.squeeze().shape

In [None]:
plt.plot(cost_values)
plt.ylabel('Cost J')
plt.xlabel('Iterations')
plt.show()

Notice that at the `1500th iteration`, the model doesnt really change anymore. You should actually stop the iteration at that time and save cost of processing

In [None]:
x_plot = np.arange(5,23)
y_plot = theta[0]*x_plot + theta[1]
# Plot gradient descent
plt.scatter(x[:,0], y, s=30, c='r', marker='o', linewidths=1)
plt.plot(x_plot,y_plot, label='Linear regression (Gradient descent)')

---------------

## 5. Prediction 

Here, I hard coded the learned weights in the model that we defined earlier.
![Title](img/model.gif)

In [None]:
def predict():
    input_ = float(input('Enter the population size(10,000): '))
    print("For a population of {}, the estimated profit is ${}".format(input_, input_ * 1.2334223))

In [None]:
predict()

----------------------

# Assignment


1. Use `scikit-learn` to develop a linear regression model using the same dataset in this practice and compare result.
2. Apply data normalization and compare the your solution with the  above result.
3. Rewrite the train function such that we pass   `'iteration', 'print_every' and 'alpha'` as arguments.


Assignment is due for submission on `26/09/2019`.

Summission link will be posted on the `SLACK CHANNEL/MAIL`?

------------------

## Credit

This exercise is adapted from [Andrew Ng Machine Learning Course](https://www.coursera.org/learn/machine-learning).