# Problem Statement
How might we predict a country’s **protein supply per capita** through the analysis of social, environmental, and economic factors such as **GDP, temperature change, consumer prices, import and export quantity and production quantity**?

# Overall 7 steps Methodology for ML Project
Reference: [Google's 7 steps of ML](https://towardsdatascience.com/the-googles-7-steps-of-machine-learning-in-practice-a-tensorflow-example-for-structured-data-96ccbb707d77)

Step 1: [Gathering Data](#step1)    
Step 2: [Data Preparation](#step2)  
Step 3: [Choosing a model](#step3)  
Step 4: [Training](#step4)  
Step 5: [Evaluation](#step5)  
Step 6: [Parameter Tuning](#step6)  
Step 7: [Prediction](#step7)  

Step 0: [Model Improvements](#modelimp)


<a id='step1'></a>
# Step 1: Gathering Data
  
Our data is sourced from FAOstats \<insert link>  
  
Before selecting our predictor values, we first designed a persona for each of our statistics. This is done below in [Step 2: data preparation](#step2) through the use of graphs to map out potentially interesting characteristics of each country in relation to each predictor.  

We have summarized our findings in the table below:  
\<Insert table of persona in markdown>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id='step2'></a>
# Step 2: Data Preparation
1. Aggregation of data by countries (e.g averaging temperature changes, summing production/consumption over the year and averaging by population)
* Ensure all data able to be compared fairly
* Saved as new csv file with new columns, Data_aggregated.csv

2. Visualize data and relationships of protein supply against various X using seaborn and matplotlib using scatter plots
* To check if there are any obvious relationships that can be seen
* Persona Analysis by observing distribution of datas for each X

3. Preprocessing - Drop Null Rows and Missing Data
* Save as new csv file, Data_dropped.csv

4. Normalize data with Z-normalization
* Save as new csv, Data_norm.csv

In [None]:
def normalize_z(dfin):
    dfout = pd.DataFrame()
    mean = dfin.mean(axis=0) 
    std = dfin.std(axis=0)  
    dfout = ((dfin - mean)/std)
    return dfout

def get_features_targets(df, feature_names, target_names):
    df_feature = df[feature_names]
    df_target = df[target_names]
    return df_feature, df_target

def prepare_feature(df_feature):
    # this is to convert table of x independent variables to X matrix, hence first column is all ones as x_0=1
    np_feature = df_feature.to_numpy()
    np_m = np_feature.shape[0]
    if np_feature.ndim == 1: 
        np_feature = np_feature.reshape(np_m, 1)
    big_X = np.concatenate((np.ones((np_m, 1), dtype=float), np_feature), axis=1)
    return big_X

def prepare_target(df_target):
    # this is for the y dependent variable to be predicted
    np_target = df_target.to_numpy()
    return np_target

def predict(df_feature, beta):
    norm_feature = normalize_z(df_feature) 
    np_feature = prepare_feature(norm_feature) 
    y_hat = calc_linear(np_feature, beta)  
    return y_hat

def split_data(df_feature, df_target, random_state=None, test_size=0.5):
    # df.sample and df.drop can also do this 
    np.random.seed(random_state)
    
    df_feature_rows = df_feature.shape[0]
    df_target_rows = df_target.shape[0]
    
    df_feature_split = int(test_size * df_feature_rows)
    df_target_split = int(test_size * df_target_rows)
    
    df_feature_rando = np.random.choice(df_feature_rows, size=df_feature_rows, replace=False)  
    df_feature_test = df_feature.iloc[df_feature_rando[:df_feature_split]]   # split the randomized index and use to get values in the idx rows
    df_feature_train = df_feature.iloc[df_feature_rando[df_feature_split:]]
    
    df_target_test = df_target.iloc[df_feature_rando[:df_target_split]]
    df_target_train = df_target.iloc[df_feature_rando[df_target_split:]]
    
    return df_feature_train, df_feature_test, df_target_train, df_target_test

<a id='step3'></a>
# Step 3: Choosing a model

Our chosen model is that of multiple linear regression.   

In our model, we hypothesize that the protein supplied to people in a country is represented by the function $\hat{y}$ as follows:

$$\hat{y}(x) =  \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \ldots + \hat{\beta}_n x_n$$  

#### <center> OR </center>  

$$\mathbf{\hat{y}} = \mathbf{X} \times \mathbf{\hat{b}}$$

The cost function of our model is the following, which is derived from the sum of squared errors, i.e $\frac{1}{2}$MSE:
$$J(\hat{\beta}_0, \hat{\beta}_1) = \frac{1}{2m}\Sigma^m_{i=1}\left(\hat{y}(x^i)-y^i\right)^2$$

Using calculus to minimize our cost J against $\hat{\beta}$, we can derive our gradient descent update equation as the following:
$$ \mathbf{\hat{b}} = \mathbf{\hat{b}} - \alpha\frac{1}{m} \mathbf{X}^T \times (\mathbf{X}\times \mathbf{\hat{b}} - \mathbf{y}) $$

With this, we get a column vector of $\hat{\beta}$ for our multiple linear regression model which best fits our data.  
  
To check if we have sufficiently minimized our cost, we can plot our cost at iteration, J, against iteration number.

In [None]:
def calc_linear(X, beta):
    return np.matmul(X, beta)

def compute_cost(X, y, beta):
    J = 0
    m = X.shape[0]
    error = calc_linear(X, beta) - y
    error_sq = np.matmul(error.T, error)
    J = (1/(2*m))*error_sq
    J = J[0][0]
    return J

def gradient_descent(X, y, beta, alpha, num_iters):
    m = X.shape[0] 
    J_storage = np.zeros((num_iters, 1))
    
    for i in range(num_iters):
        beta = beta - (alpha / m) * np.matmul(X.T, (calc_linear(X, beta) - y))
        J_storage[i] = compute_cost(X, y, beta)
    return beta, J_storage

<a id='step4'></a>
# Step 4: Training

<a id='step5'></a>
# Step 5: Evaluation

Metrics wise there are 3 that works in linear regression (excluding their upgraded variations that work in multiple linear regression):
- R squared (Linear Regression only) —> Adjusted R squared (Both Single and Multiple Linear Regression)
- Mean Squared Error (MSE) —> Root Mean Squared Error (RMSE) (i.e just corrected for the squaring of error)
- Mean Absolute Error (MAE)

MSE, RMSE and MAE measure the error of the model against the actual data. All 3 metrics range from 0 to $\infty$, and is negatively-oriented (i.e the lower it is the better).
  
## R-Squared
R squared measures the fit of the model by explaining how much of a variation of a dependent variable is explainable by independent variable in regression model. R squared only works on single linear regression, as it increases or remains the same, making the model seem like a better fit than it actually is.

Instead, we use **Adjusted R-Squared** which adjusts for degree of freedom of data, and is unaffected by irrelevant predictors. 
$$ R^2_{\mathrm{adj}} = 1 - \frac{\mathrm{RSS}/(n-p)}{\mathrm{TSS}/(n-1)} $$

Adjusted R-Squared is positive-oriented (i.e higher is better fit), and ranges from 0 to 1.  

## Mean Squared Error
MSE and RMSE are the same in distribution, but RMSE is scaled down in value. 

Mean Squared Error = Mean of squared error = $ \frac{1}{n}\sum_{i=1}^{n}(\hat{y}(x^i)-y_i)^2 $  
Root Mean Squared Error = $ \sqrt{MSE} $
  
MSE and RMSE are sensitive to outliers. This is as it gives higher weight to larger errors (residuals), and is better used when large errors are particularly undesirable. 
* Since we desire to minimize RMSE, we want to over represent large errors.

This is sensitivity is due to the squaring of the errors, which we can see mathematically from how:
$$ (\frac{b}{a})^2 > \frac{b}{a} \text{, where b is the large error and a is the small error}$$


## Mean Absolute Error
MAE = Mean of absolute error = $\frac{1}{n}\sum_{i=1}^{n}|\hat{y}(x^i)-y_i|$
MAE treats every error equally and just finds their average.  

However, absolute values are undesirable in mathematics as absolute value is a piecewise function which are subject to both algebra and a geometric interpretation of piecewise graph.

$$ \lvert x \rvert = \left\{
        \begin{array}{ll}
            -x & \quad x < 0 \\
            x & \quad x \geq 0
        \end{array}
    \right. $$

## Choice of Metric: RMSE
Given our model of protein supply is worse off having larger errors due to outliers in the data, as well as the ease of mathematical calculations, we will be choosing RMSE.


References: 
* [Medium: RMSE vs MAE - Which metric is better](https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d#:~:text=RMSE%20has%20the%20benefit%20of,then%20MAE%20is%20more%20appropriate.)
* [Medium: Comparing robustness of MAE, MSE and RMSE](https://towardsdatascience.com/comparing-robustness-of-mae-mse-and-rmse-6d69da870828)

In [None]:
def r2_score(y, ypred):
    y_mean = y.mean()
    ss_res = np.sum((y - ypred)**2)
    ss_tot = np.sum((y - y_mean)**2)
    return 1 - ss_res/ss_tot



def mean_squared_error(target, pred):
    n = target.shape[0]
    rss = np.sum((target - pred)**2)
    mse = rss / n
    return mse

def root_mean_squared_error(target, pred):
    return sqrt(mean_squared_error(target, pred))



<a id='step6'></a>
# Step 6: Parameter Tuning

<a id='step7'></a>
# Step 7: Prediction

<a id='modelimp'></a>
# Model Improvements
## Improvement 1: Removal of outliers
From our data, we can see certain countries protein supply are far from the median.  
Examples include: .... countries ....   
  
We can arbitarily define outliers as protein supply values less than (Q1 - 1.5 IQR) or protein supply values more than (Q3 + 1.5 IQR). The functions to find these are defined below.
  
This has two implications that allows us to get a better model:  
1. Our chosen cost metric, RMSE, is more sensitive to extreme values. By removing outliers, we can lower our RMSE.
2. Our cost function J is equal to \frac{1}{2} MSE which like RMSE is sensitive to outliers.

# Improvement 2: Removal of irrelevant / low correlation predictors  
  
Based on our graphs, we will be removing ________ as our predictor(s).  


In [None]:
def quartile_1(df, target):
    return

def quartile_3(df, target):
    return

def iqr(target):
    return quartile_3(df, target) - quartile_1(df, target)

def remove_outliers(df, target):
    """ 
    Returns a new dataframe with the outliers row removed.  
    """
    return df_new

