# Problem Statement
How might we predict a country’s **protein supply per capita** through the analysis of social, environmental, and economic factors such as **GDP, temperature change, consumer prices, import and export quantity and production quantity**?

# Overall 7 steps Methodology for ML Project
Reference: [Google's 7 steps of ML](https://towardsdatascience.com/the-googles-7-steps-of-machine-learning-in-practice-a-tensorflow-example-for-structured-data-96ccbb707d77)

Step 1: [Gathering Data](#step1)    
Step 2: [Data Preparation](#step2)  
Step 3: [Choosing a model](#step3)  
Step 4: [Training](#step4)  
Step 5: [Evaluation](#step5)  
Step 6: [Parameter Tuning](#step6)  
Step 7: [Prediction](#step7)  

Step 0: [Model Improvements](#modelimp)


<a id='step1'></a>
# Step 1: Gathering Data
  
Our data is sourced from FAOstats \<insert link>  
  
Before selecting our predictor values, we first designed a persona for each of our statistics. This is done below in [Step 2: data preparation](#step2) through the use of graphs to map out potentially interesting characteristics of each country in relation to each predictor.  

We have summarized our findings in the table below:  
\<Insert table of persona in markdown>

<a id='step2'></a>
# Step 2: Data Preperation

<a id='step3'></a>
# Step 3: Choosing a model

Our chosen model is that of multiple linear regression.   

In our model, we hypothesize that the protein supplied to people in a country is represented by the function $\hat{y}$ as follows:

$$\hat{y}(x) =  \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \ldots + \hat{\beta}_n x_n$$  

#### <center> OR </center>  

$$\mathbf{\hat{y}} = \mathbf{X} \times \mathbf{\hat{b}}$$

The cost function of our model is the following, which is derived from the sum of squared errors, i.e $\frac{1}{2}$MSE:
$$J(\hat{\beta}_0, \hat{\beta}_1) = \frac{1}{2m}\Sigma^m_{i=1}\left(\hat{y}(x^i)-y^i\right)^2$$

Using calculus to minimize our cost J against $\hat{\beta}$, we can derive our gradient descent update equation as the following:
$$ \mathbf{\hat{b}} = \mathbf{\hat{b}} - \alpha\frac{1}{m} \mathbf{X}^T \times (\mathbf{X}\times \mathbf{\hat{b}} - \mathbf{y}) $$

With this, we get a column vector of $\hat{\beta}$ for our multiple linear regression model which best fits our data.

In [None]:
def calculate_cost():
    return

def calc_linear():
    return

def gradient_descent():
    return

<a id='step4'></a>
# Step 4: Training

<a id='step5'></a>
# Step 5: Evaluation

<a id='step6'></a>
# Step 6: Parameter Tuning

<a id='step7'></a>
# Step 7: Prediction

<a id='modelimp'></a>
# Model Improvements
## Improvement 1: Removal of outliers
From our data, we can see certain countries protein supply are far from the median.  
Examples include: .... countries ....   
  
We can arbitarily define outliers as protein supply values less than (Q1 - 1.5 IQR) or protein supply values more than (Q3 + 1.5 IQR). The functions to find these are defined below.
  
This has two implications that allows us to get a better model:  
1. Our chosen cost metric, RMSE, is more sensitive to extreme values. By removing outliers, we can lower our RMSE.
2. Our cost function J is equal to \frac{1}{2} MSE which like RMSE is sensitive to outliers.

# Improvement 2: Removal of irrelevant / low correlation predictors  
  
Based on our graphs, we will be removing ________ as our predictor(s).  


In [None]:
def quartile_1(df, target):
    return

def quartile_3(df, target):
    return

def iqr(target):
    return quartile_3(df, target) - quartile_1(df, target)

def remove_outliers(df, target):
    """ 
    Returns a new dataframe with the outliers row removed.  
    """
    return df_new

