## Overview
This week I want to show you a complete machine learning project end-to-end. We will continue to use only Linear and Polynomial Regression Models. But we will fit it into the full process.
- We will talk about how to construct a machine learning project.
- We will talk about some feature engineering steps and processes that we can use.
  - Things related to data preprocessing, and transformation.
  - Things related to how we can select features.
  - how to use categorical variables in our models which are clearly expecting numbers.
  - We will talk a bit more about the scikit learn library and the pipelines you can generate using it.
- We will talk about how to evaluate the model, how to select the best model, how to fine-tune it.

For an End-to-End Machine Learning Project, we will go through the following steps:
- Look at the big picture.
- Get the data.
- Discover and visualize the data to gain insights.
- Prepare the data for Machine Learning algorithms.
- Select a model and train it.
- Fine-tune your model.
- Present your solution.
- Launch, monitor, and maintain your system.


## Look at the Big Picture


## Splitting the Data
When we talked about Linear Regression Model, we were able to combine our visual observation with the cost function calculations to get a good understanding of how the model works. We were able to choose the right degree for our model, such that it reduces the cost function value, without being overfit.
But the question we should wonder about here, is how can we do this in a more systematic way? How can we do it without relaying on the visual observations since we will be able to visualize anything beyond 3 dimensions?

For that we need to split out labeled data into two parts:
- A training set.
- A test set.

We will perform all of our model selection and fine-tuning on the training set. And we will use the test set to evaluate our final model.
When a model performs very poorly on the training set, then we're probably underfitting the data. When a model performs very well on the training data, but poorly on the test data, then we're probably overfitting the data. When the model performs relatively okay on both the training and test sets, then we're probably just right.

remember you'll never get data perfectly 100%. it makes no since, unless all the data we recieved where already following a perfect formula.

# Machine Learning Process
1. Define the problem
2. Collect the data
3. Prepare the data
4. Evaluate the algorithms
5. Improve the results
6. Present the results

- start with the types of ML problems and the types of ML algorithms
- Frame the problem and look at the big picture
  - are we trying to get an estimate for the price
  - or just a category indicating (cheap, medium, expensive), in that case, accuracy is not important, and this could become a classification problem
- Splitting the data into training and test sets
- Computers don't generate truly random numbers, so we need to set the seed
  - https://www.statisticshowto.com/random-seed-definition/#:~:text=Generator%20in%20Excel.-,What%20is%20a%20Random%20Seed%3F,Henkemans%20%26%20Lee%2C%202001).
- Feature Scaling
  - https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

- scikit-learn has:
  - estimators
  - transformers
  - predictors
  - They can be chained together using pipelines


In [1]:
import pandas as pd

diabetes_df = pd.read_table('./data/diabetes.txt');
diabetes_df.head()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135


In [57]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer

model = make_pipeline(
  # Normalizer(),
  MinMaxScaler(),
  # StandardScaler(with_std=False, with_mean=True),
  RobustScaler(with_centering=True, with_scaling=True, ),
)

X = diabetes_df.drop('Y', axis=1)
y = diabetes_df['Y']

transformed_data_df = pd.DataFrame(model.fit_transform(X), columns=X.columns)
transformed_data_with_y_df = pd.concat([transformed_data_df, y], axis=1)

display(transformed_data_with_y_df['BMI'].apply(lambda x: x**2).sum())
display(transformed_data_with_y_df.head())

# transformed_data = pd.DataFrame(model.fit_transform(diabetes_df),columns = diabetes_df.columns)
# transformed_data.head()

238.71943640027757

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,0.433735,1.0,1.053498,0.380952,-0.637363,-0.514954,-0.571429,0.0,0.332755,-0.271186,151
1,-0.096386,0.0,-0.674897,-0.285714,-0.065934,-0.254876,1.257143,-0.5,-1.010756,-1.491525,75
2,1.060241,1.0,0.790123,0.0,-0.659341,-0.504551,-0.4,0.0,0.073213,-0.40678,141
3,-1.253012,0.0,-0.065844,-0.428571,0.263736,0.478544,-0.457143,0.5,0.375087,-0.135593,206
4,0.0,0.0,-0.444444,0.380952,0.131868,0.322497,0.228571,0.0,-0.457391,-0.745763,135


238.7194364002778

In [19]:
from sklearn import datasets

diabetes = datasets.load_diabetes(as_frame=True)

diabetes.frame.head()


Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [40]:
# sum of the squares of the bmi column
diabetes.frame['bmi'].apply(lambda x: x**2).sum()

0.9999999999999998

In [39]:
print(diabetes.DESCR)


.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

In [None]:
from zlib import crc32
import numpy as np

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

print(crc32(np.int64('1')) & 0xffffffff)
print(crc32(np.int64(4294967296)) & 0xffffffff)
print(0xffffffff)
print(2**32)

2844319735
3718166540
4294967295
4294967296


# Feature Scaling

## Notes
- feature scaling is important for gradient descent
- explain how the algorithm works
- data splitting
- https://medium.com/@thaddeussegura/simple-linear-regression-in-200-words-eb0835324af5
https://medium.com/@thaddeussegura/multiple-linear-regression-in-200-words-data-8bdbcef34436
https://medium.com/@thaddeussegura/polynomial-regression-in-200-words-2b1f4f8b5c5a

Finally we'll end the module with a complete end-to-end example of a machine learning project with all of its cleaning, preprocessing steps.

In [None]:
# Visualize the same data using different polynomial degrees
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

fig = plt.figure(figsize=(20,10))
### Set figure size
ax = fig.add_subplot(111, projection='3d')
ax.scatter(income_df['Education'],income_df['Seniority'],income_df['Income'],c='red', marker='o', alpha=0.5)

data_polynomialed = PolynomialFeatures(degree=2).fit_transform(income_df[['Education', 'Seniority']])
data_polynomialed.shape # 30 x 6 
polynomial_income_model = LinearRegression()
polynomial_income_model.fit(data_polynomialed, income_df['Income'])
# for degree in [1,2,3,4]:
    # model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    # model.fit(income_df[['Education', 'Seniority']], income_df['Income'])
    # predictedIncomeForSurface=model.predict(surfaceX)
    # ax.plot_surface(x_surf, y_surf, predictedIncomeForSurface.reshape(x_surf.shape), alpha=0.3)