![logo](https://user-images.githubusercontent.com/59526258/124226124-27125b80-db3b-11eb-8ba1-488d88018ebb.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

import matplotlib.pyplot as plt
%matplotlib inline

## View data

In [None]:
df = pd.read_csv('https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt', sep="\t")

In [None]:
df.head()

The sklearn's load_diabetes() dataset is using the same source in this example. The difference is load_diabetes() returns standardized data, while this example works from scratch.

#### Dataset feeatures
- age: age in years
- sex
- bmi: body mass index
- bp: average blood pressure
- s1: tc, T-Cells (a type of white blood cells)
- s2: ldl, low-density lipoproteins
- s3: hdl, high-density lipoproteins
- s4: tch, thyroid stimulating hormone
- s5: ltg, lamotrigine
- s6: glu, blood sugar level

#### Target (Y column)
As you can see from the full description below, the dataset aims to predict the quantitative measure of disease progression.

In [None]:
# View full description
for line in datasets.load_diabetes()['DESCR'].split('\n'):
    print(line)

## Prepare data

Group the age column according to range

In [None]:
age = df['']
print(min(age), max(age))

Let's set the low limit to be 10 and upper limit 80 according to the min and max value of the data.

The labels are set according to the age group definition
[Source](https://help.healthycities.org/hc/en-us/articles/219556208-How-are-the-different-age-groups-defined-)

In [None]:
# Set the bins
bins = [10, 12, 17, 65, 80]
age_labels = ["children", "teens", "adults", "elderly"]

# perform range encoding
age = pd.cut(age, bins=bins, labels=age_labels, include_lowest=True)
age = pd.DataFrame(age)

In [None]:
age.head()

### One-hot encoding

After grouping the age data, perform one-hot encoding.

In [None]:
age = pd.(age)

# use rename() to change the new column name
gender = pd.(df["SEX"]).rename(columns={1:"gender1", 2:"gender2"})

gender.head()

### Normalize data

Use sklearn MinMaxScaler for to normalize data

In [None]:
scaler = preprocessing.()
df_to_scale = df.drop(["AGE", "SEX"],axis=1)
df_scaled = scaler.fit_transform(df_to_scale)

column_names = ["BMI","BP","S1","S2","S3","S4","S5","S6","Y"]
df_scaled = pd.DataFrame(df_scaled, columns=column_names)

In [None]:
df_scaled.head()

### Combine all data

In [None]:
df = pd.concat([age, gender, df_scaled], axis=1)

In [None]:
df.head()

## Summary of Data Preparation
- Change continuous variable to ordinal variable

We convert age into age groups and give them labels

- One-hot encoding

Both age and gender are converted to one-hot encoding. 

- Normalize data

Other data are scaled using the sklearn min-max algorithm which normalizes according to the minimum and maximum data

Now, the data is ready for training

In [None]:
# Split data and target
data = df.drop(["Y"], axis=1)
target = df["Y"]

In [None]:
data = data.astype(dtype=np.float32)
target = target.astype(dtype=np.float32)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=123)

## Train model

In [None]:
model = LinearRegression()

perform training

## Evaluate

### Train Error
Evaluate model using train data

### Test Error
Evaluate model using test data

# Model evaluation
Evaluation is done to monitor the performance of the trained model. The performance shows how well the model learns the data. 

There are three conditions to monitor

### Underfitting
In this condition, the model has not learned enough about training data. A simple example is using a straight line to model non-linear data. The line will not be able to describe the data.

In the case of regression, underfitting is when the training and testing errors are large.

### Overfitting
In this condition, the model has learned too much training data, which consists of good data and noise. In other words, the model "memorise" the data. This condition is undesirable as noise is not what we want the model to learn. It will perform well on training data but not unseen data.

In the case of regression, overfitting is when the training error is very low, and the test error is large.

### Generalize well
This is the desired condition for a model. It shows that the model can learn the training data and perform good predictions for the testing data.

__In this example, both training and testing errors are low, which shows that the model generalizes the data well.__