# Modeling exercise

## General Instructions

* Submission date: 25.4.2022
* Submission Method: Link to your solution notebook in [this sheet](https://docs.google.com/spreadsheets/d/1fTmjiVxzw_rM1hdh16enwUTtxzlHSJIiw41dJS2LKp0/edit?usp=sharing).

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import sys; sys.path.append('../Modles and Modeling/src')
import numpy as np
import plotly_express as px

In [4]:
import pandas as pd
import ipywidgets as widgets

In [5]:
from datasets import make_circles_dataframe, make_moons_dataframe

ModuleNotFoundError: No module named 'datasets'

## Fitting and Overfiting 

The goal of the following exercise is to:
* Observe overfitting due to insufficient data
* Observe Overfitting due to overly complex model
* Identify the overfitting point by looking at Train vs Test error dynamic
* Observe how noise levels effect the needed data samples and model capacity

To do so, you'll code an experiment in the first part, and analyze the experiment result in the second part.

### Building an experiment

Code:

1. Create data of size N with noise level of magnitude NL from datasets DS_NAME. 
1. Split it to training and validation data (no need for test set), use 80%-20%. 
1. Use Logistic regression and Choose one complex model of your choice: [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [SVM with RBF kernel](https://scikit-learn.org/stable/modules/svm.html) with different `gamma` values or [Random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with differnt number of `min_samples_split`. 
1. Train on the train set for different hyper parameter values. compute:
   1. Classification accuracy on the training set (TRE)
   1. Classification accuracy on the validation set (TESTE)
   1. The difference beteen the two above (E_DIFF)
1. Save DS_NAME, N, NL, CLF_NAME, K, TRE, TESTE, E_DIFF and the regularization/hyper param (K, gamma or min_samples_split and regularization value for the linear regression classifier)

Repeat for:
* DS_NAME in Moons, Circles
* N (number of samples) in [5, 10, 50, 100, 1000, 10000]
* NL (noise level) in [0, 0.1, 0.2, 0.3, 0.4, 0.5]
* For the complex model: 10 Values of hyper parameter of the complex model you've chosen.
* For the linear model: 5 values of ridge (l2) regularization - [0.001, 0.01, 0.1, 1, 10, 100, 1000]

**NOTE:** The Result DataFrame *size* shoule, for running each Model, is 510. For TOW models its size is 1,020 (for THREE 1,530).

### Analysing the expermient results

1. For SVM only, For dataset of size 10k and for each dataset, What are the best model params? How stable is it? 
1. For SVM only, For dataset of size 10k and for each dataset, What is the most stable model and model params? How good is it in comparison to other models? Explain using bias and variance terminoligy.
1. Does regularization help for linear models? consider different datasets sizes. 
1. For a given noise level of your chioce, How does the train, test and difference error changes with increasing data sizes? (answer for svm and LR seperatly)
1. For a given noise level of your chioce, How does the train, test and difference error changes with increasing model complexity? (answer for svm and LR seperatly)
1. Does the Noise Level (NL) effect the number of datapoints needed to reach optimal test results? 

Bonus:

* For SVM: Select one dataset and with 0.2 noise level. Identify the optimal model params, and visualize the decision boundry learned. 
  * Hint: Use a grid. See classification models notebook 

## Tips and Hints

For buliding the experiment:

* Start with one dataframe holding all the data for both datastes with different noise level. Use the `make_<dataset_name>_dataframe()` functions below, and add two columns, dataset_name and noise_level, before appending the new dataset to the rest of the datasets. Use `df = pd.DataFrame()` to start with an empty dataframe and using a loop, add data to it using `df = df.append(<the needed df here>)`. Verify that you have 10k samples for each dataset type and noise level by a proper `.value_counts()`. You can modify the 
* When you'll need an N samples data with a specific noise level, use `query()` and `head(n)` to get the needed dataset. 
* Use sklearn `train_test_split()` method to split the data with `test_size` and `random_state` parameters set correctly to ensure you are always splitting the data the same way for a given fold `k`. Read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) if needed. 
* You can also not create your own data splitter, and instead use `model_selection.cross_validate()` from sklearn. You'll need to ask for the train erros as well as the test errors, see [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html).
* Use prints in proper location to ensure the progress of the experiment. 

**If you get stuck, and need refernce, scroll to the end of the notebook to see more hints!**

## Moons dataset

In [5]:
from sklearn.datasets import make_moons
import numpy as np

In [6]:
moons_df = make_moons_dataframe(n_samples=1000, noise_level=0.1)
moons_df.head()

Unnamed: 0,x,y,label
0,1.975602,0.458461,B
1,0.610984,1.014617,A
2,-0.642946,0.940368,A
3,0.485465,-0.403386,B
4,0.718501,-0.536166,B


In [7]:
@widgets.interact
def plot_noisy_moons(noise_level = widgets.FloatSlider(value=0, min=0, max=0.5, step=0.05)):
    moons_df = make_moons_dataframe(n_samples=1000, noise_level=noise_level)
    return px.scatter(moons_df, x='x', y='y', color = 'label')

interactive(children=(FloatSlider(value=0.0, description='noise_level', max=0.5, step=0.05), Output()), _dom_c…

## Circles Dataset

In [8]:
circles_df = make_circles_dataframe(n_samples=500, noise_level=0)
circles_df.head()

Unnamed: 0,x,y,label
0,-0.509939,0.616411,B
1,0.719524,-0.349693,B
2,0.799747,0.020104,B
3,-0.974527,0.224271,A
4,0.675333,-0.737513,A


In [9]:
@widgets.interact
def plot_noisy_circles(noise_level = widgets.FloatSlider(value=0, min=0, max=0.5, step=0.05)):
    df = make_circles_dataframe(1000, noise_level)
    return px.scatter(df, x='x', y='y', color = 'label')

interactive(children=(FloatSlider(value=0.0, description='noise_level', max=0.5, step=0.05), Output()), _dom_c…

## SOLUTION

### 1. Creating MOONs and CIRCLEs Datasets

In [10]:
n_samples = [10, 50, 100, 1000, 10000] # Because in 5 observations samples there are times that the data doesn't include both labels, we decided to execute the experiment with 10 observations and up.
noise_level = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
ds_name = ['Moon', 'Circle']
df = pd.DataFrame()

In [11]:
for typush in ds_name:
    if typush == 'Moon':
        for nl in noise_level:
            temp_df = make_moons_dataframe(n_samples = 10000, noise_level = nl)
            temp_df['noise_level'] = nl
            temp_df['ds_name'] = typush
            df = df.append(temp_df)
    elif typush == 'Circle':
        for nl in noise_level:
            temp_df = make_circles_dataframe(n_samples = 10000, noise_level = nl)
            temp_df['noise_level'] = nl
            temp_df['ds_name'] = typush
            df = df.append(temp_df)
            
df.groupby('ds_name').noise_level.value_counts()

ds_name  noise_level
Circle   0.0            10000
         0.1            10000
         0.2            10000
         0.3            10000
         0.4            10000
         0.5            10000
Moon     0.0            10000
         0.1            10000
         0.2            10000
         0.3            10000
         0.4            10000
         0.5            10000
Name: noise_level, dtype: int64

In [331]:
df.head()

Unnamed: 0,x,y,label,noise_level,ds_name
0,-0.704324,0.709879,A,0.0,Moon
1,1.741221,-0.171261,B,0.0,Moon
2,1.516955,-0.356013,B,0.0,Moon
3,-0.828072,0.560622,A,0.0,Moon
4,1.056844,-0.498383,B,0.0,Moon


### 2. Running Models

Before we beging the Hyper Parameter Search (HPS), we should choose the most appropriate **Evaluation Tool**. Our Data is balaced and there is no "favorite" label we prefer increase its Identification. Therefore we will ues **ACCURACY** as our evloation tool.

**MODELS:**
1. Logistic Regression (**logit**), as Linear Model, uses as *Low Capcity* Model in this paper.
1. Support Vector Machine (**SVM**), as more Complex model, uses as *High Capcity* Model in this paper.

**NOTES**:
* When the size sample (n_sample) is extremely small, it is more than possible that the sample would contain observations from one lablel only. In that case, the function `cross_validate()` dosent work. Therefore we decided not to use 5-size samples and run the models with 10, 50, 100, 1000, and 10000 observations.

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.model_selection import KFold, cross_validate

In [13]:
clfs = ['svm', 'logit']
cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
gammas = [0.001 ,0.005, 0.01 ,0.1, 0.5, 1 ,5 , 10, 100, 1000]
k_fold = KFold(n_splits = 5)
result = []

In [168]:
count_svm = 0
count_logit = 0

for ds in ds_name:
    for clf in clfs:
        for n in n_samples:
            for nl in noise_level:
                temp_df = df.query('ds_name == @ds and noise_level == @nl').head(n)
                
                if clf == 'svm':
                    for gamma in gammas:
                        svm_clf = svm.SVC(kernel = 'rbf', gamma = gamma)
                        score = cross_validate(svm_clf, temp_df[['x', 'y']], temp_df['label'], cv = k_fold, return_train_score = True)
                        TRE_mean, TRE_std = score['train_score'].mean(), score['train_score'].std()
                        TESTE_mean, TESTE_std = score['test_score'].mean(), score['test_score'].std()
                        E_DIFF = TRE_mean - TESTE_mean
                        result.append((ds, n, nl, clf, 5, TRE_mean, TRE_std, TESTE_mean, TESTE_std, E_DIFF, gamma))
                        count_svm = count_svm + 1
                        print(f'{count_svm}) {clf}, gamma = {gamma}, {ds} dataset, {nl} noise level, {n} observations')
                
                elif clf == 'logit':
                    for c in cs:
                        logit = LogisticRegression(penalty = 'l2', C = c)
                        score = cross_validate(logit, temp_df[['x', 'y']], temp_df['label'], cv = k_fold, return_train_score = True)
                        TRE_mean, TRE_std = score['train_score'].mean(), score['train_score'].std()
                        TESTE_mean, TESTE_std = score['test_score'].mean(), score['test_score'].std()
                        E_DIFF = TRE_mean - TESTE_mean
                        result.append((ds, n, nl, clf, 5, TRE_mean, TRE_std, TESTE_mean, TESTE_std, E_DIFF, c))
                        count_logit = count_logit + 1
                        print(f'{count_logit}) {clf}, C = {c}, {ds} dataset, {nl} noise level, {n} observations')

                '''
                # Note: K neighbors than 9 can not be run with 10 observations sample and 5-Folds Cross Validation. therefore, we will set the K neighbors running until 7 only.
                elif clf == 'knn':
                    for k in range(1,20,2): # See Note above.
                        # print(f'Running {clf} model with K Nearest Neighbors: {k}, for {ds} dataset with {nl} noise level, substeted for {n} observations')
                        knn = KNeighborsClassifier(n_neighbors = k)
                        score = cross_validate(knn, temp_df[['x', 'y']], temp_df['label'], cv = k_fold, return_train_score = True)
                        TRE = score['train_score'].mean()
                        TESTE = score['test_score'].mean()
                        E_DIFF = TRE - TESTE
                        result.append((ds, n, nl, clf, 5, TRE, TESTE, E_DIFF, k))
                '''
                

1) svm, gamma = 0.001, Moon dataset, 0 noise level, 10 observations
2) svm, gamma = 0.005, Moon dataset, 0 noise level, 10 observations
3) svm, gamma = 0.01, Moon dataset, 0 noise level, 10 observations
4) svm, gamma = 0.1, Moon dataset, 0 noise level, 10 observations
5) svm, gamma = 0.5, Moon dataset, 0 noise level, 10 observations
6) svm, gamma = 1, Moon dataset, 0 noise level, 10 observations
7) svm, gamma = 5, Moon dataset, 0 noise level, 10 observations
8) svm, gamma = 10, Moon dataset, 0 noise level, 10 observations
9) svm, gamma = 100, Moon dataset, 0 noise level, 10 observations
10) svm, gamma = 1000, Moon dataset, 0 noise level, 10 observations
11) svm, gamma = 0.001, Moon dataset, 0.1 noise level, 10 observations
12) svm, gamma = 0.005, Moon dataset, 0.1 noise level, 10 observations
13) svm, gamma = 0.01, Moon dataset, 0.1 noise level, 10 observations
14) svm, gamma = 0.1, Moon dataset, 0.1 noise level, 10 observations
15) svm, gamma = 0.5, Moon dataset, 0.1 noise level, 10 o

In [309]:
result_df = pd.DataFrame(result, columns = ['ds', 'n', 'nl', 'clf', 'KFolds', 'TRE_mean', 'TRE_std', 'TESTE_mean', 'TESTE_std', 'E_DIFF', 'Regularization'])
result_df['E_DIFF'] = result_df.TESTE_mean - result_df.TRE_mean
result_df = result_df.tail(1020)
result_df.shape

(1020, 11)

In [311]:
result_df.to_csv('Result DataFrame - Modeling Exercise 20_04_2022.csv', index = False)

### 3. Analysing

In [6]:
result_df_copy = pd.read_csv('Data/Result DataFrame - Modeling Exercise 20_04_2022.csv')
result_df_copy.head()

Unnamed: 0,ds,n,nl,clf,KFolds,TRE_mean,TRE_std,TESTE_mean,TESTE_std,E_DIFF,Regularization
0,Moon,10,0.0,svm,5,0.675,0.1,0.5,0.0,-0.175,0.001
1,Moon,10,0.0,svm,5,0.675,0.1,0.5,0.0,-0.175,0.005
2,Moon,10,0.0,svm,5,0.675,0.1,0.5,0.0,-0.175,0.01
3,Moon,10,0.0,svm,5,0.75,0.158114,0.5,0.0,-0.25,0.1
4,Moon,10,0.0,svm,5,0.85,0.122474,0.8,0.244949,-0.05,0.5


#### 3.1) For SVM only, For dataset of size 10k and for each dataset, What are the best model params? How stable is it?

In [315]:
result_df_copy.query('clf == "svm" and n == 10000').groupby(['ds', 'nl']).TRE_mean.max()

ds      nl 
Circle  0.0    1.000000
        0.1    0.871325
        0.2    0.799450
        0.3    0.811025
        0.4    0.833550
        0.5    0.841975
Moon    0.0    1.000000
        0.1    0.999950
        0.2    0.982925
        0.3    0.942175
        0.4    0.917350
        0.5    0.906525
Name: TRE_mean, dtype: float64

In [316]:
@widgets.interact
def show_fig31(noise_level = widgets.SelectionSlider(options = [0, 0.1, 0.2, 0.3, 0.4, 0.5])):
    df = result_df_copy.query('clf == "svm" and n == 10000').reset_index()
    df = df.query('nl == @noise_level')
    df['Dataset Type'], df['Gamma'], df['Accuracy'] = df.ds, df.Regularization.apply(lambda n: str(n)), df.TESTE_mean.apply(lambda s: s * 100)
    fig31 = px.bar(df,
                       x = 'Gamma',
                       y = 'Accuracy',
                       color = 'Dataset Type',
                       barmode='group',
                       text_auto = '.2s',
                       range_y = [0,100],
                       title = f'Accuracy Test Score by Gamma Values for SVM Model for Smaples Size of 10,000 with {noise_level} Noise Level.')
    fig31.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
    return fig31

interactive(children=(SelectionSlider(description='noise_level', options=(0, 0.1, 0.2, 0.3, 0.4, 0.5), value=0…

**Answer 3.1:** For all Noise level and for both Datasets (Circle and Moon), the Hyper Parameter that provides the best results (max TESTE) is **gamma = 1,000**.

#### 3.2) For SVM only, For dataset of size 10k and for each dataset, What is the most stable model and model params? How good is it in comparison to other models? Explain using bias and variance terminoligy.



In [345]:
@widgets.interact
def show_fig32(ds = widgets.RadioButtons(options =result_df_copy.ds.unique())):
    df = result_df_copy.query('ds == @ds and clf == "svm" and n == 10000')
    df['Accuracy'] = df.TESTE_mean.apply(lambda s: s * 100)
    df['Test_Standard_Deviation'] = df.TESTE_std.apply(lambda s: s * 100)
    df['Regularization'] = df.Regularization.apply(lambda s: str(s))
    fig32  = px.line(df, x = 'Regularization', y = ['Test_Standard_Deviation'], color = 'nl', width = 1000, height = 600)
    return fig32.show()

interactive(children=(RadioButtons(description='ds', options=('Moon', 'Circle'), value='Moon'), Output()), _do…

**ANSWER 3.2:** For SVM Model, for data size of 10,000 and for both datasets, as the Regularization increases the Variance (STD) reduces, for most of noise levels.

#### 3.3) Does regularization help for linear models? consider different datasets sizes. 

In [318]:
@widgets.interact
def show_fig33(ds = widgets.RadioButtons(options =result_df_copy.ds.unique())):
    df = result_df_copy.query('clf == "logit" and ds == @ds')
    df['Regularization'], df['Accuracy Test Score'] = df.Regularization.apply(lambda s: str(s)), df.TESTE_mean.apply(lambda s: s * 100)
    return px.bar(df,
                  x = 'Regularization',
                  y = 'Accuracy Test Score',
                  facet_col = 'n',
                  facet_row= 'nl',
                  width = 1250,
                  height = 1250,
                  range_y = [0,100])

interactive(children=(RadioButtons(description='ds', options=('Moon', 'Circle'), value='Moon'), Output()), _do…

**Answer 3.3:** Generally, for both datasets, Moons, and Circles, Regularization does NOT improve the Accuracy received for all noise levels. The Reason might be that in first place Logistic Regression as a Linear Model is a Low Capacity Model. Therefore, regularization, that makes the model less comlex, doesn't help improving the results.

#### 3.4) For a given noise level of your chioce, How does the train, test and difference error changes with increasing data sizes? (answer for svm and LR seperatly)

In [319]:
@widgets.interact
def show_fig34(ds = widgets.RadioButtons(options =result_df_copy.ds.unique()), clf = widgets.RadioButtons(options =result_df_copy.clf.unique())):
    df = result_df_copy.query('ds == @ds and clf == @clf and nl == 0 and Regularization == 0.1')
    df['n'] = df.n.apply(lambda s: str(s))
    df['Accuracy Test Score'] = df.TESTE_mean.apply(lambda s: s * 100)
    df['Accuracy Train Score'] = df.TRE_mean.apply(lambda s: s * 100)
    df['Difference'] = df.E_DIFF.apply(lambda s: s * 100)
    fig34  = px.scatter(df, x = 'n', y = ['Accuracy Test Score', 'Accuracy Train Score', 'Difference'], range_y = [-50,150],
                         title = f'The Conection Between Test and Train Accuracy and their Difference to The Sample Size for {clf} Model, Noise Level of 0, Regulariztion (C and Gamma) of 0.1')
    return fig34.show()
                  

interactive(children=(RadioButtons(description='ds', options=('Moon', 'Circle'), value='Moon'), RadioButtons(d…

**Answer 3.4** For Noise Level 0.0, For both Classifier Models (logit and SVM), and for both Datasets (Moons and Circles) with Regularization c = 0.1 and gamma = 0.1, the Different between the Test-score and Train-score decreases as the data size increases. It means that the Overfitting received for smaller size data is being gone as the number of samples increases.

#### 3.5) For a given noise level of your chioce, How does the train, test and difference error changes with increasing model complexity? (answer for svm and LR seperatly)

In [326]:
@widgets.interact
def show_fig35():
    df = result_df_copy.query('nl == 0.5 and n == 10000')
    df['Regularization'] = df.Regularization.apply(lambda s: str(s))
    df['Accuracy Test Score'] = df.TESTE_mean.apply(lambda s: s * 100)
    df['Accuracy Train Score'] = df.TRE_mean.apply(lambda s: s * 100)
    df['Difference'] = df.E_DIFF.apply(lambda s: s * 100)
    fig35 = px.scatter(df,
                       x = 'Regularization',
                       y = ['Accuracy Test Score', 'Accuracy Train Score', 'Difference'],
                       facet_row = 'clf',
                       facet_col = 'ds',
                       range_y = [-50,150],
                       height = 1000,
                       width = 1250)
    return fig35.show()

interactive(children=(Output(),), _dom_classes=('widget-interact',))

**Answer 3.5:** For noise level 0.5, data size 10,000, Logistic Regression's Train and Test stay close (around 0 Difference) consistently as Regularization increases. A posible reason is that LR is a Low Comlexity Models by defult. For SMV Model, the Difference reduses as the regularization increases.

#### 3.6) Does the Noise Level (NL) effect the number of datapoints needed to reach optimal test results?

In [347]:
@widgets.interact
def fig36(ds = widgets.RadioButtons(options = result_df_copy.ds.unique()),
          clf = widgets.RadioButtons(options = result_df_copy.clf.unique()),
          reg = widgets.RadioButtons(options = result_df_copy.Regularization.unique())):
    df = result_df_copy.query('ds == @ds and clf == @clf and Regularization == @reg')

interactive(children=(RadioButtons(description='ds', options=('Moon', 'Circle'), value='Moon'), RadioButtons(d…



Bonus:

* For SVM: Select one dataset and with 0.2 noise level. Identify the optimal model params, and visualize the decision boundry learned. 
  * Hint: Use a grid. See classification models notebook 

## Appendix

### More hints!

If you'll build the datasets dataframe correctly, you'll have **one** dataframe that has dataset_name and noise_level colmuns, as well as the regular x,y,label colmns. To unsure you've appended everything correctly, groupby the proper colmuns and look at the size:

In [106]:
# Use proper groupby statement to ensure the datasets dataframe contains data as expected. You should see the following result:

Your 

You experiment code should look something like that:

In [19]:
datasets_type = ['circles', 'moons']
k_folds = 10
n_samples = [10, 50, 100, 1000, 10000]
noise_levels = [0, 0.1, 0.2, 0.3, 0.4, 0.5]
clf_types = ['log_reg', 'svm']
hp_range = <'Your hyper parameters ranges here'>
regularization_values = <'Your regularization values here'>
results = []
for ds_type in datasets_type:
    print(f'Working on {ds_type}')
    for nl in noise_levels:
        for n in n_samples:
            ds = datasets.query(<'your query here'>).head(n)
            print(f'Starting {k_folds}-fold cross validation for {ds_type} datasets with {n} samples and noise level {nl}. Going to train {clf_types} classifiers.')
            for k in range(k_folds):
                X, Y = <'Your code here'>
                x_train,x_test,y_train,y_test= <'Your code here'>
                for clf_type in clf_types:
                    if clf_type == 'log_reg':
                        for regularization_value in regularization_values:
                            train_acc, test_acc = <'Your code here'>
                            results.append(<'Your code here'>)
                    if clf_type == 'svm':
                        for gamma in hp_range:
                            train_acc, test_acc = <'Your code here'>
                            results.append(<'Your code here'>)

SyntaxError: invalid syntax (3386946450.py, line 6)

### Question 1 - Manual Classification

The purpose of this excercise is to examplify the need in a fitting algorithm. We will do so by trying to find only 2 models parameters by ourselves. 

In [None]:
slope, intercept = 2.5, 6

In [None]:
x_1, x_2 = 0.2, 0.6
on_line = [[x, x*slope + intercept,'on_line'] for x in np.linspace(-1,2,100)]

above_line = [[x_1, x_1*slope + intercept + 2, 'Above'], 
              [x_2, x_2*slope + intercept + 2, 'Above']] 

below_line = [[x_1, x_1*slope + intercept - 2, 'Below'], 
              [x_2, x_2*slope + intercept - 2, 'Below']] 

In [None]:
columns = ['x','y','label']
data = pd.DataFrame(on_line + above_line + below_line, columns = columns)

In [None]:
px.scatter(data, x='x', y='y', color = 'label')