## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state = 5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Diabetes Dataset (2 points)

We will use the diabetes dataset from the UCI machine learning repository. Details about this dataset can be found [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database). The objective of this project is to predict whether or not a female patient has diabetes based on diagnostic measurements in the dataset.

The dataset consists of several medical predictor variables (features) and one target variable indicating whether or not the person has diabetes. Predictor variables include the number of pregnancies the patient has had, glucose level, blood pressure, skin, insulin, bmi, pedigree, and age.

### Loading the dataset

In [None]:
# These are the names of the columns in the dataset. They includes all features of the data and the label.
col_names = ['pregnancies', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# Download and load the dataset
import os
if not os.path.exists('diabetes.csv'): 
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364_2023/master/dataset/diabetes.csv 
diabetes_data = pd.read_csv('diabetes.csv', header=1, names=col_names)

FEATURE_NAMES=diabetes_data.drop('label',axis=1).columns

# Display the first five instances in the dataset
diabetes_data.head(5)

In [None]:
# Check the type of data in each column
diabetes_data.info()

#### Use the `describe` function to display some statistics of the data. See [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) for details about this function.

In [None]:
# Display some stats
diabetes_data.describe()

1. Count tells us the number of non-empty rows in a feature.

2. Mean tells us the mean value of that feature.

3. Std tells us the standard deviation of that feature.

4. Min tells us the minimum value of that feature.

5. 25%, 50%, and 75% are the percentile/quartiles of each feature.

6. Max tells us the maximum value of that feature.

#### Visualize the distribution of fraudulent vs genuine transactions

In [None]:
fig, ax = plt.subplots(1, 1)
ax.pie(diabetes_data.label.value_counts(),autopct='%1.1f%%', labels=['Not diabetes','Diabetes'], colors=['green','red'])
plt.axis('equal')
plt.ylabel('');

### Extract target and descriptive features (0.5 points)

In [None]:
# Store all the features from the data in X
X = # TODO
# Store all the labels in y
y = # TODO

In [None]:
# Convert data to numpy array
X = # TODO
y = # TODO

### Create training and validation datasets (0.5 points)


Split the data into training and validation sets using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation.

In [None]:
X_train, X_val, y_train, y_val = # TODO

### Preprocess the dataset (1 point)

#### Preprocess the data by normalizing each feature to have zero mean and unit standard deviation. This can be done using the `StandardScaler()` function. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more details.


In [None]:
# Define the scaler for scaling the data
scaler = # TODO

# Normalize the training data
X_train = # TODO

# Use the scaler defined above to standardize the validation data by applying the same transformation to the validation data.
X_val = # TODO

## Training error-based models (18 points)


#### We will use the `sklearn` library to train a Multinomial Logistic Regression classifier and Support Vector Machines. 


### Exercise 1:  Learning a Multinomial Logistic Regression classifier (4 points)

#### Use `sklearn`'s `SGDClassifier` to train a multinomial logistic regression classifier (i.e., using a one-versus-rest scheme) with Stochastic Gradient Descent. Review ch.7 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier) for more details. 

#### Set the `random_state` as defined above,  increase the `n_iter_no_change` to 1000 and `max_iter` to 10000 to facilitate better convergence.  

#### Report the model's accuracy over the training and validation sets.
 

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score 

In [None]:
# TODO

#### Explain any performance difference observed between the training and validation datasets.

Insert answer

### Exercise 2: Learning a Support Vector Machine (SVM) (14 points)

#### Use `sklearn`'s `SVC` class to train an SVM (i.e., using a [one-versus-one scheme](https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-one)). Review ch.7 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for more details. 
 

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#### Exercise 2a: Warm up (2 points)

#### Train an SVM with a linear kernel. Set the  random_state to the value defined above. Keep all other parameters at their defaults.

#### Report the model's accuracy over the training and validation sets.

In [None]:
# TODO

#### Exercise 2b: Evaluate a polynomial kernel function (4 points)

#### Try fitting an SVM with a polynomial kernel function and vary the degree among {1, 2, 3, 4}. Note that degree=1 yields a linear kernel. 

#### For each fitted classifier, report its accuracy over the training and validation sets. 

#### As before, set the random_state to the value defined above. Set the regularization strength `C=100`.  When the data is not linearly separable, this encourages the model to fit the training data. Keep all other parameters at their default values.

In [None]:
# TODO

#### Explain the effect of increasing the degree of the polynomial.

Insert answer

#### Exercise 2c: Evaluate the radial basis kernel function (6 points)

#### Try fitting an SVM with a radial basis kernel function and vary the length-scale parameter given by $\gamma$ among {0.01, 0.1,1,10, 100}. 

#### For each fitted classifier, report its accuracy over the training and validation sets. 

#### As before, set the random_state to the value defined above. Set the regularization strength `C=100`.  When the data is not linearly separable, this encourages the model to fit the training data (read more [here](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)). Keep all other parameters at their default values.

In [None]:
# TODO

#### Comment on the effect of increasing/reducing the length-scale parameter $\gamma$. Also, compare the performance of the classifiers trained with RBF kernel function against those trained with the polynomial and linear kernel functions (i.e., Ex. 2b). 

Insert answer

#### Exercise 2d: Briefly state the main difference between the logistic regression classifier and the SVM. (2 points)

Insert answer