<img src='https://weclouddata.com/wp-content/uploads/2016/11/logo.png' width='30%'>
-------------

<h3 align='center'> Applied Machine Learning Course - Assignment Week 2 </h3>
<h1 align='center'> Iris Dataset Classification </h1>

<br>
<center align="left"> Developed by:</center>
<center align="left"> WeCloudData Academy </center>


<h2>Background</h2>

In this assignment, we will practice some advanced Logistic Regression technique on the famous [iris dataset](#https://archive.ics.uci.edu/ml/datasets/iris).

This is perhaps the best known database to be found in the machine learning literature. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 

Predicted attribute: **class of iris plant**. 

This is an exceedingly simple domain.

<h2>Data Description</h2>

<b>Features:</b>
1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 


<b>Target Value:</b>

- class: 
  1. Iris Setosa 
  2. Iris Versicolour 
  3. Iris Virginica

## $\Omega$ 1: Explore the Training Data

In [None]:
# Import necessary libraries (numpy, sklearn, matplotlib) and 

from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set(style='ticks', palette='Set2')
%matplotlib inline

- <b> Step 1</b> 

  Load the iris dataset from sklearn into a variable called `data`.

  - Hint: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris
  - We have used similar technique to load the wine dataset in the in-class lab.

In [None]:
# Step 1


- <b> Step 2</b> 

  Explore the dataset

  - Figure out how many features there are and how many classes there are in the dataset

In [None]:
# Step 2


- <b> Step 3</b> 

  Explore class distribution

  - Find out how many samples there are for each target class.
  - Visualize 2D feature distribution on this dataset (hint: we used similar techniques in the in-class lab)

In [None]:
# Step 3

def visualize_2d(feature_indices, all_feature_names, target_names, X, y):
    # TODO: implement the similar visualization routine as we saw in the lab
    pass

In [None]:
# TODO: pick two features and visualize the data distribution w.r.t. the chosen two features


In [None]:
# TODO: compute the distribution of different classes


## $\Omega$ 2: Prepare Training and Testing Data

- <b>Step 1</b>

  - Use the 'train_test_split' function in scikit learn to split X and y into 80% Traning data and 20% Testing Data
  
  - Hint: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [1]:
# Step 1



- <b>Step 2</b>

  - Perform feature standardization on `X_train` by using sklearn's `StandardScaler`, and use the same standardizer to standardize `X_test`.
  - Hint: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [2]:
# Step 2



- <b>Step 3</b>

  - Repeat the data visualization you did in `Step 3: Explore class distribution` in the first section above on the training and test data (`X_train` and `X_test`) after the standardization

In [None]:
# Step 3



## $\Omega$ 3: Multi-class Logistic Regression </h3>

We have practiced about performing binary classification using Logistic Regression. However, Logistic Regression can also perform multi-class classification.

- <b>Step 1</b>
  - Create a Logistic Regression model - 'lr' and fit X_train and Y_train to train it. Do you think there is any special setup you need to do for making this Logistic Regression classifier capable of performing multi-class classification?
  - Hint: Look at the function doc at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


- <b>Step 2</b>
  - Import classification_report and accuracy_score from `sklearn.metrics.classification` to evaluate our classifier
  - Use `lr.predict` on `X_test` to get predicted value of `y` and call `classification_report(y_test, y_predict)` to get generate the classification report.
  - Call `accuracy_score` to get the prediction accuracy.

- <b>Step 3</b>
  - What's your model's performance (using the same set of metrics as in Step 2) on the `training` data? This is to understand whether our model is underfitting or not, i.e., if the training performance is not almost perfect, it is an indicator that our model is not expressive enough to represent the variance in the training data.

## $\Omega$ 4: Model generality </h3>

Logistic Regression in scikit-learn has regularization built in. It comes with many different variants, depending on the particular optimization solver and so on. But roughly speaking, there is a parameter $C$ which controls the strength of the regularization.

According to scikit-learn's documentation, this parameter $C$ is:

``` Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.```

Roughly speaking, the cost function has either one of the following forms:

<center>

$J_\theta=\sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)}) + \frac{1}{C} \times ||{\theta}||$ (using l1 norm), or
</center>

<center>
$J_\theta=\sum_{i=1}^{m} (\hat{y}^{(i)}-y^{(i)}) + \frac{1}{C} \times ||{\theta}||^2$ (using l2 norm)
</center>

In this exercise, we will experiment with using different $C$ values, and find the optimal $C$ value on a validation set. Then we use the model built with this optimal $C$ value to perform prediction on the test set.

- <b> Step 1: </b>
  - Further split the current training set to be 80% actual training set (`X_t` and `y_t`) and 20% validation set (`X_val` and `y_val`). 

In [3]:
# Step 1




- <b> Step 2: </b>
  - For each $C$ value in this set: $[0.001, 0.01, 0.1, 1, 5, 10, 50, 100]$, fit a LogisticRegression model `lr` with this $C$ value on `X_t` and `y_t`, and use `lr` to predict `X_val`. Keep track of  the accuracy score on the training set (`X_t` and `y_t`) and validation set (`X_val` and `y_val`).
  - Report the $C$ value which achieves the best validation accuracy score.


In [None]:
# Step 2
candidate_c = [0.001, 0.01, 0.1, 1, 5, 10, 50, 100]
train_accs = []
val_accs = []

# TODO: compute the training and validation accuracy scores over different $C$ values
    
    
    
    
# sort val_accs to find the best val accuracy and the corresponding c value which achieves the best val accuracy
best_c_idx, best_val_acc = sorted(enumerate(val_accs), key=lambda x : x[1], reverse=True)[0]
best_c = candidate_c[best_c_idx]

print(f'Best validation accuracy %.1f is achieved using c=%.3f' % (best_val_acc * 100.0, best_c))

- <b> Step 3</b>
  - Plot a curve of training accuracies vs. different $C$ value and a similar curve of validation accuracies vs. different $C$ value.

- <b>Step 4</b>

  Use the optimal value of $C$ you found in step 2 to train a new Logistic Regression classifier `lr` on the combination of `X_t` and `X_val` (i.e., the original `X_train`) and test your new model on the test data `X_test` and `y_test`. Report the performance now and compare with that you obtained in [Multi-class Logistic Regreesion]($\Omega$-3:-Multi-class-Logistic-Regression-) section.

In [None]:
# Step 4





## $\Omega$ 5 (Advanced): Implement gradient descent on Logistic Regression</h3>

- Under the hood, Scikit-learn is already using regularization (via $C$ value) and gradient descent based method to to fit LogisticRegression.

- Therefore, you have been enjoying the simple life of fitting a LogisticRegression classifier by calling sklearn directly. 

- In this step, you can challenge yourself to implement the most vanilla version of training a logistic regression by gradient descent, i.e., **no regularization**.

- However, this dataset is a **multi-class** dataset, and it requires special care to perform multi-class classification. Therefore, let's first transform this dataset into a binary classification dataset by merging the last two classes. The code for this part is provided to you.

- For more info on multi-class logistic regression, see https://cedar.buffalo.edu/~srihari/CSE574/Chap4/4.3.4-MultiLogistic.pdf![image.png](attachment:image.png)

In [None]:
for y in [y_train, y_test]:
    class_2_idx = np.where(y==2) # the indices of the samples where the original class is 2
    y[class_2_idx] = 1 # assign these samples a new class label: 1, i.e., they are merged into the class where labels=1

# confirm there are only two classes in `y_train` and `y_test` now:
print(f'unique values in y_train: {np.unique(y_train)}')
print(f'unique values in y_test: {np.unique(y_test)}')

- <b>Step 1: </b>

  Initialize all parameters randomly

<center>
$\hat{y}=\sigma({\theta^T} x)=\cfrac{1}{e^{(-\theta^Tx)}}$
<center>

In [None]:
# Step 1
def initialize_theta(dim):
    import numpy.random
    # TODO: randomly initialize coefficient to be a dim-sized 1D vector
    
    

- <b>Step 2</b>

  - Calculate the gradient of the logistic regresion function $\sigma(\theta^Tx)$ against each $\theta_i$. 

In [None]:
# Step 2: implement this function
import numpy as np

def prediction(X, current_theta):
    # TODO: compute the current estimation of the output, H, given the current_theta and X
    # X.shape: m*(p+1)
    # current_theta.shape: (p+1)*1
    
    
    
def loss(X, y, current_theta):
    H = prediction(X, current_theta) # H.shape: m*1
    # TODO: compute loss function J, given H, and y
    
    
    
def loss_gradient(X, y, current_theta):
    # TODO: implement the loss gradient
    # compute the current estimation of the output, H, given the current_theta and X
    # compute loss function J, given H, and y
    # compute gradient of this loss function against current_theta
    
    


- <b>Step 3</b>

  - Verify your gradient computation
  
  According to the definition of gradient: 
  <center>
  $\frac{\partial{J}}{\partial{\theta}}=(\frac{\partial{J}}{\partial{\theta_0}},\frac{\partial{J}}{\partial{\theta_1}},\ldots,\frac{\partial{J}}{\partial{\theta_p}})$
  </center>
  
  That is, the gradient is just the generalization of derivatives to the multivariate situation.
  
  For derivatives, we know it is defined as:
  
  <center>
  $\frac{dJ(\theta)}{d\theta}=\lim_{\Delta\rightarrow 0}\frac{J(\theta+\delta)-J(\theta-\delta)}{2\delta}$,
  </center>
  
  i.e., it can be computed by taking a really small step of $\Delta$ from the original value of $\theta$ and compute the difference between the new value of $J=J(\theta+\Delta)$ and the original value $J(\theta)$, then divided by that very small step $\Delta$.
  
  Similarly to derivatives, gradient can be computed in a similar fashion:
  
    <center>
  $\frac{\partial{J}}{\partial{\theta}}=(\frac{\partial{J}}{\partial{\theta_0}},\frac{\partial{J}}{\partial{\theta_1}},\ldots,\frac{\partial{J}}{\partial{\theta_p}})$
  <br>
  
  $\frac{\partial{J(\theta)}}{\partial{\theta_j}}=\lim_{\Delta\rightarrow 0}\frac{J(\theta_j+\delta)-J(\theta_j-\delta)}{2\delta}$, for $j=0,1,\ldots,p$
  </center>
  
  Therefore, we can verify the correctness of your computed gradients by checking whether it is approximately the value of $\frac{J(\theta_j+\delta)-J(\theta_j-\delta)}{2\delta}$ when $\delta$ is a very small number.
  
  Here, we provide the verification code for you, but please study the code carefully to understand the idea behind it. If, in training time later, your computed gradients (the `loss_gradient` function defined ealier) do not pass this verification code, it means your implementation of the `loss_gradient` function is very likely wrong.

In [None]:
from copy import deepcopy


def check_loss_gradient(X, y, current_theta):
    gradients = loss_gradient(X, y, current_theta)
    # gradients: a 1-D vector with p+1 values, i.e., gradients[i] is the i-th partial derivative, 
    
    delta = 1e-10
    
    current_loss = loss(X, y, current_theta)
    
    approx_gradients = []
    for j in range(len(current_theta)):
        forward_theta = deepcopy(current_theta) # we cannot use `forward_theta=current_theta`
        # as this will be a shallow copy, and when we modify `forward_theta`, the original current_theta
        # will be modified as well
        
        forward_theta[j] = forward_theta[j] + delta # we only add delta to the j-th parameter
        backward_theta = deepcopy(current_theta)
        backward_theta[j] -= delta # we only subtract delta to the j-th parameter
        
        forward_loss = loss(X, y, forward_theta)
        backward_loss = loss(X, y, backward_theta)
        approx_gradient_j = (forward_loss - backward_loss) / (2*delta)
        
        approx_gradients.append(approx_gradient_j)
    
    return np.allclose(gradients, approx_gradients)

In [None]:
# for example, you should be seeing `True` returned by calling:
check_loss_gradient(X=X_train, y=y_train, current_theta=initialize_theta(dim=X_train.shape[1]))

- <b>Step 3</b>

  - Define a learning rate $\alpha$, which would be the step size to update each $\theta_i$, $\theta_i=\theta_i - \alpha \times loss\_gradient_i$. Repeatedly updating all $\theta_i$'s until the loss converges.

In [None]:
# Step 4:

def update_theta(gradient, current_theta, step_size):
    # TODO: implement theta update logic
     
        


In [None]:
# We put the skeleton code here for you

# initialization:
# convert pandas dataframe into numpy arrays
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

precision = 0.001
step_size = 0.1 # use your own step_size
current_theta = initialize_theta(dim=X_train.shape[1]) # you need to determine the input value to `dim`
current_loss = loss(X_train, y_train, current_theta)
losses = [current_loss]

In [None]:
# main graident descent loop
while len(losses) < 2 or abs(losses[-1] - losses[-2]) > precision: # all some other convergence condition
    gradient = loss_gradient(X_train, y_train, current_theta)
    if not check_loss_gradient(gradient, X, current_theta):
        print('Check your gradient implementation!')
        break
    
    current_theta = update_theta(gradient, current_theta, step_size)
    
    # compute current loss
    current_loss = loss(X_train, y_train, current_theta)
    losses.append(current_loss)
    
# once converged, current_theta are therefore the coefficients in the linear regression model  
print(f'converge after {len(losses)} iterations')

In [None]:
# plot loss against number of iterations
# you should be seeing the cost value gradually decrease, if not, your implmenetation is wrong

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12, 6))
plt.plot(range(len(losses)), losses)  
plt.show()

- <b>Step 5</b>

  - Calculate the classification evaluation metrics (precision, recall, f1, and accuracy) when applying your model on the test data

In [None]:
# Step 5



