# Excercise 1

In the tutorial you saw how to compute LDA for a two class problem. In this excercise we will work on a multi-class problem. We will be working with the famous Iris dataset that has been deposited on the UCI machine learning repository
(https://archive.ics.uci.edu/ml/datasets/Iris).

The iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:
1. Iris-setosa (n=50)
2. Iris-versicolor (n=50)
3. Iris-virginica (n=50)

The four features of the Iris dataset:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

<img src="iris_petal_sepal.png">



In [None]:
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set();
import pandas as pd
from sklearn.model_selection import train_test_split
from numpy import pi

### Importing the dataset

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)

dataset.tail()

### Data preprocessing

Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:

In [None]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

The above script assigns the first four columns of the dataset i.e. the feature set to X variable while the values in the fifth column (labels) are assigned to the y variable.

The following code divides data into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### Feature Scaling

We will now perform feature scaling as part of data preprocessing too. For this task, we will be using scikit learn `StandardScalar`.

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Write your code below

Write you code below to LDA on the IRIS dataset and compute the overall accuracy of the classifier.

In [None]:
### WRITE YOUR CODE HERE ####
import import_ipynb
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set();
import pandas as pd
from sklearn.model_selection import train_test_split
from numpy import pi

from utils import get_accuracy, get_prediction, plot_decision_boundary, generate_gifs
from utils import plot_2D_input_datapoints, signum, multi_class_signum, normalize

In [None]:
# Normalizing X_train and absorbing weight b of the hyperplane
X_normalized_train = normalize(X_train[:, :2])

b_ones = np.ones((len(X_normalized_train), 1))
X_normalized_train = np.hstack((X_normalized_train, b_ones))

In [None]:
# Calculating covariance of an input matrix
def calc_cov_matrix(X_input):
  n_samples = np.shape(X_input)[0]
  cov_matrix = np.array((1 / (n_samples-1)) * (X_input - X_input.mean(axis=0)).T.dot(X_input - X_input.mean(axis=0)))

  return cov_matrix

In [None]:
def train(X_train, y_train):

  """Train method for LDA.

  Parameters
  -----------
  X_train: ndarray (num_examples(rows) vs num_features(columns))
   Input dataset which LDA will use to obtain optimal weights during training

  y_train: ndarray (num_examples(rows) vs class_labels(columns))
  """
  
  # Collecting all class 0 and class 1 into separate variables
  class_X0 = X_train[np.argwhere(y_train == 0)[:, 0]]
  class_X1 = X_train[np.argwhere(y_train == 1)[:, 0]]

  # Getting number of examples in each class
  num_class_X0_samples = np.shape(class_X0)[0]
  num_class_X1_samples = np.shape(class_X1)[0]

  # Computing class mean for each label and calculating the difference between them.
  class_X0_mean = class_X0.mean(0)
  class_X1_mean = class_X1.mean(0)
  class_mean_diff = class_X1_mean - class_X0_mean
  class_mean_diff = class_mean_diff.reshape((-1, 1))
  SB = np.dot(class_mean_diff, class_mean_diff.T)

  # Calculating covariance matrix
  cov_mat_class_X0 = calc_cov_matrix(class_X0)
  cov_mat_class_X1 = calc_cov_matrix(class_X1)
  #SW = num_class_X0_samples * cov_mat_class_X0 + num_class_X0_samples * cov_mat_class_X1
  SW = cov_mat_class_X0 + cov_mat_class_X1

  print(SB)
  print(SW)

  eigvals, eigvecs = np.linalg.eig(np.linalg.pinv(SW).dot(SB))

  # Getting the eigenvectors with the maximum eigenvalue.
  idx = eigvals.argsort()[::-1]
  eigvals = eigvals[idx][:1]
  weights = np.atleast_1d(eigvecs[:, idx])[:, :1]

  return weights

In [None]:
trained_weights = train(X_normalized_train, y_train)

In [None]:
# Predict on test set
num_test_samples = np.shape(y_test)[0]

y_test_predicted = np.dot(X_test, trained_weights)
y_test_predicted[y_test_predicted >= 0] = 1
y_test_predicted[y_test_predicted < 0] = 0

y_test_predicted = y_test_predicted.reshape((-1, 1))
y_test = y_test.reshape((-1, 1))

# Getting misclassfied points on test set
miscls_test_points = np.unique(np.argwhere(y_test_predicted != Y_test)[:, 0])
accuracy = 1-(len(miscls_test_points)/num_test_samples)
print("Accuracy: ", accuracy*100)

In [None]:
plot_decision_boundary(X_test, y_test, trained_weights, dataset_type='test', class_label_01_form='on', model_type='LinearDA')