# Excercise 1

In the tutorial you saw how to compute LDA for a two class problem. In this excercise we will work on a multi-class problem. We will be working with the famous Iris dataset that has been deposited on the UCI machine learning repository
(https://archive.ics.uci.edu/ml/datasets/Iris).

The iris dataset contains measurements for 150 iris flowers from three different species.

The three classes in the Iris dataset:
1. Iris-setosa (n=50)
2. Iris-versicolor (n=50)
3. Iris-virginica (n=50)

The four features of the Iris dataset:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

<img src="iris_petal_sepal.png">



In [85]:
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set();
import pandas as pd
from sklearn.model_selection import train_test_split
from numpy import pi


### Importing the dataset

In [15]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)

dataset.tail()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


### Data preprocessing

Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:

In [16]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

The above script assigns the first four columns of the dataset i.e. the feature set to X variable while the values in the fifth column (labels) are assigned to the y variable.

The following code divides data into training and test sets:

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### Feature Scaling

We will now perform feature scaling as part of data preprocessing too. For this task, we will be using scikit learn `StandardScalar`.

In [18]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Write your code below

Write you code below to LDA on the IRIS dataset and compute the overall accuracy of the classifier.

In [86]:
### WRITE YOUR CODE HERE ####
# Calculating covariance of an input matrix
def calc_cov_matrix(X_input):
  n_samples = np.shape(X_input)[0]
  cov_matrix = np.array((1 / (n_samples-1)) * (X_input - X_input.mean(axis=0)).T.dot(X_input - X_input.mean(axis=0)))

  return cov_matrix
def train(X_train, y_train):

  """Train method for LDA.

  Parameters
  -----------
  X_train: ndarray (num_examples(rows) vs num_features(columns))
   Input dataset which LDA will use to obtain optimal weights during training

  y_train: ndarray (num_examples(rows) vs class_labels(columns))
  """
  
  # Collecting all class 0 and class 1 into separate variables
  class_X0 = X_train[np.argwhere(y_train == 0)[:, 0]]
  class_X1 = X_train[np.argwhere(y_train == 1)[:, 0]]

  # Getting number of examples in each class
  num_class_X0_samples = np.shape(class_X0)[0]
  num_class_X1_samples = np.shape(class_X1)[0]

  # Computing class mean for each label and calculating the difference between them.
  class_X0_mean = class_X0.mean(0)
  class_X1_mean = class_X1.mean(0)
  class_mean_diff = class_X1_mean - class_X0_mean
  class_mean_diff = class_mean_diff.reshape((-1, 1))
  SB = np.dot(class_mean_diff, class_mean_diff.T)

  # Calculating covariance matrix
  cov_mat_class_X0 = calc_cov_matrix(class_X0)
  cov_mat_class_X1 = calc_cov_matrix(class_X1)
  #SW = num_class_X0_samples * cov_mat_class_X0 + num_class_X0_samples * cov_mat_class_X1
  SW = cov_mat_class_X0 + cov_mat_class_X1

  eigvals, eigvecs = np.linalg.eig(np.linalg.pinv(SW).dot(SB))

  # Getting the eigenvectors with the maximum eigenvalue.
  idx = eigvals.argsort()[::-1]
  eigvals = eigvals[idx][:1]
  weights = np.atleast_1d(eigvecs[:, idx])[:, :1]

  return weights


In [54]:
trained_weights = []
n = np.unique(y)

for i in n:
    if i in y_train:
        y_vec = np.where(y_train == i, 1 , 0 ) 
        trained_weights.append( train(X_train, y_vec))

In [59]:
trained_weights[0], trained_weights[1] , trained_weights[2]

(array([[-0.16705006],
        [-0.15448432],
        [ 0.97066801],
        [ 0.07766913]]),
 array([[-0.15159054],
        [-0.20359827],
        [ 0.75831223],
        [-0.60044201]]),
 array([[ 0.06511834],
        [-0.16857946],
        [-0.2169602 ],
        [-0.95930644]]))

In [74]:
m = len(trained_weights)   

In [62]:
y_test[0]

'Iris-virginica'

In [83]:
num_test_samples = np.shape(y_test)[0]
y_test_predicted = []
for i in range(m):
    y_test_predicted.append(np.dot(X_test, trained_weights[i]))
#     y_test_predicted[y_test_predicted >= 0] = 1
#     y_test_predicted[y_test_predicted < 0] = 0



In [103]:
y_test_predicted[0].reshape((-1,1)).shape

(30, 1)

In [133]:
df1 , df2, df3 = pd.DataFrame(y_test_predicted[0]) ,pd.DataFrame(y_test_predicted[1]),pd.DataFrame(y_test_predicted[2])
df = pd.concat([df1 , df2, df3], axis = 1 )
             #, column = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])

In [134]:
df.columns = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

In [135]:
df[:1]

Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica
0,0.925654,-0.225588,-1.514096


In [136]:
df[:1].apply(lambda x : x.argmax(), axis = 1 )#.loc(0, ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])

The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  """Entry point for launching an IPython kernel.


0    Iris-setosa
dtype: object

In [147]:
y_predictions = df.apply(lambda x: x.argmin(), axis = 1)

The current behaviour of 'Series.argmin' is deprecated, use 'idxmin'
instead.
The behavior of 'argmin' will be corrected to return the positional
minimum in the future. For now, use 'series.values.argmin' or
'np.argmin(np.array(values))' to get the position of the minimum
row.
  """Entry point for launching an IPython kernel.


In [148]:
# Getting misclassfied points on test set
miscls_test_points = np.unique(np.argwhere(y_predictions != y_test)[:, 0])
accuracy = 1-(len(miscls_test_points)/num_test_samples)
print("Accuracy: ", accuracy*100)

Accuracy:  56.666666666666664


In [150]:
#pd.concat([pd.DataFrame(y_predictions), pd.DataFrame(y_test)] , axis = 1 )