# Problem Statement:

You work as a data scientist in a flower research company. The company has a sample dataset of prelabeled
data on iris dataset with features like 'sepal-length', 'sepal-width', 'petal-length', 'petal-width'
and 'Class'. They plan to extend this dataset and train a RandomForestClassifier on it. But they expect
the dataset to grow quite large i.e. millions of rows and are worried that a million rows and 4 features is
going to be too big for them to be able to train their classifier. They wish to reduce the number of
features or dimensions without a sharp decrease in accuracy of the classifier.

You have been asked to:
1. Read the sample dataset given to you.
2. Use PCA to figure out the number of most important principle features.
3. Reduce the number of features using PCA
4. Train and test the RandomForestClassifier algorithm to check if reducing the number of
dimensions is causing the model to perform poorly.
5. Figure out the most optimal number of components that produce good quality results i.e. they
do not cause a sharp decrease in prediction accuracy.
6. Do this for all possible number of principle components and find out the smallest number of
components that our dataset can be reduced to with good prediction accuracy

In [1]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
#Importing the dataset
column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv('iris.csv', names=column_names)

In [3]:
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
#Splitting the date into train and test
X = dataset.drop('Class', 1)
Y = dataset['Class']

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.92597835, 0.0536899 , 0.01568407, 0.00464768])

In [6]:
for n in range(1, len(X.columns) + 1):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
    
    pca = PCA(n_components=n)
    X_train = pca.fit_transform(X_train)
    X_test = pca.transform(X_test)
    
    classifier = RandomForestClassifier(max_depth=2, random_state=0)
    classifier.fit(X_train, Y_train)
    Y_pred = classifier.predict(X_test)
    
    cm = confusion_matrix(Y_test, Y_pred)
    print("Confusuion Matrix with {0} Principle Components: ".format(n))
    print(cm)
    print("Accuracy Score with {0} Principle Components: ".format(n), end="")
    print(accuracy_score(Y_test, Y_pred), end="\n\n")

Confusuion Matrix with 1 Principle Components: 
[[11  0  0]
 [ 0 12  1]
 [ 0  2  4]]
Accuracy Score with 1 Principle Components: 0.9

Confusuion Matrix with 2 Principle Components: 
[[11  0  0]
 [ 0  8  5]
 [ 0  1  5]]
Accuracy Score with 2 Principle Components: 0.8

Confusuion Matrix with 3 Principle Components: 
[[11  0  0]
 [ 0  5  8]
 [ 0  1  5]]
Accuracy Score with 3 Principle Components: 0.7

Confusuion Matrix with 4 Principle Components: 
[[11  0  0]
 [ 0 11  2]
 [ 0  1  5]]
Accuracy Score with 4 Principle Components: 0.9



From the above result it can be seen that the confusion matric with 1 principle component produces the same result as the confusion matrix with 4 principle components .The smallest number of component to produce good accuracy is 0.9

# END OF ASSIGNMENT