### Student Details

Student name:

Student ID:

# Description

In this task, we will use NSL-KDD dataset. We will use the NSL-KDD dataset to do multi-class classification. This dataset is quite large, and the training time can be quite long if you use the whole thing so that we will use just 20% of the dataset. If you completed the network security task in the previous activity (i.e., if you are in the BSc in CyberSecurity), you will have seen this already. For those that have not seen this dataset before, you will also be dealing with many more dimensions than you have done up to now, but you will see that the machine learning techniques we have employed up to now scale nicely to many dimensions.

The aim of the NSL-KDD dataset is to enable training a machine learning algorithm to identify different types of cyber attacks based on network trafiic features. The different attacks can be: denial-of-service (dos), Remote to user (r2l), probing attack (probe), User-to-Root (U2R). I hope this means something to the CyberSecurity cohort. For the rest of us, don't worry, we can just see it as a generic classification task.

The data is already split into training and testing. It also contains a mix of different types of features - categorical, binary, and numerical features. However, in this task, we are going to investigate only the numerical features. So, in the code just below, I have stripped out all of the non-numerical features, and provide you with the numpy arrays `train_X`, `train_Y`, `test_X`, and `test_Y`.

Our aim will be to use the available data to train an algorithm to predict the type of attack that is occurring. We will then see if we can get similar performance by using fewer features. Undoubtedly, unless there is a feature that has no influence *at all* on the output, we will see *some* degradation in performance. However, as discussed in the material, there are significant gains to be made by using fewer features.

In [1]:
####################
# CODE PROVIDED

# This code is a little bit complicated, and I don't want you to get bogged down in reading from csv files.
# This code reads from the CSV files, and creates the training and test sets for both binary and multi-class

# Read the data
import pandas as pd
test_df = pd.read_csv('KDDTest_CE4317.csv', header=0)
train_df = pd.read_csv('KDDTrain_CE4317.csv', header=0)

# Differentiating between nominal, binary, and numeric features
# Note, we only need to do this for the train data, as the train and test have the same feature names (of course)
col_names = train_df.columns.values    

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 19, 20]
numeric_idx = list(set(range(40)).difference(nominal_idx).difference(binary_idx))

numeric_cols = col_names[numeric_idx].tolist()   # The columns that have numerical features

train_Y = train_df['attack_category']
test_Y = test_df['attack_category']

# In this case, we are only going to use the numeric columns for our predictions
train_X = train_df[numeric_cols]   
test_X = test_df[numeric_cols]
# print(train_X.columns.values)
# num_missing = (train_X[[1,2,3,4,5]] == 0).sum()
# print(num_missing)

print("shape of train_X", train_X.shape)
print("shape of test_X", test_X.shape)

FileNotFoundError: [Errno 2] No such file or directory: 'KDDTest_CE4317.csv'

Let's have a quick look at what some of the samples look like

In [None]:
# Let's look at the data
train_X

And let's look quickly at how many samples in each attack category we have

In [None]:
import matplotlib.pyplot as plt

train_attack_cats = train_df['attack_category'].value_counts()
test_attack_cats = test_df['attack_category'].value_counts()
train_attack_cats.plot(kind='barh', figsize=(10,5), fontsize=15)
plt.xlabel("Number of samples", fontsize=20)
plt.ylabel("Attack category", fontsize=20)

# Task 1- Feature Selection

### Part 1: Support Vector Classification

Here, we will use Support Vector Classification to predict the type of network attack that is occurriing, given a set of features. We will use a simple linear SVM Classification, and use the default parameters, as we're not investigating the properties of SVM, but rather the properties of data.

#### Task:
1. Apply the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to the training and test data. Remember, train on the `*_train` data, but apply to both the `*_train` and the `*_test` data
1. Train a linear Support Vector Classification, using [`sklearn`'s `svm.SVC` class](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). All parameters can be left at default except `kernel='linear'`
1. Predict the category of the network attack
1. Print the [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) and the [accuracy score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
1. Discuss the confusion matrix briefly

You should see that the linear SVM Classification isn't too bad. You should get an accuracy around 0.73


<span style="color:red">Insert your code below</span>.

## Method for calculating Standard Scaler for X_train and X_test

In [None]:
from sklearn.preprocessing import StandardScaler

####################################
# Your code here
def scaleData(X_train, X_test) :
    # scaling train data and test data
    scaler = StandardScaler()
    X_train_scaled = scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)

    # apply the fitted standard scaler to test data
    X_test_scaled = scaler.transform(X_test)
    
    return X_train_scaled, X_test_scaled

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score

####################
# YOUR CODE

# using standard scaler to fit the train and test data by calling the method
X_train_scaled, X_test_scaled = scaleData(train_X, test_X)

# using svm.SVC with linear kernel
clf = SVC(kernel='linear')

# train the model and predict the y_pred
clf.fit(X_train_scaled, train_Y)
y_pred = clf.predict(X_test_scaled)

print("The accuracy score with svm.SVC is", accuracy_score(test_Y, y_pred)) # calculating accuracy
confusion_matrix(test_Y, y_pred) # display the confusion matrix

# Confusion Matrix

<span style="color:red">Insert your text answers below</span>.  

The confusion matrix below tells us where the model went wrong(how our model is performing). The confusion matrix provides the values it "confused" with the correct results or predicted as result. It offers us the count of inaccurate predictions it generated in comparison to the actual result values. The sum of all the values in the confusion matrix provides us the same result as the number of test samples(22544). Sum of the values in the diagonal provides the correct predictions against the test results ie. the (sum of diagonal)/(total count of test values) would provide us the accuracy((9085+5521+1864+2+5)/22544 = 0.7308818310858765). However, the other values are the mismatched values.

The confusion Matrix plotted below shows values with a color bar which illustrates the color it would presume according to the number of values.It goes from white to red as the number of observations increase. The higher the result, the darker the value eg. 9085 for benign shows this. On the X it has predicted values and the true values on the Y. Here we can find the result "benign" predicted correcly 9085 times, however incorrectly as dos, probe and r2l as 472, 150 and 4 times, which would lead to the decrease in accuracy. This can be read as for benign class it predicted benign 9085(correctly) out of all the data and incorrectly as dos as 472 times and probe and r2l as 150 and 4 times respectively

<u>Note</u>: Here I have altered the color to white and red, where darker red corresponds to more number of observations. 

In [None]:
(9085+5521+1864+2+5)/22544 # sum of the diagonal values divided by the total count of test samples

### Define a method plotConfusionMatrix to plot confusion matrix which is used later as well

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def plotConfusionMatrix(test_Y, y_pred, labels) :
    
    # Displaying the confusion matrix
    cm = confusion_matrix(test_Y, y_pred, labels=labels)
    matrix = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
    matrix.plot(cmap=plt.cm.Reds) # changing the color from the default to red inorder for a better visualization
    plt.title("Confusion Matrix")
    plt.show()

In [None]:
plotConfusionMatrix(test_Y, y_pred, clf.classes_)  # plot confusion Matrix using the method defined above

### Part 2: Feature Variance

Feature variance is a rather simple way of predicting if a given feature will have influence on the outcome of a trained model. The principle is that, if a feature has low variance, it cannot have much influence on the model prediction. As an extreme, if we have a variance of 0 in a feature across all samples, i.e. we have the same value for this feature in all samples, then this feature cannot be used to distinguish samples and is useless as a predictor.

However, the converse is not true. A high variance in a feature does not necessarily mean that it is a good predictor. You could imagine a feature that just contains noise with high amplitude. It might have high variance, but is meaningless. Or you can have a feature that has high variance but no influence on the outcome. For example, would hair length in cm have any influence on a baseball players salary? That said, variance can be a useful measure of the strength of a predictor.

The variance of a set of features is given by:

$$
S_{i}^{2} = \frac{\sum_{j=1}^n\left(\textbf{X}_{i,j} - \overline{\textbf{X}}_i \right)^2}{n - 1}
$$

where $\textbf{X}_{i,j}$ is the $j$'th sample of the $i$'th feature, $\overline{\textbf{X}}_i$ is the mean of all the samples of the $i$'th feature, and $n$ is the total number of samples.

Variance thresholding doesn't examine the relationship between the feature $\textbf{X}$ and the output $\textbf{y}$. This has the disadvantage that you can't test if the feature actually has an influence on the output. However, even though in this case we use it in a supervised learning context, it does mean that we can use variance thresholding for unsupervised learning.

#### Notes:
1. In Part 1, we used the `StandardScaler` to scale the features. In general, this is good practice, and in the next Task where we look at PCAs, really is even required. The `StandardScaler` will make it so every feature has a variance of 1 (unless the features started out with a variance of 0 to begin with) and a mean of 0. Therefore, features scaled with `StandardScaler` are useless for thresholding on variance, as there is no practical way to distinguish them.
2. However, we should not do `VarianceThreshold`ing on just the raw data. Have a look at the values in the training dataset. Some columns will have typically small values. It is the nature of that data, and even though they may have a large influence on the type of attack, they will have a lower variance compared to some of the other columns.
3. So we must scale, but not using the `StandardScaler`. Here it is more appropriate to use the [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), which scales all the data to the range 0 to 1 by default (though you can set any range).
4. Note that `MinMaxScaler` can also be applied to machine learning algorithms. Just in this case, we want to use the `StandardScaler`. There is, in fact, a [whole suite of other scalers provided by scikit-learn](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html). Each has it's own benefits. In fact, some would say we should use `MinMaxScaler` as the default scaler, and only use `StandardScaler` if we know the distribution of the data is normal.

#### Task:
1. Fit an instance of the `MinMaxScaler` with the original `X_train` data. You will get a set of data in the range 0 to 1.
2. It is not very intuitive what threshold of variance we should use. So it's better to plot the variances of each feature, and then decide if some of the variances are small enough to discard
3. Use `np.var` function to calculate the variances of the features (`axis-0`). Plot the variances, and pick a value that might remove 3 or 4 of the features.
4. Fit the output of the `MinMaxScaler` using an object of [`sklearn.feature_selection`'s `VarianceThreshold` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) using this threshold
5. `transform` the data that was scaled with the `StandardScaler`. This is an important step. Notice that we use the output of the `MinMaxScaler` to select the features, but the data we want to use is selected from the `StandardScaler`.
6. Repeat the steps of Part 1: Train an SVC with the selected features and print the accuracy.
7. How does the accuracy compare to the the SVC with no features removed (from Part 1)?

<span style="color:red">Insert your code below</span>.

## Using bar graph for better comprehensibility

In [None]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import numpy as np

####################################
# Your code here
# using min max scaler for applying variance thresholding
min_max_scaler = MinMaxScaler()
min_max_scaled_train_X = min_max_scaler.fit(train_X)
min_max_scaled_train_X = min_max_scaler.transform(train_X)
# calculate variances of all the features in the data in order to remove those with less variance
variances = np.var(min_max_scaled_train_X, axis=0)

plt.bar(range(len(variances)), variances)
plt.xlabel('Features')
plt.ylabel('Variance')
plt.title('Feature Variances')
plt.show()


## Applying thresholding post MinMaxScaling and then StandardScaling the data before classification

In [None]:
# choosing a threshold value that removes a few of the features
threshold = 0.00015 # threshold value of .00015 chosen to remove 5 features which doesn't reduce the accuracy over 2 decimal places
sel = VarianceThreshold(threshold)
# This ensures that both datasets have the same scaling and feature selection transformations applied to them, leading to better model generalization and potentially higher accuracy.

min_max_scaled_test_X = min_max_scaler.transform(test_X)

# fitting and transforming the data based on the threshold chosen on train data and transforming the test data
thresholded_train_X = sel.fit_transform(min_max_scaled_train_X)
thresholded_test_X = sel.transform(min_max_scaled_test_X)

print(f"The number of features to consider by the classifier after thresholding are {thresholded_test_X.shape[1]}")

# Applying Standard Scaler on the data before calling the classifier
standard_scaled_train_X, standard_scaled_test_X = scaleData(thresholded_train_X, thresholded_test_X)

clf = SVC(kernel='linear')
clf.fit(standard_scaled_train_X, train_Y)
y_pred = clf.predict(standard_scaled_test_X)

print("The accuracy score with svm.SVC is", accuracy_score(test_Y, y_pred)) # calculating accuracy
plotConfusionMatrix(test_Y, y_pred, clf.classes_)  




# Confusion Matrix after removing features

In [None]:
plotConfusionMatrix(test_Y, y_pred, clf.classes_)

<span style="color:red">Insert your text answers below</span>.

How does the accuracy compare to the the SVC with no features removed (from Part 1)?  
The accuracy has reduced by .0006 percent when we have 26 features with a threshold value of 0.00015(accuracy is 0.7302608232789212), whereas, the accuracy obtained with no features removed were 0.7308818310858765. This does not produce a wide impact to the model as it does not make a significant difference to the result produced. Hence we can say that removing 5 features with a low variance does not do much harm to the model.



### Part 3: Univariate Feature Selection - `f_classif`

Univariate feature selection works by performing statistical tests on each of the features (i.e. on each column in our dataset). There are a [few options provided by `scikit-learn`](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). We will use the [`SelectKBest` functionality](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html), which allows us to pick "the top" `K` features per the metric we select. To pick the top features, we will use the [`f_classif` function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html), as we are doing a classification. `f_classif` uses the ANOVA F-value to determine features to select. More info on ANOVA is available [here](https://datascience.stackexchange.com/questions/74465/how-to-understand-anova-f-for-feature-selection-in-python-sklearn-selectkbest-w).

The `SelectKBest` functionality coupled with `f_classif`, will use this score to pick the `K` top features.

#### Task:
1. Loop over the total count of features (i.e. for variable `k` from 1 to 31)
2. Use the [`SelectKBest` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) with [`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html) to pick the top `k` features from our dataset
3. Train a new SVM classification with features transformed with the `SelectKBest` object you just created (note: both train and test data have to be transformed)
4. Use the `accuracy_score` function to get the accuracy at each iteration
5. Repeat steps 2 to 4 for each value of `k`.
6. Then plot the accuracy versus number of features in a single plot
7. Given this data, discuss the number of features you might use in a final solution? (Use markdown - no wrong answer here)
8. How doe the "best" accuracy value compare with the SVM before removing any features? 

this will take a few minutes to run, go get a coffee!

<span style="color:red">Insert your code below</span>.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Your code here

accuracy_scores = []
for k in range(1, train_X.shape[1] + 1):
    standard_scaled_train_X, standard_scaled_test_X = scaleData(train_X, test_X) # scaling the data to run it faster, as this shouldnt make a considerable difference to the intended target as is clearly evident from the accuracy of the k = 31
    select_K_best = SelectKBest(score_func=f_classif, k=k)
    select_K_best.fit(standard_scaled_train_X, train_Y)
    X_train_K_best = select_K_best.transform(standard_scaled_train_X)
    X_test_K_best = select_K_best.transform(standard_scaled_test_X)

    # Using SVM for classification with a linear kernel
    clf = SVC(kernel='linear')

    clf.fit(X_train_K_best, train_Y)
    y_pred = clf.predict(X_test_K_best)

    accuracy_score_obtained = accuracy_score(test_Y, y_pred)
    accuracy_scores.append(accuracy_score_obtained)
    print(f"The accuracy score with SVM.SVC is with k as {k} is ", accuracy_score_obtained)


In [None]:
import matplotlib.pyplot as plt

####################################
# Your code here
plt.plot(range(1, X_test_K_best.shape[1] + 1), accuracy_scores, marker='o')
plt.xlabel('Number of Features')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. Number of Features')
plt.show()


<span style="color:red">Insert your question answers below</span>.

<span style="color:green">The above graph shows the Accuracy vs. Number of Features. It illustrates that the accuracy does not vary much after the consideration of 17 features. This helps us reduce the features for consideration while passing them onto the classifier, which improves the perfomance of the model</span>.

7. Given this data, discuss the number of features you might use in a final solution? (Use markdown - no wrong answer here)  
    Here we can choose 25 as the maximum number of features required for the model to perform well, as the accuracy lies in the range of 73, which is almost the same as the accuracy value when we take all the 31 features into consideration. Here we can remove 6 features there by making the classifier performant.  

    If we need a really performant version of the model then we might even consider reducing the feature count even further to 18, which is nearly half the original size of the features(31), as it would still produce an accuracy of 72.9%  

8. How doe the "best" accuracy value compare with the SVM before removing any features?  
    The best accuracy obtained is 0.7308818310858765 which is identical to the value obtained in Task 1 before removing any of the features from the dataset. This is expected as the accuracy 0.7308818310858765 is what we have when we take all the featues(31) into consideration.





# Task 2: Dimensionality Reduction via PCA

PCA is one of the most commonly used unsupervised transforms, and one of the most commmon means to manipulate data for machine learning. You touched on the PCA in E-tivity 2, where we investigated linear algebra. Here we will use it to reduce the numbers of features needed for a machine learning algorithm.

In the last task, we removed features. The first part, we just used some statistics on the features themselves, in independence of the other features and of the output. Then, we looked at the correlation between features and the output. 

What PCA does is look at correlations *between features*. If we have high correlation between two or more features, PCA will find vectors in the feature space that best describe all features. It doesn't remove features, rather it creates a new feature space, and projects all samples to this feature space. The basis of the new feature space is a linear combination of the original features. Maybe a bit crudely, you can think of it as combining features.

Let's look at an example. Here is the first few rows of the dataset.

In [None]:
train_df.head(8)

Ok, so in a small handfull of features, we can spot that (perhaps) `num_root` and `num_compromised` are correlated? Let's plot a few of them that might be correlated.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 3, figsize=(20, 5))
fig.subplots_adjust(wspace=0.3)

ax[0].scatter(train_df['num_compromised'], train_df['num_root']);
ax[0].set(xlabel='num_compromised', ylabel='num_root')

ax[1].scatter(train_df['srv_serror_rate'], train_df['serror_rate']);
ax[1].set(xlabel='srv_serror_rate', ylabel='serror_rate')

ax[2].scatter(train_df['srv_rerror_rate'], train_df['rerror_rate']);
ax[2].set(xlabel='srv_rerror_rate', ylabel='rerror_rate')

plt.show()

Note that the last two plots, while there are outliers in the plots (values at 1.0), most of the data lies along the diagonal. Just the drawing doesn't show this well, though it is highly correlated.

Yes, we can see that there is some correlation between the features we selected here. We can probably assume that there is a causal relationship between them - CyberSecurity specialists wish to comment?

So there is certainly some redundancy here. And likely there are more hidden correlations that we don't know about!

# Task

On to this weeks task. We will perform PCA on the data, before training a linear SVM, and explore some more properties of it, and how it affects the machine learning algorithm.

1. Run [`PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) on the standard scaled data. Initially, set the desired variance to keep to 95% (`n_components=0.95`, all other parameters set to default)
2. Train a Support Vector Classification on the PCA reduced data. As with Task 1, use a linear SVM and keep all other parameters as default
3. Try 99% and 90%
4. Play around with desired variance to see if you can reduce the number of features while maintaining an accuracy close to the original dataset above

Discuss the following points, and compare with the previous task in this e-tivity:

1. How many new features are there after the PCA?
2. Discuss the "goodness" of the model, compared with the one without scaling (from Task 1), by comparing the accuracy
3. How about if we set the variance to 99%? And how about 99.9%?
4. Can you get better accuracy with fewer features using PCA compared to dropping the features from Task 1?

Note that the parameter `n_components` of `PCA` can take either a real value between 0 and 1, in which case it will pick the number of components that maintains that level of variance in the samples, or it can take an integer value, in which case it will keep that number of components.

<span style="color:red">Insert your code below</span>.

# Defining a method to calculate PCA and plot the accuracy to be used later with different values of variance and to plot confusion matrix

In [None]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score

####################################
# Your code here

def applyPCA(variance) :
    # Applying PCA on the standard scaled data from above with the variance at 95%
    pca = PCA(n_components=variance)

    X_train_scaled_pca = pca.fit(X_train_scaled)
    X_train_scaled_pca = pca.transform(X_train_scaled)

    X_test_scaled_pca = pca.transform(X_test_scaled)
    print(f"Number of features in standard scaled Test data after PCA with variance {variance*100}: ", X_test_scaled_pca.shape[1])

    # using svm.SVC with linear kernel
    clf = SVC(kernel='linear')

    clf.fit(X_train_scaled_pca, train_Y)
    y_pred = clf.predict(X_test_scaled_pca)
    print(f"The accuracy score with svm.SVC after PCA with variance of {variance*100}% is", accuracy_score(test_Y, y_pred), "\n") # calculating accuracy
    plotConfusionMatrix(test_Y, y_pred, clf.classes_)

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score

####################################
# Your code here
print("The accuracy score with svm.SVC is", accuracy_score(test_Y, y_pred)) # calculating accuracy

# Desired Variance 95

In [None]:
variance_95 = 0.95
applyPCA(variance_95) # apply PCA with variance 95%


# Desired Variance 99%

In [None]:
variance_99 = 0.99
applyPCA(variance_99) # apply PCA with variance 99%

# Desired Variance 90%

In [None]:
variance_90 = 0.90
applyPCA(variance_90) # apply PCA with variance 90%

# Desired Variance 99.9%

In [None]:
variance_999 = 0.999
applyPCA(variance_999) # apply PCA with variance 99.9%

The newly added features reduce the accuracy due to addition of new features which throws the model to a little less inaccurate

# Desired Variance 10%

In [None]:
variance_10 = 0.10
applyPCA(variance_10) # apply PCA with variance 10%

# Desired Variance 20%

In [None]:
variance_20 = 0.20
applyPCA(variance_20) # apply PCA with variance 20%

# Desired Variance 30%

In [None]:
variance_30 = 0.30
applyPCA(variance_30) # apply PCA with variance 30%

<span style="color:red">Insert your text answers below</span>.

1. How many new features are there after the PCA?

    After the application PCA to the question, we see a reduction in number of features as intended. With a higher value for Desired Variance such as 99.9% we see only 3 of the features are removed from the dataset. However, on reducing the value of desired variance we observe the reduction in number of features as well. This is based on the degree of the variance in dataset we possess. In the above dataset, we see a considerable reduction of features(over half) when we decrease the variance to just 90% and continues to decrease as we reduce the desired variance percentage.  <u>Note</u>: the number of features taken for the classifer are 2 when the desired variance is 30%.


2. Discuss the "goodness" of the model, compared with the one without scaling (from Task 1), by comparing the accuracy
    Even though in the Task1 we were asked to scale the model, the model stays relevant for classifcation due to very small variations of accuracy from the one where we did not have any removal of features(Task 1). As we can see from the data above, when the desired variance is 99.9%, we observe an accuracy of 0.7298172462739532(72.98%) which is close enough to 0.7308818310858765(73.09%) that we observed without using PCA or removal of any features. Even with 90% desired variance, we observe an accuracy rating of 72.63% which is only 0.4 percent less than the accuracy obtained when we considered all the features. The advantage of this is that we only consider 16 features, which would make a considerable difference in performance as it is nearly half the feature size. Hence I would say that the model is considerably "good" when compared to the one without removal of features.

3. How about if we set the variance to 99%? And how about 99.9%?  
    When the variance is set as 99% we see a drop in number of features considered to 23 and the accuracy drops to 0.7301721078779276. However, with the variance of 99.9% we see the number of features considered are 28, but interestingly the accuracy of the model drops to 0.7298172462739532. This may be because we are considering one too many features with outlier data in them. The newly added features ie. the 5 features we are considering for the 99.9 percent desired variance, could be having outlier data which is throwing the model off balance. This is a key point in determining a good model as to how we take the features and how many of those features are critical for the model to perform well.

4. Can you get better accuracy with fewer features using PCA compared to dropping the features from Task 1?  
    As we can see from the result of the code below the accuracy obtained with dropping 12 features is 72.85%. Using PCA if we only consider 19 features(dropping 12 features) we see an accuracy of 72.78%(from above where the desired variance is 95%). Hence, numerically PCA does not yield a better result for this dataset, however, PCA would be useful where dimensionality reduction is key. Even for this data the difference in magnitude of the accuracy is negligible. However, this has be done only with domains and features where we know it would not cause any harm.

## In order to answer 4

In [None]:
threshold = 0.0085 # threshold value of .00015 chosen to remove 5 features which doesn't reduce the accuracy over 2 decimal places
sel = VarianceThreshold(threshold)
# This ensures that both datasets have the same scaling and feature selection transformations applied to them, leading to better model generalization and potentially higher accuracy.
min_max_scaler = MinMaxScaler()
min_max_scaled_train_X = min_max_scaler.fit_transform(train_X)
min_max_scaled_test_X = min_max_scaler.transform(test_X)

thresholded_train_X = sel.fit_transform(min_max_scaled_train_X)
thresholded_test_X = sel.transform(min_max_scaled_test_X)

print(f"The number of features to consider by the classifier after thresholding are {thresholded_test_X.shape[1]}")

standard_scaled_train_X, standard_scaled_test_X = scaleData(thresholded_train_X, thresholded_test_X)

clf = SVC(kernel='linear')
clf.fit(standard_scaled_train_X, train_Y)
y_pred = clf.predict(standard_scaled_test_X)

print("The accuracy score with svm.SVC is", accuracy_score(test_Y, y_pred)) # calculating accuracy




# Task 3 : Exploration

## Feature Selection with Recursive Feature Elimination(RFE)

RFE is a feature selection technique used to locate key featueres in a dataset. RFE is used most commonly with SVMs. RFE works by estimating the importance of each of the features associated with the testdata and provides ranking to the features based on their importance. The next step involves removing the least important feature with the lowest importance or highest value for rank. Then we build a model using the remaining features and the above steps are run in a loop until the desired Number of features are obtained. RFE observes interactions between features in order to determine the rank of features. This is an advantage with RFE.   

RFE is a wrapper feature selection algorithm as the core of the machine learning algorithm would be different and its used to obtain the best features. It often uses filter based feature selection internally. The ranking is determined either by machine learning algorithms used in the core(eg. decision trees) or by using statistical methods.

An example of RFE

// define the method  
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=3)  
//fit the model  
rfe.fit(X, y)  
//transform the data  
X, y = rfe.transform(X, y)  

Here the estimator argument is the machine learning model we choose to evaluate the importance of features and the n_features_to_select is the number of features we want to select in the end. RFE is often used with k-fold cross-validation to prevent data leakage

Besides the obvious reasons of reducing the training time and improving the model performance, it helps the model to focus on most significant predictors and derive a relationship between the features. It also aids in the systematic testing of impact of each of the features on the model. It can handle interactions between features and hence is suitable for complex datasets. It can also handle correlation between several features, eventhough not preferred due to its computational complexity as one of the factors.

### Things to keep in mind while working with RFE
1. Number of features chosen should be in such a way that it keeps a balance between power and complexity of the features.  
2. Setting the number of cross-validation folds which can help in reduction of overfitting and thereby improve the generalization of the model.

However, it is computationally expensive for large datasets.
## Dimensionality Reduction using Isomap

Isomap is used when data is non-linear(correlation between features is non-linear). If we extrapolate it on to a linear plane we might loose some of the critical information. This can happen with geometric structures of data and Isomap preserves the geodesic distance. Isomap uses geodesic distance along with K nearest neigbours to create a similarity matrix for eigen value decomposition. It uses local information to create global similarity matrix. The algorithm uses Eucledian distance to prepare nearest neighbour graph and then approximates geodesic distance between two points by measuring the shortest distance between the two points. Hence it gets the global as well as local structure of the dataset on to a lower dimension.


It comes as a part of manifold learning. A manifold can be thought of as a surface of any shape ie. it doesn't need to fit on a plane. While PCA can be used to project data onto a lower dimensional surface(linear), Isomap works without this as it can work on data on any surface, which the PCA cannot do.
The data points here can be seen as samples from a lower dimensional manifold that is embedded in a higher dimensional space. Other algorithms that can do this are Locally Linear Embedding, Laplacian Eigen maps etc.

Page 494, Applied Predictive Modeling, 2013.  
https://machinelearningmastery.com/rfe-feature-selection-in-python/  
https://prateekvjoshi.com/2014/06/21/what-is-manifold-learning/    
https://blog.paperspace.com/dimension-reduction-with-isomap/  