# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets: 
    * Boston house prices 
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image: 
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I 

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers. 
    * Separate the data into train, validation, and test. 
    * Use accuracy as the metric for assessing performance. 
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1? 

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

### Disclaimer: Lab completed with the partial assistance of Google searches and ChatGPT

## Importing Packages

In [1]:
from sklearn.datasets import fetch_kddcup99
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## Quick look at the data

In [16]:
d = fetch_kddcup99()
df = fetch_kddcup99(as_frame=True)

In [3]:
df.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [6]:
print(df['DESCR'])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [17]:
data, target = df['data'], df['target']

In [4]:
data

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494016,0,b'tcp',b'http',b'SF',310,1881,0,0,0,0,...,86,255,1.0,0.0,0.01,0.05,0.0,0.01,0.0,0.0
494017,0,b'tcp',b'http',b'SF',282,2286,0,0,0,0,...,6,255,1.0,0.0,0.17,0.05,0.0,0.01,0.0,0.0
494018,0,b'tcp',b'http',b'SF',203,1200,0,0,0,0,...,16,255,1.0,0.0,0.06,0.05,0.06,0.01,0.0,0.0
494019,0,b'tcp',b'http',b'SF',291,1200,0,0,0,0,...,26,255,1.0,0.0,0.04,0.05,0.04,0.01,0.0,0.0


In [18]:
label_encoder = LabelEncoder()

for column in data.columns:
    if data[column].dtype == 'object':
        data[column] = label_encoder.fit_transform(data[column])
        
target = label_encoder.fit_transform(target)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column] = label_encoder.fit_transform(data[column])


In [6]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# 1.)

## Stochastic Gradient Descent

In [7]:
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

In [8]:
def fit_and_opt_SGDClassifier(sqd_clf, X_train, X_test, y_train, y_test):
    # Fit the model to the training data
    sgd_clf.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = sgd_clf.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy}")

#### No parameter optimization

In [80]:
%%time
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3)
fit_and_opt_SGDClassifier(sgd_clf, X_train, X_test, y_train, y_test)

Accuracy: 0.9886442993775618
CPU times: user 1min 20s, sys: 92.5 ms, total: 1min 20s
Wall time: 1min 20s


#### Optimizing `max_iter` 
Lowered to 100

In [81]:
%%time
sgd_clf = SGDClassifier(max_iter=100, tol=1e-3)
fit_and_opt_SGDClassifier(sgd_clf, X_train, X_test, y_train, y_test)

Accuracy: 0.9917615505288194
CPU times: user 1min 13s, sys: 134 ms, total: 1min 13s
Wall time: 1min 13s




#### Optimizing `learning_rate` 
Changed to 'Optimal'

In [82]:
%%time
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, learning_rate='optimal')
fit_and_opt_SGDClassifier(sgd_clf, X_train, X_test, y_train, y_test)

Accuracy: 0.9903648600779312
CPU times: user 1min 10s, sys: 166 ms, total: 1min 11s
Wall time: 1min 10s


## Support Vector Machine 

In [9]:
from sklearn.svm import SVC

In [10]:
def fit_and_opt_SVMClassifier(svm_model, X_train, X_test, y_train, y_test):
    # Fit the model to the training data
    svm_model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = svm_model.predict(X_test)
    
    # Evaluate the model
    print("Accuracy:", accuracy_score(y_test, y_pred))

#### No parameter optimization

In [86]:
%%time
svm_model = SVC(kernel='linear', decision_function_shape='ovo')
fit_and_opt_SVMClassifier(svm_model, X_train, X_test, y_train, y_test)

Accuracy: 0.999443348008704
CPU times: user 10min 30s, sys: 428 ms, total: 10min 31s
Wall time: 10min 31s


#### Optimizing `kernel`
Changing to Sigmoid

In [88]:
%%time
svm_model = SVC(kernel='sigmoid', decision_function_shape='ovo')
fit_and_opt_SVMClassifier(svm_model, X_train, X_test, y_train, y_test)

Accuracy: 0.9812964930924548
CPU times: user 46min 38s, sys: 1.89 s, total: 46min 40s
Wall time: 46min 40s


#### Optimizing `degree`
Changing to 5

In [89]:
%%time
svm_model = SVC(kernel='linear', decision_function_shape='ovo', degree=5)
fit_and_opt_SVMClassifier(svm_model, X_train, X_test, y_train, y_test)

Accuracy: 0.999443348008704
CPU times: user 10min 52s, sys: 250 ms, total: 10min 53s
Wall time: 10min 53s


## Decision Tree

In [11]:
from sklearn.tree import DecisionTreeClassifier

In [12]:
def fit_and_opt_DTClassifier(dt_model, X_train, X_test, y_train, y_test):
    # Fit the model to the training data
    dt_model.fit(X_train, y_train)

    # Make predictions on the test data
    y_pred = dt_model.predict(X_test)
    
    # Evaluate the model
    print("Accuracy:", accuracy_score(y_test, y_pred))

#### No parameter optimization

In [93]:
%%time
dt_model = DecisionTreeClassifier()
fit_and_opt_DTClassifier(dt_model, X_train, X_test, y_train, y_test)

Accuracy: 0.9995040736804818
CPU times: user 2.14 s, sys: 8 µs, total: 2.14 s
Wall time: 2.14 s


#### Optimizing `max_depth`
Set to 3

In [94]:
%%time
dt_model = DecisionTreeClassifier(max_depth=3)
fit_and_opt_DTClassifier(dt_model, X_train, X_test, y_train, y_test)

Accuracy: 0.985446080663934
CPU times: user 534 ms, sys: 20 ms, total: 554 ms
Wall time: 551 ms


#### Optimizing `max_features`
Set to sqrt(n_features)

In [96]:
%%time
max_d = int(np.sqrt(len(data.columns)))

dt_model = DecisionTreeClassifier(max_features=max_d)
fit_and_opt_DTClassifier(dt_model, X_train, X_test, y_train, y_test)

Accuracy: 0.9993218966651485
CPU times: user 300 ms, sys: 20 ms, total: 320 ms
Wall time: 316 ms


## 2.)

Making an ensemble model containing at least 25 DecisionTreeClassifier() models because it reported the highest accuracy out of all classifiers.

In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [15]:
# Create Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=25, random_state=42)

# Train the model
rf_classifier.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.9996457669146298


Identifying the top and bottom 10% data points in terms of uncertainty.

In [16]:
# Assuming clf is your trained classifier and X_test is your test data
probs = rf_classifier.predict_proba(X_test)

# Calculate uncertainty - here we use the margin between the top two probabilities
uncertainties = np.sort(probs, axis=1)[:, -2:]  # Get the top two probabilities
margins = uncertainties[:, 1] - uncertainties[:, 0]  # Calculate margins
sorted_indices = np.argsort(margins)  # Sort indices by margin

# Get top and bottom 10%
top_10_percent_idx = sorted_indices[-int(0.1 * len(margins)):]
bottom_10_percent_idx = sorted_indices[:int(0.1 * len(margins))]

In [17]:
print(top_10_percent_idx)
print(bottom_10_percent_idx)

[29776 29788 29786 ... 33014 33085 98804]
[48282 48281 68619 ... 69273 69272 69271]


# 3.)

In [11]:
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.tree import DecisionTreeClassifier

#### Top 10 Features

In [12]:
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(data, target)

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(data.columns)

# Concatenate dataframes for better visualization
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['Feature', 'Score']  # Naming the dataframe columns
top_10_KBest = featureScores.nlargest(10, 'Score')
print(top_10_KBest)  # Print 10 best features

top_10_KBest = list(top_10_KBest['Feature'])

                        Feature         Score
5                     dst_bytes  8.283615e+08
4                     src_bytes  1.450596e+08
23                    srv_count  9.291380e+07
22                        count  6.147368e+07
37         dst_host_serror_rate  3.070995e+07
24                  serror_rate  2.823845e+07
0                      duration  2.546861e+07
32           dst_host_srv_count  2.532183e+07
38     dst_host_srv_serror_rate  2.206405e+07
35  dst_host_same_src_port_rate  1.763869e+07


In [13]:
# Initialize the base classifier
model = DecisionTreeClassifier()
feature_names = data.columns

# Initialize RFE with the model and specify the number of features
rfe = RFE(estimator=model, n_features_to_select=10)

# Fit RFE
fit = rfe.fit(data, target)

# Get the ranking of features and selected features
ranking = fit.ranking_
top_10_RFE = [feature_names[i] for i in range(len(feature_names)) if ranking[i] == 1]

# Display selected feature names
print("Top 10 Selected Features:\n", top_10_RFE)

Top 10 Selected Features:
 ['service', 'src_bytes', 'wrong_fragment', 'num_compromised', 'srv_count', 'same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_srv_serror_rate']


#### Re-running models with top 10 features (KBest)

In [14]:
truncated_KBest_data, target = data[top_10_KBest], target
X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest = train_test_split(truncated_KBest_data, target, test_size=0.2, random_state=42)

##### Stochastic Gradient Descent

In [22]:
%%time
# No parameter optimization
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3)
fit_and_opt_SGDClassifier(sgd_clf, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

Accuracy: 0.9832700774252315
CPU times: user 59.4 s, sys: 231 ms, total: 59.6 s
Wall time: 59.3 s


In [23]:
%%time
# Optimizing `max_iter` 
sgd_clf = SGDClassifier(max_iter=100, tol=1e-3)
fit_and_opt_SGDClassifier(sgd_clf, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

Accuracy: 0.9841607206113051
CPU times: user 48.7 s, sys: 512 ms, total: 49.2 s
Wall time: 48.5 s




In [24]:
%%time
# Optimizing `learning_rate` 
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, learning_rate='optimal')
fit_and_opt_SGDClassifier(sgd_clf, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

Accuracy: 0.9781488791053085
CPU times: user 54.6 s, sys: 481 ms, total: 55.1 s
Wall time: 54.4 s


##### Support Vector Machine

In [None]:
%%time
# No parameter optimization
svm_model = SVC(kernel='linear', decision_function_shape='ovo')
fit_and_opt_SVMClassifier(svm_model, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

In [None]:
%%time
# Optimizing `kernel`
svm_model = SVC(kernel='sigmoid', decision_function_shape='ovo')
fit_and_opt_SVMClassifier(svm_model, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

In [None]:
%%time
# Optimizing `degree`
svm_model = SVC(kernel='linear', decision_function_shape='ovo', degree=5)
fit_and_opt_SVMClassifier(svm_model, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

##### Decision Tree

In [25]:
%%time
# No parameter optimization
dt_model = DecisionTreeClassifier()
fit_and_opt_DTClassifier(dt_model, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

Accuracy: 0.9989980264156673
CPU times: user 732 ms, sys: 38 µs, total: 732 ms
Wall time: 727 ms


In [26]:
%%time
# Optimizing `max_depth`
dt_model = DecisionTreeClassifier(max_depth=3)
fit_and_opt_DTClassifier(dt_model, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

Accuracy: 0.981579879560751
CPU times: user 261 ms, sys: 16 µs, total: 261 ms
Wall time: 257 ms


In [27]:
%%time
# Optimizing `max_features`
max_d = int(np.sqrt(len(truncated_KBest_data.columns)))

dt_model = DecisionTreeClassifier(max_features=max_d)
fit_and_opt_DTClassifier(dt_model, X_train_KBest, X_test_KBest, y_train_KBest, y_test_KBest)

Accuracy: 0.9988765750721117
CPU times: user 150 ms, sys: 0 ns, total: 150 ms
Wall time: 144 ms


#### Re-running models with top 10 features (RFE)

In [15]:
truncated_RFE_data, target = data[top_10_RFE], target
X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE = train_test_split(truncated_RFE_data, target, test_size=0.2, random_state=42)

##### Stochastic Gradient Descent

In [29]:
%%time
# No parameter optimization
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3)
fit_and_opt_SGDClassifier(sgd_clf, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

Accuracy: 0.9865492637012296
CPU times: user 45.2 s, sys: 89.1 ms, total: 45.3 s
Wall time: 45.1 s


In [30]:
%%time
# Optimizing `max_iter` 
sgd_clf = SGDClassifier(max_iter=100, tol=1e-3)
fit_and_opt_SGDClassifier(sgd_clf, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

Accuracy: 0.9897373614695613
CPU times: user 32.3 s, sys: 508 ms, total: 32.8 s
Wall time: 32.2 s




In [31]:
%%time
# Optimizing `learning_rate` 
sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, learning_rate='optimal')
fit_and_opt_SGDClassifier(sgd_clf, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

Accuracy: 0.9920044532159303
CPU times: user 50.8 s, sys: 518 ms, total: 51.3 s
Wall time: 50.6 s


##### Support Vector Machine

In [None]:
%%time
# No parameter optimization
svm_model = SVC(kernel='linear', decision_function_shape='ovo')
fit_and_opt_SVMClassifier(svm_model, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

In [None]:
%%time
# Optimizing `kernel`
svm_model = SVC(kernel='sigmoid', decision_function_shape='ovo')
fit_and_opt_SVMClassifier(svm_model, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

In [None]:
%%time
# Optimizing `degree`
svm_model = SVC(kernel='linear', decision_function_shape='ovo', degree=5)
fit_and_opt_SVMClassifier(svm_model, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

##### Decision Tree

In [32]:
%%time
# No parameter optimization
dt_model = DecisionTreeClassifier()
fit_and_opt_DTClassifier(dt_model, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

Accuracy: 0.9994534689540003
CPU times: user 1.13 s, sys: 421 ms, total: 1.55 s
Wall time: 1.06 s


In [33]:
%%time
# Optimizing `max_depth`
dt_model = DecisionTreeClassifier(max_depth=3)
fit_and_opt_DTClassifier(dt_model, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

Accuracy: 0.985446080663934
CPU times: user 625 ms, sys: 149 µs, total: 625 ms
Wall time: 620 ms


In [34]:
%%time
# Optimizing `max_features`
max_d = int(np.sqrt(len(truncated_RFE_data.columns)))

dt_model = DecisionTreeClassifier(max_features=max_d)
fit_and_opt_DTClassifier(dt_model, X_train_RFE, X_test_RFE, y_train_RFE, y_test_RFE)

Accuracy: 0.9993826223369263
CPU times: user 392 ms, sys: 376 µs, total: 392 ms
Wall time: 388 ms


# 4.)

In [25]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score
from sklearn.decomposition import PCA

In [22]:
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_train_KBest_reduced = pca.fit_transform(X_train_KBest)
X_train_RFE_reduced = pca.fit_transform(X_train_RFE)

#### Clustering models with top 10 features (KBest)

In [38]:
%%time
kmeans = KMeans(n_clusters=3)
kmeans_labels = kmeans.fit_predict(X_train_KBest)

kmeans_ari = adjusted_rand_score(y_train_KBest, kmeans_labels)
kmeans_nmi = normalized_mutual_info_score(y_train_KBest, kmeans_labels)

print(f"K-Means ARI: {kmeans_ari}, NMI: {kmeans_nmi}")

K-Means ARI: 0.7738611167423556, NMI: 0.6915610010698197
CPU times: user 10.3 s, sys: 3.49 s, total: 13.8 s
Wall time: 2.01 s


In [None]:
# NMI has a fairly good score representing some clusters were found but the ARI score is bad. Therefore no clear clusters were found

In [26]:
%%time
agglo = AgglomerativeClustering(n_clusters=3, linkage='complete')
agglo_labels = agglo.fit_predict(X_train_KBest_reduced)

agglo_ari = adjusted_rand_score(y_train_KBest, agglo_labels)
agglo_nmi = normalized_mutual_info_score(y_train_KBest, agglo_labels)

print(f"Agglomerative ARI: {agglo_ari}, NMI: {agglo_nmi}")

MemoryError: Unable to allocate 582. GiB for an array with shape (78097645720,) and data type float64

In [None]:
%%time
dbscan = DBSCAN(eps=0.5)  # eps is the maximum distance between two samples for one to be considered as in the neighborhood of the other
dbscan_labels = dbscan.fit_predict(X_train_KBest_reduced)

if np.unique(dbscan_labels).size > 1:
    dbscan_ari = adjusted_rand_score(y_train_KBest, dbscan_labels)
    dbscan_nmi = normalized_mutual_info_score(y_train_KBest, dbscan_labels)
else:
    dbscan_ari, dbscan_nmi = 0, 0
    
print(f"DBSCAN ARI: {dbscan_ari}, NMI: {dbscan_nmi}")

#### Clustering models with top 10 features (RFE)

In [17]:
kmeans = KMeans(n_clusters=3)
kmeans_labels = kmeans.fit_predict(X_train_RFE)

kmeans_ari = adjusted_rand_score(y_train_RFE, kmeans_labels)
kmeans_nmi = normalized_mutual_info_score(y_train_RFE, kmeans_labels)

print(f"K-Means ARI: {kmeans_ari}, NMI: {kmeans_nmi}")

K-Means ARI: 0.7728261186781732, NMI: 0.7087606844063691


In [None]:
# NMI has a fairly good score representing some clusters were found but the ARI score is bad. Therefore no clear clusters were found

In [18]:
agglo = AgglomerativeClustering(n_clusters=3)
agglo_labels = agglo.fit_predict(X_train_RFE)

agglo_ari = adjusted_rand_score(y_train_RFE, agglo_labels)
agglo_nmi = normalized_mutual_info_score(y_train_RFE, agglo_labels)

print(f"Agglomerative ARI: {agglo_ari}, NMI: {agglo_nmi}")

MemoryError: Unable to allocate 582. GiB for an array with shape (78097645720,) and data type float64

In [None]:
dbscan = DBSCAN(eps=0.5)  # eps is the maximum distance between two samples for one to be considered as in the neighborhood of the other
dbscan_labels = dbscan.fit_predict(X_train_RFE)

if np.unique(dbscan_labels).size > 1:
    dbscan_ari = adjusted_rand_score(y_train_RFE, dbscan_labels)
    dbscan_nmi = normalized_mutual_info_score(y_train_RFE, dbscan_labels)
else:
    dbscan_ari, dbscan_nmi = 0, 0
    
print(f"DBSCAN ARI: {dbscan_ari}, NMI: {dbscan_nmi}")

# 5.)
 To the best of my attempts there were no reliable clusters found, and therefore no clusters were found in the top or bottom 10% of the data.

# 6.)

In [19]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import classification_report, roc_auc_score

In [20]:
SA_data, SA_target = data, target
X_train, X_test, y_train, y_test = train_test_split(SA_data, SA_target, test_size=0.2, random_state=42)

In [21]:
# Anomaly Detection algorithms
iso_forest = IsolationForest()
oc_svm = OneClassSVM()
lof = LocalOutlierFactor()

In [None]:
iso_forest_pred = iso_forest.fit_predict(X_train)
oc_svm_pred = oc_svm.fit_predict(X_train)
lof_pred = lof.fit_predict(X_train)

In [None]:
# Convert -1 to 1 for anomaly and 1 to 0 for normal for consistency with y labels
iso_forest_pred = [1 if x == -1 else 0 for x in iso_forest_pred]
oc_svm_pred = [1 if x == -1 else 0 for x in oc_svm_pred]
lof_pred = [1 if x == -1 else 0 for x in lof_pred]

print("Isolation Forest Performance:")
print(classification_report(y, iso_forest_pred))
print("ROC AUC Score:", roc_auc_score(y, iso_forest_pred))

print("\nOne-Class SVM Performance:")
print(classification_report(y, oc_svm_pred))
print("ROC AUC Score:", roc_auc_score(y, oc_svm_pred))

print("\nLocal Outlier Factor Performance:")
print(classification_report(y, lof_pred))
print("ROC AUC Score:", roc_auc_score(y, lof_pred))

# 7.)

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import confusion_matrix

In [None]:
# Randomly select 250 data points
indices = random.sample(range(len(data)), 25)
X_subset = data[indices]
y_subset = np.array(target)[indices]

In [None]:
iso_forest = IsolationForest()
oc_svm = OneClassSVM()
lof = LocalOutlierFactor()

In [None]:
loo = LeaveOneOut()
iso_forest_scores = []
oc_svm_scores = []
lof_scores = []

for train_index, test_index in loo.split(X_subset):
    X_train, X_test = X_subset[train_index], X_subset[test_index]
    y_train, y_test = y_subset[train_index], y_subset[test_index]

    # Fit and predict with Isolation Forest
    iso_forest.fit(X_train)
    iso_forest_pred = iso_forest.predict(X_test)
    iso_forest_scores.append(confusion_matrix(y_test, iso_forest_pred, labels=[-1, 1]))

    # Fit and predict with One-Class SVM
    oc_svm.fit(X_train)
    oc_svm_pred = oc_svm.predict(X_test)
    oc_svm_scores.append(confusion_matrix(y_test, oc_svm_pred, labels=[-1, 1]))

    # Fit and predict with Local Outlier Factor
    lof.fit(X_train)
    lof_pred = lof.fit_predict(X_test)
    lof_scores.append(confusion_matrix(y_test, lof_pred, labels=[-1, 1]))


In [None]:
# Aggregate and summarize the scores
iso_forest_performance = np.sum(iso_forest_scores, axis=0)
oc_svm_performance = np.sum(oc_svm_scores, axis=0)
lof_performance = np.sum(lof_scores, axis=0)

print("Isolation Forest Performance:\n", iso_forest_performance)
print("\nOne-Class SVM Performance:\n", oc_svm_performance)
print("\nLocal Outlier Factor Performance:\n", lof_performance)

# 8.)

In [None]:
from sklearn.feature_selection import SelectKBest, chi2, RFE
from sklearn.tree import DecisionTreeClassifier

In [None]:
bestfeatures = SelectKBest(score_func=chi2, k=5)
fit = bestfeatures.fit(SA_data, SA_target)

dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(SA_data.columns)

# Concatenate dataframes for better visualization
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['Feature', 'Score']  # Naming the dataframe columns
top_10_KBest = featureScores.nlargest(5, 'Score')
print(top_10_KBest)  # Print 10 best features

top_10_KBest = list(top_10_KBest['Feature'])