# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets:
    * Boston house prices
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image:
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers.
    * Separate the data into train, validation, and test.
    * Use accuracy as the metric for assessing performance.
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1?

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [2]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [3]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [4]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [5]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

### Exercise 1

In [8]:
import pandas as pd

# Convert features and target to DataFrames
feature_df = pd.DataFrame(D.data, columns=D["feature_names"])
target_df = pd.Series(D.target).rename('target')

# Concatenate features and target into a single DataFrame
data_df = pd.concat([feature_df, target_df], axis=1)

In [9]:
from sklearn.datasets import fetch_kddcup99
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import numpy as np

D = fetch_kddcup99()


categorical_features = ['protocol_type', 'service', 'flag']
numerical_features = [f for f in D["feature_names"] if f not in categorical_features]


categorical_transformer = OneHotEncoder()
numerical_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

X = preprocessor.fit_transform(data_df.iloc[:, :-1])
y = data_df['target'].values


In [10]:
# Split the dataset
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


In [11]:
y_train = np.array([label.decode('utf-8') for label in y_train])

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)

In [12]:
print(y_train.shape)

(345814,)


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define classifiers and their hyperparameter grids
classifiers = {
    'RandomForest': RandomForestClassifier(),
    'SVM': SVC(),
    'LogisticRegression': LogisticRegression()
}

param_grids = {
    'RandomForest': {'n_estimators': [10, 50, 100], 'max_depth': [5, 10, None]},
    'SVM': {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']},
    'LogisticRegression': {'C': [0.1, 1, 10], 'max_iter': [100, 200]}
}

# Perform grid search for each classifier
optimal_classifiers = {}
# Proceed with training and hyperparameter tuning
for clf_name in classifiers:
    grid_search = GridSearchCV(classifiers[clf_name], param_grids[clf_name], cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    optimal_classifiers[clf_name] = grid_search.best_estimator_


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [13]:
y_val = np.array([label.decode('utf-8') for label in y_val])

In [14]:
y_val_encoded = label_encoder.transform(y_val)

In [21]:
from sklearn.metrics import accuracy_score

#y_val_encoded should be in the same format as y_pred, and compute the accuracy
for clf_name, clf in optimal_classifiers.items():
    y_pred = clf.predict(X_val)
    accuracy = accuracy_score(y_val_encoded, y_pred)
    print(f"Accuracy of {clf_name}: {accuracy}")

Accuracy of RandomForest: 0.9997570948544593
Accuracy of SVM: 0.9994467160573796
Accuracy of LogisticRegression: 0.9992173056421467


In [23]:
best_accuracy = 0
best_clf_name = None

for clf_name, clf in optimal_classifiers.items():
    y_val_pred = clf.predict(X_val)
    accuracy = accuracy_score(y_val_encoded, y_val_pred)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_clf_name = clf_name

print(f"Best classifier on validation set: {best_clf_name} with accuracy: {best_accuracy}")

best_clf = optimal_classifiers[best_clf_name]


y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test_encoded, y_test_pred)
print(f"Test Accuracy of {best_clf_name}: {test_accuracy}")


Best classifier on validation set: RandomForest with accuracy: 0.9997570948544593
Test Accuracy of RandomForest: 0.9997031199395444


In [24]:
y_test_encoded = label_encoder.transform(y_test)

test_accuracy = accuracy_score(y_test_encoded, y_test_pred)
print(f"Test Accuracy of {best_clf_name}: {test_accuracy}")

Test Accuracy of RandomForest: 0.9997031199395444


### Exercise 2

In [16]:
categorical_indices = [1, 2, 3]
numerical_indices = [0, 4, 5]

# Update the preprocessor to ignore unknown categories
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_indices),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_indices)
    ])


preprocessor.fit(X_train)

# Transform both training and test data again
X_train_encoded = preprocessor.transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

In [15]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)

In [30]:
from sklearn.ensemble import RandomForestClassifier

ensemble_clf = RandomForestClassifier(n_estimators=25, random_state=42)

ensemble_clf.fit(X_train_encoded, y_train_encoded)

ensemble_predictions = ensemble_clf.predict_proba(X_test_encoded)

In [31]:
# Calculate the maximum predicted probability for each sample
max_probabilities = np.max(ensemble_predictions, axis=1)

# Calculate the thresholds for the top and bottom 10%
top_10_percent_threshold = np.percentile(max_probabilities, 90)
bottom_10_percent_threshold = np.percentile(max_probabilities, 10)

# Identify the top and bottom 10% of the data
top_10_percent_indices = np.where(max_probabilities >= top_10_percent_threshold)[0]
bottom_10_percent_indices = np.where(max_probabilities <= bottom_10_percent_threshold)[0]

# Output the results
print(f"Indices of the top 10% uncertain data: {top_10_percent_indices}")
print(f"Indices of the bottom 10% uncertain data: {bottom_10_percent_indices}")

Indices of the top 10% uncertain data: [    0     1     2 ... 74100 74101 74102]
Indices of the bottom 10% uncertain data: [    4    14    15 ... 74098 74099 74103]


### Exercise 3

Method 1

In [32]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Instantiate a classifier to use with SelectFromModel
rf_classifier = RandomForestClassifier(n_estimators=50, random_state=42)

# Fit the classifier to get feature importances
rf_classifier.fit(X_train_encoded, y_train_encoded)

selector = SelectFromModel(rf_classifier, max_features=10, prefit=True)


X_train_selected = selector.transform(X_train_encoded)
X_test_selected = selector.transform(X_test_encoded)


rf_classifier.fit(X_train_selected, y_train_encoded)
y_pred_selected = rf_classifier.predict(X_test_selected)


accuracy_selected = accuracy_score(y_test_encoded, y_pred_selected)

Method 2

In [33]:
rf_classifier.fit(X_train_encoded, y_train_encoded)

importances = rf_classifier.feature_importances_
indices = np.argsort(importances)[::-1]

# Select the top 10 most important features
top_indices = indices[:10]
X_train_top_features = X_train_encoded[:, top_indices]
X_test_top_features = X_test_encoded[:, top_indices]

# Train a classifier on the top features
rf_classifier.fit(X_train_top_features, y_train_encoded)
y_pred_top_features = rf_classifier.predict(X_test_top_features)

# Evaluate performance
accuracy_top_features = accuracy_score(y_test_encoded, y_pred_top_features)


In [34]:
print(f'Accuracy with RFE selected features: {accuracy_selected}')
print(f'Accuracy with top model features: {accuracy_top_features}')

Accuracy with RFE selected features: 0.9782332937493253
Accuracy with top model features: 0.9782332937493253


### Excercise 4

In [17]:
number_of_classes = len(np.unique(y_train))

In [18]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, adjusted_rand_score
import numpy as np

In [42]:
from sklearn.model_selection import train_test_split

X_sample, _ = train_test_split(X_train_encoded, train_size=0.1, random_state=42)

# K-Means Clustering
kmeans = KMeans(n_clusters=number_of_classes, random_state=42)
kmeans_labels_sample = kmeans.fit_predict(X_sample)

# Calculate silhouette score on the sample
silhouette_kmeans_sample = silhouette_score(X_sample, kmeans_labels_sample)

print(f'Silhouette Score for K-Means (on sample): {silhouette_kmeans_sample}')



Silhouette Score for K-Means (on sample): 0.8196713888611975


In [19]:
categorical_indices = [1, 2, 3]
numerical_indices = [0, 4, 5]

# Update the preprocessor to ignore unknown categories
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_indices),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_indices)
    ])


preprocessor.fit(X_train)

# Transform both training and test data again
X_train_encoded = preprocessor.transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

In [20]:
from sklearn.model_selection import train_test_split
X_sample, _ = train_test_split(X_train_encoded, train_size=0.1, random_state=42)

In [None]:
# Convert the sparse matrix to a dense matrix
X_sample_dense = X_sample.toarray()

# Apply Agglomerative Clustering on the dense sample
agg_clustering = AgglomerativeClustering(n_clusters=number_of_classes)
agg_labels = agg_clustering.fit_predict(X_sample_dense)

# Calculate the silhouette score on the dense sample
silhouette_agg = silhouette_score(X_sample_dense, agg_labels)

print(f'Silhouette Score for Agglomerative Clustering (on sample): {silhouette_agg}')


In [None]:
# DBSCAN Clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)  # Parameters such as eps and min_samples may need tuning
dbscan_labels = dbscan.fit_predict(X_train_encoded)
# Note: DBSCAN may find a different number of clusters, so silhouette score is only valid if clusters > 1
if len(np.unique(dbscan_labels)) > 1:
    silhouette_dbscan = silhouette_score(X_train_encoded, dbscan_labels)
else:
    silhouette_dbscan = None  # DBSCAN found less than 2 clusters
