# Lab 6

Scikit learn provides a large variety of algorithms for some common Machine Learning tasks, such as:

* Classification
* Regression
* Clustering
* Feature Selection
* Anomaly Detection

It also provides some datasets that you can use to test these algorithms:

* Classification Datasets:
    * Breast cancer wisconsin
    * Iris plants (3-classes)
    * Optical recognition of handwritten digits (10-classes)
    * Wine (n-classes)

* Regression Datasets:
    * Boston house prices
    * Diabetes
    * Linnerrud (multiple regression)
    * California Housing

* Image:
    * The Olivetti faces
    * The Labeled Faces in the Wild face recognition
    * Forest covertypes

* NLP:
    * News group
    * Reuters Corpus Volume I

* Other:
    * Kddcup 99- Intrusion Detection

## Exercises

1. Use the full [Kddcup](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset to compare classification performance of 3 different classifiers.
    * Separate the data into train, validation, and test.
    * Use accuracy as the metric for assessing performance.
    * For each classifier, identify the hyperparameters. Perform optimization over at least 2 hyperparameters.   
    * Compare the performance of the optimal configuration of the classifiers.

2. Pick the best algorithm in question 1. Create an ensemble of at least 25 models, and use them for the classification task. Identify the top and bottom 10% of the data in terms of uncertainty of the decision.

3. Use 2 different feature selection algorithm to identify the 10 most important features for the task in question 1. Retrain classifiers in question 1 with just this subset of features and compare performance.

4. Use the same data, removing the labels, and compare performance of 3 different clustering algorithms. Can you find clusters for each of the classes in question 1?

5. Can you identify any clusters within the top/botton 10% identified in 2. What are their characteristics?

6. Use the "SA" dataset to compare the performance of 3 different anomaly detection algorithms.

7. Create a subsample of 250 datapoints, redo question 6, using Leave-one-out as the method of evaluation.

8. Use the feature selection algorithm to identify the 5 most important features for the task in question 6, for each algorithm. Does the anomaly detection improve using less features?

## Quick look at the data

In [1]:
from sklearn.datasets import fetch_kddcup99
D=fetch_kddcup99()

In [2]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [3]:
print(D["DESCR"])

.. _kddcup99_dataset:

Kddcup 99 dataset
-----------------

The KDD Cup '99 dataset was created by processing the tcpdump portions
of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
generated using a closed network and hand-injected attacks to produce a
large number of different types of attack with normal activity in the
background. As the initial goal was to produce a large training set for
supervised learning algorithms, there is a large proportion (80.1%) of
abnormal data which is unrealistic in real world, and inappropriate for
unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:

* qualitatively different from normal data
* in large minority among the observations.

We thus transform the KDD Data set into two different data sets: SA and SF.

* SA is obtained by simply selecting all

In [4]:
dir(D)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [5]:
import numpy as np
np.unique(D["target"])

array([b'back.', b'buffer_overflow.', b'ftp_write.', b'guess_passwd.',
       b'imap.', b'ipsweep.', b'land.', b'loadmodule.', b'multihop.',
       b'neptune.', b'nmap.', b'normal.', b'perl.', b'phf.', b'pod.',
       b'portsweep.', b'rootkit.', b'satan.', b'smurf.', b'spy.',
       b'teardrop.', b'warezclient.', b'warezmaster.'], dtype=object)

In [6]:
len(np.unique(D["target"]))

23

In [7]:
D["feature_names"]

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate']

In [33]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2

In [9]:
import gzip
import shutil

gz_file_path = '/content/kddcup.data.gz'

extracted_file_path = '/content/kddcup.data'

# Unzip the file
with gzip.open(gz_file_path, 'rb') as gz_file:
    with open(extracted_file_path, 'wb') as extracted_file:
        shutil.copyfileobj(gz_file, extracted_file)

print("File has been unzipped to:", extracted_file_path)


File has been unzipped to: /content/kddcup.data


In [11]:
import pandas as pd
n_rows= 50000
data = pd.read_csv('/content/kddcup.data', header=None, nrows=n_rows)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41
0,0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


In [10]:
#df = data.sample(n=5000, random_state=42)

In [12]:
print(data.shape)

(50000, 42)


In [13]:
pd.set_option('display.max_columns', None)  # Display all columns
print(data.head())

   0    1     2   3    4      5   6   7   8   9   10  11  12  13  14  15  16  \
0   0  tcp  http  SF  215  45076   0   0   0   0   0   1   0   0   0   0   0   
1   0  tcp  http  SF  162   4528   0   0   0   0   0   1   0   0   0   0   0   
2   0  tcp  http  SF  236   1228   0   0   0   0   0   1   0   0   0   0   0   
3   0  tcp  http  SF  233   2032   0   0   0   0   0   1   0   0   0   0   0   
4   0  tcp  http  SF  239    486   0   0   0   0   0   1   0   0   0   0   0   

   17  18  19  20  21  22  23   24   25   26   27   28   29   30  31  32   33  \
0   0   0   0   0   0   1   1  0.0  0.0  0.0  0.0  1.0  0.0  0.0   0   0  0.0   
1   0   0   0   0   0   2   2  0.0  0.0  0.0  0.0  1.0  0.0  0.0   1   1  1.0   
2   0   0   0   0   0   1   1  0.0  0.0  0.0  0.0  1.0  0.0  0.0   2   2  1.0   
3   0   0   0   0   0   2   2  0.0  0.0  0.0  0.0  1.0  0.0  0.0   3   3  1.0   
4   0   0   0   0   0   3   3  0.0  0.0  0.0  0.0  1.0  0.0  0.0   4   4  1.0   

    34    35   36   37   38   39

###**Question 1:**

In [14]:
print(data.iloc[:, -1].unique())  # Check unique values of the last column


['normal.' 'buffer_overflow.' 'loadmodule.' 'perl.']


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [16]:
# Encode columns 1, 2, 3 with One-Hot Encoding
data = pd.get_dummies(data, columns=[1, 2, 3], drop_first=True)

# Encode the target column (column 41) with Label Encoding
label_encoder = LabelEncoder()
data[41] = label_encoder.fit_transform(data[41])

# Save label mappings for reference
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label Mappings:", label_mapping)

Label Mappings: {'buffer_overflow.': 0, 'loadmodule.': 1, 'normal.': 2, 'perl.': 3}


In [17]:
X = data.drop(columns=[41])  # Features
y = data[41]                # Target

#splitting
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Train size:", X_train.shape)
print("Validation size:", X_val.shape)
print("Test size:", X_test.shape)

Train size: (35000, 57)
Validation size: (7500, 57)
Test size: (7500, 57)


In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [19]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

X_train shape: (35000, 57), y_train shape: (35000,)
X_val shape: (7500, 57), y_val shape: (7500,)
X_test shape: (7500, 57), y_test shape: (7500,)


In [20]:
# Ensure column names are consistent
X_train.columns = X_train.columns.astype(str)
X_val.columns = X_val.columns.astype(str)
X_test.columns = X_test.columns.astype(str)

# Train the DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=10, min_samples_split=2)
clf.fit(X_train, y_train)
val_accuracy = accuracy_score(y_val, clf.predict(X_val))
print(f"DecisionTreeClassifier trained. Validation Accuracy: {val_accuracy}")


DecisionTreeClassifier trained. Validation Accuracy: 0.9998666666666667


In [21]:
# Support Vector Machine (SVM)
clf_svm = SVC(probability=True)
clf_svm.fit(X_train, y_train)
svm_val_accuracy = accuracy_score(y_val, clf_svm.predict(X_val))
print(f"SVM trained. Validation Accuracy: {svm_val_accuracy}")


SVM trained. Validation Accuracy: 0.9998666666666667


In [22]:
# RandomForestClassifier
clf_rf = RandomForestClassifier(n_estimators=100, max_depth=10)
clf_rf.fit(X_train, y_train)
rf_val_accuracy = accuracy_score(y_val, clf_rf.predict(X_val))
print(f"RandomForestClassifier trained. Validation Accuracy: {rf_val_accuracy}")

RandomForestClassifier trained. Validation Accuracy: 0.9998666666666667


In [24]:
# Define hyperparameter grids with reduced parameters
param_grids = {
    "DecisionTree": {"max_depth": [10, 20], "min_samples_split": [2, 5]},
    "SVM": {"C": [0.1, 1], "kernel": ["linear"]},
    "RandomForest": {"n_estimators": [50, 100], "max_depth": [10, 20]}
}

best_models = {}

# Hyperparameter tuning for each classifier
for model_name, params in param_grids.items():
    if model_name == "DecisionTree":
        model = DecisionTreeClassifier()
    elif model_name == "SVM":
        model = SVC(probability=True)
    elif model_name == "RandomForest":
        model = RandomForestClassifier()

    grid = GridSearchCV(model, params, cv=2, scoring="accuracy", n_jobs=-1)  # Parallel and 2-fold cross-validation
    grid.fit(X_train, y_train)

    # Store best model
    best_models[model_name] = grid.best_estimator_
    print(f"{model_name} best params: {grid.best_params_}")

# Evaluate best models on test set
for model_name, model in best_models.items():
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"{model_name} Test Accuracy: {test_accuracy}")




DecisionTree best params: {'max_depth': 10, 'min_samples_split': 2}




SVM best params: {'C': 0.1, 'kernel': 'linear'}




RandomForest best params: {'max_depth': 10, 'n_estimators': 50}
DecisionTree Test Accuracy: 0.9998666666666667
SVM Test Accuracy: 0.9997333333333334
RandomForest Test Accuracy: 0.9998666666666667


In [28]:
from sklearn.ensemble import BaggingClassifier

# Use the best model (RandomForest) from earlier
base_model = best_models["RandomForest"]

# Initialize BaggingClassifier with the RandomForest model
ensemble = BaggingClassifier(estimator=base_model, n_estimators=25, random_state=42)
ensemble.fit(X_train, y_train)

# Evaluate on test set
ensemble_preds = ensemble.predict(X_test)
ensemble_accuracy = accuracy_score(y_test, ensemble_preds)
print(f"Ensemble Test Accuracy: {ensemble_accuracy}")

# Identify top and bottom 10% predictions based on uncertainty
probs = ensemble.predict_proba(X_test)  # Get probabilities for each class
uncertainty = probs.max(axis=1)  # Uncertainty is 1 - max probability (confidence)
sorted_indices = uncertainty.argsort()  # Sort based on uncertainty

# Top 10% (least certain predictions)
top_10_percent_indices = sorted_indices[:len(sorted_indices) // 10]
bottom_10_percent_indices = sorted_indices[-len(sorted_indices) // 10:]

# Optionally, print or inspect the top and bottom predictions
print(f"Top 10% Uncertainty Indices: {top_10_percent_indices}")
print(f"Bottom 10% Uncertainty Indices: {bottom_10_percent_indices}")


Ensemble Test Accuracy: 0.9998666666666667
Top 10% Uncertainty Indices: [4614 2288 1419 7428 1101 5592 1724 2140 5918 3684  749 5644  431 6821
 7166 1946 5488 6152 2319 7199 7193 5022 5000 4999 4998 5030 4997 5031
 4996 4995 4994 4993 4992 4991 4990 5032 4989 4988 4987 5033 5034 5029
 5001 5003 5023 5021 5020 5019 5018 5017 5024 5016 5015 5014 5013 5002
 5012 5026 5027 5011 5010 5009 5007 5028 5006 5005 5004 5025 5008    0
 4985 4956 4955 4954 4953 4952 4951 4950 4949 4948 4947 4946 4945 4944
 4943 4942 4941 4940 4939 4938 4937 4936 4935 4934 4933 4932 4957 4986
 4958 4960 4984 4983 5035 4982 4981 4980 4979 4978 4977 4976 4975 4974
 4973 4972 4971 4970 4969 4968 4967 4966 4965 4964 4963 4962 4961 4959
 5036 5042 5038 5122 5121 5120 5119 5118 5117 5116 5115 5114 5113 5112
 5111 5110 5109 5108 5107 5106 5105 5104 5103 5102 5101 5100 5099 5098
 5123 5097 5124 5126 5151 5150 5149 5148 5147 5146 5145 5144 5143 5142
 5141 5140 5139 5138 5137 5136 5135 5134 5133 5132 5131 5130 5129 5128
 5127

In [29]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE

# Mutual Information
selector_mi = SelectKBest(mutual_info_classif, k=10)
X_train_mi = selector_mi.fit_transform(X_train, y_train)
X_val_mi = selector_mi.transform(X_val)
X_test_mi = selector_mi.transform(X_test)

# Recursive Feature Elimination (RFE)
selector_rfe = RFE(RandomForestClassifier(), n_features_to_select=10, step=1)
selector_rfe.fit(X_train, y_train)

X_train_rfe = selector_rfe.transform(X_train)
X_val_rfe = selector_rfe.transform(X_val)
X_test_rfe = selector_rfe.transform(X_test)

# Retrain models using selected features
for model_name, model in best_models.items():
    model.fit(X_train_mi, y_train)
    mi_accuracy = accuracy_score(y_test, model.predict(X_test_mi))
    print(f"{model_name} with Mutual Information Test Accuracy: {mi_accuracy}")

    model.fit(X_train_rfe, y_train)
    rfe_accuracy = accuracy_score(y_test, model.predict(X_test_rfe))
    print(f"{model_name} with RFE Test Accuracy: {rfe_accuracy}")


DecisionTree with Mutual Information Test Accuracy: 0.9998666666666667
DecisionTree with RFE Test Accuracy: 0.9998666666666667
SVM with Mutual Information Test Accuracy: 0.9998666666666667
SVM with RFE Test Accuracy: 0.9998666666666667
RandomForest with Mutual Information Test Accuracy: 0.9998666666666667
RandomForest with RFE Test Accuracy: 0.9998666666666667


In [30]:
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score

clustering_algorithms = {
    "KMeans": KMeans(n_clusters=10, random_state=42),
    "DBSCAN": DBSCAN(eps=0.5, min_samples=10),
    "Agglomerative": AgglomerativeClustering(n_clusters=10)
}

# Remove labels
X_no_labels = X_test.copy()

# Apply clustering
for name, algo in clustering_algorithms.items():
    clusters = algo.fit_predict(X_no_labels)
    score = adjusted_rand_score(y_test, clusters)
    print(f"{name} ARI Score: {score}")


KMeans ARI Score: 9.356903802840253e-05
DBSCAN ARI Score: 0.0
Agglomerative ARI Score: 2.1782863763023507e-07


In [31]:
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor

anomaly_detectors = {
    "IsolationForest": IsolationForest(random_state=42),
    "OneClassSVM": OneClassSVM(),
    "LocalOutlierFactor": LocalOutlierFactor(novelty=True)
}


for name, detector in anomaly_detectors.items():
    detector.fit(X_train)
    predictions = detector.predict(X_test)
    print(f"{name} predictions: {predictions[:10]}")


IsolationForest predictions: [1 1 1 1 1 1 1 1 1 1]
OneClassSVM predictions: [ 1  1  1  1 -1 -1  1 -1  1 -1]




LocalOutlierFactor predictions: [ 1  1  1  1  1 -1  1  1  1  1]


In [35]:
from sklearn.model_selection import LeaveOneOut
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor

In [36]:
X_array = X.to_numpy() if hasattr(X, "to_numpy") else X
y_array = y.to_numpy() if hasattr(y, "to_numpy") else y

# 250-point sample
np.random.seed(42)
sample_indices = np.random.choice(len(X_array), size=250, replace=False)
X_sample = X_array[sample_indices]
y_sample = y_array[sample_indices]


anomaly_detectors = {
    "IsolationForest": IsolationForest(random_state=42),
    "OneClassSVM": OneClassSVM(),
    "LocalOutlierFactor": LocalOutlierFactor(novelty=True)
}


loo = LeaveOneOut()
results = {}

for name, model in anomaly_detectors.items():
    correct = 0
    for train_idx, test_idx in loo.split(X_sample):
        X_train, X_test = X_sample[train_idx], X_sample[test_idx]

        if name != "LocalOutlierFactor":
            model.fit(X_train)
            pred = model.predict(X_test)
        else:
            model.fit(X_train)
            pred = model.predict(X_test)


        correct += int(pred[0] == y_sample[test_idx][0])

    accuracy = correct / len(X_sample)
    results[name] = accuracy


print("Anomaly Detection with LOOCV Results:")
for name, acc in results.items():
    print(f"{name}: {acc:.4f}")


Anomaly Detection with LOOCV Results:
IsolationForest: 0.0000
OneClassSVM: 0.0000
LocalOutlierFactor: 0.0000


In [39]:
from sklearn.feature_selection import mutual_info_classif

# Calculate scores
mi_scores = mutual_info_classif(X, y)

top_5_features = np.argsort(mi_scores)[-5:]

X_reduced = X.iloc[:, top_5_features]

print(f"Top 5 Features: {top_5_features}")


Top 5 Features: [38 29  8 56 30]


In [49]:
X_reduced = X_reduced[:len(y_sample)]

In [50]:
print(f"Length of X_reduced: {len(X_reduced)}")
print(f"Length of y_sample: {len(y_sample)}")


Length of X_reduced: 250
Length of y_sample: 250


In [51]:
assert len(X_reduced) == len(y_sample), "Lengths must match after feature reduction"

# Proceed with LeaveOneOut cross-validation
reduced_results = {}
loo = LeaveOneOut()

for name, model in anomaly_detectors.items():
    correct = 0
    for train_idx, test_idx in loo.split(X_reduced):
        X_train, X_test = X_reduced[train_idx], X_reduced[test_idx]
        y_test = y_sample[test_idx]

        if name != "LocalOutlierFactor":
            model.fit(X_train)
            pred = model.predict(X_test)
        else:
            model.fit(X_train)
            pred = model.predict(X_test)

        correct += int(pred[0] == y_test[0])

    accuracy = correct / len(X_reduced)
    reduced_results[name] = accuracy

# Display results
print("Reduced Feature Anomaly Detection Results:")
for name, acc in reduced_results.items():
    print(f"{name}: {acc:.4f}")


Reduced Feature Anomaly Detection Results:
IsolationForest: 0.0000
OneClassSVM: 0.0000
LocalOutlierFactor: 0.0000


The anomaly detection models performed poorly (0.0000 accuracy) with the top 5 features. This could be due to issues like model settings or data problems. Try adjusting the model, using different metrics, and applying k-fold cross-validation for better results.