### What You're Aiming For

- In this checkpoint, we are going to work on the 'Microsoft Malware' dataset that was provided by Kaggle as part of the Microsoft Malware Prediction competition. This checkpoint will cover all the major concepts of supervised and unsupervised machine learning methods. 

- Dataset description : This dataset was derived from the original copy and simplified for learning purposes. It contains a set of machines, which run Microsoft Windows OS. The goal of this exercise is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine.

### Feature/Variable  -	Explanation													
- Wdft_IsGamer	= Indicates whether the device is a gamer device or not based on its hardware combination.													
- Census_IsVirtualDevice =	Identifies a Virtual Machine (machine learning model)													
- Census_OSEdition  =	Edition of the current OS													

- Census_HasOpticalDiskDrive =	True indicates that the machine has an optical disk drive (CD/DVD)													
- Firewall	= This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.													

- SMode	= This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed													

- IsProtected 	= This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this machine, or if the AV is active, but is not receiving the latest updates													
- OsPlatformSubRelease	= Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)													
- CountryIdentifier	= ID for the country the machine is located in													

#### Instructions

- Part1: supervised learning

    - Import you data and perform basic data exploration phase
    - Display general information about the dataset
    - Create a pandas profiling reports to gain insights into the dataset
    - Handle Missing and corrupted values
    - Remove duplicates, if they exist
    - Handle outliers, if they exist
    - Encode categorical features
    - Prepare your dataset for the modelling phase
    - Apply Decision tree, and plot its ROC curve
    - Try to improve your model performance by changing the model hyperparameters

- Part2: unsupervised learning

    - Drop out the target variable
    - Apply K means clustering and plot the clusters
    - Find the optimal K parameter
    - Interpret the results

### 1. Supervised Learning Using Decision Tree

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
#import linear algebra and data manipulation libraries
import numpy as np
import pandas as pd

#import standard visualization
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split #split
from sklearn.metrics import accuracy_score #metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# import the label Encoder library 
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

from sklearn.preprocessing import StandardScaler

#tools for hyperparameters search
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [None]:
data = pd.read_csv("Microsoft_malware_dataset_min.csv")
data

In [None]:
data.describe().T

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data["Wdft_IsGamer"].value_counts()

In [None]:
data["Census_IsVirtualDevice"].value_counts()

In [None]:
data["SMode"].value_counts()

In [None]:
data["Firewall"].value_counts()

In [None]:
numerical_features = data.select_dtypes(include='number').columns
numerical_features

In [None]:
plt.figure(figsize=(15, 7.5))
correlation_matrix = data[numerical_features].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
data = data.dropna()

In [None]:
data = data.drop("Firewall", axis = 1)
#data = data.drop("Census_HasOpticalDiskDrive", axis = 1)
#data = data.drop("SMode", axis = 1)
data = data.drop("CountryIdentifier", axis = 1)

In [None]:
# Imputation: filling them with a mode placeholder
data['Wdft_IsGamer'].fillna(data['Wdft_IsGamer'].mode()[0], inplace=True)
data['Census_IsVirtualDevice'].fillna(data['Census_IsVirtualDevice'].mode()[0], inplace=True)
data['IsProtected'].fillna(data['IsProtected'].mode()[0], inplace=True)

In [None]:
data.shape

In [None]:
data = data.drop_duplicates()

In [None]:
categorical_features = data.select_dtypes(include='object').columns
categorical_features

In [None]:
plt.figure(figsize=(25, 25))
for i in range(0, len(numerical_features)):
    plt.subplot(8, 2, i+1)
    sns.boxplot(x = data[numerical_features[i]], palette = 'viridis')
    plt.title(numerical_features[i], fontsize = 30)
    plt.xlabel(' ')
    plt.tight_layout()

In [None]:
data.info()

In [None]:
#Removing Outliers
from scipy.stats import zscore


# Calculate Z-scores for all numerical columns
z_scores = data[numerical_features].apply(zscore)

# Set the Z-score threshold for detecting outliers (commonly 3 or -3)
threshold = 3

# Remove outliers per column (not requiring all to be below threshold)
for col in numerical_features:
    data_no_outliers = data[(z_scores[col] < threshold) & (z_scores[col] > -threshold)]

# Print the shape of the DataFrame before and after removing outliers
print("Original shape:", data.shape)
print("Shape after removing outliers:", data_no_outliers.shape)

In [None]:
# Initialize Label Encoder
label_encoder = LabelEncoder()

# Apply label encoding to categorical columns
for col in categorical_features:
    data_no_outliers[col] = label_encoder.fit_transform(data_no_outliers[col])

# Check the encoded dataset
data_no_outliers.head()

In [None]:
data = data_no_outliers
data

In [None]:
X = data.drop(columns=["HasDetections"])
y = data["HasDetections"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.impute import SimpleImputer

# Imputing missing values for numerical columns (using median)
num_cols = data.select_dtypes(include=['float64', 'int64']).columns
imputer_num = SimpleImputer(strategy='median')
data[num_cols] = imputer_num.fit_transform(data[num_cols])

# Imputing missing values for categorical columns (using most frequent value)
cat_cols = ['Census_OSEdition', 'OsPlatformSubRelease']
imputer_cat = SimpleImputer(strategy='most_frequent')
data[cat_cols] = imputer_cat.fit_transform(data[cat_cols])

In [None]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_scaled = scaler.transform(X_test)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

tree_clf = DecisionTreeClassifier()

tree_clf.fit(X_train_scaled, y_train)
y_pred_tree = tree_clf.predict(X_test_scaled)
accuracy_tree = accuracy_score(y_test, y_pred_tree)
conf_matrix_tree = confusion_matrix(y_test, y_pred_tree)
class_report_tree = classification_report(y_test, y_pred_tree)

print(f"Decision Tree Classifier:")
print(f"Accuracy: {accuracy_tree:.4f}")
print("Confusion Matrix:")
print(conf_matrix_tree)
print("Classification Report:")
print(class_report_tree)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, roc_auc_score

# Predict probabilities
y_probs = tree_clf.predict_proba(X_test)[:, 1]

# Calculate ROC AUC
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
roc_auc = roc_auc_score(y_test, y_probs)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, label=f'Decision Tree (AUC = {roc_auc:.2f})')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

### Unsupervised Learning Using K-means Clustering

In [None]:
# Drop the target variable 'HasDetections'
X_unsupervised = data.drop('HasDetections', axis=1)

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Reduce dimensions using PCA for visualization (optional)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_unsupervised)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)  # Start with an arbitrary k=3
kmeans.fit(X_pca)

# Plot the clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='viridis', marker='o')
plt.title('K-Means Clusters (k=3)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

In [None]:
# Use the elbow method to find the optimal number of clusters
inertia = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_unsupervised)
    inertia.append(kmeans.inertia_)

# Plot inertia vs. number of clusters
plt.figure()
plt.plot(K_range, inertia, 'bx-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method For Optimal K')
plt.show()

In [None]:
# Re-apply K-means using the optimal K (say K=4 from the elbow method)
kmeans_optimal = KMeans(n_clusters=2, random_state=42)
kmeans_optimal.fit(X_pca)

# Plot the optimized clusters
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_optimal.labels_, cmap='viridis', marker='o')
plt.title('K-Means Clusters (Optimal K)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()