### <b> <i> <center>Model Preparation for Flask Deployment :

### Model Preparation for Flask Deployment

**Overview:**

**1. Data Preprocessing:**
   - **Outlier Detection**: Identifies and removes outliers using both Z-Score and IQR methods.
   - **Feature Engineering**: Creates new features such as `total_bytes`, `byte_ratio`, and combined error rates while dropping irrelevant columns.
   - **Categorical Encoding**: Encodes categorical features using CatBoostEncoder and label encoding.

**2. Data Filtering:**
   - **Class Filtering**: Removes classes with fewer samples than a specified threshold to ensure sufficient representation.

**3. Data Splitting and Resampling:**
   - **Train-Test Split**: Divides the data into training and test sets.
   - **SMOTE**: Applies Synthetic Minority Over-sampling Technique (SMOTE) to balance the classes in the training set.

**4. Model Training and Tuning:**
   - **Bagging Classifier**: Utilizes Bagging with a Decision Tree base estimator.
   - **Hyperparameter Tuning**: Performs GridSearchCV to find the best hyperparameters for the Bagging Classifier.
   - **Model Evaluation**: Evaluates the best model based on cross-validation and test accuracy.

**5. Results:**
   - **Best Parameters and Accuracy**: Outputs the optimal parameters and accuracy of the Bagging Classifier.

This prepared model can now be deployed via a Flask API, enabling real-time predictions and integration into applications.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE
import category_encoders as ce
import pickle

# Loading the data
df = pd.read_csv('NAD.csv')

# Selecting only numeric columns
numeric_df = df.select_dtypes(include=[np.number])

# Step 1: Z-Score Method
# Calculating Z-Scores for each column
z_scores = stats.zscore(numeric_df)
z_scores_df = pd.DataFrame(z_scores, columns=numeric_df.columns)
z_threshold = 3.5
# Identifying outliers based on Z-scores
z_outliers = (z_scores_df.abs() > z_threshold)
z_outlier_indices = z_outliers[z_outliers.any(axis=1)].index

# Step 2: IQR Method
# Calculating Q1 (25th percentile) and Q3 (75th percentile)
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1
iqr_threshold = 1.5 
iqr_outliers = ((numeric_df < (Q1 - iqr_threshold * IQR)) | (numeric_df > (Q3 + iqr_threshold * IQR)))
iqr_outlier_indices = iqr_outliers[iqr_outliers.any(axis=1)].index

# Finding common outliers
common_outlier_indices = z_outlier_indices.intersection(iqr_outlier_indices)

df_cleaned = df.drop(index=common_outlier_indices, errors='ignore')
df_cleaned.reset_index(drop=True, inplace=True)

# Verifying the size after dropping outliers
df = df_cleaned
df['total_bytes'] = df['srcbytes'] + df['dstbytes']
df['byte_ratio'] = df['srcbytes'] / (df['dstbytes'] + 1)
df.drop(columns=['srcbytes', 'dstbytes'], inplace=True)
df.drop(columns=['wrongfragment', 'urgent'], inplace=True)
df['combined_serror_rerror_rate'] = (df['serrorrate'] + df['rerrorrate']) / 2
df['combined_srv_serror_rerror_rate'] = (df['srvserrorrate'] + df['srvrerrorrate']) / 2
df['ratio_samesrvrate_diffsrvrate'] = df['samesrvrate'] / (df['diffsrvrate'] + 1e-6)
df['service_host_distribution_ratio'] = df['samesrvrate'] / (df['srvdiffhostrate'] + 1e-6)
df['combined_dsthostserrorrate_dsthostrerrorrate'] = df['dsthostserrorrate'] + (df['dsthostrerrorrate'] + 1e-6)
df['combined_dsthostsrvserrorrate_dsthostsrvrerrorrate'] = df['dsthostsrvserrorrate'] + (df['dsthostsrvrerrorrate'] + 1e-6)
df.drop(columns=['serrorrate', 'rerrorrate', 'srvserrorrate', 'srvrerrorrate',
                 'samesrvrate', 'diffsrvrate', 'srvdiffhostrate', 'dsthostserrorrate',
                 'dsthostrerrorrate', 'dsthostsrvserrorrate', 'dsthostsrvrerrorrate'], inplace=True)

label_encoder = LabelEncoder()
df['attack_encoded'] = label_encoder.fit_transform(df['attack'])
df.drop('attack', axis=1, inplace=True)

categorical_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Initializing CatBoostEncoder
catboost_encoder = ce.CatBoostEncoder(cols=categorical_columns[:3])

# Fit and transform the categorical features
df_encoded_features = catboost_encoder.fit_transform(df[categorical_columns[:3]], df['attack_encoded'])
df[categorical_columns[:3]] = df_encoded_features

X = df.drop(columns=['attack_encoded'])
y = df['attack_encoded']

# Finding classes with fewer samples than the threshold
threshold = 6
class_counts = df['attack_encoded'].value_counts()
classes_to_remove = class_counts[class_counts < threshold].index

# Droping rows with these classes
df_filtered = df[~df['attack_encoded'].isin(classes_to_remove)]

# Separating  features and target
X = df_filtered.drop('attack_encoded', axis=1)
y = df_filtered['attack_encoded']

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Applying SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
X, y = X_resampled, y_resampled

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# StratifiedKFold
stratified_kfold = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)  
# Initializing the Decision Tree and Bagging Classifier
decision_tree = DecisionTreeClassifier(random_state=42)
bagging_clf = BaggingClassifier(estimator=decision_tree, random_state=42) 
# Hyperparameter tuning for Bagging
param_grid_bagging = {
    'estimator__max_depth': [None, 1, 2, 4, 6, 8, 10], 
    'n_estimators': [10, 100]
}

grid_search_bagging = GridSearchCV(bagging_clf, param_grid_bagging, cv=stratified_kfold, scoring='accuracy')
grid_search_bagging.fit(X_train, y_train)

# Best parameters and cross-validation accuracy for Bagging
print("Best parameters for Bagging:", grid_search_bagging.best_params_)
print("Best cross-validation accuracy for Bagging:", grid_search_bagging.best_score_)

# Testing the best model on the test set
best_bagging_model = grid_search_bagging.best_estimator_
y_pred_bagging = best_bagging_model.predict(X_test)
print("Test accuracy for Bagging:", accuracy_score(y_test, y_pred_bagging))


Best parameters for Bagging: {'estimator__max_depth': None, 'n_estimators': 100}
Best cross-validation accuracy for Bagging: 0.9992033934879475
Test accuracy for Bagging: 0.9995124775282611


### <i> <b> Saving the Trained Model and Encoders : 

In [None]:
import pickle
# Sav9ng the trained model
with open('network_anomaly_detection_model.pkl', 'wb') as file:
    pickle.dump(best_bagging_model, file)
# Saving the label encoder
with open('label_encoder.pkl', 'wb') as file:
    pickle.dump(label_encoder, file)
# Saving the CatBoost encoder
with open('catboost_encoder.pkl', 'wb') as file:
    pickle.dump(catboost_encoder, file)

In this section, we save the trained model and preprocessing components for future use:

- **Model Saving**: The trained Bagging classifier model (`best_bagging_model`) is saved using the `pickle` library. This allows us to easily load and deploy the model without retraining it.

- **Label Encoder Saving**: The `LabelEncoder`, which was used to encode the target variable, is also saved. This ensures that we can consistently decode the target labels in future predictions.

- **CatBoost Encoder Saving**: The `CatBoostEncoder`, used for encoding categorical features, is saved to ensure that the same encoding is applied during future data preprocessing.

This approach facilitates the deployment and reproducibility of the model in various environments.
