# Predictive Analysis Notebook

This Jupyter Notebook demonstrates a predictive analysis workflow using pre-trained models. The notebook covers the following main tasks:

1. **Loading the Dataset**: Load the dataset for making predictions.
2. **Loading Saved Models**: Load saved models and pre-processing parameters.
3. **Pre-processing the Input Data**: Pre-process the input data for predictions.
4. **Making Predictions**: Use a loaded classifier to make predictions.
5. **Analyzing and Saving Predictions**: Analyze predictions and save relevant data.

**Prerequisites:**
- Python environment with required libraries installed (`pandas`, `joblib`).
- The saved models and pre-processing parameters should be available in the specified paths.
- The input data for predictions should be available in the provided CSV file path.

**Note:**
- Execute each cell in order to perform the corresponding step of the predictive analysis.
- Explanations, code comments, and print statements guide you through the workflow.

Let's proceed with the predictive analysis!


## Loading the Dataset


In [3]:
import pandas as pd

# Load the dataset for making predictions
new_data = pd.read_csv(r'./featrues_df.csv')
print("Loaded data for making predictions.")
new_data


Loaded data for making predictions.


Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports
0,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0
1,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0
2,1804203672181747,61.621388,tcp,-,-,10.0,0.0,2394.0,0.0,1.0,...,10.0,10.0,10.0,10.0,0.0,0.0,0.0,10.0,0.0,0.0
3,598413082629892,97.672258,tcp,-,-,18.0,0.0,3991.0,0.0,0.5,...,18.0,18.0,18.0,18.0,0.0,0.0,0.0,18.0,0.0,0.0
4,1718789661686758,96.449313,tcp,-,-,12.0,0.0,2831.0,0.0,0.5,...,12.0,12.0,12.0,12.0,0.0,0.0,0.0,12.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,2176121286556545,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
240,476063750084706,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
241,1975335942319872,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
242,1776270667273491,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Loading Saved Models

In [4]:
import joblib

# Load saved models and pre-processing parameters
model_paths = {
        'encoder': './encoder.joblib',
        'imputer': './imputer.joblib',
        'categorical_cols': './categorical_cols.joblib',
        'all_features': './all_features.joblib',
        'classifier': './xgboost_model.joblib'
}
selected_features=['dur', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss', 'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin', 'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm', 'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm', 'ct_src_ltm', 'ct_srv_dst', 'proto_tcp', 'state_INT']

# Load models and parameters
loaded_models = {}
for model_name, model_path in model_paths.items():
    try:
        loaded_models[model_name] = joblib.load(model_path)
        print(f"{model_name} loaded successfully.")
    except Exception as e:
        print(f"Failed to load {model_name}: {e}")
        loaded_models = None
        break

if loaded_models:
    print("All models and parameters loaded successfully!")


encoder loaded successfully.
imputer loaded successfully.
categorical_cols loaded successfully.
all_features loaded successfully.
classifier loaded successfully.
All models and parameters loaded successfully!


## Pre-processing the Input Data

In [5]:
from sklearn.preprocessing import OneHotEncoder

# Pre-process the input data
def preprocess_data(new_data, models):
    """
    Preprocess the input data.

    Parameters:
    - new_data (DataFrame): The data to be preprocessed.
    - models (dict): Dictionary containing pre-saved models and parameters.

    Returns:
    - DataFrame: Preprocessed data.
    """
    X_new = new_data.drop(['id'], axis=1)
    categorical_cols = X_new.select_dtypes(include=['object']).columns.tolist()
    
    # One-hot encoding
    encoded_cat_new = models['encoder'].transform(X_new[categorical_cols])
    encoded_cat_df_new = pd.DataFrame(encoded_cat_new, columns=models['encoder'].get_feature_names_out(categorical_cols))
    
    # Combine encoded data
    X_encoded_new = pd.concat([X_new.drop(categorical_cols, axis=1).reset_index(drop=True), encoded_cat_df_new.reset_index(drop=True)], axis=1)
    
    # Missing Value Imputation
    X_imputed_new = models['imputer'].transform(X_encoded_new)
    
    # Feature selection
    X_selected_new = pd.DataFrame(X_imputed_new, columns=X_encoded_new.columns.tolist())[selected_features]
    
    return X_selected_new

# Pre-process the input data
X_new = preprocess_data(new_data, loaded_models)




## Making Predictions

In [6]:
# Make predictions using the loaded classifier

def make_predictions(X, classifier):
    """
    Use the provided classifier to make predictions on the data.

    Parameters:
    - X (DataFrame): The data on which predictions should be made.
    - classifier: Pre-trained classifier.

    Returns:
    - array: Array of predictions.
    """
    return classifier.predict(X)

predictions = make_predictions(X_new, loaded_models['classifier'])
new_data['predicted_label'] = predictions
new_data

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,predicted_label
0,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0,1
1,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0,1
2,1804203672181747,61.621388,tcp,-,-,10.0,0.0,2394.0,0.0,1.0,...,10.0,10.0,10.0,0.0,0.0,0.0,10.0,0.0,0.0,0
3,598413082629892,97.672258,tcp,-,-,18.0,0.0,3991.0,0.0,0.5,...,18.0,18.0,18.0,0.0,0.0,0.0,18.0,0.0,0.0,0
4,1718789661686758,96.449313,tcp,-,-,12.0,0.0,2831.0,0.0,0.5,...,12.0,12.0,12.0,0.0,0.0,0.0,12.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,2176121286556545,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
240,476063750084706,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
241,1975335942319872,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
242,1776270667273491,0.000000,udp,-,-,1.0,0.0,74.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


## Analyzing and Saving Predictions

In [7]:
# Count and display the number of predictions where label is 1
count_predicted_label_1 = len(new_data[new_data["predicted_label"] == 1])
print(f"Number of rows with predicted_label equal to 1: {count_predicted_label_1}")


Number of rows with predicted_label equal to 1: 6
Saved attacks to 'attacks.csv'.


In [8]:
# predictions with label
attacks = new_data[new_data["predicted_label"] == 1]
attacks


Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,predicted_label
0,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0,1
1,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0,1
7,1227368093728265,10.713269,tcp,-,-,15.0,0.0,2093.0,0.0,2.0,...,15.0,15.0,15.0,0.0,0.0,0.0,15.0,0.0,0.0,1
12,998244587245707,10.766534,tcp,-,-,13.0,0.0,1755.0,0.0,2.0,...,13.0,13.0,13.0,0.0,0.0,0.0,13.0,0.0,0.0,1
13,535470451044991,10.774939,tcp,-,-,13.0,0.0,1749.0,0.0,2.0,...,13.0,13.0,13.0,0.0,0.0,0.0,13.0,0.0,0.0,1
42,203297560709042,84.882379,tcp,-,-,215.0,0.0,13723.0,0.0,1.0,...,43.0,43.0,43.0,0.0,0.0,0.0,43.0,0.0,0.0,1


In [9]:
# Save predictions with label 1 to CSV
attacks.to_csv('attacks.csv', index=False)
print("Saved attacks to 'attacks.csv'.")

Saved attacks to 'attacks.csv'.
