# Preprocessing the UNSW-NB15 Dataset

In this part, the preprocessing of the dataset is performed. The steps followed in this part are as follows:

## Steps

1. **Loading the Dataset**
   - Load the UNSW-NB15 dataset from a CSV file into a pandas DataFrame for further processing.
   - The code reads a cleaned CSV file ('Cleaned_full_data.csv') into a pandas DataFrame.
   - It verifies the shape of the DataFrame to ensure the data is loaded correctly.

2. **Encoding Categorical Features**
   - Convert categorical features into numerical values using label encoding. This is necessary for machine learning algorithms that require numerical input.

3. **Scaling Numerical Features**
   - Standardize the numerical features by scaling them to have zero mean and unit variance. This helps in improving the performance of machine learning models.

4. **Splitting the Dataset into Training and Testing Sets**
   - Split the dataset into training and testing sets to evaluate the performance of machine learning models. Typically, 80% of the data is used for training and 20% for testing.

5. **Main Preprocessing Function**
   - The preprocess_unsw_nb15 function orchestrates the entire preprocessing pipeline.
   - It identifies categorical and numerical columns, applies encoding and scaling, and splits the data.
   - It returns the processed training and testing sets, encoders, and scaler.

6. **Execusion and Output** 
   - The if __name__ == "__main__": block executes the preprocessing steps.
   - It prints the shapes of the training and testing sets, confirming the split.
   - It prints the head of the processed training and test data, to show the results of the preprocessing steps.
   - cThe processed data is saved to pickle files for later use in model training.



In [None]:
# Core data manipulation and analysis libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations and arrays

import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Enable inline plotting in Jupyter notebooks
# Fixed duplicate import and invalid syntax
%matplotlib inline


#### Load the Cleaned Dataset.


In [None]:
    
# Reading datasets
# Using list comprehension to read all csv files in 4 csv files
df = pd.read_csv('C:/Users/raman/OneDrive/Important/1UnisaSTUDY/Courses/Capstone_Project_1/Github/Code Working/Data Cleaning and EDA/Cleaned_full_data.csv', header=0) 

df.head()


In [None]:
# Checking Full data is avaliable or not
df.shape

 **Encoding Categorical Features**
   - Convert categorical features into numerical values using label encoding. This is necessary for machine learning algorithms that require numerical input.

In [None]:
# Function to encode categorical features
# Label Encoding: Converts categories to numerical values (0,1,2...). Good for ordinal data but can imply ordering
# One-Hot Encoding: Creates binary columns for each category. Better for nominal data with no inherent order
def encode_categorical_features(df, categorical_features, encoding_type='label'):
    """
    Args:
        df: Input dataframe
        categorical_features: List of categorical column names
        encoding_type: 'LabelEncoder' for LabelEncoder or 'onehot' for OneHotEncoder
    """
    if encoding_type == 'LabelEncoder':
        label_encoders = {}
        for column in categorical_features:
            label_encoders[column] = LabelEncoder()
            df[column] = label_encoders[column].fit_transform(df[column].astype(str))
        return df, label_encoders
    
    elif encoding_type == 'onehot':
        onehot = OneHotEncoder(sparse=False, handle_unknown='ignore')
        encoded_array = onehot.fit_transform(df[categorical_features])
        
        # Create new column names for one-hot encoded features
        new_columns = []
        for i, feature in enumerate(categorical_features):
            categories = onehot.categories_[i]
            new_columns.extend([f"{feature}_{cat}" for cat in categories])
            
        # Create new dataframe with encoded features
        encoded_df = pd.DataFrame(encoded_array, columns=new_columns, index=df.index)
        
        # Drop original categorical columns and concat encoded ones
        df = df.drop(columns=categorical_features)
        df = pd.concat([df, encoded_df], axis=1)
        return df, onehot

**Scaling Numerical Features**
   - Standardize the numerical features by scaling them to have zero mean and unit variance. This helps in improving the performance of machine learning models.

In [None]:
# Function to scale numerical features
def scale_numerical_features(df, numerical_features):
    scaler = StandardScaler()
    df[numerical_features] = scaler.fit_transform(df[numerical_features])
    return df, scaler

**Splitting the Dataset into Training and Testing Sets**
   - Use `train_test_split` from the `sklearn.model_selection` module to split the dataset into training and testing sets.

In [None]:
# Function to split the dataset into training and testing sets
def split_dataset(df, target_column, test_size=0.2, random_state=42):
    X = df.drop(columns=[target_column])
    y = df[target_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

In [None]:

# Main function to preprocess the dataset
def preprocess_unsw_nb15(df, target_column, categorical_features, numerical_features, encoding_type='LabelEncoder'):
    df, encoders = encode_categorical_features(df, categorical_features, encoding_type)
    df, scaler = scale_numerical_features(df, numerical_features)
    X_train, X_test, y_train, y_test = split_dataset(df, target_column)
    return X_train, X_test, y_train, y_test, encoders, scaler

if __name__ == "__main__":
    # Assuming df is already loaded and categorical and numerical features are identified
    target_column = 'label'  # Update with the correct target column

    # Identify categorical and numerical columns
    categorical_features = df.select_dtypes(include=['object']).columns.tolist()
    numerical_features = df.select_dtypes(include=['number']).columns.tolist()


    # Preprocessing
    encoding_type = 'LabelEncoder'  # Change to 'onehot' for One-Hot Encoding
    X_train, X_test, y_train, y_test, encoders, scaler = preprocess_unsw_nb15(df, target_column, categorical_features, numerical_features, encoding_type)
    print("Preprocessing complete.")
    print(f"Training set size: {X_train.shape}")
    print(f"Test set size: {X_test.shape}")

In [None]:
# Print head values
print("\nTraining set head:")
print(X_train.head())
print(y_train.head())

print("\nTest set head:")
print(X_test.head())
print(y_test.head())

In [None]:
# Save to CSV
#X_train.to_csv('X_train.csv', index=False)
#X_test.to_csv('X_test.csv', index=False)
#y_train.to_csv('y_train.csv', index=False)
#y_test.to_csv('y_test.csv', index=False)
#print("CSV files saved.")

In [None]:
# Save to pickle
output_folder = 'C:/Users/raman/OneDrive/Important/1UnisaSTUDY/Courses/Capstone_Project_1/Github/Code Working/Pickle'  # Change this to your desired folder

# Ensure the output folder exists
import os
os.makedirs(output_folder, exist_ok=True)

with open(os.path.join(output_folder, 'X_train.pkl'), 'wb') as f:
    pickle.dump(X_train, f)
with open(os.path.join(output_folder, 'X_test.pkl'), 'wb') as f:
    pickle.dump(X_test, f)
with open(os.path.join(output_folder, 'y_train.pkl'), 'wb') as f:
    pickle.dump(y_train, f)
with open(os.path.join(output_folder, 'y_test.pkl'), 'wb') as f:
    pickle.dump(y_test, f)
print("Pickle files saved.")
