# XG-Boost Introduction 

XG-boost is a tree-based method. It doesn't need to scale or normalise features. Tree splits care only about ordering and thresholds, not absolute distances: 
- Monotonic transformations (e.g. log, minâ€‘max) donâ€™t change how trees partition the feature space


### TO DO: 

Because XGBoost has many knobs â€” and how you set them can dramatically impact performance, we would need to do **Hyperparameter search**. Following are the hyperparameters you should look into. 

| Param             | Trade-off It Affects                     |
|-------------------|------------------------------------------|
| `max_depth`       | Bias vs. variance                        |
| `eta`             | Learning speed vs. convergence           |
| `subsample`       | Overfitting vs. underfitting             |
| `colsample_bytree`| Feature selection & decorrelation        |
| `lambda`, `alpha` | L2 / L1 regularization                   |
| `min_child_weight`| Controls when splits are allowed         |


- Do your heldâ€‘out chunks show stable performance (e.g. ROCâ€‘AUC, F1) across different data slices? Have you checked for overfitting (train vs. test score gaps), and looked at learning curves or chunkâ€‘byâ€‘chunk variation?
- Are all labelâ€‘encoders and imputers permanently fitted and saved, so you wonâ€™t get unexpected unseen_label errors at inference time?



### Data path 

In [1]:
import numpy as np 
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/nfuqnidsv2-network-intrusion-detection-dataset/NF-UQ-NIDS-v2.csv


### chunked df_examination

In [1]:
import pandas as pd

# Load the first chunk to inspect structure
chunk_size = 100_000
df_chunk = pd.read_csv('NF-UQ-NIDS-v2.csv', chunksize=100_000)
df_sample = next(df_chunk)
df_sample.head()


Unnamed: 0,IPV4_SRC_ADDR,L4_SRC_PORT,IPV4_DST_ADDR,L4_DST_PORT,PROTOCOL,L7_PROTO,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,...,TCP_WIN_MAX_OUT,ICMP_TYPE,ICMP_IPV4_TYPE,DNS_QUERY_ID,DNS_QUERY_TYPE,DNS_TTL_ANSWER,FTP_COMMAND_RET_CODE,Label,Attack,Dataset
0,192.168.100.148,65389,192.168.100.7,80,6,7.0,420,3,0,0,...,0,35840,140,0,0,0,0.0,1,DoS,NF-BoT-IoT-v2
1,192.168.100.148,11154,192.168.100.5,80,6,7.0,280,2,40,1,...,0,0,0,0,0,0,0.0,1,DoS,NF-BoT-IoT-v2
2,192.168.1.31,42062,192.168.1.79,1041,6,0.0,44,1,40,1,...,0,0,0,0,0,0,0.0,0,Benign,NF-ToN-IoT-v2
3,192.168.1.34,46849,192.168.1.79,9110,6,0.0,44,1,40,1,...,0,0,0,0,0,0,0.0,0,Benign,NF-ToN-IoT-v2
4,192.168.1.30,50360,192.168.1.152,1084,6,0.0,44,1,40,1,...,0,0,0,0,0,0,0.0,0,Benign,NF-ToN-IoT-v2


In [2]:
print(df_sample.columns.tolist())


['IPV4_SRC_ADDR', 'L4_SRC_PORT', 'IPV4_DST_ADDR', 'L4_DST_PORT', 'PROTOCOL', 'L7_PROTO', 'IN_BYTES', 'IN_PKTS', 'OUT_BYTES', 'OUT_PKTS', 'TCP_FLAGS', 'CLIENT_TCP_FLAGS', 'SERVER_TCP_FLAGS', 'FLOW_DURATION_MILLISECONDS', 'DURATION_IN', 'DURATION_OUT', 'MIN_TTL', 'MAX_TTL', 'LONGEST_FLOW_PKT', 'SHORTEST_FLOW_PKT', 'MIN_IP_PKT_LEN', 'MAX_IP_PKT_LEN', 'SRC_TO_DST_SECOND_BYTES', 'DST_TO_SRC_SECOND_BYTES', 'RETRANSMITTED_IN_BYTES', 'RETRANSMITTED_IN_PKTS', 'RETRANSMITTED_OUT_BYTES', 'RETRANSMITTED_OUT_PKTS', 'SRC_TO_DST_AVG_THROUGHPUT', 'DST_TO_SRC_AVG_THROUGHPUT', 'NUM_PKTS_UP_TO_128_BYTES', 'NUM_PKTS_128_TO_256_BYTES', 'NUM_PKTS_256_TO_512_BYTES', 'NUM_PKTS_512_TO_1024_BYTES', 'NUM_PKTS_1024_TO_1514_BYTES', 'TCP_WIN_MAX_IN', 'TCP_WIN_MAX_OUT', 'ICMP_TYPE', 'ICMP_IPV4_TYPE', 'DNS_QUERY_ID', 'DNS_QUERY_TYPE', 'DNS_TTL_ANSWER', 'FTP_COMMAND_RET_CODE', 'Label', 'Attack', 'Dataset']


### Imports 

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
import joblib
from sklearn.model_selection import train_test_split
from sklearn.metrics          import roc_auc_score

### Label Encoder 


`LabelEncoderExt` is a custom wrapper around scikit-learnâ€™s LabelEncoder that handles unseen categories during transformation by mapping them to a reserved 'Unknown' label. It appends `'Unknown'` to the label set during fitting, ensuring the encoder can always process new or unexpected inputs without raising an error. This is especially useful in streaming, chunked, or real-time inference settings where unseen categorical values (e.g., new IPs) may appear. It preserves compatibility with XGBoost while ensuring robustness across data shifts.


In [None]:
# 2) Robust LabelEncoder that handles unseen labels
class LabelEncoderExt:
    def __init__(self):
        self.le = LabelEncoder()
    def fit(self, data):
        # include an explicit 'Unknown' category
        self.le = self.le.fit(list(data) + ['Unknown'])
        self.classes_ = self.le.classes_
        return self
    def transform(self, data):
        # map any unseen item â†’ 'Unknown' before encoding
        arr = []
        for x in data:
            arr.append(x if x in self.classes_ else 'Unknown')
        return self.le.transform(arr)



### Initializing Encoders globally 

We now set up global placeholders for our IP address encoders and tracks whether theyâ€™ve been fitted yet. The encoders for 
`IPV4_SRC_ADDR (source IP)` and `IPV4_DST_ADDR (destination IP)` are `le_src` and `le_dst` respectively. Both use LabelEncoderExt to handle unseen IPs safely during later chunks or inference

In [None]:
# 4) Encoder placeholders & flags
le_src, le_dst = LabelEncoderExt(), LabelEncoderExt()
fitted_src = fitted_dst = False


### Paths and Hyper Parameters

In [None]:

# 3) Paths & Hyperparams
DATA_PATH   = '/kaggle/input/nfuqnidsv2-network-intrusion-detection-dataset/NF-UQ-NIDS-v2.csv'
OUTPUT_DIR  = '/kaggle/working'
CHUNK_SIZE  = 200_000


In [None]:

XGB_PARAMS = {
    "objective":    "binary:logistic",
    "eval_metric":  "auc",
    # always use the fast histogram method...
    "tree_method":  "hist",
    "device":       "cuda" if use_gpu else "cpu",
    "max_depth":    6,
    "eta":          0.1,
}


In [None]:
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import joblib

booster       = None
chunk_id      = 0
val_aucs      = []
val_accs      = []
val_f1s       = []

for chunk in pd.read_csv(DATA_PATH, chunksize=CHUNK_SIZE, low_memory=False):
    chunk_id += 1
    print(f"\nâ–¶ Chunk {chunk_id}: {chunk.shape}")

    # â”€â”€ a) Drop, sanitize, encode, impute, clip â”€â”€
    chunk.drop(['Attack', 'Dataset'], axis=1, inplace=True, errors='ignore')
    chunk.replace([np.inf, -np.inf], np.nan, inplace=True)
    chunk.fillna(0, inplace=True)

    if not fitted_src:
        le_src.fit(chunk['IPV4_SRC_ADDR']); fitted_src = True
    if not fitted_dst:
        le_dst.fit(chunk['IPV4_DST_ADDR']); fitted_dst = True

    chunk['IPV4_SRC_ADDR'] = le_src.transform(chunk['IPV4_SRC_ADDR'])
    chunk['IPV4_DST_ADDR'] = le_dst.transform(chunk['IPV4_DST_ADDR'])

    for col in chunk.select_dtypes(include='object'):
        le = LabelEncoderExt().fit(chunk[col])
        chunk[col] = le.transform(chunk[col])

    if 'Label' not in chunk.columns:
        print("No 'Label' â€“ skipping")
        continue

    y = chunk['Label'].astype(int)
    X = chunk.drop(columns=['Label'])
    X.replace([np.inf, -np.inf], np.nan, inplace=True)
    X = pd.DataFrame(SimpleImputer(strategy='mean').fit_transform(X), columns=X.columns)
    X = X.clip(-1e6, 1e6)

    # â”€â”€ b) TRAIN/VAL SPLIT â”€â”€
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval   = xgb.DMatrix(X_val, label=y_val)

    # â”€â”€ c) INCREMENTAL TRAIN (+ EVAL) â”€â”€
    evals = [(dval, 'validation')]
    if booster is None:
        booster = xgb.train(XGB_PARAMS, dtrain, num_boost_round=50, evals=evals, verbose_eval=False)
    else:
        booster = xgb.train(XGB_PARAMS, dtrain, num_boost_round=50, xgb_model=booster, evals=evals, verbose_eval=False)

    # â”€â”€ d) METRICS â”€â”€
    preds = booster.predict(dval)
    y_pred_bin = (preds > 0.5).astype(int)

    auc = roc_auc_score(y_val, preds)
    acc = accuracy_score(y_val, y_pred_bin)
    f1  = f1_score(y_val, y_pred_bin)
    cm  = confusion_matrix(y_val, y_pred_bin)

    val_aucs.append(auc)
    val_accs.append(acc)
    val_f1s.append(f1)

    print(f"Chunk {chunk_id} â€” AUC: {auc:.4f}, Accuracy: {acc:.4f}, F1: {f1:.4f}")
    print("Confusion Matrix:\n", cm)




### Final Report

In [None]:
# â”€â”€ e) FINAL REPORT â”€â”€
print(f"\nðŸ“Š Mean AUC:      {np.mean(val_aucs):.4f}")
print(f"ðŸ“Š Mean Accuracy: {np.mean(val_accs):.4f}")
print(f"ðŸ“Š Mean F1 Score: {np.mean(val_f1s):.4f}")



### Save Model

In [None]:
# â”€â”€ f) SAVE MODEL + ENCODERS â”€â”€
MODEL_PATH     = f"{OUTPUT_DIR}/xgb_nids_model.json"
ENCODERS_PATH  = f"{OUTPUT_DIR}/ip_encoders.pkl"
booster.save_model(MODEL_PATH)
joblib.dump({'src': le_src, 'dst': le_dst}, ENCODERS_PATH)
print(f"\nModel saved to: {MODEL_PATH}")
print(f"Encoders saved to: {ENCODERS_PATH}")

# â”€â”€ g) SAVE METRICS CSV â”€â”€
metrics_df = pd.DataFrame({
    'Chunk': list(range(1, chunk_id + 1)),
    'AUC': val_aucs,
    'Accuracy': val_accs,
    'F1': val_f1s
})
csv_path = f"{OUTPUT_DIR}/chunk_validation_metrics.csv"
metrics_df.to_csv(csv_path, index=False)
print(f"ðŸ“„ Validation metrics saved to: {csv_path}")



# Feature Importance Plot 

In [None]:
# â”€â”€ h) PLOT FEATURE IMPORTANCE â”€â”€
fig_path = f"{OUTPUT_DIR}/feature_importance.png"
xgb.plot_importance(booster, max_num_features=15, importance_type='gain')
plt.title("Top 15 Features by Gain")
plt.tight_layout()
plt.savefig(fig_path, dpi=300)
plt.show()
print(f"ðŸ“ˆ Feature importance saved to: {fig_path}")

In [None]:
import matplotlib.pyplot as plt
xgb.plot_importance(booster, max_num_features=15, importance_type='gain')
plt.tight_layout()
plt.show()


In [None]:
import joblib

# Save model (no need for get_booster)
booster.save_model("xgb_nids_model.json")

# Save encoders using joblib
joblib.dump({'le_src': le_src, 'le_dst': le_dst}, "preprocessor.pkl")
