# 🎓 Understanding XGBoost for Phishing Email Classification 📧🌳🚀

### What is XGBoost? 🤔
**XGBoost** (eXtreme Gradient Boosting) is a **supervised learning algorithm** based on **boosting**. It is designed for both **classification** and **regression** tasks and is known for its speed and performance. XGBoost creates an ensemble of **decision trees**, where each new tree is trained **sequentially**, correcting the errors made by the previous trees. In this way, each tree **learns from the mistakes of the previous one**, gradually improving the model’s accuracy.

---

### How XGBoost Works 🛠️

1. **Boosting Technique**:
   - XGBoost uses a method called **boosting**, where trees are added **sequentially**, and each new tree tries to correct the errors of the previous ones.
   - Unlike **Random Forest**, which trains all trees in parallel, XGBoost trains trees sequentially, allowing each tree to learn from the mistakes of the previous trees, thereby improving the overall model.

2. **Gradient Descent Optimization**:
   - XGBoost uses **gradient descent** to minimize the error made by the ensemble of trees. It adjusts the model by minimizing a **loss function** through **gradient updates**.
   
3. **Tree Pruning**:
   - XGBoost prunes trees during training by using a **regularization term**, which prevents overfitting. This ensures that the trees are not overly complex and helps with generalization.

4. **Weighted Trees**:
   - Each tree is weighted based on its performance. Poorly performing trees are given more focus, and new trees aim to correct their errors, improving overall performance.

5. **Learning Rate**:
   - The model uses a **learning rate** to control the contribution of each tree. A lower learning rate makes the model slower to learn but allows for a more accurate model with more trees.

---

### Advantages of XGBoost for Phishing Email Classification 📧✨

- **High Accuracy**: XGBoost is known for its exceptional accuracy due to its boosting technique and ability to handle complex patterns in the data.
- **Regularization**: It includes **L1 and L2 regularization** to prevent overfitting, which makes it more robust, even with noisy data.
- **Handles Missing Data**: XGBoost can automatically handle missing data by learning the best direction to split on for missing values.

---

### Potential Limitations:
- **Complexity**: While XGBoost provides high accuracy, it can be more complex to understand and tune compared to simpler models like logistic regression or decision trees.
- **Sensitive to Hyperparameters**: XGBoost has many hyperparameters to tune (learning rate, tree depth, regularization, etc.), and improper tuning can affect model performance.

### Implementation 🔍
1. **Loading the required libraries** 📚

In [2]:
from xgboost import XGBClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report

2. **Loading and splitting the Data** 📥

In [3]:
# Load the save TF-IDF features and labels
x_data = np.load('../feature_x.npy')
y_data = np.load('../y_tf.npy')

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.8, random_state=0)

3. **Model Initialization** 🤖

The **`XGBClassifier()`** is initialized with its **default parameters** in the **XGBoost** library. This classifier uses **gradient boosting** to combine the predictions of multiple decision trees, improving the overall model's performance for classification tasks.

- **`objective="binary:logistic"`**: The default objective function is **logistic regression** for binary classification tasks. It calculates the probability that a given sample belongs to one of the two classes.
  
- **`learning_rate=0.1`**: This parameter controls the **step size** during the boosting process. A lower learning rate makes the model learn more slowly but more accurately, while a higher rate makes it learn faster but could lead to overfitting.

- **`n_estimators=100`**: The number of **trees** the model will build. Each tree attempts to correct the errors of the previous trees, making the overall model stronger.

- **`max_depth=3`**: The maximum depth of each individual decision tree. Shallower trees prevent overfitting but may underfit, while deeper trees might overfit the training data.

- **`subsample=1.0`**: This parameter controls the percentage of the training data that is used to grow each tree. A value less than 1.0 can help prevent overfitting by introducing randomness into the training process.

- **`colsample_bytree=1.0`**: The fraction of features (columns) to be used by each tree. Reducing this can help with generalization by training each tree with different subsets of features.

- **`gamma=0`**: This is the **regularization parameter** that controls how much the model attempts to split nodes. Higher values make the algorithm more conservative and prevent overfitting.

- **`random_state=None`**: This controls the randomness for reproducibility. By default, the model does not use a fixed random seed, but setting this parameter will ensure consistent results across different runs.


In [4]:
xgb = XGBClassifier()

4. **Training the Model** 🏋️‍♂️

In [None]:
xgb.fit(x_train,y_train)

5. **Making Predictions** 🔮

In [None]:
prediction = xgb.predict(x_test)

6. **Evaluating the Model** 🧮

In [None]:
print(f"accuracy from XGB:{accuracy_score(y_test,prediction)*100:.5f} %")
print(f"f1 score from XGB: {f1_score(y_test,prediction)*100:.5f} %")
print("classification report : \n",classification_report(y_test,prediction))