# 🎓 Understanding Logistic Regression for Phishing Email Classification 📧🔐

### What is Logistic Regression? 🤔
**Logistic Regression** is a **supervised learning algorithm** used for **binary classification** tasks. It estimates the **probability** that a given input belongs to a specific class (e.g., phishing or non-phishing). 🧠 Logistic Regression uses a **sigmoid function** to transform the input features into a value between 0 and 1, which represents the probability of a particular class. 🚀

In the context of **phishing email classification**, Logistic Regression helps predict whether an email is **phishing** or **safe** based on various features, such as the presence of certain words or patterns in the email.

---

### How Logistic Regression Works 🛠️

1. **Linear Combination of Features**:
   - Logistic Regression begins by calculating a **linear combination** of the input features using a set of weights and biases.
   - This is similar to **linear regression** but with the goal of estimating the **probability** of the output belonging to one of two classes.

2. **Sigmoid Activation**:
   - The linear combination is then passed through the **sigmoid function**:
   
 $$
     \sigma(z) = \frac{1}{1 + e^{-z}}
$$
   
   - Here, **z** is the linear combination of features. The sigmoid function squashes the output to a range between 0 and 1, providing a probability score.

3. **Probability Threshold**:
   - Logistic Regression uses a **threshold** (typically 0.5) to decide the class:
     - If the probability is **≥ 0.5**, the model predicts **phishing**.
     - If the probability is **< 0.5**, the model predicts **safe**.
   
4. **Optimization**:
   - The model adjusts its weights using **Gradient Descent** to minimize the **log loss** (or **cross-entropy loss**) and improve the accuracy of predictions.

---

### Advantages of Logistic Regression for Phishing Email Classification 📧✨

- **Interpretable**: Logistic Regression provides clear probability scores, making it easy to interpret whether an email is likely phishing.
- **Low Computational Cost**: It is less computationally intensive compared to more complex algorithms, making it suitable for large datasets.
- **Effective with Linearly Separable Data**: Works well when there is a clear linear separation between phishing and legitimate emails.

---

### Potential Limitations:
- **Cannot Handle Complex Relationships**: Logistic Regression assumes a **linear relationship** between features and the log-odds of the outcome, which may not capture complex patterns in the data.
- **Sensitive to Outliers**: Outliers can heavily influence the weights, leading to skewed predictions if the data isn’t properly cleaned.

Overall, Logistic Regression is a powerful yet simple tool for **binary classification** tasks like phishing email detection, especially when the relationship between features and classes is relatively straightforward. 🔍


### Implementation 🔍
1. **Loading the required libraries** 📚

In [6]:
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report

2. **Loading and splitting the Data** 📥

In [7]:
# Load the save TF-IDF features and labels
x_data = np.load('../feature_x.npy')
y_data = np.load('../y_tf.npy')

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, train_size=0.8, random_state=0)

### 3. **Model Initialization** 🤖

The **LogisticRegression()** is initialized with its **default parameters** in Scikit-learn.

- **`penalty="l2"`**: The default penalty parameter is **L2 regularization**. This adds a penalty term to the loss function to prevent overfitting by discouraging large coefficient values. L2 regularization is also known as **Ridge Regression**.

- **`solver="lbfgs"`**: By default, the model uses the **Limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS)** solver. This solver is efficient for small datasets and supports **multiclass classification**.

- **`max_iter=100`**: The maximum number of iterations the solver can take to converge. If the algorithm doesn’t converge within 100 iterations, it will stop and raise a warning.

- **`C=1.0`**: This parameter controls the **regularization strength**. Smaller values of `C` imply **stronger regularization**. A `C` value of 1.0 means the model will balance fitting the training data while keeping coefficients small.

- **`fit_intercept=True`**: The model calculates and adds an **intercept term** to the decision boundary equation by default. This intercept helps in shifting the decision boundary, making it more flexible.

- **`random_state=None`**: The model does not use a fixed random seed by default. Specifying a `random_state` ensures reproducibility of results, which can be useful for debugging and consistent results.


In [10]:
logistic_reg = LogisticRegression()

4. **Training the Model** 🏋️‍♂️

In [11]:
logistic_reg.fit(x_train,y_train)

5. **Making Predictions** 🔮

In [12]:
pred_logistic_reg = logistic_reg.predict(x_test)

6. **Evaluating the Model** 🧮

In [14]:
print(f"accuracy from logistic regression:{accuracy_score(y_test,pred_logistic_reg)*100:.5f} %")
print(f"f1 score from logistic regression: {f1_score(y_test, pred_logistic_reg)*100:.5f} %")
print("classification report : \n",classification_report(y_test,pred_logistic_reg))

accuracy from logistic regression:97.97605 %
f1 score from logistic regression: 98.36368 %
classification report : 
               precision    recall  f1-score   support

           0       0.98      0.96      0.97      1351
           1       0.98      0.99      0.98      2157

    accuracy                           0.98      3508
   macro avg       0.98      0.98      0.98      3508
weighted avg       0.98      0.98      0.98      3508

