## Saadallah Itani.

I will be using the KNN (k-nearest-neighbors) model to test it on the phishing email dataset to try and detect or classify wether an email is safe or phishing.

In [2]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score



### **Preprocessing the data.**

First we download the dataset from kaggle.
</br> Then we start preprocessing the data by dropping the rows with the empty values in order to clean our data so it would not impact the training of our model </br> Finally, I normalized all the text in the content column in order to maintain consistency in the data.

In [3]:
os.environ['KAGGLE_CONFIG_DIR'] = "/content/"
!kaggle datasets download -d subhajournal/phishingemails -q
!unzip -o -q phishingemails.zip
!rm phishingemails.zip;

df = pd.read_csv('Phishing_Email.csv', header=0, names=['number', 'content', 'label'])

### preprocessing the data:
#handling the missing values by droping the rows where the email text is missing
df = df.dropna(subset=['content'])

#Text normalizing or lowercasing the content of the emails, helps maintain
# consistency with text and strings, as for example, "A" & "a" are considered
# different strings, so in order to be treated the same
df['content'] = df['content'].str.lower()

df.head() ## how the dataset looks like




Unnamed: 0,number,content,label
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nhello i am your hot lil horny toy.\n i am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email


### **Vectorization.**

The next thing we need to do is convert the text into numerical data using the TF-IDF, this is because machine learning models work with numerical data. This process is called vectorization which is also important for KNN because KNN works by classifying a data based on a point on the 'nearest' neighbors in this feature space. Vectorization determines the features used to calculate these distances. For example, using TF-IDF vectorization, words that are more important for distinguishing documents (i.e., have higher TF-IDF scores) will have a greater influence on the distance calculations and, consequently, on the classification outcome.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Vectorization converts text into numerical data, TF-IDF, which is
## Term Frequency-Inverse Document Frequency, gives a weight to each word in the
# document, highlighting the important words and diminishing, and this is crucial
# for text classification
vectorizer = TfidfVectorizer(max_features=1500, ngram_range=(1, 2))  # Limit to top 1500 features
X = vectorizer.fit_transform(df['content'])
y = df['label']

## Splitting the dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



### **Identifying the Best Parameters.**

The main goal of the grid search is to identify the best combination of paramters that yield best performance for our KNN model. This is quite useful because it helps us save the time of figuring the best combination of parameters which, which result in the highest accuracy, by hand. GridSearch also uses cross-validation which splits the data into 5 (cv=5) folds and performs training and validation on these folds multiple times.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [3,4,5,6],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'cosine']
}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters
print(grid_search.best_params_)

## Note, this may take a lot of time and may require high computational capability

### Training and accuracy.

Finally, we train our model according to the optimal parameters we computed and check its accuracy.

In [13]:
knn = KNeighborsClassifier(n_neighbors=6, weights='distance', metric='cosine')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

                precision    recall  f1-score   support

Phishing Email       0.92      0.93      0.92      1518
    Safe Email       0.95      0.94      0.95      2209

      accuracy                           0.94      3727
     macro avg       0.93      0.94      0.93      3727
  weighted avg       0.94      0.94      0.94      3727

Accuracy: 0.9369466058492085


We can see that we achieved an accuracy of 93.6%, which is very good, but the reason I haven't achieved a higher accuracy is due to the fact accuracy of k-Nearest Neighbors (kNN) in text classification tasks may sometimes be lower than other algorithms. This is due to a multiple of factors such as but not limited to, high dimensionality of text data, parameter sensitivity, sensitivity to noisy features ... etc.