# Phishing Website Detection Notebook

This notebook demonstrates data loading, preprocessing, feature extraction, model training, and saving of models used in the Flask app.

What this notebook contains
- Load dataset from `dataset/phishing_site_urls.csv`.
- Tokenize and stem URL text, build a bag-of-words using `CountVectorizer`.
- Train simple classifiers (Logistic Regression and Multinomial Naive Bayes).
- Evaluate models and save serialized artifacts: `vectorizer.pkl`, `Phishing.pkl`, `phishing_mnb.pkl`.

Prerequisites
- Python 3.8+ and a virtual environment activated.
- Install dependencies:

```bash
pip install -r requirements.txt
pip install nltk wordcloud seaborn
```

Notes
- The notebook uses NLTK tokenizers and stemmers; you may need to download NLTK data (run `nltk.download('punkt')` if necessary).
- After running the training cells, the notebook saves pickled models in the notebook root. The Flask app expects `vectorizer.pkl` and `Phishing.pkl` (or a renamed `phishing_model.pkl`) in the project root.

Usage
1. Open this notebook in Jupyter or VS Code.
2. Run cells sequentially from top to bottom.
3. Verify the saved `.pkl` files exist before running the Flask app.

Warnings
- Pickle files can contain arbitrary code. Do not run or load pickles from untrusted sources.

---


In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

: 

In [None]:
df = pd.read_csv('dataset/phishing_site_urls.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()


In [None]:
df.isnull().sum()

In [None]:
 df.Label.value_counts()

In [None]:
from nltk.tokenize import RegexpTokenizer

In [None]:
tokenizer=RegexpTokenizer(r'[A-Za-z]+')

In [None]:
df.URL[0]

In [None]:
tokenizer.tokenize(df.URL[0])

In [None]:
df['text_tokenized'] = df.URL.map(lambda t: tokenizer.tokenize(t))

In [None]:
df.head()

In [None]:
from nltk.stem.snowball import SnowballStemmer

In [None]:
stemmer=SnowballStemmer('english')

In [None]:
df['text_stemmed'] = df['text_tokenized'].map(lambda l: [stemmer.stem(word) for word in l])

In [None]:
df.head()

In [None]:
df['text']= df['text_stemmed'].map(lambda l: ' '.join(l))

In [None]:
df.head()

In [None]:
good_sites=df[df.Label=='good']
bad_sites=df[df.Label=='bad']

In [None]:
 good_sites.head()


In [None]:
all_text=' '.join(good_sites['text'].tolist())

In [None]:
from wordcloud import WordCloud

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
all_text = ' '.join(bad_sites['text'].tolist())

In [None]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(all_text)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
df.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv=CountVectorizer()

In [None]:
features=cv.fit_transform(df.text)

In [None]:
features[:5].toarray()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
 x_train, x_test, y_train, y_test = train_test_split(features, df.Label)

Model Training


In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
l_model = LogisticRegression()

In [None]:
l_model.fit(x_train,y_train)

In [None]:
l_model.score(x_test,y_test)

In [None]:
l_model.score(x_train, y_train)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print('\nCLASSIFICATION REPORT\n')
print(classification_report(l_model.predict(x_test), y_test, target_names=['Bad','Good']))

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(l_model.predict(x_test), y_test),
            columns = ['Predicted:Bad', 'Predicted:Good'],
            index = ['Actual:Bad','Actual:Good'])

In [None]:
import seaborn as sns

In [None]:
plt.figure(figsize= (6,4))
sns.heatmap(con_mat, annot=True, fmt='d', cmap="YlGnBu")

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
mnb=MultinomialNB()

In [None]:
mnb.fit(x_train,y_train)

In [None]:
mnb.score(x_test, y_test)

### save model


In [None]:
import pickle

In [None]:
pickle.dump(l_model, open('Phishing.pkl','wb'))

In [None]:
pickle.dump(mnb,open('phishing_mnb.pkl','wb'))

In [None]:
pickle.dump(cv, open('vectorizer.pkl','wb'))

#testing

In [None]:
URL=["yeniik.com./wp-admin/js/login.alibaba.com/login.jsp.php"]

In [None]:
predictURL=cv.transform(URL)

In [None]:
l_model.predict(predictURL)