# Phishing Site Detector

Dataset from https://www.kaggle.com/taruntiwarihp/phishing-site-urls

The task is to create a model which returns if the given URL is phishing site or not.

Place the dataset and this notebook in the same folder to run without any errors.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


data= pd.read_csv("phishing_site_urls.csv")
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.shape

In [None]:
import seaborn as sns
sns.countplot(x="Label",data=data)

* We need to remove all special characters from the url and convert into a list of words. We can do this using regextokenizer
* We need to take the processed string and convert into English words
* When we do the above steps, the words will be separated with commas. This will make the process of converting into numbers harder.
* So we shall remove the commas and make it more like a sentence (to humans this sentence may not make any sense but machines can make sense )

In [None]:
#remove special characters using regex and convert them into tokens (aka list of words )
from nltk.tokenize import RegexpTokenizer 
tokenizer = RegexpTokenizer(r'[A-Za-z]+')
data['tokens'] = data.URL.map(lambda t: tokenizer.tokenize(t))

In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
data['english_words'] = data['tokens'].map(lambda l: [stemmer.stem(word) for word in l])

 Sidenote: Lambda functions are one liner functions which can be used to do any task.
 Instead of using def keyword to define a function we can use lambda to make one liners

In [None]:
data.head()

In [None]:
data['sentences'] = data['english_words'].map(lambda l: ' '.join(l))
# So basically we are replacing commas with empty spaces.

### Machine learning algorithm

We shall convert the sentences column to vectors (aka the format which our model can understand and interpret) and we shall split the data into training and test sets using train_test_split

* CountVectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

feature = CountVectorizer().fit_transform(data.sentences)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(feature, data.Label, test_size = 0.15)
#split data into train and test sets

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
np.random.seed(45)
model = LogisticRegression()
model.fit(X_train,y_train)
accuracy = model.score(X_test, y_test)
print(f'The model scores {accuracy*100} %')

In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import RandomForestClassifier
np.random.seed(45)
randomforestmodel = RandomForestClassifier()
randomforestmodel.fit(X_train,y_train)
accuracy = randomforestmodel.score(X_test, y_test)
print(f'The random forest model scores {accuracy*100} %')

In [None]:
from sklearn.tree import DecisionTreeClassifier
np.random.seed(45)
decisionmodel = DecisionTreeClassifier()
decisionmodel.fit(X_train,y_train)
accuracy = decisionmodel.score(X_test, y_test)
print(f'The decision tree model scores {accuracy*100} %')

In [None]:
from sklearn.svm import SVC
np.random.seed(45)
svcmodel = SVC()
svcmodel.fit(X_train,y_train)
accuracy = svcmodel.score(X_test, y_test)
print(f'The svc forest model scores {accuracy*100} %')

In [None]:
from sklearn.naive_bayes import MultinomialNB
np.random.seed(45)
multimodel = MultinomialNB()
multimodel.fit(X_train,y_train)
accuracy = multimodel.score(X_test, y_test)
print(f'The multinomial model scores {accuracy*100} %')

In [None]:
from sklearn.naive_bayes import GaussianNB
np.random.seed(45)
gaussmodel = GaussianNB()
gaussmodel.fit(X_train,y_train)
accuracy = gaussmodel.score(X_test, y_test)
print(f'The gauss model scores {accuracy*100} %')

In [None]:
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(45)
knmodel = KNeighborsClassifier()
knmodel.fit(X_train,y_train)
accuracy = knmodel.score(X_test, y_test)
print(f'The knn model scores {accuracy*100} %')

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
np.random.seed(45)
gbmodel = GradientBoostingClassifier()
gbmodel.fit(X_train,y_train)
accuracy = gbmodel.score(X_test, y_test)
print(f'The gradient boosting model scores {accuracy*100} %')

In [None]:
x_new  = 'https://dlscordapp.codes/billing/promotions/rJSuZk5ySk6Sf6qnk4v9bHEG/'


In [None]:
from sklearn.pipeline import make_pipeline
api_pipeline = make_pipeline(CountVectorizer(tokenizer = RegexpTokenizer(r'[A-Za-z]+').tokenize,stop_words='english'), LogisticRegression())

In [None]:
trainX, testX, trainY, testY = train_test_split(data.URL, data.Label)

In [None]:
api_pipeline.fit(trainX,trainY)

In [None]:
api_pipeline.score(testX,testY)

In [None]:
import pickle
pickle.dump(api_pipeline,open('phishing.pkl','wb'))

In [None]:
loaded_model = pickle.load(open('phishing.pkl', 'rb'))
result = loaded_model.predict(['https://www.linkedin.com'])
print(result)

In [None]:
result = loaded_model.predict(['https://dlscordapp.codes/billing/promotions/rJSuZk5ySk6Sf6qnk4v9bHEG/'])
print(result)

In [None]:
type(result)