# <font color="#00adb5">👉👨‍💻 Catching Illegal Phishing ☠</font>
* Phishing is a cybercrime in which a target or targets are contacted by email, telephone or text message by someone posing as a legitimate institution to lure individuals into providing sensitive data such as personally identifiable information, banking and credit card details, and passwords.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from IPython.display import Image
import os
!ls ../input/

In [None]:
import pandas as pd

In [None]:
phishing_data = pd.read_csv('/kaggle/input/phising-urls/phishing_site_urls.csv')

In [None]:
phishing_data.head()

In [None]:
phishing_data.tail()

### <font color="#f21170">📊 Regarding Dataset</font>
* Data is containg 5,49,346 unique entries.
* Label column is prediction col which has 2 categories:
1. Good - which means the urls is not containing malicious stuff and this site is not a Phishing Site. 
2. Bad - which means the urls contains malicious stuffs and this site is a Phishing Site.
* There is no missing value in the dataset.

In [None]:
# To get the information about 'phishing_site_urls.csv':
phishing_data.info()

In [None]:
# To check if there is any missing values in dataset:
phishing_data.isnull().sum()

In [None]:
# Now we create a DataFrame of classes counts
lbl_counts = pd.DataFrame(phishing_data.Label.value_counts())

In [None]:
# Now let's visualize the target column by using seaborn:
# use for high-level interface for drawing attractive and informative statistical graphics 
import seaborn as sns

In [None]:
sns.set_style('darkgrid')
sns.barplot(lbl_counts.index, lbl_counts.Label)

### <font color="#ff8882"> Preprocessing</font>
* So, now that we have the data, we have to vectorize our URLs. So, I'm using CountVectorizer to gather words using tokenizer, since there are words in urls that are more important than other words such as:
    1. -> Virus
    2. -> .exe
    3. -> .dat
* Let's convert the URLs into a vector form.

### <font color="#1597bb">Regexp Tokenizer</font>
* A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

In [None]:
# We use Regexp tokenizers to split words from text:
from nltk.tokenize import RegexpTokenizer

In [None]:
# In this expression we are spliting only alphabets
tokenizer = RegexpTokenizer(r'[A-Za-z]+')

In [None]:
'''For example here you can see lots of numbers, symbols, dots, etc which
is not important to your data so we remove this and get only strings of alphabets.''' 
print(phishing_data.URL[0]) # This 0 is first row

In [None]:
# This command will only pull all the alphabet strings present in URL:
clean_text = tokenizer.tokenize(phishing_data.URL[0]) 
print(clean_text)

### <font color="#ff1a75">⏰ Time module</font>

In [None]:
# So now we transform all the URLs to clean_text:
# To calculate time of execution we import time
import time
start = time.time()
phishing_data['text_tokenized'] = phishing_data.URL.map(lambda text: tokenizer.tokenize(text))
end = time.time()
time_req = end - start
formatted_time = "{:.2f}".format(time_req)
print(f"Time required to tokenize text is: \n{formatted_time} sec")

In [None]:
# Now let's check some sample results of URLs conversion to tokenize text:
phishing_data.sample(7)

### <font color="#f875aa">Snowball Stemmer NLTK</font>
* Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.

In [None]:
# Now we use Snowball stemmer to get the root words out of tokenized text:
from nltk.stem.snowball import SnowballStemmer

In [None]:
# I am using english language for stemming purpose you can choose any language:
sbs = SnowballStemmer("english")

# we will see the execution time to stem the tokenize text:
start = time.time()
phishing_data['text_stemmed'] = phishing_data['text_tokenized'].map(lambda text: [sbs.stem(word) for word in text])
end = time.time()
time_req = end - start
formatted_time = "{:.2f}".format(time_req)
print(f"⏳ Time required for stemming all the tokenized text is: \n{formatted_time} sec")

In [None]:
# Now let's see the sample stemmed text:
phishing_data.sample(7)

In [None]:
# So, now we join the stemmed words together as a sentence:
start = time.time()
phishing_data['text_to_sent'] = phishing_data['text_stemmed'].map(lambda text: ' '.join(text))
end = time.time()
time_req = end - start
formatted_time = "{:.2f}".format(time_req)
print(f"Time required for joining text to sentence is: \n{formatted_time} sec")

In [None]:
# let's see some sample results of joined text to sentences:
phishing_data.sample(10)

### <font color="#845ec2">Visualization Part</font>
##### <font color="#3aa6c5">It is very important to know your data and visualize it to understand it better.</font>
    1.Let's visualize some important key words using Wordcloud.

In [None]:
# First we slice the classes as:
phishing_sites = phishing_data[phishing_data.Label == 'bad']
not_phishing_sites = phishing_data[phishing_data.Label == 'good']

In [None]:
# Using head() function to see the top 5 URLs of phishing sites:
phishing_sites.head()

In [None]:
# Using head() function to see the top 5 URLs of Not phishing sites:
not_phishing_sites.head()

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from PIL import Image

In [None]:
# Now let's create a function to visualize the important words from URLs:

def my_wordcloud(text, mask=None, max_words=500, max_font_size=70, figure_size=(8.0, 10.0),
                title=None, title_size=70, image_color=False):
    
    stopwords = set(STOPWORDS)
    my_stopwords = {'com', 'http'}
    stopwords = stopwords.union(my_stopwords)
    
    wordcloud = WordCloud(background_color='#fff', 
                         stopwords = stopwords,
                         max_words = max_words,
                         random_state = 42,                            
                         mask = mask)
    
    wordcloud.generate(text)
    
    plt.figure(figsize=figure_size)
    
    if image_color:
        image_color = ImageColorGenerator(mask);
        plt.imshow(wordcloud.recolor(color_func=image_colors),
                  interpolation='bilinear');
        
        plt.title(title,
                 fontdict={'size': title_size,
                          'verticalalignment': 'bottom'})
        
    else:
        plt.imshow(wordcloud);
        plt.title(title,
                 fontdict={'size': title_size,
                          'color': '#ff3333',
                          'verticalalignment': 'bottom'})

    plt.axis('off');
    plt.tight_layout()
    
d = '../input/images/'
    

In [None]:
my_data = not_phishing_sites.text_to_sent
my_data.reset_index(drop=True, inplace=True)

In [None]:
not_phishing_common_text = str(my_data)
common_mask = np.array(Image.open(d+'idea.png'))
my_wordcloud(not_phishing_common_text,
               common_mask,
               max_words=400, 
               max_font_size=50, 
               title = 'The Most common words use in not phishing URLs:',
               title_size=20)

In [None]:
# for phishing sites common words are:
my_data = phishing_sites.text_to_sent
my_data.reset_index(drop=True, inplace=True)

In [None]:
# let's create a wordcloud for phishing sites:
phishing_common_words = str(my_data)
common_mask = np.array(Image.open(d+'target.png'))
my_wordcloud(phishing_common_words, 
             common_mask,
             max_words=500, 
             max_font_size=20, 
             title='The Most common words use in phishing URLs:', 
             title_size=20)

### <font color="#ff3366">Creating Model</font>
* Using CountVectorizer is used to transform a corpora of text to a vector of term / token counts.

In [None]:
# create sparse matrix of words using regexptokenizes: 
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Now we create a CV object:
CV = CountVectorizer()

In [None]:
help(CountVectorizer())

In [None]:
# transform all text which we tokenize and stemed:
feature = CV.fit_transform(phishing_data.text_to_sent)

In [None]:
# convert sparse matrix into array to print transformed features:
feature[:5].toarray()

In [None]:
# spliting the data between feature and target:
from sklearn.model_selection import train_test_split

# gives whole report about metrics (e.g, recall,precision,f1_score,c_m):
from sklearn.metrics import classification_report

# gives info about actual and predicted:
from sklearn.metrics import confusion_matrix

In [None]:
# Splitting the data:
train_X, test_X, train_Y, test_Y = train_test_split(feature, phishing_data.Label)

### <font color="#e6e600">Logistic Regression</font>
* Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

In [None]:
# algo use to predict not phishing site or phishing site: 
from sklearn.linear_model import LogisticRegression

In [None]:
# create an object for Logistic Regression()
lr = LogisticRegression()

In [None]:
lr.fit(train_X, train_Y)

In [None]:
# Here we are calculating the score of tests:
lr.score(test_X, test_Y)

### <font color="#00ffaa">LR Score</font>
* Logistic Regression is giving us 96% accuracy, Now we will store scores in dict to see which model perform best of all

In [None]:
Score_ml = {}
Score_ml['Logistic Regression'] = np.round(lr.score(test_X, test_Y), 2)

In [None]:
print('Training Accuracy: ',lr.score(train_X, train_Y))
print('Testing Accuracy: ',lr.score(test_X, test_Y))
# here we create confusion matrix:
conf_mat = pd.DataFrame(confusion_matrix(lr.predict(test_X), test_Y),
                       columns = ['Predicted: Phishing', 'Predicted: Not Phishing'],
                       index = ['Actual: Phishing', 'Actual: Not Phishing'])

print('\nClassification Report: \n')
print(classification_report(lr.predict(test_X), test_Y,
                           target_names = ['Bad', 'Good']))

print('\nconfusion Matrix: \n')
plt.figure(figsize = (6, 4))
sns.heatmap(conf_mat, annot = True, fmt='d', cmap="RdYlBu")

### <font color="#00bfff">MultinomialNB</font>
* Applying Multinomial Naive Bayes to NLP Problems. Naive Bayes Classifier Algorithm is a family of probabilistic algorithms based on applying Bayes' theorem with the “naive” assumption of conditional independence between every pair of a feature.

In [None]:
# nlp algo use to predict not phishing site or phishing site:
from sklearn.naive_bayes import MultinomialNB

In [None]:
# create mnb object
mnb = MultinomialNB()

In [None]:
mnb.fit(train_X, train_Y)

In [None]:
mnb.score(test_X, test_Y)

### <font color="#ff5500">MultinomialNB Score</font>
* MultinomialNB is giving us 95% accuracy. So, now we will store scores in dict to see which model perform best of all.

In [None]:
Score_ml['MultinomialNB'] = np.round(mnb.score(test_X, test_Y), 2)

In [None]:
print('Training Accuracy: ',mnb.score(train_X, train_Y))
print('Testing Accuracy: ',mnb.score(test_X, test_Y))

conf_mat = pd.DataFrame(confusion_matrix(mnb.predict(test_X), test_Y),
                       columns = ['Predicted: Phishing', 'Predicted: Not Phishing'],
                       index = ['Actual: Phishing', 'Actual: Not Phishing'])

print('\nClassification Report\n')
print(classification_report(mnb.predict(test_X), test_Y,
                           target_names = ['Bad', 'Good']))

print('\nConfusion Matrix\n')
plt.figure(figsize = (6,4))
sns.heatmap(conf_mat, annot = True, fmt='d', cmap='PuBuGn_r')

In [None]:
results = pd.DataFrame.from_dict(Score_ml, 
                                 orient = 'index', 
                                 columns = ['Accuracy'])

print(f"Actual Score of Logistic Regression: \n{lr.score(test_X, test_Y)}\n")
print(f"Actual Score of MultinomialNB: \n{mnb.score(test_X, test_Y)}\n")
print(f"Final Rounded Score: \n{results}")

sns.set_style('darkgrid')
sns.barplot(results.index, results.Accuracy)

### <font color="#ff5500">Best fit Model</font>
* In the above results we can clearly see that the Logistic Regression is the best fit model with actual score of 96%.
* So, now we make sklearn pipeline using Logistic Regression.

In [None]:
# Used for combining all preprocessor techniques and algorithms:
from sklearn.pipeline import make_pipeline

In [None]:
# Making a pipeline:
pipeline_ls = make_pipeline(CountVectorizer(tokenizer = RegexpTokenizer(r'[A-Za-z]+').tokenize, stop_words='english'), LogisticRegression())

In [None]:
train_X, test_X, train_Y, test_Y = train_test_split(phishing_data.URL, phishing_data.Label)

In [None]:
pipeline_ls.fit(train_X, train_Y)

In [None]:
pipeline_ls.score(test_X, test_Y)

In [None]:
print("Training Accuracy: ",pipeline_ls.score(train_X, train_Y))
print("Testing Accuracy: ",pipeline_ls.score(test_X, test_Y))

conf_mat = pd.DataFrame(confusion_matrix(pipeline_ls.predict(test_X), test_Y), 
                       columns = ["Predicted: Phishing", "Predicted: Not Phishing"],
                       index = ["Actual: Phishing", "Actual: Not Phishing"])

print("\nClassification Report \n")
print(classification_report(pipeline_ls.predict(test_X), test_Y,
                            target_names = ['Bad', 'Good']))

print("\nConfusion Matrix \n")
plt.figure(figsize = (6,4))
sns.heatmap(conf_mat, annot = True, fmt = 'd', cmap="Blues")


In [None]:
import pickle
pickle.dump(pipeline_ls,open('phishing.pkl','wb'))

In [None]:
loaded_model = pickle.load(open('phishing.pkl', 'rb'))
result = loaded_model.score(test_X,test_Y)
print(result)

# <font color="#009999">🙌Conclusion</font>
* So now, we get an accuracy of 96%. That’s a very high value for a machine to be able to detect a phishing sites. Want to test some links to see if the model gives good predictions? Sure. Let's do it

##### <font color="#ff0000">❌ Bad Links</font> 
* __These are some Phishing links!__
   1. www.yeniik.com.tr/wp-admin/js/login.alibaba.com/login.jsp.php
   2. www.fazan-pacir.rs/temp/libraries/ipad
   3. www.tubemoviez.exe
   4. www.svision-online.de/mgfi/administrator/components/com_babackup/classes/fx29id1.txt
   

##### <font color="#77ff33">✔ Good Links</font>
* __These are some not Phishing links!__
   1. www.youtube.com/
   2. www.python.org/
   3. www.google.com/
   4. www.kaggle.com/
   

In [None]:
predict_bad = ['www.yeniik.com.tr/wp-admin/js/login.alibaba.com/login.jsp.php',
               'www.fazan-pacir.rs/temp/libraries/ipad',
               'www.tubemoviez.exe/',
               'www.svision-online.de/mgfi/administrator/components/com_babackup/classes/fx29id1.txt']

predict_good = ['www.youtube.com/',
                'www.python.org/',
                'www.google.com/',
                'www.kaggle.com/']

loaded_model = pickle.load(open('phishing.pkl', 'rb'))
#predict_bad = vectorizers.transform(predict_bad)
# predict_good = vectorizer.transform(predict_good)

result_1 = loaded_model.predict(predict_bad)
result_2 = loaded_model.predict(predict_good)

In [None]:
print(f"{result_1} \n {'-'*26} \n{result_2}")

# `Project Terminated Here... 😎`