# Project 3 

## Requirements

In [None]:
%%capture
!pip install --upgrade pip
!pip install pycaret
!pip install spacy
!python -m spacy download en

## Overall Process

Here are the general steps to classify a large amount of webpages using pycaret:

1. Collect and preprocess the data: Collect the webpages that need to be classified and preprocess them to extract the relevant information. This may involve cleaning the text data, removing stop words, and transforming the data into a format that can be used by pycaret.
1. Load the data into a pandas DataFrame: Load the preprocessed data into a pandas DataFrame.
1. Split the data into training and testing sets: Split the data into training and testing sets using the train_test_split() function from sklearn.model_selection.
1. Set up the pycaret environment: Initialize the pycaret environment and load the data using the setup() function. This function automatically preprocesses the data and prepares it for modeling.
1. Train and compare multiple models: Train multiple classification models using the compare_models() function. This function automatically trains and evaluates several models and selects the best one based on performance metrics.
1. Tune the selected model: Use the tune_model() function to fine-tune the selected model and improve its performance.
1. Evaluate the model: Evaluate the final model on the testing set using the evaluate_model() function.
1. Use the model to classify new data: Once the final model is trained and evaluated, use it to classify new webpages using the predict_model() function.
1. Save the model: Save the trained model to a file using the save_model() function so that it can be reused later.
1. Deploy the model: Deploy the trained model in a production environment and use it to classify webpages as needed.

## Preprocessing

The preprocessing steps for webpages can vary depending on the specific requirements of your project, but some common steps include:

1. Retrieving the raw HTML content of each webpage using a web scraping tool such as BeautifulSoup or Scrapy.
1. Cleaning the HTML content by removing HTML tags, script and style tags, and other unwanted content using regular expressions or an HTML parsing library.
1. Tokenizing the cleaned HTML content into words or phrases using a natural language processing library such as NLTK or spaCy.
1. Normalizing the tokens by converting them to lowercase, removing punctuation, and removing stop words (common words that do not add meaning to the text).

### Preprocessing Text Data

To preprocess raw text from webpages for classification using pycaret, you can follow these steps:

1. Clean the HTML tags from the text using a library such as beautifulsoup.
1. Remove stop words and punctuation marks from the text using the nltk library.
1. Tokenize the text into individual words using nltk.
1. Apply stemming or lemmatization to reduce each word to its root form.

## Conversion

1. Creating a document-term matrix or other feature representation of the preprocessed text data that can be used as input to a machine learning algorithm.
1. Convert the processed text into a numerical representation using techniques such as bag of words, TF-IDF, or word embeddings.

In [3]:
#@title Data Source
datafile = "./sample_data/data.csv" #@param {type:"string"}


In [11]:
import pandas as pd

# Read the CSV containing the extracted data from the webpages
df = pd.read_csv(datafile)

# Show a sample of the data
df.head()

Unnamed: 0,title_raw,has_form,has_login_form,has_js,js_include_b64,text_clean,classification
0,debtwire login,True,True,True,False,Debtwire Login Debtwire Login DebtwireCreated ...,benign
1,what is revenuestripe ? - powerinbox,False,False,True,False,RevenueStripe PowerInbox Learn RevenueStripe,benign
2,newmark knight frank ( @ newmarkkf ) | twitter,True,True,True,False,Newmark Knight Frank Newmarkkf Twitter 've det...,benign
3,join our rewards program for more opportunitie...,False,False,True,False,Join Rewards Program Opportunities Win Home Re...,benign
4,debtwire login,True,True,True,False,Debtwire Login Debtwire Login DebtwireCreated ...,benign


## Embedding

In [12]:
from pycaret.nlp import *

# Generate the classifier for the text contents of the web pages
clsf = setup(data = df, target = 'text_clean') 

Description,Value
session_id,8210
Documents,23
Vocab Size,939
Custom Stopwords,False


INFO:logs:setup() succesfully completed......................................


### Latent Dirichlet Allocation(LDA) technique

In [14]:
# Generate the LDA model
m_lda = create_model(model='lda', multi_core=True)
d_lda = assign_model(m_lda)
d_lda.head()

INFO:logs:(23, 13)
INFO:logs:assign_model() succesfully completed......................................


Unnamed: 0,title_raw,has_form,has_login_form,has_js,js_include_b64,text_clean,classification,Topic_0,Topic_1,Topic_2,Topic_3,Dominant_Topic,Perc_Dominant_Topic
0,debtwire login,True,True,True,False,debtwirecreate cap lock remember forget passwo...,benign,0.971727,0.009361,0.009451,0.009461,Topic 0,0.97
1,what is revenuestripe ? - powerinbox,False,False,True,False,revenuestripe powerinbox learn revenuestripe,benign,0.050043,0.05163,0.847615,0.050711,Topic 2,0.85
2,newmark knight frank ( @ newmarkkf ) | twitter,True,True,True,False,detect disabled browser would proceed skip con...,benign,0.000317,0.000318,0.000317,0.999048,Topic 3,1.0
3,join our rewards program for more opportunitie...,False,False,True,False,program opportunity win home resort guestroom ...,benign,0.000305,0.999087,0.000304,0.000303,Topic 1,1.0
4,debtwire login,True,True,True,False,debtwirecreate cap lock remember forget passwo...,benign,0.971728,0.009361,0.009451,0.00946,Topic 0,0.97


In [None]:
evaluate_model(m_lda)

In [19]:
d_lda.drop(['title_raw', 'text_clean', 'Dominant_Topic', 'Perc_Dominant_Topic'], axis=1, inplace = True)
d_lda.head()

Unnamed: 0,has_form,has_login_form,has_js,js_include_b64,classification,Topic_0,Topic_1,Topic_2,Topic_3
0,True,True,True,False,benign,0.971727,0.009361,0.009451,0.009461
1,False,False,True,False,benign,0.050043,0.05163,0.847615,0.050711
2,True,True,True,False,benign,0.000317,0.000318,0.000317,0.999048
3,False,False,True,False,benign,0.000305,0.999087,0.000304,0.000303
4,True,True,True,False,benign,0.971728,0.009361,0.009451,0.00946


In [23]:
!pip install numpy==1.20

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.20
  Downloading numpy-1.20.0-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.20.0 which is incompatible.
xarray-einstats 0.5.1 requires scipy>=1.6, but you have scipy 1.5.4 which is incompatible.
cmdstanpy 1.1.0 requires numpy>=1.21, but you have numpy 1.20.0 which is incompatible.[0m[31m
[0mSuccessfully instal

In [24]:
from pycaret.classification import *

c_lda = setup(data = d_lda, 
              target='classification',
              normalize=True,
              train_size=0.85) 

Unnamed: 0,Description,Value
0,session_id,2611
1,Target,classification
2,Target Type,Binary
3,Label Encoded,"benign: 0, malicious: 1"
4,Original Data,"(23, 9)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,4
8,Ordinal Features,False
9,High Cardinality Features,False


INFO:logs:create_model_container: 0
INFO:logs:master_model_container: 0
INFO:logs:display_container: 1
INFO:logs:Pipeline(memory=None,
         steps=[('dtypes',
                 DataTypes_Auto_infer(categorical_features=[],
                                      display_types=True, features_todrop=[],
                                      id_columns=[],
                                      ml_usecase='classification',
                                      numerical_features=[],
                                      target='classification',
                                      time_features=[])),
                ('imputer',
                 Simple_Imputer(categorical_strategy='not_available',
                                fill_value_categorical=None,
                                fill_value_numerical=None,
                                numer...
                ('P_transform', 'passthrough'), ('binn', 'passthrough'),
                ('rem_outliers', 'passthrough'), ('cluster_all'

In [25]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.9,0.7,0.7,0.65,0.6667,,0.6,0.026
svm,SVM - Linear Kernel,0.8,0.0,0.6,0.6,0.6,,0.5,0.017
ada,Ada Boost Classifier,0.8,0.6,0.6,0.5,0.5333,,0.4,0.156
gbc,Gradient Boosting Classifier,0.8,0.6,0.7,0.6,0.6333,,0.5,0.102
dt,Decision Tree Classifier,0.75,0.6,0.6,0.5,0.5333,,0.4,0.014
ridge,Ridge Classifier,0.75,0.0,0.5,0.5,0.5,,0.4,0.013
et,Extra Trees Classifier,0.75,0.6,0.6,0.55,0.5667,,0.4,0.307
lr,Logistic Regression,0.7,0.5,0.5,0.5,0.5,,0.3,0.571
knn,K Neighbors Classifier,0.7,0.45,0.4,0.4,0.4,,0.3,0.018
nb,Naive Bayes,0.7,0.6,0.5,0.5,0.5,,0.3,0.015


INFO:logs:create_model_container: 15
INFO:logs:master_model_container: 15
INFO:logs:display_container: 2
INFO:logs:LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)
INFO:logs:compare_models() succesfully completed......................................


LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [26]:
c_lda_lda = tune_model(create_model('lda'))

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1.0,0.0,0.0,0.0,0.0,,0.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,0.5,0.0,1.0,0.5,0.6667,0.0,0.0
5,0.5,1.0,0.0,0.0,0.0,0.0,0.0
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,1.0,0.0,0.0,0.0,0.0,,0.0


INFO:logs:create_model_container: 17
INFO:logs:master_model_container: 17
INFO:logs:display_container: 4
INFO:logs:LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=0.001,
                           solver='lsqr', store_covariance=False, tol=0.0001)
INFO:logs:tune_model() succesfully completed......................................


### Non-Negative Matrix Factorization(NMF)

In [16]:
m_nmf = create_model(model='nmf', multi_core=True)
d_nmf = assign_model(m_nmf)
d_nmf.head()

INFO:logs:(23, 13)
INFO:logs:assign_model() succesfully completed......................................


Unnamed: 0,title_raw,has_form,has_login_form,has_js,js_include_b64,text_clean,classification,Topic_0,Topic_1,Topic_2,Topic_3,Dominant_Topic,Perc_Dominant_Topic
0,debtwire login,True,True,True,False,debtwirecreate cap lock remember forget passwo...,benign,0.001158,0.00012,0.0,0.000202,Topic 0,0.78
1,what is revenuestripe ? - powerinbox,False,False,True,False,revenuestripe powerinbox learn revenuestripe,benign,4e-05,6.7e-05,0.0,0.000158,Topic 3,0.6
2,newmark knight frank ( @ newmarkkf ) | twitter,True,True,True,False,detect disabled browser would proceed skip con...,benign,0.0043,0.000896,0.0,0.004,Topic 0,0.47
3,join our rewards program for more opportunitie...,False,False,True,False,program opportunity win home resort guestroom ...,benign,0.002717,6.8e-05,0.0,0.000123,Topic 0,0.93
4,debtwire login,True,True,True,False,debtwirecreate cap lock remember forget passwo...,benign,0.001158,0.00012,0.0,0.000202,Topic 0,0.78


In [None]:
evaluate_model(m_nmf)

In [20]:
d_nmf.drop(['title_raw', 'text_clean', 'Dominant_Topic', 'Perc_Dominant_Topic'], axis=1, inplace = True)
d_nmf.head()

Unnamed: 0,has_form,has_login_form,has_js,js_include_b64,classification,Topic_0,Topic_1,Topic_2,Topic_3
0,True,True,True,False,benign,0.001158,0.00012,0.0,0.000202
1,False,False,True,False,benign,4e-05,6.7e-05,0.0,0.000158
2,True,True,True,False,benign,0.0043,0.000896,0.0,0.004
3,False,False,True,False,benign,0.002717,6.8e-05,0.0,0.000123
4,True,True,True,False,benign,0.001158,0.00012,0.0,0.000202


## Model Generation