# Natural Language Processing (NLP)

## Table of Contents

* [0. Problem Statement](#co)
* [1. Importing libraries](#c1)
* [2. Data Collection](#c2)
* [3. Exploration and Data Cleaning](#c3)
  * [3.1 Drop Null Values](#c3-1)
  * [3.2 Drop Duplicate Information](#c3-2)
* [4. Preprocessing of Text (URLs)](#c4)

## 0. Problem Statement <a id='c0'></a>

The objective of this exercise is to develop an NLP model to detect spam on a webpage based on its URL.

## 1. Importing libraries <a id='c1'></a>

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
import json
import warnings
import pickle
from pickle import dump
import regex as re
from nltk import download
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

warnings.filterwarnings("ignore")

def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_columns', None)

## 2. Data Collection <a id='c2'></a>

In [15]:
URL = 'https://breathecode.herokuapp.com/asset/internal-link?id=932&path=url_spam.csv'

def get_data(URL:str) -> pd.DataFrame:
    total_data = pd.read_csv(URL, sep=',')
    total_data.head()
    return total_data

get_data(URL)
total_data = get_data(URL)
print(total_data.head())

                                                 url  is_spam
0  https://briefingday.us8.list-manage.com/unsubs...     True
1                             https://www.hvper.com/     True
2                 https://briefingday.com/m/v4n3i4f3     True
3   https://briefingday.com/n/20200618/m#commentform    False
4                        https://briefingday.com/fan     True


## 3. Exploration and Data Cleaning <a id='c3'></a>

#### 3.1 Drop Null Values <a id='c3-1'></a>

Since there is no null values, we won't delete any data.

In [16]:
total_data.isna().sum()

url        0
is_spam    0
dtype: int64

#### 3.2 Drop Duplicate Information <a id='c3-2'></a>

In [18]:
rows_before = total_data.shape[0]
duplicated_rows_before = total_data.duplicated().sum()
print(f'Before: The dataframe has {rows_before} rows, of which {duplicated_rows_before} are duplicated.')

total_data = total_data.drop_duplicates()

rows_after = total_data.shape[0]
duplicated_rows_after = total_data.duplicated().sum()
print(f'After: Now the dataframe has {rows_after} rows and {duplicated_rows_after} duplicated ones.')

Before: The dataframe has 2369 rows, of which 0 are duplicated.
After: Now the dataframe has 2369 rows and 0 duplicated ones.


## 4. Preprocessing of Text (URLs) <a id='c4'></a>