<figure style="margin-left: 20px; margin-right: 20px;">
  <img src="../figures/logo-esi-sba.png" width="256" height="256" align="right" alt="Logo">
</figure>

# Email spam classification using semi-supervised learning techniques

*Directed by* 
- Fellah Abdnour (ab.fellah@esi-sba.dz) 
- Benyamina Yacine Lazreg (yl.benyamina@esi-sba.dz) 
- Mokadem Adel Abdelkader (aa.mokadem@esi-sba.dz) 
- Benounene Abdelrahmane (a.benounene@esi-sba.dz) 

# Notebook 1: Data Cleaning

In this initial notebook, the focus is on the task of data cleaning. This section addresses data qualityissues, such as missing values, outliers, and inconsistencies.

## Outline

- [Necessary packages](#necessary_packages)
- [Data loading](#data_loading)
- [Getting familiar with the data](#getting_familiar_with_the_data)
- [Handeling missing values](#handeling_missing_values)
- [Removing irrelevant columns](#remove_irrelevant_columns)
- [Handeling duplicates](#handeling_duplicates)
- [Columns renaming](#columns_renaming)
- [Feature extraction](#feature_extraction)
- [Text cleaning and preprocessing](#text_cleaning_and_preprocessing)
- [Outliers Handeling](#outliers_handeling)
- [Removing data representation inconsistenties](#removing_data_representation_inconsistenties)
- [Save the results to the disk](#save_the_results_to_the_disk)

<div id="necessary_packages" >
    <h3>Necessary packages</h3>
</div>

In [1]:
import numpy as np
import pandas as pd
import nltk
import os
import string
import re
from tqdm.notebook import tqdm
from IPython.display import HTML,display
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline

In [2]:
tqdm.pandas()

<div id="data_loading" >
    <h3>Data loading</h3>
</div>

In [3]:
path = os.path.join("..","data","spam.csv")
df = pd.read_csv(path, encoding="iso-8859-1")

<div id="getting_familiar_with_the_data" >
    <h3>Getting familiar with the data</h3>
</div>

In [4]:
df.sample(5)

Unnamed: 0.1,Unnamed: 0,label,text
326,2305,-1,"Subject: enron actuals for dec . 27 , 2000\r\n..."
4109,684,ham,Subject: entex - may\r\ni would assume 40000 f...
4131,1195,-1,Subject: new paycheck information !\r\nintrodu...
2143,4034,-1,Subject: italian rolex in throw away prices . ...
3177,40,-1,Subject: holiday on - call data\r\npipeline co...


In [5]:
df.dtypes

Unnamed: 0     int64
label         object
text          object
dtype: object

In [6]:
df.shape

(5171, 3)

<div id="handeling_missing_values" >
    <h3>Handeling missing values</h3>
</div>

In [7]:
df.isna().sum()

Unnamed: 0    0
label         0
text          0
dtype: int64

<div id="remove_irrelevant_columns">
    <h3>Remove irrelevant columns</h3>
</div>

In [10]:
df.drop(columns=["Unnamed: 0"], inplace=True)

In [11]:
df.columns

Index(['label', 'text'], dtype='object')

<div id="columns_renaming" >
    <h3>Columns renaming</h3>
</div>

In [12]:
df.rename(columns={"label": "class", "text": "content"}, inplace=True)

In [13]:
df.columns

Index(['class', 'content'], dtype='object')

<div id="handeling_duplicates" >
    <h3>Handeling duplicates</h3>
</div>

In [14]:
df["content"].duplicated().sum()

178

In [15]:
df.drop_duplicates(subset="content", inplace=True)

<div id="feature_extraction" >
    <h3>Feature extraction</h3>
</div>

- In this section we are going to save some information that will be lost after text cleaning,but we think that they maybe usefull features.

#### urls

In [16]:
url_regex = re.compile("(?P<url>https?://[^\s]+)")

In [17]:
def get_urls_count(txt):
    return len(re.findall(url_regex, txt))

In [18]:
df["urls_count"] = df["content"].apply(get_urls_count)

#### digits

In [19]:
digits_regex = re.compile("[0-9]+")

In [20]:
def get_digits_count(txt):
    return len(re.findall(digits_regex, txt))

In [21]:
df["digits_count"] = df["content"].apply(get_digits_count)

#### currency symbols

In [22]:
curr_symbols = re.compile(r'[€$£¥]')

In [23]:
def contains_curr_symbols(txt):
    return len(re.findall(curr_symbols, txt)) > 0

In [24]:
df["contains_currency_symbols"] = df["content"].apply(contains_curr_symbols)

#### Text length

In [25]:
df["length"] = df["content"].apply(len)

<div id="text_cleaning_and_preprocessing" >
    <h3>Text cleaning and preprocessing</h3>
</div>

In [26]:
stop_words = set(nltk.corpus.stopwords.words("english"))

In [27]:
spetial_chars = set(string.printable) - set(string.ascii_letters) - set(" ")
escaped_chars = [re.escape(c) for c in spetial_chars]
regex = re.compile(f"({'|'.join(escaped_chars)})")

In [28]:
stemmer = nltk.stem.porter.PorterStemmer()

In [29]:
def transform(text):

    # capitalization
    text = text.lower()

    # remove urls
    text = re.sub(url_regex," ",text)
    
    # tokenization
    text = nltk.word_tokenize(text, language='english')
        
    # stop words removal
    text = [word for word in text if word not in stop_words]
    
    # noise removal
    text = [word for word in text if word.isalpha()]
    
    # stemming
    text = [stemmer.stem(word) for word in text]
    
    return ' '.join(text)

In [30]:
df["content"] = df["content"].progress_apply(transform)

  0%|          | 0/4993 [00:00<?, ?it/s]

<div id="outliers_handeling" >
    <h3>Ouliers handeling</h3>
</div>

In [31]:
model = Pipeline(steps=[
    ("feature extraction", CountVectorizer()),
    ("estimator", IsolationForest())
])

In [32]:
model.fit(df["content"])

In [33]:
predictions = model.predict(df["content"])

In [34]:
values, counts = np.unique(np.concatenate([predictions, [1, -1]]), return_counts=True)
counts = counts - 1
n = counts[values == -1][0]

In [35]:
display(HTML(f"""
    <h5>The number of outliers is : {n}</h5>
"""))

<div id="removing_data_representation_inconsistenties" >
    <h3>Removing data representation inconsistenties</h3>
</div>

In [36]:
df["class"] = df["class"].map({
    "-1":-1,
    "spam": 1,
    "ham": 0
})

In [37]:
df["class"] = df["class"].astype(np.int8)

<div id="save_the_results_to_the_disk" >
    <h3>Save the results to the disk</h3>
</div>

In [38]:
df.to_csv(os.path.join("..","data","clean_df.csv"), index=False)