# Hybrid SE detection using Deep learning models

### Fetch the data

In [1]:
!wget https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip

--2024-10-30 12:16:35--  https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘sms+spam+collection.zip’

sms+spam+collection     [    <=>             ] 198.65K   211KB/s    in 0.9s    

2024-10-30 12:16:37 (211 KB/s) - ‘sms+spam+collection.zip’ saved [203415]



In [2]:
# Unzip the dataset
from zipfile import ZipFile

with ZipFile("sms+spam+collection.zip", 'r') as zip:
    zip.extractall()

The dataset is a text file. So we can read in the different lines and convert the whole dataset into a DataFrame

In [6]:
data_dict = {'label':[], 'text': []}

with open("SMSSpamCollection", 'r') as file:
    for line in file:
        # Split the line on the first tab or whitespace
        parts = line.split(maxsplit=1)
        
        # Check that we have at least two parts: label and text
        if len(parts) == 2:
            label = parts[0].strip()
            text = parts[1].strip().strip('"')  # Remove surrounding quotes if any

            # Append to dictionary lists
            data_dict["label"].append(label)
            data_dict["text"].append(text)


In [7]:
# view a sample of the entries

for i in range(10):
    print(f"Label: {data_dict['label'][i]}, Text: {data_dict['text'][i]}")

Label: ham, Text: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
Label: ham, Text: Ok lar... Joking wif u oni...
Label: spam, Text: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
Label: ham, Text: U dun say so early hor... U c already then say...
Label: ham, Text: Nah I don't think he goes to usf, he lives around here though
Label: spam, Text: FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
Label: ham, Text: Even my brother is not like to speak with me. They treat me like aids patent.
Label: ham, Text: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
Label: spam, Text: WINNER!! As a valued network customer 

In [8]:
# convert the dictionary to a dataframe
import pandas as pd

sms_df = pd.DataFrame(data_dict)
sms_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
sms_df.describe()

Unnamed: 0,label,text
count,5574,5574
unique,2,5160
top,ham,"Sorry, I'll call later"
freq,4827,30


In [10]:
sms_df.isnull().sum()

label    0
text     0
dtype: int64

In [None]:
sms_df.duplicated().sum()

np.int64(414)

In [13]:
# drop duplicated values 
sms_df = sms_df.drop_duplicates(keep="first")

In [14]:
sms_df.duplicated().sum()

np.int64(0)

In [15]:
sms_df.shape

(5160, 2)