<a href="https://colab.research.google.com/github/Applied-Machine-Learning-2022/final-project-group6-morganstate/blob/main/Spam_%26_Ham.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Prepping the `DataFrame`

When opening the colab it will often not include our spam dataset file.  
Add Spam.csv to the folder first for now, while I upload it to kaggle and create the import commands in the meantime.

The Spam Dataset originally uses the wrong encoding for the excel sheet.
Instructions to fix the data set are [here](https://medium.com/code-kings/python3-fix-unicodedecodeerror-utf-8-codec-can-t-decode-byte-in-position-be6c2e2235ee)
  
  The steps include:  

  1. Right click span and click open with -> Notepad
  1. Click file -> save as -> and keep the name the same but there will be an encoding option that is set to ANSI. Change it to UTF-8
  1. Click Save. It will ask if you want to overwrite spam, click yes.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# uploaded the dataframe using pandas read_csv
df = pd.read_csv('spam.csv')

# look at what columns we have
df.columns.values


array(['label', 'message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'],
      dtype=object)

In [2]:
df.head(5)

Unnamed: 0,label,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   label       5572 non-null   object
 1   message     5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [4]:
df

Unnamed: 0,label,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [5]:
# check the descriptions of dataframe
df.describe()

Unnamed: 0,label,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [6]:
# missing values in these unnamed columns we dont need so lets drop them from our dataframe
df.isna().any()

label         False
message       False
Unnamed: 2     True
Unnamed: 3     True
Unnamed: 4     True
dtype: bool

In [7]:
# dropped these columns since we do not need them for detecting spam
df = df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'], axis=1)
df

Unnamed: 0,label,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [8]:
# changing the label column values to 0 and 1 for spam or ham
column = 'label'
for k, v in {'ham': 0, 'spam': 1}.items():
  df.loc[df[column] == k, column] = v

df

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [9]:
# lets split our data now into X and y
X = df['message'].values
y = df['label'].values

y = y.astype('int')

In [10]:
from sklearn.model_selection import train_test_split

# here we then split our data again by 30% and 70%
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)

In [11]:
import re
import random

# next we are going to clean the data by removing punctations, URLs and numbers and we are going to change everyting to lowercase

def clean_email(msg):
  # turns the text into lowercase
  msg = msg.lower()
  # removes special characters
  msg = re.sub(r'[^0-9a-zA-Z]', ' ', msg)
  # this removes the whitespace
  msg = ''.join(word + (' ' * random.randint(1, 10)) for word in msg.split(' '))
  return msg

df['message'] = df['message'].apply(clean_email)
df.head(25)

Unnamed: 0,label,message
0,0,go until jurong point crazy ...
1,0,ok lar joking ...
2,1,free entry in 2 a wkly ...
3,0,u dun say so early hor ...
4,0,nah i don t think he ...
5,1,freemsg hey there darling ...
6,0,even my brother is not li...
7,0,as per your request ...
8,1,winner as a valued...
9,1,had your mobile 11 months ...


In [12]:
# check the size of the data we are training and testing
X_train.shape, X_test.shape

((3900,), (1672,))

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# in order to train our model we must convert the text into a matrix of token counts
# we can do so using the CountVectorizer function that sklearn provides

cv = CountVectorizer()
X_train, X_test = cv.fit_transform(X_train), cv.transform(X_test)

In [14]:
from sklearn.svm import SVC

# added a kernal to the SVM
svm = SVC(kernel='rbf', random_state=0)
svm.fit(X_train, y_train)

SVC(random_state=0)

In [15]:
from sklearn.svm import SVC

# added a kernal to the SVM
svm = SVC(kernel='rbf', random_state=0)
svm.fit(X_train, y_train)

SVC(random_state=0)

In [16]:
# check the size of the data we are training and testing
X_train.shape, X_test.shape

((3900, 7206), (1672, 7206))

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer, f1_score
from sklearn.model_selection import cross_val_score

f1 = make_scorer(f1_score, average='micro')
estimator = LogisticRegression()
cross = cross_val_score(estimator, X_train, y_train, scoring=f1)


logistic = LogisticRegression(solver='liblinear', penalty='l1')
logistic.fit(X_train, y_train)
pred = logistic.predict(X_test)
print(f'{accuracy_score(y_test, pred)}')
print(f'{cross.mean()}')

0.9778708133971292
0.9805128205128206


In [18]:
from sklearn.metrics import make_scorer, f1_score
from sklearn.model_selection import cross_val_score

# finding the accuracy and f1-score of our model
estimator = SVC()
f1 = make_scorer(f1_score, average='micro')
cross = cross_val_score(estimator, X_train, y_train, scoring=f1)
# cross.mean()

print('Accuracy: ', svm.score(X_test,y_test))
print('F1-Score: ', cross.mean())

Accuracy:  0.9796650717703349
F1-Score:  0.9746153846153847


[Assistance 1](https://towardsdatascience.com/spam-detection-with-logistic-regression-23e3709e522)  
[Assistance 2](https://pythonprogramminglanguage.com/logistic-regression-spam-filter/)  
[Assistance 3](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)  
Attatched above are basically what we can referance for the logistical and other regressive types. Vectorization and data preprocessing seems important will address later.