<a href="https://colab.research.google.com/github/giocarro/Data_Science_Gio/blob/main/Tareas/SMS_Spam_Detection_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**SPAM analysis**

## Dataset Information

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.

## Attributes

- SMS Messages
- Label (spam/ham)

## Import modules

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import plotly.express as px
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix


In [2]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Loading the dataset

In [3]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [4]:
# get necessary columns for processing
df = df[['text', 'class']]
df

Unnamed: 0,text,class
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,spam
5568,Will Ì_ b going to esplanade fr home?,ham
5569,"Pity, * was in mood for that. So...any other s...",ham
5570,The guy did some bitching but I acted like i'd...,ham


## Preprocessing the dataset

In [5]:
# check for null values
df.isnull().sum()

text     0
class    0
dtype: int64

In [6]:
stops = set(stopwords.words('english'))
print(stops)

{"you've", 'which', 'so', "mightn't", 'yourselves', "hasn't", 'in', 'before', 'being', 'does', 've', 'over', 'further', 'weren', 'nor', "couldn't", 'were', 'than', 'out', 'hers', "you're", 'o', 'while', 'same', 'not', 'between', 'isn', "wasn't", 'needn', 'about', 'own', "haven't", 'is', "didn't", 'had', 'it', 'if', 'mightn', 't', "don't", 'other', 'most', 'did', 'll', 'all', "weren't", 'a', 'both', 'again', 'yours', 'his', 'd', 'itself', 'against', 'our', 'that', 'such', 'am', 'me', "you'll", 'whom', 'few', 'until', 'down', "aren't", 'm', "she's", 'more', 'above', "shouldn't", 'him', 'doing', 'as', 'and', 'be', "it's", 'having', 'y', 'my', 'up', 'when', 'themselves', 'under', 'herself', 'ain', 'once', 's', 'some', 'yourself', 'now', 'who', 'why', 'was', 'aren', 'at', 'shouldn', "shan't", 'doesn', 'very', 'those', 'ourselves', 'what', 'couldn', 'or', 'any', "doesn't", 'have', 'mustn', 'then', 'their', 'no', 'for', "needn't", 'into', 'will', 'ma', 'we', 'wasn', 'i', 'can', 'they', "that'

In [7]:
#STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    # convert to lowercase
    text = text.lower()
    # remove special characters
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    # remove stopwords
    text = " ".join(word for word in text.split() if word not in stops)
    return text

In [8]:
# clean the messages
df['clean_text'] = df['text'].apply(clean_text)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_text'] = df['text'].apply(clean_text)


Unnamed: 0,text,class,clean_text
0,"Go until jurong point, crazy.. Available only ...",ham,go jurong point crazy available bugis n great ...
1,Ok lar... Joking wif u oni...,ham,ok lar joking wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,U dun say so early hor... U c already then say...,ham,u dun say early hor u c already say
4,"Nah I don't think he goes to usf, he lives aro...",ham,nah think goes usf lives around though


## Input Split

In [9]:
class_counts = df['class'].value_counts()
class_counts

ham     4825
spam     747
Name: class, dtype: int64

In [10]:
class_counts.index

Index(['ham', 'spam'], dtype='object')

In [11]:
fig = px.bar(x=class_counts.index, y=class_counts.values, labels={'x': 'Clase', 'y': 'Numero'}, title='Distribucion por clase', template = 'plotly_white', text = class_counts.values)
fig.show()

In [15]:
class_1 = df[df['class'] == 'ham']
class_2 = df[df['class'] == 'spam']