*Creado por:*

*Isabel Maniega*

# Natural Language Processing (NLP)

https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

El objetivo de este ejercicio:
* Los ordenadores trabajan con números, no con letras
* así que necesitamos NLP para tranasformar las palabras a números

****

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from sklearn.naive_bayes import MultinomialNB

## Cargar archivo .csv

In [4]:
# https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

In [5]:
df = pd.read_csv("spam.csv", 
                 sep=",", encoding='ISO-8859-1')
df.head(15)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


In [6]:
df = df.iloc[:, 0:2]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Nombres para las columnas

In [7]:
df.columns= ["Status", "Message"]
df.head()

Unnamed: 0,Status,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
df.shape

(5572, 2)

In [9]:
len(df)

5572

## Vemos si nos faltan algunos datos

In [10]:
df.Message.isnull().sum()

np.int64(0)

In [11]:
df.describe()

Unnamed: 0,Status,Message
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


## ¿Cuántos datos de "spam" en nuestros datos?

**Forma 1**

In [12]:
df.head()

Unnamed: 0,Status,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [13]:
df.Status.value_counts()

Status
ham     4825
spam     747
Name: count, dtype: int64

**Forma 2**

In [14]:
df.iloc[:,0].value_counts()

Status
ham     4825
spam     747
Name: count, dtype: int64

**Forma 3**

In [15]:
df_spam = df[df.Status == "spam"]
len(df_spam)

747

**Forma 4**

In [16]:
data = df[df.iloc[:,0] == "spam"]
len(data)

747

## spam == 1 (True); ham == 0 (False)

**Método 1**

In [17]:
df["Status"] = df["Status"].map({"ham": 0, "spam": 1})
df.head()

Unnamed: 0,Status,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [18]:
df.shape

(5572, 2)

In [19]:
X = df.Message

In [20]:
y = df.Status

## Train, Test split

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

## Método 1: CountVectorizer

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

In [23]:
cv = CountVectorizer()

In [24]:
X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

In [25]:
y_train = y_train.astype("int")
y_test = y_test.astype("int")

In [26]:
y_train = np.array(y_train)
y_test = np.array(y_test)

In [27]:
X_train

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 58826 stored elements and shape (4457, 7612)>

In [28]:
X_test

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 13975 stored elements and shape (1115, 7612)>

In [29]:
y_train

array([0, 0, 0, ..., 0, 0, 0], shape=(4457,))

In [30]:
y_test

array([0, 0, 0, ..., 0, 0, 0], shape=(1115,))

## Un poco de Machine Learning

In [31]:
clf = MultinomialNB()

In [32]:
clf.fit(X_train, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [33]:
y_pred = clf.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], shape=(1115,))

In [34]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_pred, y_test)
print(acc * 100)

98.7443946188341


In [35]:
clf.score(X_test, y_test)

0.9874439461883409

In [36]:
aciertos = 0

for i in range(len(y_pred)):
    if y_pred[i] == y_test[i]:
        aciertos += 1
aciertos

1101

In [37]:
(aciertos/len(y_pred))*100

98.7443946188341

## Calcular la matriz de confusión

In [38]:
len(y_train)

4457

**Falsos Positivos**

In [39]:
FP = 0

for i in np.arange(len(y_test)):
    if y_test[i] == 0 and y_pred[i] == 1:
        FP += 1
FP

2

**Falsos Negativos**

In [40]:
FN = 0

for i in np.arange(len(y_test)):
    if y_test[i] == 1 and y_pred[i] == 0:
        FN += 1
FN

12

**True Positives**

In [41]:
TP = 0

for i in np.arange(len(y_test)):
    if y_test[i] == 1 and y_pred[i] == 1:
        TP += 1
TP

154

**True Negative**

In [42]:
TN = 0

for i in np.arange(len(y_test)):
    if y_test[i] == 0 and y_pred[i] == 0:
        TN += 1
TN

947

In [43]:
confusion_matrix = np.array([[TN, FP],
                              [FN, TP]])
confusion_matrix

array([[947,   2],
       [ 12, 154]])

In [44]:
((TN + TP) / (TN+TP+FP+FN)) *100

98.7443946188341

**Forma con Sklearn**

In [45]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[947,   2],
       [ 12, 154]])

## Ahora con: TfidfVectorizer

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

In [47]:
X_train

1114    No no:)this is kallis home ground.amla home to...
3589    I am in escape theatre now. . Going to watch K...
3095    We walked from my moms. Right on stagwood pass...
1012       I dunno they close oredi not... ÌÏ v ma fan...
3320                               Yo im right by yo work
                              ...                        
4931                Match started.india  &lt;#&gt;  for 2
3264    44 7732584351, Do you want a New Nokia 3510i c...
1653    I was at bugis juz now wat... But now i'm walk...
2607    :-) yeah! Lol. Luckily i didn't have a starrin...
2732    How dare you stupid. I wont tell anything to y...
Name: Message, Length: 4457, dtype: object

In [48]:
X_test

4456    Aight should I just plan to come up later toni...
690                                    Was the farm open?
944     I sent my scores to sophas and i had to do sec...
3768    Was gr8 to see that message. So when r u leavi...
1189    In that case I guess I'll see you at campus lodge
                              ...                        
2906                                               ALRITE
1270    Sorry chikku, my cell got some problem thts y ...
3944    I will be gentle princess! We will make sweet ...
2124    Beautiful Truth against Gravity.. Read careful...
253     Ups which is 3days also, and the shipping comp...
Name: Message, Length: 1115, dtype: object

In [49]:
y_train

1114    0
3589    0
3095    0
1012    0
3320    0
       ..
4931    0
3264    1
1653    0
2607    0
2732    0
Name: Status, Length: 4457, dtype: int64

In [50]:
y_test

4456    0
690     0
944     0
3768    0
1189    0
       ..
2906    0
1270    0
3944    0
2124    0
253     0
Name: Status, Length: 1115, dtype: int64

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [52]:
tv = TfidfVectorizer(stop_words = "english")
tv

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,analyzer,'word'
,stop_words,'english'
,token_pattern,'(?u)\\b\\w\\w+\\b'


In [53]:
X_train = tv.fit_transform(X_train)
X_test = tv.transform(X_test)

In [54]:
y_train = y_train.astype("int")
y_test = y_test.astype("int")

In [55]:
y_train = np.array(y_train)
y_test = np.array(y_test)

# Creamos el algoritmo

In [56]:
clf = MultinomialNB()

In [57]:
clf.fit(X_train, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [58]:
y_pred = clf.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0], shape=(1115,))

In [59]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_pred, y_test)
print(acc * 100)

96.59192825112108


In [60]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[949,   0],
       [ 38, 128]])

*Creado por:*

*Isabel Maniega*