## Logistic Regresion: Spam Mail Prediction

Vamos a predecir usando del Modelo de Regresión Logistica si un mail es Spam o no. Además de realizar un entrenamiento con nuevos datos.

In [63]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer # Para convertir datos de textos en datos numéricos
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### Carga de datos

In [65]:
datos = pd.read_csv("C:/Users/pauri/OneDrive/Escritorio/Python projects/Trained Logistic Regresion_Spam Mail Prediction/mail_data.csv")
datos

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [66]:
datos.isnull().sum()

Category    0
Message     0
dtype: int64

No tenemos valores NA

#### Transformación Variables Categóricas

In [69]:
# spam como 0, ham como 1
datos.loc[datos["Category"] == "spam", "Category",] = 0
datos.loc[datos["Category"] == "ham" , "Category",] = 1

In [70]:
datos

Unnamed: 0,Category,Message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will ü b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


In [71]:
# Separamos los datos en categorias y textos

X = datos["Message"]
y = datos["Category"]

In [72]:
(X_train,X_test,y_train,y_test)= train_test_split(X,y,test_size=0.2,random_state=3)

In [73]:
# Necesitamos convertir los datos de X en valores numéricos mediante la función TfidfVectorizer

extraccion_caracteristicas = TfidfVectorizer(min_df=1, stop_words = "english", lowercase= True) # Número de veces que la palabra se tiene que repetir / idioma de palabras que no nos interesa the, or... / palabras en minusculas

X_train_caracteristicas = extraccion_caracteristicas.fit_transform(X_train)
X_test_caracteristicas = extraccion_caracteristicas.transform(X_test)

# converimos los datosd de y en Integers

y_train = y_train.astype("int")
y_test = y_test.astype("int")

#### Entrenamiento del modelo

In [76]:
modelo = LogisticRegression()
modelo.fit(X_train_caracteristicas,y_train)

#### Evaluacion del Modelo

In [78]:
prediccion = modelo.predict(X_test_caracteristicas)
prediccion

array([0, 1, 1, ..., 1, 1, 1])

In [80]:
accuracy = accuracy_score(y_test, prediccion)
accuracy

0.9659192825112107

#### Sistema de Predicción

In [100]:
input_mail = [""]

input_data_caracteristicas = extraccion_caracteristicas.transform(input_mail)

prediccion = modelo.predict(input_data_caracteristicas)

if prediccion == 1:
    print("No es Spam")
else: 
    print("SPAM!")

No es Spam
