# NLP
### "Sistemas de Inteligencia Artificial para la Toma de Decisiones"
### Unidad 3 Actividad 1
Alexis Guzman


In [1]:
#se lee el documento adjunto con la clasificacion de correos electronicos
import pandas as pd

df = pd.read_csv("./Data/spam.csv", encoding="latin-1")
df.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"],axis=1,inplace=True)
df.columns = ["label","text"]
print(f'Total spam: {df[df["label"] == "spam"].shape[0]}')
print(f'Total ham: {df[df["label"] == "ham"].shape[0]}')
print(f'Total messages: {df.shape[0]}')
print (f'Total nulls in label: {df["label"].isnull().sum()}')
print (f'Total nulls in text: {df["text"].isnull().sum()}')

Total spam: 747
Total ham: 4825
Total messages: 5572
Total nulls in label: 0
Total nulls in text: 0


In [2]:
df.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


### Limpieza de los datos
En este proceso de limpiar texto constya de los siguientes puntos: <br>
    - Se tranforman todas las palabras a minusculas<br>
    - Se eliminan los caracteres de puntuacion como puntos, punto y coma, corchetes, etc.<br>
    - Se eliminan las palabras sin valor como son pronombres, preposiciones, articulos definidos e indefinidos, etc.<br>
    - se obtiene la base de la palabra por ejemplo flies , fly

In [3]:
from string import punctuation
from re import split
import nltk
from nltk.stem import  WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
stopwords= nltk.corpus.stopwords.words('english')

def text_cleaning (text):
    text= "".join([word.lower()for word in text if word not in punctuation])
    text= split('\W+', text)
    text= [word for word in text if word not in stopwords]
    text = [lemmatizer.lemmatize(word) for word in text]
    return text


['nursing', 'diagnosis', 'may', 'part', 'nursing', 'process', 'clinical', 'judgment', 'individual', 'family', 'community', 'experiencesresponses', 'actual', 'potential', 'health', 'problemslife', 'process', 'nursing', 'diagnosis', 'developed', 'based', 'data', 'obtained', 'nursing', 'assessment']


In [4]:
df["text_clean"]=df["text"].apply(lambda text : text_cleaning (text))
df.head(10)


Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only ...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...","[nah, dont, think, go, usf, life, around, though]"
5,spam,FreeMsg Hey there darling it's been 3 week's n...,"[freemsg, hey, darling, 3, week, word, back, i..."
6,ham,Even my brother is not like to speak with me. ...,"[even, brother, like, speak, treat, like, aid,..."
7,ham,As per your request 'Melle Melle (Oru Minnamin...,"[per, request, melle, melle, oru, minnaminungi..."
8,spam,WINNER!! As a valued network customer you have...,"[winner, valued, network, customer, selected, ..."
9,spam,Had your mobile 11 months or more? U R entitle...,"[mobile, 11, month, u, r, entitled, update, la..."


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(analyzer=text_cleaning)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])

print(tfidf_matrix.shape)


(5572, 8866)


### TF-IDF Matrix

TF-IDF (Term Frequency-Inverse Document Frequency) es una técnica utilizada en procesamiento de lenguaje natural (NLP) y recuperación de información para evaluar la importancia de una palabra o término en un documento o corpus de documentos. Esta técnica se utiliza comúnmente para realizar tareas como recuperación de información, clasificación de texto, análisis de sentimientos y resumen automático

In [6]:
tfidf_df = pd.DataFrame(tfidf_matrix.toarray())

tfidf_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8856,8857,8858,8859,8860,8861,8862,8863,8864,8865
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Random Forest:
 Se compone de un conjunto de árboles de decisión, donde cada árbol se entrena de forma independiente en diferentes subconjuntos de datos y características. Luego, los resultados de estos árboles se combinan para tomar decisiones más precisas y robustas. 

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
# se crean muestras para el entrenamiento
x_train, x_test, y_train, y_test = train_test_split(tfidf_df,df["label"],test_size=0.2)
rf = RandomForestClassifier()
# se entrena el modelo
rf_model = rf.fit(x_train,y_train)
#se obtienen las predicciones del modelo entrenado
y_predict = rf_model.predict(x_test)
#se evalua el modelo
precision = precision_score(y_test,y_predict,pos_label="spam")
recall = recall_score(y_test,y_predict,pos_label="spam")
print(f'Precision: {round(precision,3)} Recall: {round(recall,3)}')


Precision: 1.0 Recall: 0.847


Una precisión de 1.0 indica que el modelo tiene una alta proporción de verdaderos positivos en comparación con los falsos positivos, lo que sugiere que es muy preciso en la identificación de casos positivos.

Una recuperación de 0.807 indica que el modelo es capaz de identificar correctamente el 84.7% de todos los casos positivos. En otras palabras, el modelo no se pierde la mayoría de las instancias positivas.