# Clasificación de palabras (por género de nombre)

In [None]:
import nltk, random
nltk.download('names')
from nltk.corpus import names 

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


**Función básica de extracción de atributos**

In [None]:
# definición de atributos relevantes
def atributos(palabra):
	return {'ultima_letra': palabra[-1]}

tagset = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [None]:
tagset[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

In [None]:
random.shuffle(tagset)
tagset[:10]

[('Tony', 'male'),
 ('Natala', 'female'),
 ('Dorian', 'male'),
 ('Miriam', 'female'),
 ('Thomasine', 'female'),
 ('Archibald', 'male'),
 ('Wilfrid', 'male'),
 ('Alma', 'female'),
 ('Cat', 'male'),
 ('Morry', 'male')]

In [None]:
fset = [(atributos(n), g) for (n, g) in tagset]
train, test = fset[500:], fset[:500]

**Modelo de clasificación Naive Bayes**

In [None]:
# entrenamiento del modelo NaiveBayes
classifier = nltk.NaiveBayesClassifier.train(train)

 **Verificación de algunas predicciones**

In [None]:
classifier.classify(atributos('amanda'))

'female'

In [None]:
classifier.classify(atributos('peter'))

'male'

**Performance del modelo**

In [None]:
print(nltk.classify.accuracy(classifier, test))

0.786


In [None]:
print(nltk.classify.accuracy(classifier, train))

0.7612842557764643


**Mejores atributos**

In [None]:
def mas_atributos(nombre):
    atrib = {}
    atrib["primera_letra"] = nombre[0].lower()
    atrib["ultima_letra"] = nombre[-1].lower()
    for letra in 'abcdefghijklmnopqrstuvwxyz':
        atrib["count({})".format(letra)] = nombre.lower().count(letra)
        atrib["has({})".format(letra)] = (letra in nombre.lower())
    return atrib

In [None]:
mas_atributos('jhon')

{'primera_letra': 'j',
 'ultima_letra': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

In [None]:
fset = [(mas_atributos(n), g) for (n, g) in tagset]
train, test = fset[500:], fset[:500]
classifier2 = nltk.NaiveBayesClassifier.train(train)

In [None]:
print(nltk.classify.accuracy(classifier2, test))

0.784


### Ejercicio de práctica

**Objetivo:** Construye un classificador de nombres en español usando el siguiente dataset: 
https://github.com/jvalhondo/spanish-names-surnames

1. **Preparación de los datos**: con un `git clone` puedes traer el dataset indicado a tu directorio en Colab, luego asegurate de darle el formato adecuado a los datos y sus features para que tenga la misma estructura del ejemplo anterior con el dataset `names` de nombres en ingles. 

* **Piensa y analiza**: ¿los features en ingles aplican de la misma manera para los nombres en español?

In [None]:
# escribe tu código aquí
!git clone https://github.com/jvalhondo/spanish-names-surnames
import numpy as np
import pandas as pd

fatal: destination path 'spanish-names-surnames' already exists and is not an empty directory.


In [None]:
male_names = pd.read_csv("./spanish-names-surnames/male_names.csv")
female_names = pd.read_csv("./spanish-names-surnames/female_names.csv")

male_names["sex"] = "male"
female_names["sex"] = "female"

names = pd.concat([male_names,female_names]).sample(frac=1)
names.head()

Unnamed: 0,name,frequency,mean_age,sex
10889,ROSANE,51,40.6,female
23906,ERIC ANDRES,20,19.9,male
14763,AINHOA BELEN,35,8.8,female
8586,GEORGIANA MARIA,67,19.3,female
6924,PILAR ANTONIA,88,58.9,female


In [None]:
names["name"] = names["name"].apply(lambda x: str(x).split(" ")[0])
names.drop(["frequency","mean_age"],axis=1,inplace=True)
names.head()

Unnamed: 0,name,sex
10889,ROSANE,female
23906,ERIC,male
14763,AINHOA,female
8586,GEORGIANA,female
6924,PILAR,female


In [None]:
names["last character"] = names["name"].apply(lambda x: x[-1])
names.head()

Unnamed: 0,name,sex,last character
10889,ROSANE,female,E
23906,ERIC,male,C
14763,AINHOA,female,A
8586,GEORGIANA,female,A
6924,PILAR,female,R


2. **Entrenamiento y performance del modelo**: usando el classificador de Naive Bayes de NLTK entrena un modelo sencillo usando el mismo feature de la última letra del nombre, prueba algunas predicciones y calcula el performance del modelo. 

In [None]:
def dataframe2dict(data):
  dictionary = data.to_dict('records')

  dataset = []
  for ind in dictionary:
    y = ind["sex"]
    ind.pop('sex', None)
    X = ind
    dataset.append((X,y))

  return dataset

In [None]:
# escribe tu código aquí
data = dataframe2dict(names)

k = round(len(data) * 0.8)
train, test = data[:k], data[k:] 
classifier = nltk.NaiveBayesClassifier.train(train)
print(nltk.classify.accuracy(classifier, test))

0.9034252128090798


3. **Mejores atributos:** Define una función como `atributos2()` donde puedas extraer mejores atributos con los cuales entrenar una mejor version del clasificador. Haz un segundo entrenamiento y verifica como mejora el performance de tu modelo. ¿Se te ocurren mejores maneras de definir atributos para esta tarea particular?

In [None]:
# escribe tu código aquí
names.drop(["last character"],axis=1,inplace=True)
for n in range(1,4):
  names["last {}".format(n)] = names["name"].apply(lambda x: x[-n:])

names.head()

Unnamed: 0,name,sex,last 1,last 2,last 3
10889,ROSANE,female,E,NE,ANE
23906,ERIC,male,C,IC,RIC
14763,AINHOA,female,A,OA,HOA
8586,GEORGIANA,female,A,,ANA
6924,PILAR,female,R,AR,LAR


In [None]:
names.drop(["last 1","last 2"],axis=1,inplace=True) # better with only last 3 characters

data = dataframe2dict(names)

k = round(len(data) * 0.8)
train, test = data[:k], data[k:] 
classifier = nltk.NaiveBayesClassifier.train(train)
print(nltk.classify.accuracy(classifier, test))

0.9205512768544791


# Clasificación de documentos (email spam o no spam)

In [None]:
!git clone https://github.com/pachocamacho1990/datasets

fatal: destination path 'datasets' already exists and is not an empty directory.


In [None]:
import pandas as pd
import numpy as np
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
df = pd.read_csv('datasets/email/csv/spam-apache.csv', names = ['clase','contenido'])
df['tokens'] = df['contenido'].apply(lambda x: word_tokenize(x))
df.head()

Unnamed: 0,clase,contenido,tokens
0,-1,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...","[<, !, DOCTYPE, HTML, PUBLIC, ``, -//W3C//DTD,..."
1,1,> Russell Turpin:\n> > That depends on how the...,"[>, Russell, Turpin, :, >, >, That, depends, o..."
2,-1,Help wanted. We are a 14 year old fortune 500...,"[Help, wanted, ., We, are, a, 14, year, old, f..."
3,-1,Request A Free No Obligation Consultation!\nAc...,"[Request, A, Free, No, Obligation, Consultatio..."
4,1,Is there a way to look for a particular file o...,"[Is, there, a, way, to, look, for, a, particul..."


In [None]:
df['tokens'].values[0]

['<',
 '!',
 'DOCTYPE',
 'HTML',
 'PUBLIC',
 '``',
 '-//W3C//DTD',
 'HTML',
 '4.0',
 'Transitional//EN',
 "''",
 '>',
 '<',
 'HTML',
 '>',
 '<',
 'HEAD',
 '>',
 '<',
 'META',
 'http-equiv=Content-Type',
 'content=',
 "''",
 'text/html',
 ';',
 'charset=iso-8859-1',
 "''",
 '>',
 '<',
 'META',
 'content=',
 "''",
 'MSHTML',
 '6.00.2600.0',
 "''",
 'name=GENERATOR',
 '>',
 '<',
 'STYLE',
 '>',
 '<',
 '/STYLE',
 '>',
 '<',
 '/HEAD',
 '>',
 '<',
 'BODY',
 'bgColor=',
 '#',
 'ffffff',
 '>',
 '<',
 'DIV',
 '>',
 '<',
 'FONT',
 'face=Arial',
 'size=2',
 '>',
 '<',
 'FONT',
 'face=',
 "''",
 'Times',
 'New',
 'Roman',
 "''",
 'size=3',
 '>',
 'Dear',
 'Friend',
 ',',
 '<',
 'BR',
 '>',
 '<',
 'BR',
 '>',
 'A',
 'recent',
 'survey',
 'by',
 'Nielsen/Netratings',
 'says',
 'that',
 '``',
 'The',
 'Internet',
 '<',
 'BR',
 '>',
 'population',
 'is',
 'rapidly',
 'approaching',
 'a',
 "'Half",
 'a',
 'Billion',
 "'",
 'people',
 '!',
 '``',
 '<',
 'BR',
 '>',
 '<',
 'BR',
 '>',
 'SO',
 'WHAT',
 'D

In [None]:
all_words = nltk.FreqDist([w for tokenlist in df['tokens'].values for w in tokenlist])
top_words = all_words.most_common(200)

def document_features(document):
    document_words = set(document)
    features = {}
    for word in top_words:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
document_features(df['tokens'].values[0])

{"contains((',', 2173))": False,
 "contains(('.', 2168))": False,
 "contains(('the', 1967))": False,
 "contains(('>', 1787))": False,
 "contains(('--', 1611))": False,
 "contains(('to', 1435))": False,
 "contains((':', 1220))": False,
 "contains(('*', 1149))": False,
 "contains(('and', 1064))": False,
 "contains(('of', 958))": False,
 "contains(('a', 879))": False,
 "contains(('you', 744))": False,
 "contains(('in', 742))": False,
 "contains(('I', 741))": False,
 "contains(('<', 718))": False,
 "contains(('!', 698))": False,
 "contains(('%', 677))": False,
 "contains(('for', 609))": False,
 "contains(('is', 578))": False,
 "contains(('#', 521))": False,
 "contains(('BR', 494))": False,
 "contains(('that', 479))": False,
 "contains((')', 463))": False,
 "contains(('it', 458))": False,
 'contains(("\'\'", 434))': False,
 "contains(('$', 413))": False,
 "contains(('this', 384))": False,
 "contains(('(', 380))": False,
 "contains(('on', 378))": False,
 "contains(('http', 362))": False,
 "c

In [None]:
fset = [(document_features(texto), clase) for texto, clase in zip(df['tokens'].values, df['clase'].values)]
random.shuffle(fset)
train, test = fset[:200], fset[200:]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train)

In [None]:
print(nltk.classify.accuracy(classifier, test))

0.48


In [None]:
classifier.show_most_informative_features(5)

Most Informative Features
     contains(("'", 95)) = False              -1 : 1      =      1.0 : 1.0
   contains(("''", 434)) = False              -1 : 1      =      1.0 : 1.0
    contains(("'m", 51)) = False              -1 : 1      =      1.0 : 1.0
   contains(("'re", 41)) = False              -1 : 1      =      1.0 : 1.0
   contains(("'s", 263)) = False              -1 : 1      =      1.0 : 1.0


In [None]:
df[df['clase']==-1]['contenido']

0      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
2      Help wanted.  We are a 14 year old fortune 500...
3      Request A Free No Obligation Consultation!\nAc...
10     >\n>“µ×è¹µÑÇ ¡ÑºâÅ¡¸ØÃ¡Ô¨º¹ÍÔ¹àµÍÃìà¹çµ” \n>àµ...
                             ...                        
243    ##############################################...
244    Wanna see sexually curious teens playing with ...
246    REQUEST FOR URGENT BUSINESS ASSISTANCE\n------...
248    Email marketing works!  There's no way around ...
249    Email marketing works!  There's no way around ...
Name: contenido, Length: 125, dtype: object

## Ejercicio de práctica


¿Como podrías construir un mejor clasificador de documentos?

0. **Dataset más grande:** El conjunto de datos que usamos fue muy pequeño, considera usar los archivos corpus que estan ubicados en la ruta: `datasets/email/plaintext/` 

1. **Limpieza:** como te diste cuenta no hicimos ningun tipo de limpieza de texto en los correos electrónicos. Considera usar expresiones regulares, filtros por categorias gramaticales, etc ... . 

---

Con base en eso construye un dataset más grande y con un tokenizado más pulido. 

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
# escribe tu código aquí:
# Descomprimir ZIP
import zipfile
fantasy_zip = zipfile.ZipFile('/content/datasets/email/plaintext/corpus1.zip')
fantasy_zip.extractall('/content/datasets/email/plaintext')
fantasy_zip.close()

# Creamos un listado de los archivos dentro del Corpus1 ham/spam
from os import listdir

path_ham = "/content/datasets/email/plaintext/corpus1/ham/"
filepaths_ham = [path_ham+f for f in listdir(path_ham) if f.endswith('.txt')]

path_spam = "/content/datasets/email/plaintext/corpus1/spam/"
filepaths_spam = [path_spam+f for f in listdir(path_spam) if f.endswith('.txt')]


In [195]:
# Creamos la funcion para tokenizar y leer los archivos 
def abrir(texto):
  with open(texto, 'r', errors='ignore') as f2:
    data = f2.read()
    data = word_tokenize(data)
  return data

# Creamos la lista tokenizada del ham
list_ham = list(map(abrir, filepaths_ham))
# Creamos la lista tokenizada del spam
list_spam = list(map(abrir, filepaths_spam))


In [196]:
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwd = stopwords.words('english')

words_ham = nltk.FreqDist([w for tokenlist in list_ham for w in tokenlist if w not in stopwd])
words_spam = nltk.FreqDist([w for tokenlist in list_spam for w in tokenlist if w not in stopwd])

top_words_ham = words_ham.most_common(250)
top_words_spam = words_spam.most_common(250)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [197]:
total_words = len(words_spam)
key_words = [c for c, _ in top_words_spam]

def feature_generator(document):

  total_words = len(document)
  freq = nltk.FreqDist([w for w in document if w not in stopwd])
  features = dict()

  for word in key_words:
    
    if word in freq:
      features[word] = freq[word]/total_words
    else:
      features[word] = 0
  
  return features


In [None]:
"""
# Separamos las palabras mas comunes
all_words = nltk.FreqDist([w for tokenlist in list_ham+list_spam for w in tokenlist])
top_words = all_words.most_common(250)

# Agregamos Bigramas
bigram_text = nltk.Text([w for token in list_ham+list_spam for w in token])
bigrams = list(nltk.bigrams(bigram_text))
top_bigrams = (nltk.FreqDist(bigrams)).most_common(250)


def document_features(document):
    document_words = set(document)
    bigram = set(list(nltk.bigrams(nltk.Text([token for token in document]))))
    features = {}
    for word, j in top_words:
        features['contains({})'.format(word)] = (word in document_words)

    for bigrams, i in top_bigrams:
        features['contains_bigram({})'.format(bigrams)] = (bigrams in bigram)
  
    return features
"""

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [198]:
# Juntamos las listas indicando si tienen palabras de las mas comunes
import random
fset_ham = [(feature_generator(texto), "no spam") for texto in list_ham]
fset_spam = [(feature_generator(texto), "spam") for texto in list_spam]
fset = fset_spam + fset_ham[:5000]
random.shuffle(fset)

In [199]:
# Separamos en las listas en train y test
k = round(len(fset) * 0.8)
fset_train, fset_test = fset[:k], fset[k:]

# Entrenamos el programa
classifier = nltk.NaiveBayesClassifier.train(fset_train)

# Probamos y calificamos
print(nltk.classify.accuracy(classifier, fset_test))

0.6924564796905223


2. **Validación del modelo anterior:**  
---

una vez tengas el nuevo conjunto de datos más pulido y de mayor tamaño, considera el mismo entrenamiento con el mismo tipo de atributos del ejemplo anterior, ¿mejora el accuracy del modelo resultante?

In [None]:
# escribe tu código aquí:


3. **Construye mejores atributos**: A veces no solo se trata de las palabras más frecuentes sino de el contexto, y capturar contexto no es posible solo viendo los tokens de forma individual, ¿que tal si consideramos bi-gramas, tri-gramas ...?, ¿las secuencias de palabras podrián funcionar como mejores atributos para el modelo?. Para ver si es así,  podemos extraer n-gramas de nuestro corpus y obtener sus frecuencias de aparición con `FreqDist()`, desarrolla tu propia manera de hacerlo y entrena un modelo con esos nuevos atributos, no olvides compartir tus resultados en la sección de comentarios. 

In [None]:
# escribe tu código aquí:
