<a href="https://colab.research.google.com/github/Aleskies/Clasificacion-multilabel/blob/master/desafio_clasificacion_texto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [73]:
import pandas as pd
import numpy as np
import os


import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier


#for text cleaning
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#for visualization
import matplotlib.pyplot as plt

In [4]:
os.chdir('/content/drive/MyDrive/desafio-entrevistas-trabajo/brain-food-rn-clasicador-multiclase/')

In [5]:
# 
dir_base = "/content/drive/MyDrive/desafio-entrevistas-trabajo/brain-food-rn-clasicador-multiclase/" 

df_train = pd.read_csv(dir_base +"/data/text_classification_train.csv" )
df_test = pd.read_csv(dir_base +"/data/text_classification_test.csv" )
df_emotions = pd.read_csv(dir_base +"/data/emotions.txt",  header=None)

In [6]:
df_train.shape

(43410, 3)

In [7]:
df_train.head()

Unnamed: 0,text,emotion,id
0,My favourite food is anything I didn't have to...,27,eebbqej
1,"Now if he does off himself, everyone will thin...",27,ed00q6i
2,WHY THE FUCK IS BAYLESS ISOING,2,eezlygj
3,To make her feel threatened,14,ed7ypvh
4,Dirty Southern Wankers,3,ed0bdzj


Al observar la variable objetivo `emotion`, vemos codificado las emociones con valores numéricos. 

A continuación se evidencia que los valores de las emociones no solo estan asignadas a un solo valor, además existen combinaciones de emociones en algunas de las observaciones.  

In [None]:
df_train['emotion'].unique()

In [9]:
df_train['emotion'].describe()

count     43410
unique      711
top          27
freq      12823
Name: emotion, dtype: object

Se presentan 711 valores únicos, cantidad muy diferente a las 28 emociones que estan presentes en la lista del archivo `emotions.txt` 

In [None]:
df_emotions

Unnamed: 0,0
0,admiration
1,amusement
2,anger
3,annoyance
4,approval
5,caring
6,confusion
7,curiosity
8,desire
9,disappointment


Debido a que una observación puede o no tener más de un sentimiento asignado, estamos en presencia de una clasificación de multiples etiquetas. 

A continuación se crea una nueva columna para asignar el sentimiento correspondiente al valor numérico asociado.

In [16]:
# Definir un diccionario que asocie cada número con su etiqueta correspondiente
etiquetas = {
    '0': 'admiration',
    '1': 'amusement',
    '2': 'anger',
    '3': 'annoyance',
    '4': 'approval',
    '5': 'caring',
    '6': 'confusion',
    '7': 'curiosity',
    '8': 'desire',
    '9': 'disappointment',
    '10': 'disapproval',
    '11': 'disgust',
    '12': 'embarrassment',
    '13': 'excitement',
    '14': 'fear',
    '15': 'gratitude',
    '16': 'grief',
    '17': 'joy',
    '18': 'love',
    '19': 'nervousness',
    '20': 'optimism',
    '21': 'pride',
    '22': 'realization',
    '23': 'relief',
    '24': 'remorse',
    '25': 'sadness',
    '26': 'surprise',
    '27': 'neutral'
}


# Crear una nueva columna que contenga la lista de etiquetas correspondiente a cada número
df_train['etiquetas'] = df_train['emotion'].apply(lambda x: [etiquetas[num] for num in x.split(',')])

In [17]:
df_train.head()

Unnamed: 0,text,emotion,id,etiquetas
0,My favourite food is anything I didn't have to...,27,eebbqej,[neutral]
1,"Now if he does off himself, everyone will thin...",27,ed00q6i,[neutral]
2,WHY THE FUCK IS BAYLESS ISOING,2,eezlygj,[anger]
3,To make her feel threatened,14,ed7ypvh,[fear]
4,Dirty Southern Wankers,3,ed0bdzj,[annoyance]


In [18]:
df_train[df_train['emotion']=='5,15']

Unnamed: 0,text,emotion,id,etiquetas
4906,"Thank you man whoever you are, this has really...",515,ed3dzj5,"[caring, gratitude]"
5885,thanks! That helps,515,edkopnt,"[caring, gratitude]"
7651,"Thanks for your kind words, I really appreciat...",515,ee0x49j,"[caring, gratitude]"
14172,hugs back! thank you for writing and being her...,515,ef2y327,"[caring, gratitude]"
22211,UPDATE i got a set of spare keys finally and r...,515,eeb4z9l,"[caring, gratitude]"
23105,Haha thank you for all of this! I’m sorry you ...,515,ede4d2y,"[caring, gratitude]"
29017,You have my good wishes and words of encourage...,515,eedm2qs,"[caring, gratitude]"
33826,I’m seeing a doctor tomorrow. Thanks for caring,515,ee4mb11,"[caring, gratitude]"
35323,I feel for him..thank [NAME] he has a mother w...,515,eekis5u,"[caring, gratitude]"
35923,Were lucky we were able to get this much money...,515,eeqe6v6,"[caring, gratitude]"


In [21]:
# se representa en texto en minuscula 
df_train['text'] = df_train['text'].str.lower()
df_train.head()

Unnamed: 0,text,emotion,id,etiquetas
0,my favourite food is anything i didn't have to...,27,eebbqej,[neutral]
1,"now if he does off himself, everyone will thin...",27,ed00q6i,[neutral]
2,why the fuck is bayless isoing,2,eezlygj,[anger]
3,to make her feel threatened,14,ed7ypvh,[fear]
4,dirty southern wankers,3,ed0bdzj,[annoyance]


In [54]:
# Crear una nueva columna con el número de etiquetas para cada texto
df_train['num_etiquetas'] = df_train['etiquetas'].apply(lambda x: len(x))
df_train.head()

Unnamed: 0,text,emotion,id,etiquetas,num_etiquetas
0,my favourite food is anything i didn't have to...,27,eebbqej,[neutral],1
1,"now if he does off himself, everyone will thin...",27,ed00q6i,[neutral],1
2,why the fuck is bayless isoing,2,eezlygj,[anger],1
3,to make her feel threatened,14,ed7ypvh,[fear],1
4,dirty southern wankers,3,ed0bdzj,[annoyance],1


In [55]:
df_train['num_etiquetas'].describe()

count    43410.000000
mean         1.177217
std          0.417699
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          5.000000
Name: num_etiquetas, dtype: float64

In [56]:
df_train.groupby(['num_etiquetas'])['text'].count()

num_etiquetas
1    36308
2     6541
3      532
4       28
5        1
Name: text, dtype: int64

Haya presencia de una observación que tiene asociada 5 sentimientos de un total de 28 disponibles. Esta observacion se presenata a continuación:

In [58]:
df_train[df_train['num_etiquetas'] == 5]

Unnamed: 0,text,emotion,id,etiquetas,num_etiquetas
7873,yeah i probably would've started crying on the...,912141925,ee6lqiq,"[disappointment, embarrassment, fear, nervousn...",5


In [59]:
df_train[df_train['num_etiquetas'] == 5]['text'].to_list()

['yeah i probably would\'ve started crying on the spot. loud, sudden and especially shrill noises are extremely *""cringey""* and uncomfortable and stressful']

A continuación, se procede a crear la columna de etiquetas, respresentadas con 0 y 1, de acuerdo si hay presencia o no de un sentimiento.

In [60]:
from sklearn.preprocessing import MultiLabelBinarizer

# Creamos una instancia de MultiLabelBinarizer y usamos fit_transform para convertir las etiquetas en un formato binario
mlb = MultiLabelBinarizer()
etiquetas_binarias = mlb.fit_transform(df_train['etiquetas'])

# Creamos un DataFrame a partir de las etiquetas binarias y les ponemos nombres a las columnas
df_etiquetas_binarias = pd.DataFrame(etiquetas_binarias, columns=mlb.classes_)

# Concatenamos el DataFrame de etiquetas binarias con el DataFrame original
df = pd.concat([df_train, df_etiquetas_binarias], axis=1)
df.head()

Unnamed: 0,text,emotion,id,etiquetas,num_etiquetas,admiration,amusement,anger,annoyance,approval,...,love,nervousness,neutral,optimism,pride,realization,relief,remorse,sadness,surprise
0,my favourite food is anything i didn't have to...,27,eebbqej,[neutral],1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,"now if he does off himself, everyone will thin...",27,ed00q6i,[neutral],1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,why the fuck is bayless isoing,2,eezlygj,[anger],1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,to make her feel threatened,14,ed7ypvh,[fear],1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,dirty southern wankers,3,ed0bdzj,[annoyance],1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
labels = mlb.classes_
labels

array(['admiration', 'amusement', 'anger', 'annoyance', 'approval',
       'caring', 'confusion', 'curiosity', 'desire', 'disappointment',
       'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear',
       'gratitude', 'grief', 'joy', 'love', 'nervousness', 'neutral',
       'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness',
       'surprise'], dtype=object)

La presencia total de cada uno de los sentimientos, en el total de observaciones, se distribuye como:

In [61]:
df.iloc[:,5:].sum().sort_values(ascending = False)

neutral           14219
admiration         4130
approval           2939
gratitude          2662
annoyance          2470
amusement          2328
curiosity          2191
love               2086
disapproval        2022
optimism           1581
anger              1567
joy                1452
confusion          1368
sadness            1326
disappointment     1269
realization        1110
caring             1087
surprise           1060
excitement          853
disgust             793
desire              641
fear                596
remorse             545
embarrassment       303
nervousness         164
relief              153
pride               111
grief                77
dtype: int64

Que porcentualmente corresponde a:

In [62]:
df.iloc[:,5:].sum().sort_values(ascending = False) / df.iloc[:,5:].sum().sum() * 100

neutral           27.824198
admiration         8.081717
approval           5.751130
gratitude          5.209088
annoyance          4.833376
amusement          4.555506
curiosity          4.287420
love               4.081952
disapproval        3.956715
optimism           3.093752
anger              3.066356
joy                2.841320
confusion          2.676947
sadness            2.594760
disappointment     2.483220
realization        2.172084
caring             2.127077
surprise           2.074242
excitement         1.669178
disgust            1.551768
desire             1.254329
fear               1.166272
remorse            1.066474
embarrassment      0.592920
nervousness        0.320920
relief             0.299395
pride              0.217208
grief              0.150676
dtype: float64

Hay presencia de sentimientos con muy bajo porcentaje, incluso menor a 1%. Mientras que el sentimiento que mas presencia tiene es `neutral` 

## Tratamiento del texto

In [65]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [66]:
%%time
def clean_text(txt):
    """""
    Limpieza del texto ingresado, considerando los siguientes pasos
    1- Sustituir las contracciones
    2- Eliminar puntuación
    3- Dividir en palabras
    4- Eliminar palabras vacías (stopwords)
    5- Eliminar los signos de puntuación sobrantes
    """""
    contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", 
                        "'cause": "because", "could've": "could have", 
                        "couldn't": "could not", "didn't": "did not",  
                        "doesn't": "does not", "don't": "do not", 
                        "hadn't": "had not", "hasn't": "has not", 
                        "haven't": "have not", "he'd": "he would",
                        "he'll": "he will", "he's": "he is", "how'd": "how did", 
                        "how'd'y": "how do you", "how'll": "how will", 
                        "how's": "how is",  "I'd": "I would", 
                        "I'd've": "I would have", "I'll": "I will", 
                        "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                        "i'd": "i would", "i'd've": "i would have", 
                        "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                        "i've": "i have", "isn't": "is not", "it'd": "it would", 
                        "it'd've": "it would have", "it'll": "it will", 
                        "it'll've": "it will have","it's": "it is", 
                        "let's": "let us", "ma'am": "madam", "mayn't": "may not",
                        "might've": "might have","mightn't": "might not",
                        "mightn't've": "might not have", "must've": "must have", 
                        "mustn't": "must not", "mustn't've": "must not have", 
                        "needn't": "need not", "needn't've": "need not have",
                        "o'clock": "of the clock", "oughtn't": "ought not", 
                        "oughtn't've": "ought not have", "shan't": "shall not", 
                        "sha'n't": "shall not", "shan't've": "shall not have", 
                        "she'd": "she would", "she'd've": "she would have", 
                        "she'll": "she will", "she'll've": "she will have", 
                        "she's": "she is", "should've": "should have", 
                        "shouldn't": "should not", "shouldn't've": "should not have",
                        "so've": "so have","so's": "so as", "this's": "this is",
                        "that'd": "that would", "that'd've": "that would have", 
                        "that's": "that is", "there'd": "there would", 
                        "there'd've": "there would have", "there's": "there is", 
                        "here's": "here is","they'd": "they would", 
                        "they'd've": "they would have", "they'll": "they will",
                        "they'll've": "they will have", "they're": "they are",
                        "they've": "they have", "to've": "to have", 
                        "wasn't": "was not", "we'd": "we would", 
                        "we'd've": "we would have", "we'll": "we will",
                        "we'll've": "we will have", "we're": "we are", 
                        "we've": "we have", "weren't": "were not", 
                        "what'll": "what will", "what'll've": "what will have", 
                        "what're": "what are",  "what's": "what is", 
                        "what've": "what have", "when's": "when is", 
                        "when've": "when have", "where'd": "where did", 
                        "where's": "where is", "where've": "where have",  
                        "who'll": "who will", "who'll've": "who will have", 
                         "who's": "who is", "who've": "who have", 
                        "why's": "why is", "why've": "why have", 
                        "will've": "will have", "won't": "will not", 
                        "won't've": "will not have", "would've": "would have", 
                        "wouldn't": "would not", "wouldn't've": "would not have", 
                        "y'all": "you all", "y'all'd": "you all would",
                        "y'all'd've": "you all would have","y'all're": "you all are",
                        "y'all've": "you all have","you'd": "you would", 
                        "you'd've": "you would have", "you'll": "you will", 
                        "you'll've": "you will have", "you're": "you are", 
                        "you've": "you have"}

    def _get_contractions(contraction_dict):
        contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))
        return contraction_dict, contraction_re

    def replace_contractions(text):
        contractions, contractions_re = _get_contractions(contraction_dict)
        def replace(match):
            return contractions[match.group(0)]
        return contractions_re.sub(replace, text)

    # Sustituir las contracciones
    txt = replace_contractions(txt)
    
    # eliminar puntuacion
    txt  = "".join([char for char in txt if char not in string.punctuation])
    txt = re.sub('[0-9]+', '', txt)
    
    # Dividir en palabras
    words = word_tokenize(txt)
    
    # Eliminar palabras vacías (stopwords)
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    
    # Eliminar los signos de puntuación sobrantes
    words = [word for word in words if word.isalpha()]
    
    cleaned_text = ' '.join(words)
    return cleaned_text

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 13.8 µs


In [67]:
df['textos_limpios'] = df['text'].apply(lambda txt: clean_text(txt))
df['textos_limpios']

0                             favourite food anything cook
1        everyone think hes laugh screwing people inste...
2                                      fuck bayless isoing
3                                     make feel threatened
4                                   dirty southern wankers
                               ...                        
43405    added mate well got bow love hunting aspect ga...
43406              always thought funny reference anything
43407    talking anything bad happened name fault good ...
43408                            like baptism sexy results
43409                                           enjoy ride
Name: textos_limpios, Length: 43410, dtype: object

In [68]:
X = df['textos_limpios']
X

0                             favourite food anything cook
1        everyone think hes laugh screwing people inste...
2                                      fuck bayless isoing
3                                     make feel threatened
4                                   dirty southern wankers
                               ...                        
43405    added mate well got bow love hunting aspect ga...
43406              always thought funny reference anything
43407    talking anything bad happened name fault good ...
43408                            like baptism sexy results
43409                                           enjoy ride
Name: textos_limpios, Length: 43410, dtype: object

In [69]:
y = df.iloc[:,5:]
y

Unnamed: 0,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,...,nervousness,neutral,optimism,pride,realization,relief,remorse,sadness,surprise,textos_limpios
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,favourite food anything cook
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,everyone think hes laugh screwing people inste...
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,fuck bayless isoing
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,make feel threatened
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,dirty southern wankers
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43405,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,added mate well got bow love hunting aspect ga...
43406,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,always thought funny reference anything
43407,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,talking anything bad happened name fault good ...
43408,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,like baptism sexy results


Se separa en conjunto de entrenamiento y de prueba

In [72]:
# train test split for protect overfitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=23)
print('X_train: ', X_train.shape )
print('X_test: ' , X_test.shape )
print('y_train: ', y_train.shape)
print('y_test: ' , y_test.shape)

X_train:  (39069,)
X_test:  (4341,)
y_train:  (39069, 29)
y_test:  (4341, 29)


# Modelo

In [None]:
# Convertir las etiquetas a vectores binarios
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y)

# Entrenar un modelo SVM de clasificación multietiqueta con kernel lineal
classifier = OneVsRestClassifier(SVC(kernel='linear'))
classifier.fit(X, y)

# Hacer predicciones en nuevos datos
X_new = pd.DataFrame({'caracteristica_1': [1, 2, 3], 'caracteristica_2': [4, 5, 6], 'caracteristica_3': [7, 8, 9]})
y_new = classifier.predict(X_new)

# Convertir las predicciones a etiquetas
y_new_labels = mlb.inverse_transform(y_new)

In [None]:
# pipeline
clf = Pipeline([("vectorizer", TfidfVectorizer(max_features = 25000)),
                ('classifier', OneVsRestClassifier(RandomForestClassifier(n_estimators= 100))),
               ])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('accuracy'  ,f1_score(y_test, y_pred,average='samples'), 'percent')
print("Hamming loss:",hamming_loss(y_test,y_pred))

In [None]:
# pipeline
clf = Pipeline([("vectorizer",
                 TfidfVectorizer(max_features = 25000)), 
                ("classifier", OneVsRestClassifier(XGBClassifier()))])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('accuracy'  ,f1_score(y_test, y_pred,average='samples'), 'percent')
print("Hamming loss:",hamming_loss(y_test,y_pred))


¿De qué manera se puede complementar la solución? Pensar en propuestas para el cliente.

¿Cómo se podría simplificar la tarea?

¿Cuáles pueden ser las limitaciones, riesgos, sesgos de los modelos al implementar este tipo de soluciones? 

¿Qué otras cosas hay que considerar al momento de implementar un proyecto como este?


In [None]:
#guardar modelo
import pickle
pickle.dump(model, open('trading_model.p', 'wb'))