En este notebook, entrenamos modelos de predicción de discursos de odio en base a distintos datasets, para poder predecir su uso en reddit.

In [1]:
import pickle

import pandas as pd
import spacy

from preprocessing_utils import preprocess_corpus

TEXT_SAVE_FILE = 'docs/reddit_data_hate_speech.csv'
TEXT_SAVE_FILE_POS_HATE_SPEECH = 'docs/reddit_data_hate_speech_pos.csv'
TEXT_SAVE_FILE_NEG_HATE_SPEECH = 'docs/reddit_data_hate_speech_neg.csv'
TEXT_FILE_READ = 'docs/reddit_data_lda.csv'

nlp = spacy.load("es_core_news_lg")

In [2]:
# guardamos el vectorizador y un modelo entrenado

with open('hateval_vectorizer.pkl', 'rb') as f:
    cv_hateval = pickle.load(f)
    
with open('hateval_nb_model.pkl', 'rb') as f:
    nb_hateval = pickle.load(f)


# Prueba de modelos en Reddit con Hateval

In [3]:
df = pd.read_csv(TEXT_FILE_READ)

In [4]:
reddit_corpus = preprocess_corpus(df['body'].astype('str'))
reddit_adapted = cv_hateval.transform(reddit_corpus)

In [5]:
reddit_predictions = nb_hateval.predict(reddit_adapted)
reddit_hs_proba = nb_hateval.predict_proba(reddit_adapted)[:,1]
print(reddit_hs_proba)

[0.88380823 0.36464291 0.46504022 ... 0.36455733 0.26675617 0.08903386]


In [6]:
target_predict_proba = 0.8
hate_mask = reddit_hs_proba>=target_predict_proba
non_hate_mask = reddit_hs_proba < target_predict_proba
print(len(hate_mask))

19394


In [7]:
df['hate_speech'] = df.apply(lambda row: '-' , axis = 1) 

for index,row in enumerate(df['body']):
    if reddit_hs_proba[index] >= target_predict_proba :
        is_hate_speech = 'si'
    else:
        is_hate_speech = 'no'
    df['hate_speech'][index] = is_hate_speech

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['hate_speech'][index] = is_hate_speech


In [8]:
df.to_csv(TEXT_SAVE_FILE)

In [9]:
df[hate_mask].to_csv(TEXT_SAVE_FILE_POS_HATE_SPEECH)

In [10]:
df[non_hate_mask].to_csv(TEXT_SAVE_FILE_NEG_HATE_SPEECH)

NameError: name 'non_hate_mask' is not defined

# Mejoras a realizar

* Optimizar híper-parámetros.
* Hacer un ensemble de clasificadores.
* **TODO**

FIN