### Procesamiento del lenguaje natural

Tenemos un dataset que tiene textos que queremos evaluar si son tóxicos o no tóxicos. Este dataset puedes bajarlo del siguiente enlace 
[aquí](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). 

Vamos a usar dos algoritmos para aprender cuando un texto es tóxico o no:


1) Regresión logística

2) Tf-idf


Para usar ambas, primero realizaremos un proceso de tokenización y normalización del texto y luego un bag of words.


Usaremos librerías de  `sklearn` para ello.

## 1. Leyendo los datos

In [1]:
import pandas as pd

comments_df = pd.read_csv("data/jigsaw-toxic-comment-classification-challenge/train.csv")
comments_df.head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0


### 1.1 Separación de datos de entrenamiento y de test

In [2]:
pd.set_option('display.max_colwidth',-1)

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(comments_df[['comment_text']], comments_df['toxic'], random_state=10)
X_train.head(2)

Unnamed: 0,comment_text
34852,"This is a straw man argument, Mr Merkey. Nobody is arguing that federally unverified tribes, clans or groups should be identified as having any kind of federal recognition. What's being said here, something that you seem unwilling to accept as even being a valid view, is that those groups claim to have cherokee lineage (in many cases, having verifiable documented evidence to that effect), and as such, they should be figured on the cherokee page, not simply expunged from wikipedia because they don't fall within the very narrow rules defined by the very government that has historically persecuted them. If nothing else, they should be included to highlight the controversy that their claim to being cherokee engenders. Legally, they may not be recognised cherokees, but anthropologically they are. Removing non federally recognised groups on the basis of some imagined legal threat would seem to be ridiculous; this is intended, after all, to be an encyclopedia, and thus, encyclopedic. What it's not intended to be is a vanity publishing service, and to remove all reference to non-recognised tribes and the controversy surrounding them would, in my view at least, turn this article into just that. It certainly wouldn't, as you imply, improve Wikipedia."
17133,"ARC Gritt, the fucking cunt of all cunts, ruined me by saying I vandalised 2009 Formula One season. What a cunt."


## 2. Normalización o preprocesado del texto

In [4]:
import re

import nltk
from nltk.stem import SnowballStemmer

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
GOOD_SYMBOLS = "€\?"
GOOD_SYMBOLS_RE = re.compile('([' + GOOD_SYMBOLS + '])')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z '+ GOOD_SYMBOLS + ']')
ADD_SPACES_SYMBOLS_RE = re.compile("([\?])")
STEMMER = SnowballStemmer('english')

class TextPreprocessor:
        
    def transfrom_text(self, text):
        text = re.sub(GOOD_SYMBOLS_RE, r"\1", text) #process good symbols
        text = text.lower()
        text = re.sub(REPLACE_BY_SPACE_RE, " ", text) # process bad symbols
        text = re.sub(BAD_SYMBOLS_RE, "", text) # process bad symbols
        text = re.sub(ADD_SPACES_SYMBOLS_RE, r" \1 ", text)
        test = " ".join([STEMMER.stem(word) for word in text.split()])
        return text
    
    def transform(self, series):
        return series.apply(lambda text: self.transfrom_text(text))

In [5]:
preprocessor = TextPreprocessor()
X_train_preprocessed = preprocessor.transform(X_train['comment_text'])
X_test_preprocessed = preprocessor.transform(X_test['comment_text'])

In [6]:
print(X_train["comment_text"][:2])
print(X_train_preprocessed[:2])

34852    This is a straw man argument, Mr Merkey.  Nobody is arguing that federally unverified tribes, clans or groups should be identified as having any kind of federal recognition. What's being said here, something that you seem unwilling to accept as even being a valid view, is that those groups claim to have cherokee lineage (in many cases, having verifiable documented evidence to that effect), and as such, they should be figured on the cherokee page, not simply expunged from wikipedia because they don't fall within the very narrow rules defined by the very government that has historically persecuted them. If nothing else, they should be included to highlight the controversy that their claim to being cherokee engenders.  Legally, they may not be recognised cherokees, but anthropologically they are. Removing non federally recognised groups on the basis of some imagined legal threat would seem to be ridiculous; this is intended, after all, to be an encyclopedia, and thus, encyclopedi

## Bag of words

Ahora vamos a hacer un conteo

In [7]:
from sklearn.feature_extraction.text import   CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X_train_preprocessed)
X_train_vectorized = vectorizer.transform(X_train_preprocessed)

In [8]:
X_test_vectorized = vectorizer.transform(X_test_preprocessed)

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced')
model.fit(X_train_vectorized, y_train)

In [None]:
y_test_hat = model.predict(X_test_vectorized)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\
    average_precision_score, roc_auc_score, recall_score

def scores(y, predicted):
    return {
        'accuracy': accuracy_score(y, predicted),
        'precision': precision_score(y, predicted),
        'recall': recall_score(y, predicted),
        'f1-score': f1_score(y, predicted),
        #"roc_auc": roc_auc_score(y, predicted_score),
        'average-Precision': average_precision_score(y, predicted)}

In [None]:
scores(y_test, y_test_hat)

In [None]:
class CompleteModel:
    
    def __init__(self, preprocessor, vectorizer, model, colname="comment_text"):
        self.colname = colname
        self.preprocessor = preprocessor
        self.vectorizer = vectorizer
        self.model = model
           
    def fit(self, X, y):
        print("preprocessor...")
        X_fe = pd.DataFrame({self.colname: self.preprocessor.transform(X[self.colname])})
        print("vectorizer...")
        self.vectorizer.fit(X_fe[self.colname])
        print("model...")
        X_fe = self.vectorizer.transform(X[self.colname])
        self.model.fit(X_fe, y)
        return self
        
    def predict(self, X):
        X_fe = pd.DataFrame({self.colname: self.preprocessor.transform(X[self.colname])})        
        X_fe = self.vectorizer.transform(X_fe[self.colname])
        return self.model.predict(X_fe)

In [None]:
complete_model = CompleteModel(preprocessor, vectorizer, model)

In [None]:
complete_model.fit(X_train, y_train)

In [None]:
y_test_hat = complete_model.predict(X_test)
scores(y_test, y_test_hat)

## TfIdf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df=4, max_df=0.9, ngram_range=(1, 2), token_pattern='(\S+)')
complete_tfidf_model = CompleteModel(preprocessor, tfidf_vectorizer, model)

In [None]:
complete_tfidf_model.fit(X_train, y_train)

In [None]:
y_test_hat = complete_tfidf_model.predict(X_test)
scores(y_test, y_test_hat)