# Pre procesamiento

En este *notebook* se aplicará el pre-procesamiento a cada comentario de reddit. El resultado se guardará en un archivo que es similar al archivo origen, con la única diferencia que el comentario estará conformado por *strings* procesados.

Se realizan los siguientes pre-procesamientos:
1. Eliminación de *stop words*
2. Lematización utilizando Spacy
3. Eliminación de las palabras menos frecuentes
4. Conversión de los lemas a minúscula
5. Eliminación de palabras no alfanuméricas
6. Solo se consideran palabras cuyo *part-of-speech* son un nombre propio, un sustantivo o un pronombre. [Ver *Universal POS tags*](https://universaldependencies.org/docs/u/pos/)

### Fuente

- [Twitter Topic Modeling](https://towardsdatascience.com/twitter-topic-modeling-e0e3315b12e2)


In [1]:
import pandas as pd
import nltk, spacy, gensim
from spacy.tokenizer import Tokenizer
import pickle

from preprocessing_utils import give_emoji_free_text, url_free_text, \
email_free_text, quotes_free_text, get_lemmas, tokenize

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline

TEXT_FILE_READ = 'docs/reddit_data.csv'	# Text to be processed
TEXT_SAVE_FILE = 'docs/preprocessing_reddit_data.csv'
FILENAME_PICKLE = "docs/tmpreddit.pickle"

In [2]:
nlp = spacy.load('es_core_news_lg', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed
tmpreddit = pd.read_csv(TEXT_FILE_READ)

In [3]:
# Apply the function above and get tweets free of emoji's
call_emoji_free = lambda x: give_emoji_free_text(x)

# Apply `call_emoji_free` which calls the function to remove all emoji's
tmpreddit['emails_free'] = tmpreddit['body'].apply(email_free_text)

#Create a new column with url free tweets
tmpreddit['quotes_free'] = tmpreddit['emails_free'].apply(quotes_free_text)

# Apply `call_emoji_free` which calls the function to remove all emoji's
tmpreddit['emoji_free'] = tmpreddit['quotes_free'].apply(call_emoji_free)

#Create a new column with url free tweets
tmpreddit['url_free'] = tmpreddit['emoji_free'].apply(url_free_text)

#print(tmpreddit[:1])

In [4]:
# Tokenizer
tokenizer = Tokenizer(nlp.vocab)

tokens = []

for doc in tokenizer.pipe(tmpreddit['url_free'], batch_size=500):
    doc_tokens = []
    for token in doc:
        if token.text.lower() not in nlp.Defaults.stop_words:
            doc_tokens.append(token.text.lower())
    tokens.append(doc_tokens)

# Makes tokens column
tmpreddit['tokens'] = tokens


In [5]:
# Build the bigram and trigram model
bigram = gensim.models.Phrases(tmpreddit['tokens'], min_count=10, threshold=100)
trigram = gensim.models.Phrases(bigram[tmpreddit['tokens']], threshold=100)

# Faster way to get a sentence clubbed as a bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)


In [6]:
# See trigram example
print(trigram_mod[bigram_mod[tmpreddit['tokens'][3]]])


In [7]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]


In [8]:
# Form Bigrams
data_words_bigrams = make_bigrams(tmpreddit['tokens'])

In [None]:
# Make tokens a string again
tmpreddit['tokens_back_to_text'] = [' '.join(map(str, l)) for l in data_words_bigrams]

tmpreddit['lemmas'] = tmpreddit['tokens_back_to_text'].apply(get_lemmas)


In [None]:
# Make lemmas a string again
tmpreddit['lemmas_back_to_text'] = [' '.join(map(str, l)) for l in tmpreddit['lemmas']]



In [None]:
tmpreddit = tmpreddit.drop_duplicates(subset=['lemmas_back_to_text'], keep='first', inplace=False)

In [None]:
# Apply tokenizer
tmpreddit['lemma_tokens'] = tmpreddit['lemmas_back_to_text'].apply(tokenize)

In [None]:
reddit = tmpreddit
reddit['body_preprocessing'] = tmpreddit['lemmas_back_to_text']
reddit.pop('emails_free')
reddit.pop('quotes_free')
reddit.pop('emoji_free')
reddit.pop('url_free')
reddit.pop('tokens')
reddit.pop('tokens_back_to_text')
reddit.pop('lemmas')
reddit.pop('lemmas_back_to_text')


In [None]:
reddit.to_csv(TEXT_SAVE_FILE, index=False)

fileObj = open(FILENAME_PICKLE, 'wb')
pickle.dump(tmpreddit, fileObj)
fileObj.close()