# Preprocessing and Topic Distribution with Gensim

On this notebook is developed the preprocessing and the topic mining process helped by the gensim library, that will be applied to our corpus made by newspaper texts.

## Preprocessing: Initial Normalization

First, we are going to define our functions to prepare our data before looking for the hidden topics inside them.

In [1]:
import nltk
from bs4 import BeautifulSoup
import re

In [2]:
def tokenize_by_words(text):
    words = text.split()
    alphabetic_words = list()
    
    for word in words:
        token = []
        for character in word:
            if re.match(r'^[a-záéíóúñü+$]', character):
                token.append(character)
        token = ''.join(token)
        if token != '':
            alphabetic_words.append(token)
    
    return alphabetic_words

def tokenize_by_sents(text):
    tokens = nltk.data.load("tokenizers/punkt/spanish.pickle") 
    sents = tokens.tokenize(text)
    alphabetic_sents = list()
    
    for sent in sents:
        sent_token = tokenize_by_words(sent)
        alphabetic_sents.append(sent_token)
    
    return alphabetic_sents

In [3]:
def remove_stop_words_from_sents(sents, path = './stopwords_es.txt'):
    with open(path, encoding = 'utf-8') as f:
        stop_words = f.readlines()
        stop_words = [w.strip() for w in stop_words]
        
    clean_sents = list()
    for sent in sents:
        clean_sent = [word for word in sent if word not in stop_words]
        clean_sents.append(clean_sent)
    
    return clean_sents

In [4]:
def lemmatize_from_sents(text, path = './generate.txt'):
    
    lemmas = dict()
    with open(path, encoding = 'latin1') as file:
        lines = file.readlines()
        for line in lines:
            line = line.strip()
            if line != '':
                words = line.split()
                token = words[0].strip()
                token = token.replace('#', '')
                lemma = words[-1].strip()
                tag = words[-2].strip()
                tag = tag[0].lower()
                lemmas[(token, tag)] = (lemma, tag)
    
    lemmatized_text = list()
    for sent in text:
        lemmatized_sent = list()
        for word in sent:
            if word in lemmas.keys():
                lemmatized_sent.append(lemmas[word])
            else:
                lemmatized_sent.append(word)
        
        lemmatized_text.append(lemmatized_sent)

    return lemmatized_text

## Articles and Titles Extraction

Below are established the functions that will help us to extract the articles and titles from each newspaper file.

In [5]:
def is_title(text):
    lower = 0
    upper = 0
    for c in text:
        if c.isalpha():
            if c.isupper():
                upper += 1
            else:
                lower += 1
    if lower > upper:
        return True
    return False


def get_titles(path_origin = './../EXCELSIOR_100_files/', path_destiny = './topic_mining/titles/', no_page = 0):
    corpus = nltk.corpus.PlaintextCorpusReader(path_origin, '.*')
    file_list = corpus.fileids()
    titles = list()
    with open(path_origin + file_list[no_page], encoding = 'utf-8') as rfile:
        text = rfile.read()
    soup = BeautifulSoup(text, 'lxml')
    sents = soup.find_all('h3')
    for sent in sents:
        txt = sent.get_text()
        if is_title(txt):
            if txt != ' ':
                titles.append(txt.lower())
    with open(path_destiny + "titles_" + file_list[no_page][:11] + '.txt', 'w', encoding = 'utf-8') as wfile:
        wfile.writelines(titles)
    return titles

def get_articles(path_origin = './../EXCELSIOR_100_files/', no_page = 0):
    corpus = nltk.corpus.PlaintextCorpusReader(path_origin, '.*')
    file_list = corpus.fileids()
    with open(path_origin + file_list[no_page], encoding = 'utf-8') as rfile:
        text = rfile.read()
    arts = text.split("\n\n\n")[1:]
    new_arts = arts[1::2]
    clean_arts = list()
    for art in new_arts:
        clean_arts.append(clean_articles(art))
    return clean_arts

def clean_articles(txt):
    # Remove HTML tags
    soup = BeautifulSoup(txt, 'lxml')
    clean_text = soup.get_text()
    # Apply the function lower to the text
    clean_text = clean_text.lower()
    # Save the file
    return clean_text

In [6]:
titles = get_titles('./../EXCELSIOR_100_files/', './topic_mining/titles/', 0)

In [7]:
articles = get_articles('./../EXCELSIOR_100_files/', 0)

#### Normalization of articles

In [8]:
articles

['\n\nmartes 02 de abril de 1996\n\nmonstruosa diferencia\n\ncolosistas y colosismo\n\nluis gutierrez y gonzalez\n\na luis gutiérrez sotomayor y a federico arreola, colosistas cabales, según me dijo su amigo luis donaldo.\nciertamente, el nombre y las circunstancias de luis donaldo colosio han llenado insistentemente los volúmenes y los espacios de los medios de comunicación. pero su renovada actualidad ha padecido un frenético vaivén de ficciones judiciales y políticas que integran y disgregan metafísicas más metafísicas aún que las que luis donaldo desprende —o se envuelve con ellas— al otro lado del espejo.\ndos años eternos de insolvencias, dale que dale a la fantasía, a la magia de dónde quedó la bolita, le han traído al pueblo hastío y cansancio. en la inminencia se percibe el váyanse al diablo del quórum nacional, que antes se veía decidido a instalar sécula seculórum sus demandas de justicia.\npero en el segundo aniversario del asesinato, el colosismo con su astroso luto protag

In [9]:
tokenized_articles = list()
for article in articles:
    tokenized_articles.append(tokenize_by_sents(article))

clean_articles = list()
for tk_art in tokenized_articles:
    clean_articles.append(remove_stop_words_from_sents(tk_art))
    
lemmatized_articles = list()
for cl_art in clean_articles:
    lemmatized_articles.append(lemmatize_from_sents(text = cl_art))

In [10]:
lemmatized_articles

[[['martes',
   'abril',
   'monstruosa',
   'diferencia',
   'colosistas',
   'colosismo',
   'luis',
   'gutierrez',
   'gonzalez',
   'luis',
   'gutiérrez',
   'sotomayor',
   'federico',
   'arreola',
   'colosistas',
   'cabales',
   'según',
   'dijo',
   'amigo',
   'luis',
   'donaldo'],
  ['ciertamente',
   'nombre',
   'circunstancias',
   'luis',
   'donaldo',
   'colosio',
   'llenado',
   'insistentemente',
   'volúmenes',
   'espacios',
   'medios',
   'comunicación'],
  ['renovada',
   'actualidad',
   'padecido',
   'frenético',
   'vaivén',
   'ficciones',
   'judiciales',
   'políticas',
   'integran',
   'disgregan',
   'metafísicas',
   'metafísicas',
   'aún',
   'luis',
   'donaldo',
   'desprende',
   'envuelve',
   'lado',
   'espejo'],
  ['dos',
   'años',
   'eternos',
   'insolvencias',
   'dale',
   'dale',
   'fantasía',
   'magia',
   'dónde',
   'quedó',
   'bolita',
   'traído',
   'pueblo',
   'hastío',
   'cansancio'],
  ['inminencia',
   'percibe',
 

## Topics Distribution with Gensim

Once we have completed our preprocessing, it's time to apply the functions which gensim provides us.

First we are going to try with all the articles of the selected text file.

In [11]:
from gensim import corpora
import gensim

In [12]:
lemmatized_articles_docs = list()
for item1 in lemmatized_articles:
    docs = list()
    for item2 in item1:
        docs = docs + item2
    lemmatized_articles_docs.append(docs)

In [13]:
lemmatized_articles_docs

[['martes',
  'abril',
  'monstruosa',
  'diferencia',
  'colosistas',
  'colosismo',
  'luis',
  'gutierrez',
  'gonzalez',
  'luis',
  'gutiérrez',
  'sotomayor',
  'federico',
  'arreola',
  'colosistas',
  'cabales',
  'según',
  'dijo',
  'amigo',
  'luis',
  'donaldo',
  'ciertamente',
  'nombre',
  'circunstancias',
  'luis',
  'donaldo',
  'colosio',
  'llenado',
  'insistentemente',
  'volúmenes',
  'espacios',
  'medios',
  'comunicación',
  'renovada',
  'actualidad',
  'padecido',
  'frenético',
  'vaivén',
  'ficciones',
  'judiciales',
  'políticas',
  'integran',
  'disgregan',
  'metafísicas',
  'metafísicas',
  'aún',
  'luis',
  'donaldo',
  'desprende',
  'envuelve',
  'lado',
  'espejo',
  'dos',
  'años',
  'eternos',
  'insolvencias',
  'dale',
  'dale',
  'fantasía',
  'magia',
  'dónde',
  'quedó',
  'bolita',
  'traído',
  'pueblo',
  'hastío',
  'cansancio',
  'inminencia',
  'percibe',
  'váyanse',
  'diablo',
  'quórum',
  'nacional',
  'veía',
  'decidido',

In [14]:
dictionary_all = corpora.Dictionary(lemmatized_articles_docs)

In [15]:
doc_term_matrix_all = [dictionary_all.doc2bow(doc) for doc in lemmatized_articles_docs]

In [16]:
Lda = gensim.models.ldamodel.LdaModel

In [17]:
ldamodel_all = Lda(doc_term_matrix_all, num_topics = 4, id2word = dictionary_all, passes = 500)

In [18]:
print(ldamodel_all.print_topics(num_topics = 4, num_words = 5))

[(0, '0.005*"excelsior" + 0.004*"internet" + 0.004*"millones" + 0.004*"años" + 0.004*"empresas"'), (1, '0.004*"millones" + 0.004*"si" + 0.004*"méxico" + 0.003*"abril" + 0.003*"mil"'), (2, '0.005*"trabajadores" + 0.004*"gobierno" + 0.004*"señor" + 0.003*"día" + 0.003*"méxico"'), (3, '0.006*"gobierno" + 0.004*"millones" + 0.004*"pesos" + 0.003*"sólo" + 0.003*"mil"')]


Then we are going to try with the second article of the selected text file.

In [19]:
lemmatized_article_2 = [lemmatized_articles_docs[1]]

In [20]:
lemmatized_article_2

[['martes',
  'abril',
  'scalfaro',
  'aquí',
  'allá',
  'juan',
  'maria',
  'alponte',
  'julio',
  'villa',
  'gagia',
  'benito',
  'mussolini',
  'recibía',
  'adolf',
  'hitler',
  'viajara',
  'tren',
  'italia',
  'someter',
  'dictador',
  'italiano',
  'larga',
  'furiosa',
  'requisitoria',
  'comportamiento',
  'manifiestamente',
  'malo',
  'dice',
  'tropas',
  'italianas',
  'alemanes',
  'sienten',
  'heridos',
  'julio',
  'sicilia',
  'desembarcado',
  'angloamericanos',
  'base',
  'militar',
  'augusta',
  'rendido',
  'hitler',
  'exasperado',
  'quiere',
  'aceptar',
  'italianos',
  'quieren',
  'combatir',
  'mediodía',
  'día',
  'entrevista',
  'cortada',
  'secretario',
  'duce',
  'portaba',
  'telegrama',
  'voz',
  'emocionada',
  'mussolini',
  'traduce',
  'alemán',
  'momento',
  'enemigo',
  'bombardea',
  'roma',
  'muertos',
  'heridos',
  'julio',
  'palacio',
  'venecia',
  'produjo',
  'partir',
  'cinco',
  'tarde',
  'aquel',
  'sábado',
  'so

In [21]:
dictionary_art_2 = corpora.Dictionary(lemmatized_article_2)

In [22]:
doc_term_matrix_art_2 = [dictionary_art_2.doc2bow(doc) for doc in lemmatized_article_2]

In [23]:
ldamodel_art_2 = Lda(doc_term_matrix_art_2, num_topics = 4, id2word = dictionary_art_2, passes = 500)

In [24]:
print(ldamodel_art_2.print_topics(num_topics = 4, num_words = 5))

[(0, '0.019*"mussolini" + 0.016*"italia" + 0.013*"duce" + 0.010*"roma" + 0.010*"momento"'), (1, '0.004*"monarquía" + 0.004*"mismo" + 0.004*"nación" + 0.004*"mundo" + 0.004*"muertos"'), (2, '0.004*"monarquía" + 0.004*"mismo" + 0.004*"nación" + 0.004*"mundo" + 0.004*"muertos"'), (3, '0.004*"monarquía" + 0.004*"mismo" + 0.004*"nación" + 0.004*"mundo" + 0.004*"muertos"')]
