<a href="https://colab.research.google.com/github/AlejandroBeltranA/OCVED-ML/blob/master/OCVED_Applied_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classifying remaining articles

This is the 4th of 4 scripts used in ocved.mx

This script uses the LR model trained in the first script to classify the universe of articles collected from EMIS. A total of 188,492 are classified using the model from OCVED_GSR_Trained.v2.4



In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Install tqdm
%cd /content/drive/
!ls
!pip install tqdm


/content/drive
'My Drive'


In [4]:
# Packages used
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm

from nltk.tokenize import word_tokenize
from nltk import pos_tag
#from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm, linear_model
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, precision_score, recall_score, f1_score

We download the Spacy lemmatizer again to reduce words to their lemma for normalization. 

In [5]:
%%capture
!pip install es-lemmatizer
!pip install -U spacy
!sudo python -m spacy download es_core_news_sm

import re
import nltk
nltk.download('stopwords')

We load in the universe of articles collected from EMIS using the scripts in EMIS_scrape repository. 

These articles are downloaded from subnational news sources, regional newspapers, and other sources not specified as national newspapers. There's a lot of noise in these articles. I leave the training articles in the universe since the model should perform well on the articles it was trained on. 

This csv contains 158,496 articles. The majority of these are noise!

In [6]:
 emis = pd.read_csv('My Drive/Data/OCVED/Classifier/universe/EMIS_Universe.csv')
 emis

Unnamed: 0.1,Unnamed: 0,date,file_id,text
0,0,2014 09 04,20140904__455079294.txt,\nQué leer\n\nKaren López\n20140904.-AMOR Y OT...
1,1,2014 08 13,20140813__452145796.txt,\nRoban carteras\n\n\n20140813.-Aprovechando l...
2,2,2014 07 18,20140718__449892055.txt,\n Poli...
3,3,2014 07 21,20140721__450060451.txt,\nMás cargos encima\n\nMURAL / STAFF\n20140721...
4,4,2014 07 17,20140717__449726199.txt,\nExcélsior | 2014-07-17 | 10:00\n\n\n\n\n\n\n...
...,...,...,...,...
158492,158492,2015 10 12,20151012__504376451.txt,"\n El maquillaje, si se elige adecuadamente, p..."
158493,158493,2015 02 08,20150208__469314532.txt,\n POR ...
158494,158494,2015 09 28,20150928__502939710.txt,\n Aún no se saben los motivos que llevaron a ...
158495,158495,2015 03 12,20150312__473098625.txt,\nMatan a precandidata del PRD en Guerrero\n\n...


As detailed in script 1, a seperate process collected articles from national newspapers by having RA's manually download these articles. The manual download process took 5 months, students would read each article and determine if it was relevant to the PI's research. In contrast, the scraping and generating traning data took a total of 3 months, with the added advantage that the model can be used for future data collected. 

This process generated 29,995 articles. 

In [7]:
nat = pd.read_csv('My Drive/Data/OCVED/National/txt_docs/National_OCVED.csv')
nat

Unnamed: 0,date,file_id,label,text
0,5 de Enero de 2000 \nTranslation powered by Go...,20000105001_NAC.txt,Accept,"\n es \n \n \n Las Margaritas, Chis., 5 Ene (N..."
1,4 de Enero de 2000 \nTranslation powered by Go...,20000105002_NAC.txt,Accept,"\n es \n \n \n México, 4 Ene (NTX).- La Policí..."
2,6 de Enero de 2000 \nTranslation powered by Go...,20000106001_NAC.txt,Accept,"\n es \n \n \n México, 6 Ene (NTX).- Elementos..."
3,6 de Enero de 2000 \nTranslation powered by Go...,20000106002_NAC.txt,Accept,"\n es \n \n \n Monterrey, NL., 6 Ene (NTX).- L..."
4,6 de Enero de 2000 \nTranslation powered by Go...,20000106003_NAC.txt,Accept,"\n es \n \n \n México, 6 Ene (NTX).- Elementos..."
...,...,...,...,...
29990,,2018112401_NAT.txt,Accept,"\n \n \n \n \n \n November 24, 2018 \n | \n Pu..."
29991,,2018112402_NAT.txt,Accept,"\n \n \n \n \n \n November 24, 2018 (15:07) \n..."
29992,,2018112501_NAT.txt,Accept,"\n \n \n \n \n \n November 25, 2018 \n | \n Pu..."
29993,,2018112801_NAT.txt,Accept,"\n \n \n \n \n \n November 28, 2018 \n | \n Pu..."


We combine these two datasets, making a full universe of articles on DTO's in Mexico. All articles used in the training steps are also included given the model should perform well classifying these. 

In [8]:
data = []
data.append(emis)
data.append(nat)
df = pd.concat(data, axis=0, ignore_index=True, sort=True).sort_values('file_id', ascending= True)

In [9]:
df

Unnamed: 0.1,Unnamed: 0,date,file_id,label,text
158497,,5 de Enero de 2000 \nTranslation powered by Go...,20000105001_NAC.txt,Accept,"\n es \n \n \n Las Margaritas, Chis., 5 Ene (N..."
158498,,4 de Enero de 2000 \nTranslation powered by Go...,20000105002_NAC.txt,Accept,"\n es \n \n \n México, 4 Ene (NTX).- La Policí..."
158499,,6 de Enero de 2000 \nTranslation powered by Go...,20000106001_NAC.txt,Accept,"\n es \n \n \n México, 6 Ene (NTX).- Elementos..."
158500,,6 de Enero de 2000 \nTranslation powered by Go...,20000106002_NAC.txt,Accept,"\n es \n \n \n Monterrey, NL., 6 Ene (NTX).- L..."
158501,,6 de Enero de 2000 \nTranslation powered by Go...,20000106003_NAC.txt,Accept,"\n es \n \n \n México, 6 Ene (NTX).- Elementos..."
...,...,...,...,...,...
28601,28601.0,2018 12 31,20181231__639443971.txt,,\nLos hechos ocurrieron esta mañana en la colo...
27641,27641.0,2018 12 31,20181231__639443972.txt,,\nHay dos personas lesionadas. am. Un ataque ...
39623,39623.0,2018 12 31,20181231__639453461.txt,,\n El horario para dejar los juguetes es de 9...
30079,30079.0,2018 12 31,20181231__639501374.txt,,\n Un h...


In [10]:
np.random.seed(1000)

Code for removing accents.

In [11]:
import unicodedata
import string
# BEGIN SHAVE_MARKS_LATIN
def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters"""
    norm_txt = unicodedata.normalize('NFD', txt)  # <1>
    latin_base = False
    keepers = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:   # <2>
            continue  # ignore diacritic on Latin base char
        keepers.append(c)                             # <3>
        # if it isn't combining char, it's a new base char
        if not unicodedata.combining(c):              # <4>
            latin_base = c in string.ascii_letters
    shaved = ''.join(keepers)
    return unicodedata.normalize('NFC', shaved)   # <5>
# END SHAVE_MARKS_LATIN
def shave_marks(txt):
    """Remove all diacritic marks"""
    norm_txt = unicodedata.normalize('NFD', txt)  # <1>
    shaved = ''.join(c for c in norm_txt
                     if not unicodedata.combining(c))  # <2>
    return unicodedata.normalize('NFC', shaved)  # <3>
# END SHAVE_MARKS

Let's load the tokenizer and lemmatizer in. 

In [12]:
from es_lemmatizer import lemmatize
import es_core_news_sm

nlp = es_core_news_sm.load()
nlp.add_pipe(lemmatize, after="tagger")

Stopwords removed to reduce noise and reduce the number of not useful features.

In [13]:
from nltk.corpus import stopwords

##Creating a list of stop words and adding custom stopwords
stop_words = set(stopwords.words("spanish"))
##Creating a list of custom stopwords
new_words = ["daily", "newspaper", "reforma", "publication", "universal", "iv", "one", "two", "august" , "excelsior", "online",
             "november", "july", "september", "june", "october", "december", "print", "edition", "news", "milenio", "january", "international",
             "march", "april", "july", "february", "may", "october", "el occidental", "comments", "powered", "display", "space", 
             "javascript", "trackpageview", "enablelinktracking", "location", "protocol", "weboperations", "settrackerurl", "left", 
             "setsiteid", "createelement", "getelementsbytagname", "parentnode", "insertbefore", "writeposttexto", "everykey", "passwords"
             "writecolumnaderechanotas", "anteriorsiguente", "anteriorsiguiente", "writefooter", "align", "googletag", "writeaddthis", "writefooteroem", 
             "diario delicias", "diario tampico", "the associated press", "redaccion" , "national", "diario yucatan", "mural", "periodico", 
             "new", "previously", "shown" , "a", "para", "tener" , "haber", "ser" , "mexico city", "states", "city", "and", "elsolde", "recomendamos", 
            "diario chihuahua" , "diario juarez" , "el norte", "voz frontera" , "regional" , "de"  , "el sol" , "el" , "sudcaliforniano" , "washington",
            "union morelos", "milenio" , "notimex", "el financiero" , "financiero" , "forum magazine" , "economista" , "gmail" , "financial", "el" , "de",
             "la", "del", "de+el" , "a+el" , "shortcode" , "caption", "cfcfcf", "float", "item", "width", "follow", "aaannnnn", "gmannnnn", 
             "dslnnnnn", "jtjnnnnn", "lcgnnnnn", "jgcnnnnn", "vhannnnn",  "mtc", "eleconomista", "monitoreoif", "infosel", "gallery", 
             "heaven", "div", "push" , "translate", "google"]
stop_words = stop_words.union(new_words)
stop_words = shave_marks(repr(stop_words))

In [14]:
dataset = df

Process for cleaning out the text and generating the corpus. 

In [15]:
corpus = []
for i in dataset.itertuples():
#for i in tqdm(range(1, 2000)):
    text = shave_marks_latin(i.text)
    #Remove punctuations
    text = re.sub('[^a-zA-Z]', ' ', text)
    #Convert to lowercase
    #text = shave_marks_latin(text)
    #text = text.lower()
    #remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    text = re.sub(' +', ' ', text)
    #Lemmatisation
    doc = nlp(text)
    text = [token.lemma_ for token in doc if token.lemma_ not in stop_words] 
    text = " ".join(text)
    text = shave_marks(text)
    file_id = i.file_id
    original = i.text
    corpus.append({ 'text': text, 'file_id': file_id , "original": original})
print ("done")

done


In [16]:
data = pd.DataFrame(corpus)
data

Unnamed: 0,text,file_id,original
0,margaritas chis ntx elementos ejercito exicano...,20000105001_NAC.txt,"\n es \n \n \n Las Margaritas, Chis., 5 Ene (N..."
1,ntx policia federal preventiva pfp informo ult...,20000105002_NAC.txt,"\n es \n \n \n México, 4 Ene (NTX).- La Policí..."
2,ntx elementos policia judicial federal pjf ase...,20000106001_NAC.txt,"\n es \n \n \n México, 6 Ene (NTX).- Elementos..."
3,monterrey ntx policia ministerial reporto homb...,20000106002_NAC.txt,"\n es \n \n \n Monterrey, NL., 6 Ene (NTX).- L..."
4,ntx elementos policia judicial federal pjf ase...,20000106003_NAC.txt,"\n es \n \n \n México, 6 Ene (NTX).- Elementos..."
...,...,...,...
188487,hecho ocurrir manana colonia leon guzman salam...,20181231__639443971.txt,\nLos hechos ocurrieron esta mañana en la colo...
188488,persona lesionadas ataque domicilio colonia ba...,20181231__639443972.txt,\nHay dos personas lesionadas. am. Un ataque ...
188489,horario dejar juguete manana tarde instalacion...,20181231__639453461.txt,\n El horario para dejar los juguetes es de 9...
188490,hombre asesinar casa villagran hombre perdio v...,20181231__639501374.txt,\n Un h...


In [None]:
gen = data['text']

In [None]:
gen

Let's look at how frequent some words are in the universe using Tokenizer. 

In [None]:
from keras.preprocessing.text import Tokenizer
#Using TensorFlow backend. xtrain_count, train_y, xvalid_count
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(gen)

X_gen = tokenizer.texts_to_sequences(gen)
#X_test = tokenizer.texts_to_sequences(valid_x)
#
vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print(gen.iloc[3])
print(X_gen[3])



In [None]:
for word in ['sexual', 'cartel', 'sinaloa', 'violencia']:
  print('{}: {}'.format(word, tokenizer.word_index[word]))

In [None]:
from keras.preprocessing.sequence import pad_sequences

maxlen = 100

X_gen = pad_sequences(X_gen, padding='post') #, maxlen=maxlen
#X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print(X_gen[2, :])

# Application

Now we load in the encoder, model, and vectorizer from script 1 so we can implement it in the application pipeline. 

In [None]:
import pickle

pkl_file = open('/content/drive/My Drive/Data/OCVED/Classifier/algorithm/OCVED_encoder_v2.pkl', 'rb')
encoder = pickle.load(pkl_file) 
pkl_file.close()

We used the LR model because it produced the best F1 score of all models. See Osorio & Beltran (2020) for more information on why. 

In [None]:
from sklearn.externals import joblib
# save the model to disk
filename = '/content/drive/My Drive/Data/OCVED/Classifier/algorithm/logistic_model_v2.sav'
  
# load the model from disk
logit_model = joblib.load(filename)

It's important we use the same trained tfidf from the first script in this process. Otherwise the length and words used will be different across vectors!

In [None]:
# create a count vectorizer object 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pickle

#count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
#count_vect.fit(data['text'])

# transform the training and validation data using count vectorizer object
#xtrain_count =  count_vect.transform(gen)
#xtrain_count

#pickle.dump(xtrain_count, open("/content/drive/My Drive/Data/Bogota/categorized_articles/tfidf.pickle", "wb"))

pkl_file = open("/content/drive/My Drive/Data/OCVED/Classifier/algorithm/Tfidf_vect_3.pickle", 'rb')
tfidf = pickle.load(pkl_file) 
pkl_file.close()

In [None]:
tfidf

In [None]:
gen_2 = tfidf.transform(gen)
gen_2

Now we finally ask the logit model to generate predictions for each article. It reviews the numeric contents and makes a predictions. Anything that has above a .5 probability of being DTO related is classified as such. 

In [None]:
# make a prediction
y_label = logit_model.predict(gen_2)
# show the inputs and predicted outputs
print("X=%s, Predicted=%s" % (gen_2[0], y_label[0]))

In [None]:
# make a prediction
y_prob = logit_model.predict_proba(gen_2)[:,1]
# show the inputs and predicted outputs
y_prob

Now we want to save the output, first in a csv. 

In [None]:
data['y_label'] = y_label

data['y_prob'] = y_prob

In [None]:
data.to_csv('My Drive/Data/OCVED/Classifier/predictions_v3/logit_OCVED_pred_v3.csv')

Here I save them as .txt files for use in Eventus ID. 

In [None]:
data = data[data.y_label == 1 ]
data

In [None]:
for i in tqdm(dataset.itertuples()):
    text = shave_marks_latin(i.text)
    #Remove punctuations
    #text = re.sub('[^a-zA-Z]', ' ', text)
    #Convert to lowercase
    #remove tags
    #text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)
    # remove special characters and digits
    #text=re.sub("(\\d|\\W)+"," ",text)
    #text = re.sub(' +', ' ', text)
    file_id = i.file_id
    original = i.original
    dirty = 'My Drive/Data/OCVED/Classifier/predictions_v3/dirty/'
    clean = 'My Drive/Data/OCVED/Classifier/predictions_v3/clean/'
    dirty_file = dirty + file_id
    clean_file = clean + file_id
    with open(dirty_file, 'w') as f:
      f.write(original)
    with open(clean_file, 'w') as c:
      c.write(text)


In [None]:
print ("script has completed")

This takes a long time so I have it print out the time it finished. 

In [None]:
!rm /etc/localtime
!ln -s /usr/share/zoneinfo/America/Phoenix /etc/localtime
!date