<a href="https://colab.research.google.com/github/AlejandroBeltranA/OCVED-ML/blob/master/OCVED_CNN_v2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## OCVED Neural Network

This is the second script of four used in Osorio & Beltran (2020)

Here I build a series of convolutional neural networks to classify the training data. The CNN's perform well but not as great as the logistic regression. CNN's require much more data for accurate predictions. 



First step is to load our google drive.

In [None]:
# Upload the train file from your local drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Here we are using the preprocessed text I already cleaned out in a previous step. 

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/My Drive/Data/OCVED/Classifier/universe/preprocessed_text.csv')
df['source'] = 'ocved'
df

Unnamed: 0.1,Unnamed: 0,index,text,label,file_id,label_1,category,source
0,0,0,margaritas chis ntx elementos ejercito exicano...,Accept,20000105001_NAC.txt,1,1,ocved
1,1,1,ntx policia federal preventiva pfp informo ult...,Accept,20000105002_NAC.txt,1,1,ocved
2,2,2,ntx elementos policia judicial federal pjf ase...,Accept,20000106001_NAC.txt,1,1,ocved
3,3,3,monterrey ntx policia ministerial reporto homb...,Accept,20000106002_NAC.txt,1,1,ocved
4,4,4,ntx elementos policia judicial federal pjf ase...,Accept,20000106003_NAC.txt,1,1,ocved
...,...,...,...,...,...,...,...,...
60816,60816,60832,carril lateral autopista queretaro quedar cuer...,Reject,20181231__639379415.txt,0,0,ocved
60817,60817,60833,fuga agua romper pavimento calle lluvia coloni...,Reject,20181231__639382582.txt,0,0,ocved
60818,60818,60834,nuevo casas grandes hombre vida localizar hace...,Reject,20181231__639386332.txt,0,0,ocved
60819,60819,60835,nuevo casas grandes ciudad total cuatro person...,Reject,20181231__639386333.txt,0,0,ocved


Here I pull the GSR classifications.

In [None]:
gsd = pd.read_csv('/content/drive/My Drive/Data/OCVED/Classifier/predictions_v2/correct_classification.csv')
gsd

Unnamed: 0,file_id,correct
0,20000105001_NAC.txt,1
1,20000105002_NAC.txt,1
2,20000106001_NAC.txt,1
3,20000106002_NAC.txt,1
4,20000106003_NAC.txt,1
...,...,...
60801,20181231__639379415.txt,0
60802,20181231__639382582.txt,0
60803,20181231__639386332.txt,0
60804,20181231__639386333.txt,0


In [None]:
df = pd.merge(df, gsd, on = "file_id")
df = df[["file_id", "text", "source", "correct"]]
df = df.rename(columns = {"correct":"label"})
df

Unnamed: 0,file_id,text,source,label
0,20000105001_NAC.txt,margaritas chis ntx elementos ejercito exicano...,ocved,1
1,20000105002_NAC.txt,ntx policia federal preventiva pfp informo ult...,ocved,1
2,20000106001_NAC.txt,ntx elementos policia judicial federal pjf ase...,ocved,1
3,20000106002_NAC.txt,monterrey ntx policia ministerial reporto homb...,ocved,1
4,20000106003_NAC.txt,ntx elementos policia judicial federal pjf ase...,ocved,1
...,...,...,...,...
60801,20181231__639379415.txt,carril lateral autopista queretaro quedar cuer...,ocved,0
60802,20181231__639382582.txt,fuga agua romper pavimento calle lluvia coloni...,ocved,0
60803,20181231__639386332.txt,nuevo casas grandes hombre vida localizar hace...,ocved,0
60804,20181231__639386333.txt,nuevo casas grandes ciudad total cuatro person...,ocved,0


Before getting into the NN, I test the data on a simple logistic regression. 

The command below creates the training and testing data. 

In [None]:
from sklearn.model_selection import train_test_split
sentences_train, sentences_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.25, random_state=1000)

I'm using count vectorizer here rather than tfidf. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(sentences_train)

X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(sentences_test)
X_train

<45604x85168 sparse matrix of type '<class 'numpy.int64'>'
	with 3940041 stored elements in Compressed Sparse Row format>

Let's run the LR and see how well it performs, this is our baseline comparison. 

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(solver='liblinear')
classifier.fit(X_train, y_train)
score = classifier.score(X_test, y_test)

print("Accuracy:", score)

Accuracy: 0.9358637021444547


Does pretty ok, let's make things more complicated.

In [None]:
!pip install keras



I'm using a simple tokenizer with padding to reduce the size of articles. If in the first 500 words we can't tell if the article is relevant or not then neither will the model. Plus the CNN takes a long time to run, this helps make it a manageable model. 

In [None]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(sentences_train)

X_train = tokenizer.texts_to_sequences(sentences_train)
X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

print(sentences_train[2])
print(X_train[2])

ntx elementos policia judicial federal pjf asegurar tonelada kilogramo mariguana operativo revision realizado nayarit informo hoy procuraduria general republica pgr dependencia agrego comunicado ejercer supervision vigilancia carretera numero tramo tepic capomal revisar vehiculo marcar kinworth tipo tractor color verde servicio publico federal placa circulacion detallo vehiculo modelo conducir javier valadez sepulveda procedente tonala jalisco dirigir ciudad tijuana baja embargo detener revision integrante pjf delegacion estatal explico dentro automotor encontrar caja madera paquete confeccionados cinta adhesivo color beige interior contenian mariguana peso bruto total kilo gramos pgr dar conocer droga vehiculo detenido poner disposicion agente ministerio publico federacion iniciar averiguacion correspondiente delito salud ntx ago mer mmm
[527, 524, 315, 1986, 12, 35, 1267, 1310, 315, 12, 35, 1267, 1310, 257, 1, 1509, 12, 63, 230, 878, 39, 121, 688, 31, 161, 1744, 10, 1065, 1310, 144, 

In [None]:
from keras.preprocessing.sequence import pad_sequences

maxlen = 300

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print(X_train[0, :])

[   8  212  428 1155   88    8    2 1075  126   25  297   14    1   19
   28 1361  160  138  316  197  386  552   99  428 1089  969  372   22
    8    2   67  118   20  184 1003  241  223  101 1757   28  109 1341
   29  322   26  390  316  248   67  351  223  994 1194 2410   49  656
  307  390   58  265   29    6 1409  431  292 3859  321  667    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

Here I define the model and internals of the CNN.

In [None]:
def create_model(num_filters, kernel_size, vocab_size, embedding_dim, maxlen):
    model = Sequential()
    model.add(layers.Embedding(vocab_size, embedding_dim, input_length=maxlen))
    model.add(layers.Conv1D(num_filters, kernel_size, activation='relu'))
    model.add(layers.GlobalMaxPooling1D())
    model.add(layers.Dense(10, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

In [None]:
param_grid = dict(num_filters=[32, 64, 128],
                  kernel_size=[3, 5, 7],
                  vocab_size=[5000], 
                  embedding_dim=[50],
                  maxlen=[500])

Now to the actual model. I am using a randomized grid search, which runs multiple versions of a cnn with different dimensions to see which one does the best job. CNN's require a lot of fine tuning , but given this is not a very big sample size I rely on the random search for testing multiple paramters. It will run 5 different CNN's 

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import RandomizedSearchCV
from keras.models import Sequential
from keras import layers
# Main settings
epochs = 20
embedding_dim = 50
maxlen = 300
output_file = '/content/drive/My Drive/Data/OCVED/Classifier/universe/output_6.txt'
saved_scores = pd.DataFrame()
# Run grid search for each source 
for source, frame in df.groupby('source'):
    print('Running grid search for data set :', source)
    sentences = df['text'].values
    y = df['label'].values

    # Train-test split
    sentences_train, sentences_test, y_train, y_test = train_test_split(
        sentences, y, test_size=0.25, random_state=1000)

    # Tokenize words
    tokenizer = Tokenizer(num_words=5000)
    tokenizer.fit_on_texts(sentences_train)
    X_train = tokenizer.texts_to_sequences(sentences_train)
    X_test = tokenizer.texts_to_sequences(sentences_test)

    # Adding 1 because of reserved 0 index
    vocab_size = len(tokenizer.word_index) + 1

    # Pad sequences with zeros
    X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
    X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)
    print ("preprocessing done")
    # Parameter grid for grid search
    param_grid = dict(num_filters=[32, 64, 128],
                      kernel_size=[3, 5, 7],
                      vocab_size=[vocab_size],
                      embedding_dim=[embedding_dim],
                      maxlen=[maxlen])
    model = KerasClassifier(build_fn=create_model,
                            epochs=epochs, batch_size=10,
                            verbose=False)
    grid = RandomizedSearchCV(estimator=model, param_distributions=param_grid,
                              cv=4, verbose=1, n_iter=5)
    print ("model ready")
    grid_result = grid.fit(X_train, y_train)
    saved_scores['NN'] = grid_result

    # Evaluate testing set
    test_accuracy = grid.score(X_test, y_test)
    print ("CNN Done")



Running grid search for data set : ocved
preprocessing done
model ready
Fitting 4 folds for each of 5 candidates, totalling 20 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


In [None]:
    # Save and evaluate results
    with open(output_file, 'a') as f:
        s = ('Running {} data set\nBest Accuracy : '
             '{:.4f}\n{}\nTest Accuracy : {:.4f}\n\n')
        output_string = s.format(
            source,
            grid_result.best_score_,
            grid_result.best_params_,
            test_accuracy)
        print(output_string)
        f.write(output_string)

In [None]:
test_accuracy

After it has finished running I print out the dimensions and reslts for each model and store it in a csv. 

In [None]:
print("done")
grid_result.cv_results_ 
save = pd.DataFrame(grid_result.cv_results_)
save


Looks like all the CNN's averaged around .92

---



In [None]:
save.to_csv("/content/drive/My Drive/Data/OCVED/Classifier/results/scores/CNN_scores_v2.1.csv")