# ELUVIO data science challenge

### Assuming the file is large, the csv is firstly downloaded into cloud drive and then processed through a database file.
### The notebook below was ran on Google Colab and it documented the rough process of model prototyping. Please noted that due to the limitations on time and resources many details such as EDA and visualizations were not recorded.



In [1]:
# install fastText if needed
!git clone https://github.com/facebookresearch/fastText.git
!pip install fastText/

import fasttext
import fasttext.util

# download fasttext English embeddings (could take a while)
fasttext.util.download_model('en')


Cloning into 'fastText'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 3679 (delta 2), reused 17 (delta 1), pack-reused 3657[K
Receiving objects: 100% (3679/3679), 8.10 MiB | 5.98 MiB/s, done.
Resolving deltas: 100% (2313/2313), done.
Processing ./fastText
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2859405 sha256=82010b82b03dd8f5a7006e9b99bf940b1a90592562132b65d246c16dbe304139
  Stored in directory: /tmp/pip-ephem-wheel-cache-c4_d6uop/wheels/a1/9f/52/696ce6c5c46325e840c76614ee5051458c0df10306987e7443
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.1
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

In [2]:
# load English embeddings (high memory usage)
eng = fasttext.load_model('cc.en.300.bin')
# reduce embeddings dimension
fasttext.util.reduce_model(eng, 100)



<fasttext.FastText._FastText at 0x7f09974fedd8>

In [126]:
# new dimension
eng.get_dimension()

100

In [0]:
# import useful libraries

# DS tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# database tools
import sqlite3

# text processing
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
import string

# google drive tools
import pydrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [128]:
# stopwords
nltk.download(['stopwords', 'punkt'])
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
# authenticate and create PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# address of the file, replace the address below with the one on YOUR google drive 
link = 'https://drive.google.com/open?id=15X00ZWBjla7qGOIW33j8865QdF89IyAk'
_, id = link.split('=')

In [0]:
# download csv to cloud drive (files can be directly read from google drive as well, it was firstly downloaded to cloud for speedy process)
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('eluvio.csv')

In [0]:
# create database on the could
conn = sqlite3.connect('eluvio.db')
cur = conn.cursor()

In [0]:
# import csv into database file
df = pd.read_csv('eluvio.csv', chunksize=512)
for i in df:
  i.to_sql('eluvio', con=conn, if_exists='append')

In [134]:
# dataset size
pd.read_sql_query('Select Count(*) from Eluvio', con=conn)

Unnamed: 0,Count(*)
0,509236


In [135]:
# basic database info
cur.execute('PRAGMA table_info(''eluvio'')').fetchall()

[(0, 'index', 'INTEGER', 0, None, 0),
 (1, 'time_created', 'INTEGER', 0, None, 0),
 (2, 'date_created', 'TEXT', 0, None, 0),
 (3, 'up_votes', 'INTEGER', 0, None, 0),
 (4, 'down_votes', 'INTEGER', 0, None, 0),
 (5, 'title', 'TEXT', 0, None, 0),
 (6, 'over_18', 'INTEGER', 0, None, 0),
 (7, 'author', 'TEXT', 0, None, 0),
 (8, 'category', 'TEXT', 0, None, 0)]

In [136]:
# some explorations
pd.read_sql_query('Select * From eluvio Limit 10', con=conn)


Unnamed: 0,index,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,0,polar,worldnews
1,1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,0,polar,worldnews
2,2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,0,polar,worldnews
3,3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,0,fadi420,worldnews
4,4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,0,mhermans,worldnews
5,5,1201287889,2008-01-25,15,0,Hay presto! Farmer unveils the illegal mock-...,0,Armagedonovich,worldnews
6,6,1201289438,2008-01-25,5,0,"Strikes, Protests and Gridlock at the Poland-U...",0,Clythos,worldnews
7,7,1201536662,2008-01-28,0,0,The U.N. Mismanagement Program,0,Moldavite,worldnews
8,8,1201558396,2008-01-28,4,0,Nicolas Sarkozy threatens to sue Ryanair,0,Moldavite,worldnews
9,9,1201635869,2008-01-29,3,0,US plans for missile shields in Polish town me...,0,JoeyRamone63,worldnews


### One possible application is predicting title popularity as time-series using RNN to determine which newly published title should be displayed on front page, yet the causal relationship of title popularity with time is not easy to establish. Hence in the intrest of time a more straight-forward application of determine which title should be flagged as over_18 is presented below
### Please note that since the two categories are extremely imbalanced (only 320 observations in "over_18") the majority class were randomly sampled with 320 observations to join the minority and form a balanced experimentation data set.


In [0]:
# positive and negative titles by age limit, negative cases randomly down-sampled to match the number of positive cases
over_18 = pd.read_sql_query('Select * From eluvio Where over_18 = 1', con=conn)
not_over_18 = pd.read_sql_query('Select * from eluvio Where over_18=0 Order By Random() Limit 320', con=conn)

df = pd.concat([over_18, not_over_18])

### Based on the given file, whether a file would be flagged as over_18 is mainly determined by its title which is not very much time dependent. Therefore only title is used for the modelling process. Other features such as author and category can also be included in the future when a robust in-house word embedding system is established. 




In [0]:
df = df[['title', 'over_18']]
df.reset_index(inplace=True)

In [185]:
df.head()

Unnamed: 0,index,title,over_18
0,0,Pics from the Tibetan protests - more graphic ...,1
1,1,"MI5 linked to Max Mosley’s Nazi-style, sadomas...",1
2,2,Tabloid Horrifies Germany: Poland s Yellow Pre...,1
3,3,Love Parade Dortmund: Techno Festival Breaks R...,1
4,4,IDF kills young Palestinian boy. Potentially N...,1


### for each title, all punctuations, stop-words and symbols/numbers are removed, and embedding vectors for all other words left are summed up into a single 100-dimensional vector to represent a single title in the design matrix


In [0]:
# prepare trainable data
def data_prep(data_frame, embeddings):
  # word embeddings matrix construction
  X = np.zeros((data_frame.shape[0], embeddings.get_dimension()))
  Y = data_frame['over_18']
  count = 0
  for title in data_frame['title']:
    # tokenize, remove stop words, punctuation etc.
    words = [(i.lower()) for i in word_tokenize(title) if i.isalpha() and i not in stopwords]
    
    # get word vectors and take the sum, which will be used as the representation of each title
    sum = np.zeros(embeddings.get_dimension())
    for w in words:
      sum+=embeddings[w]
    
    X[count, :] = sum.reshape(1, embeddings.get_dimension())
    count+=1

  return sklearn.model_selection.train_test_split(X, Y, test_size=0.2, random_state=13, shuffle=True)


In [0]:
# trainable data
trainX, testX, trainY, testY = data_prep(df, eng)

In [188]:
# dimensions
trainX.shape, testX.shape, trainY.shape, testY.shape

((512, 100), (128, 100), (512,), (128,))

In [0]:
# sklearn tools
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

### Since the down-sampling process gives us a balanced dataset, the naive benchmark will be 50% and let's see if we can do better.

In [190]:
# first a vanilla SVM to get a basic sense
clf = svm.SVC()
clf.fit(trainX, trainY)
predY = clf.predict(testX)
confusion_matrix(testY, predY)

array([[55,  3],
       [20, 50]])

### we can see that the FN is quite high. Let's try to tune the model a bit

In [0]:
from sklearn.model_selection import GridSearchCV 
parameters = {'C':[1, 5, 10, 25, 50, 100], 'gamma':[0.0001, 0.001, 0.005, 0.01, 0.1, 1]}
clf = svm.SVC()

In [192]:
recall_scorer = sklearn.metrics.make_scorer(sklearn.metrics.recall_score)
clf = GridSearchCV(clf, parameters, scoring=recall_scorer)
clf.fit(trainX, trainY)

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1, 5, 10, 25, 50, 100],
                         'gamma': [0.0001, 0.001, 0.005, 0.01, 0.1, 1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=make_scorer(recall_score), verbose=0)

In [193]:
confusion_matrix(testY, clf.best_estimator_.predict(testX))

array([[53,  5],
       [17, 53]])

In [194]:
clf.best_score_

0.756

### we can see a slight improvement on recall, due to the interest of time  no other models, nor repeated sampling majority class to test model stability, nor further tuning will be done and the process recorded so far illustrates the rough overall process of model prototyping. Many of the details such as EDA and visualizations are not recorded in this report again due to the limitation on time, yet they were crucial steps which led to the decisions on problem formulation, modelling and so on.