# Learning NLP Tutorial Series 
## Tutorial 1 : More Sentiment Analysis 

Topics include: 
* Exploring a dataset ([Kaggle's ArXiv](https://www.kaggle.com/Cornell-University/arxiv)) 
* Explainability methods : SHAP, LIME 
* Binary classification generalization to N classes 

(Authors: Luca Bottero, Simone Azeglio, Alessio Borriero)

---
---

## **Overview**

* [Preprocessing](#section1)
    * [Feature Engineering: feature extraction with count vectorizer and term frequency-inverse document frequency (tf-idf)](#section1.1)

* [Classification](#section2)
    * [Train a classifier](#section2.1)  
    * [Prediction over test set](#section2.2)  
    * [Evaluation](#section2.1)  

* [Explainability Methods](#section2)
    * [SHAP](#section2.1)
    * [LIME](#section2.2)

* [References & Additional Material](#section4)

---
---

<a id="section1"></a>
# **Preprocessing**

In order to download from kaggle the dataset we've chosen for this tutorial, you have to execute the following instruction. You can find the complete description for this procedure [here](https://www.kaggle.com/general/74235).

Before to execute the first cell, you have to go to your Kaggle account page, scroll to API section and click on "Create New API Token". This command will download kaggle.json file on your machine.

In [None]:
! pip install -q kaggle #

In [None]:
 from google.colab import files

In [None]:
files.upload()

Saving kaggle.json to kaggle (1).json


{'kaggle.json': b'{"username":"alessioborriero","key":"7f0768dad21428b753204801a07b4c1c"}'}

In [None]:
! mkdir ~/.kaggle

In [None]:
! cp kaggle.json ~/.kaggle/

In [None]:
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets list

ref                                                         title                                              size  lastUpdated          downloadCount  
----------------------------------------------------------  ------------------------------------------------  -----  -------------------  -------------  
gpreda/reddit-vaccine-myths                                 Reddit Vaccine Myths                              221KB  2021-03-26 07:27:32           1058  
crowww/a-large-scale-fish-dataset                           A Large Scale Fish Dataset                          3GB  2021-02-17 16:10:44            813  
dhruvildave/wikibooks-dataset                               Wikibooks Dataset                                   1GB  2021-02-18 10:08:27            722  
imsparsh/musicnet-dataset                                   MusicNet Dataset                                   22GB  2021-02-18 14:12:19            315  
alsgroup/end-als                                            End ALS Kaggle C

In [None]:
! kaggle datasets download -d Cornell-University/arxiv

Downloading arxiv.zip to /content
 99% 897M/906M [00:13<00:00, 68.2MB/s]
100% 906M/906M [00:13<00:00, 70.1MB/s]


In [None]:
! mkdir train

In [None]:
! unzip arxiv.zip -d train

Archive:  arxiv.zip
  inflating: train/arxiv-metadata-oai-snapshot.json  


The Arxiv dataset we will use is written in JSON, a syntax for storing and exchanging data.
JSON is text, written with JavaScript object notation.

In [None]:
import pandas as pd
import numpy as np
import json   #importing this module we can work with JSON data
import nltk   #NLP toolkit
from nltk.corpus import stopwords
nltk.download('stopwords')
import re     # library for regular expression operations
import string # for string operations
import collections
import gensim  
from gensim import parsing        # Help in preprocessing the data, very efficiently


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


With the function in the next cell we build an object called generator, i.e. a kind of iterable you can only iterate over once.
A generator don't store all the values in memory.
So, with the function get_metadata() you can open the file in order to manage it paper by paper.

In [None]:
path_to_data = 'train/arxiv-metadata-oai-snapshot.json'

In [None]:
def get_metadata():
    with open(path_to_data) as f:
        for line in f:
            yield line #Yield is used like Return, except the function will return a generator

In [None]:
metadata = get_metadata()

for paper in metadata:
    first_paper = json.loads(paper) #json.loads() return a dictionary
    break

In [None]:
for key in first_paper:
    print(key)

id
submitter
authors
title
comments
journal-ref
doi
report-no
categories
license
abstract
versions
update_date
authors_parsed


We're interested only in the keys Categories, Authors, Title and Abstract of each paper, so let's save this information in a Dataframe:

In [None]:
#set of empty list that will be filled with the information of each paper

categories = []
authors = []
title = []
abstract = []

In [None]:
total_items = 0

for papers in metadata:
    paper = json.loads(papers)
    
    categories.append(paper['categories'])
    authors.append(paper['authors'])
    title.append(paper['title'])
    abstract.append(paper['abstract'])
    
    total_items += 1

In [None]:
print(total_items)

1796910


In [None]:
#In this cell we create a dictionary with the information stored before
d = {
    'Categories': categories,
    'Authors': authors,
    'Title': title,
    'Abstract': abstract,
}

In [None]:
df = pd.DataFrame(d)

In [None]:
df.head()

Unnamed: 0,Categories,Authors,Title,Abstract
0,math.CO cs.CG,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-..."
1,physics.gen-ph,Hongjun Pan,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...
2,math.CO,David Callan,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...
3,math.CA math.FA,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...
4,cond-mat.mes-hall,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,We study the two-particle wave function of p...


In order to use this data for classification we have to prepocessing them, so we exploit [Gensim library](https://radimrehurek.com/gensim/corpora/textcorpus.html) .
The following code has been ispired from the following notebook found on [Kaggle](https://www.kaggle.com/anurag3753/prediction-naive-bayes-preprocessing-with-gensim): 

In [None]:
def transformText(text):
    
    stops = set(stopwords.words("english"))
    
    # Convert text to lower
    text = text.lower()
    # Removing non ASCII chars    
    text = re.sub(r'[^\x00-\x7f]',r' ',text)
    
    # Strip multiple whitespaces
    text = gensim.corpora.textcorpus.strip_multiple_whitespaces(text)
    
    # Removing all the stopwords
    filtered_words = [word for word in text.split() if word not in stops]
    
    # Removing all the tokens with lesser than 3 characters
    filtered_words = gensim.corpora.textcorpus.remove_short(filtered_words, minsize=3)
    
    # Preprocessed text after stop words removal
    text = " ".join(filtered_words)
    
    # Remove the punctuation
    text = gensim.parsing.preprocessing.strip_punctuation2(text)
    
    # Strip all the numerics
    text = gensim.parsing.preprocessing.strip_numeric(text)
    
    # Strip multiple whitespaces
    text = gensim.corpora.textcorpus.strip_multiple_whitespaces(text)
    
    # Stemming
    return gensim.parsing.preprocessing.stem_text(text)

In [None]:
df['Title'] = df['Title'].map(transformText) #this line will take a quite amount of time to be executed 

In [None]:
df.head()

Unnamed: 0,Categories,Authors,Title,Abstract
0,math.CO cs.CG,Ileana Streinu and Louis Theran,sparsiti certifi graph decomposit,"We describe a new algorithm, the $(k,\ell)$-..."
1,physics.gen-ph,Hongjun Pan,evolut earth moon system base dark matter fiel...,The evolution of Earth-Moon system is descri...
2,math.CO,David Callan,determin stirl cycl number count unlabel acycl...,We show that a determinant of Stirling cycle...
3,math.CA math.FA,Wael Abu-Shammala and Alberto Torchinsky,dyadic lambda alpha lambda alpha,In this paper we show how to compute the $\L...
4,cond-mat.mes-hall,Y. H. Pong and C. K. Law,boson charact atom cooper pair across reson,We study the two-particle wave function of p...


First we have to chose only two Categories in order to perform our binary classification. At the page https://arxiv.org/category_taxonomy you can find the complete ArXiv categories taxonomy. So, for our purpose we chose the two more frequent categories. Let's find them.

In [None]:
categories = df.Categories

In [None]:
cat_freq_dic = collections.Counter(categories) #collections.Counter gives us a dictionary with a count of how many 
                                               #times a category appears in the dataset

In [None]:
max1 = 0
max2 = 0
for key in cat_freq_dic:
    if  cat_freq_dic[key]>max1:
        max1=cat_freq_dic[key]
        max1key=key
    elif cat_freq_dic[key]>max2:
        max2=cat_freq_dic[key]
        max2key=key        
            
print(max1key, max1)
print(max2key, max2)

astro-ph 86914
hep-ph 73549


In [None]:
traindf = df[(df['Categories']==max2key) | (df['Categories']==max1key)]
traindf.head()

Unnamed: 0,Categories,Authors,Title,Abstract
7,astro-ph,"Paul Harvey, Bruno Merin, Tracy L. Huard, Luis...",spitzer cd survei larg nearbi insterstellar cl...,We discuss the results from the combined IRA...
14,hep-ph,"Chao-Hsi Chang, Tong Li, Xue-Qian Li and Yu-Mi...",lifetim doubli charm baryon,"In this work, we evaluate the lifetimes of t..."
15,astro-ph,"Nceba Mhlahlo, David H. Buckley, Vikram S. Dhi...",spectroscop observ intermedi polar hydra quies...,Results from spectroscopic observations of t...
21,astro-ph,"M. A. Loukitcheva, S. K. Solanki and S. White",alma ideal probe solar chromospher,"The very nature of the solar chromosphere, i..."
27,hep-ph,"Zhan Shu, Xiao-Lin Chen and Wei-Zhen Deng",understand flavor symmetri break nucleon flavo...,"In $\XQM$, a quark can emit Goldstone bosons..."


Now that only two categories has been selected, we have to convert categories names in a digits in order to be processed by a classification algorithm.

In [None]:
category_to_id = {    #create a simple dictionary which map the category in a digit
    max1key: 0,
    max2key: 1
}

def get_category_id(category):
    return category_to_id[category]

traindf['Categories'] = traindf['Categories'].map(get_category_id)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [None]:
traindf.head()

Unnamed: 0,Categories,Authors,Title,Abstract
7,0,"Paul Harvey, Bruno Merin, Tracy L. Huard, Luis...",spitzer cd survei larg nearbi insterstellar cl...,We discuss the results from the combined IRA...
14,1,"Chao-Hsi Chang, Tong Li, Xue-Qian Li and Yu-Mi...",lifetim doubli charm baryon,"In this work, we evaluate the lifetimes of t..."
15,0,"Nceba Mhlahlo, David H. Buckley, Vikram S. Dhi...",spectroscop observ intermedi polar hydra quies...,Results from spectroscopic observations of t...
21,0,"M. A. Loukitcheva, S. K. Solanki and S. White",alma ideal probe solar chromospher,"The very nature of the solar chromosphere, i..."
27,1,"Zhan Shu, Xiao-Lin Chen and Wei-Zhen Deng",understand flavor symmetri break nucleon flavo...,"In $\XQM$, a quark can emit Goldstone bosons..."


Once we have properly preprocess our data, we have to split the dataset in training and test set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(traindf['Title'], traindf['Categories'], 
                                                    test_size=0.33, random_state=42)

In [None]:
X_train

338205                     save fourth gener higg radion mix
1395428                                   infrar color dwarf
1414457                            growth hii region reioniz
1431266    spitzer space telescop extra galact first look...
1427474    lyman alpha radiat collaps protogalaxi ii obse...
                                 ...                        
1460445                      flight determin plate scale eit
1443002               orion ob associ ii orion eridanu bubbl
1599370    numer evalu master integr loop gener massiv se...
1618623          effect shadow doubl pomeron exchang process
1585877                         issu flat direct baryogenesi
Name: Title, Length: 107510, dtype: object

<a id='section1.1'></a>
# Feature Engineering: feature extraction with count vectorizer and term frequency-inverse document frequency (tfidf)

Now we have to create the features will feed our classification model. In order to do that we exploit to methods: CountVectorizer and TfidfTransofer. 

[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is able to create a dictionary of word inside all the documents we provide to it and than to represent each of this documents (the titles) in a matrix form. Every row will be a title and every column a word.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
features_name = vectorizer.get_feature_names()
features_name[:10]

['aa', 'aaa', 'aal', 'aamq', 'aao', 'aaomega', 'aat', 'aavso', 'ab', 'aband']

Because of the high number of word in the vocabulary, the resulting matrix after applying CountVectorizer to our data is a sparse matrix, with most of its values equal to zero. 

In [None]:
len(features_name)

13199

After the data manipulation above, we have to use the [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) in order to create a proper count of the frequency of each word inside our dataset. Tf-idf is the acronym for Term Frequncy-Inverse Document Frequency. With this approach mw evaluate the relative importance of particular word. Tf-idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist. In the case of the Term Frequency is the "raw frequency" of a term in a document, i.e. the number of times a term occurs in document (a title). The "inverse document frequency" is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. The Tf-idf is the product of this two quantity

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

<a id='section2'></a>
# **Classification**

<a id='section2.1'></a>
# Train a classifier

We choose Logistic Regression as classification model. Instead of making a manual implemetation of this model, we exploit the sklearn method for [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
from sklearn.linear_model import LogisticRegression
regression = LogisticRegression()
regression.fit(X_train_tfidf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

<a id='section2.2'></a>
# Prediction over test set

Pay attention to the methods used to countvectorize the test set. In this case we use `CountVectorizer.transform()` instead of `CountVectorizer.fit_transform()` in order to mantain the vocabulary built before.

In [None]:
X_test_counts = vectorizer.transform(X_test)
features_name = vectorizer.get_feature_names()
features_name[:10]

['aa', 'aaa', 'aal', 'aamq', 'aao', 'aaomega', 'aat', 'aavso', 'ab', 'aband']

In [None]:
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

In [None]:
prediction = regression.predict(X_test_tfidf)

<a id='section2.3'></a>
# Evaluation

In [None]:
np.mean(prediction == y_test)

0.9773006250826204


<a id='section2'></a>
# **Explainability Methods**
Explainability .... 

<a id='section2.1'></a>
## **SHAP**
SHAP ipsum lorem 

<a id='section2.2'></a>
## **LIME**
Lime ipsum lorem 