
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MachineLearningJournalClub/LearningNLP/blob/main/LearningNLP_Tutorial1.ipynb)

# Learning NLP Tutorial Series 
## Tutorial 1 : More Sentiment Analysis 

Topics include: 
* Exploring a dataset (Disaster Tweets, ArXiv) 
* Explainability methods : SHAP, LIME 
* Sentiment Analysis generalization to N classes 

(Authors: Luca Bottero, Simone Azeglio)

---
---

## **Overview**

* [Preprocessing](#section1)
    * [Feature Engineering](#section1.1)
    

* [Explainability Methods](#section2)
    * [SHAP](#section2.1)
    * [LIME](#section2.2)

* [References & Additional Material](#section4)

---
---

<a id='section1'></a>
# **Preprocessing**
In this first part ... 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The Arxiv dataset we will use is written in JSON, a syntax for storing and exchanging data.
JSON is text, written with JavaScript object notation.

In [2]:
import pandas as pd
import numpy as np
import json   #importing this module we can work with JSON data
import nltk   #NLP toolkit
from nltk.corpus import stopwords
import re     # library for regular expression operations
import string # for string operations
import collections
import gensim  
from gensim import parsing        # Help in preprocessing the data, very efficiently


With the function in the next cell we build an object called generator, i.e. a kind of iterable you can only iterate over once.
A generator don't store all the values in memory.
So, with the function get_metadata() you can open the file in order to manage it paper by paper.

In [17]:
def get_metadata():
    with open('/content/gdrive/MyDrive/ColabNotebooks/arxiv-metadata-oai-snapshot.json') as f:
        for line in f:
            yield line #Yield is used like Return, except the function will return a generator

In [18]:
metadata = get_metadata()

for paper in metadata:
    first_paper = json.loads(paper) #json.loads() return a dictionary
    break

FileNotFoundError: ignored

In [11]:
for key in first_paper:
    print(key)

NameError: ignored

We're interested only in the keys Categories, Authors, Title and Abstract of each paper, so let's save this information in a Dataframe:

In [None]:
#set of empty list that will be filled with the information of each paper

categories = []
authors = []
title = []
abstract = []

In [None]:
total_items = 0

for papers in metadata:
    paper = json.loads(papers)
    
    categories.append(paper['categories'])
    authors.append(paper['authors'])
    title.append(paper['title'])
    abstract.append(paper['abstract'])
    
    total_items += 1

In [None]:
print(total_items)

In [None]:
#In this cell we create a dictionary with the information stored before
d = {
    'Categories': categories,
    'Authors': authors,
    'Title': title,
    'Abstract': abstract,
}

In [None]:
df = pd.DataFrame(d)

In [None]:
df.head()

In order to use this data for classification we have to prepocessing them, so we exploit Gensim library (reference at the following link https://radimrehurek.com/gensim/corpora/textcorpus.html).
The following code has been ispired from the following notebook found on kaggle: https://www.kaggle.com/anurag3753/prediction-naive-bayes-preprocessing-with-gensim

In [None]:
def transformText(text):
    
    stops = set(stopwords.words("english"))
    
    # Convert text to lower
    text = text.lower()
    # Removing non ASCII chars    
    text = re.sub(r'[^\x00-\x7f]',r' ',text)
    
    # Strip multiple whitespaces
    text = gensim.corpora.textcorpus.strip_multiple_whitespaces(text)
    
    # Removing all the stopwords
    filtered_words = [word for word in text.split() if word not in stops]
    
    # Removing all the tokens with lesser than 3 characters
    filtered_words = gensim.corpora.textcorpus.remove_short(filtered_words, minsize=3)
    
    # Preprocessed text after stop words removal
    text = " ".join(filtered_words)
    
    # Remove the punctuation
    text = gensim.parsing.preprocessing.strip_punctuation2(text)
    
    # Strip all the numerics
    text = gensim.parsing.preprocessing.strip_numeric(text)
    
    # Strip multiple whitespaces
    text = gensim.corpora.textcorpus.strip_multiple_whitespaces(text)
    
    # Stemming
    return gensim.parsing.preprocessing.stem_text(text)

In [None]:
df['Title'] = df['Title'].map(transformText)

In [None]:
df.head()

First we have to chose only two Categories in order to perform our binary classification. At the page https://arxiv.org/category_taxonomy you can find the complete ArXiv categories taxonomy. So, for our purpose we chose the two more frequent categories. Let's find them.

In [None]:
categories = df.Categories

In [None]:
cat_freq_dic = collections.Counter(categories) #collections.Counter gives us a dictionary with a count of how many 
                                               #times a category appears in the dataset

In [None]:
max1 = 0
max2 = 0
for key in cat_freq_dic:
    if  cat_freq_dic[key]>max1:
        max1=cat_freq_dic[key]
        max1key=key
    elif cat_freq_dic[key]>max2:
        max2=cat_freq_dic[key]
        max2key=key        
            
print(max1key, max1)
print(max2key, max2)

In [None]:
traindf = df[(df['Categories']==max2key) | (df['Categories']==max1key)]
traindf.head()

Now that only two categories has been selected, we have to convert categories names in a digits in order to be processed by a classification algorithm.

In [None]:
category_to_id = {    #create a simple dictionary which map the category in a digit
    max1key: 0,
    max2key: 1
}

def get_category_id(category):
    return category_to_id[category]

traindf['Categories'] = traindf['Categories'].map(get_category_id)

In [None]:
traindf.head()

Once we have properly preprocess our data, we have to split the dataset in training and test set.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(traindf['Title'], traindf['Categories'], 
                                                    test_size=0.33, random_state=42)

In [None]:
X_train

# **Feature extraction with count vectorizer and term frequency-inverse document frequency (tfidf)**

Now we have to create the features will feed our classification model. In order to do that we exploit to methods: CountVectorizer and TfidfTransofer. 

CountVectorizer (reference at the following link https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is able to create a dictionary of word inside all the documents we provide to it and than to represent each of this documents (the titles) in a matrix form. Every row will be a title and every column a word.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
features_name = vectorizer.get_feature_names()
features_name[:10]

Because of the high number of word in the vocabulary, the resulting matrix after applying CountVectorizer to our data is a sparse matrix, with most of its values equal to zero. 

In [None]:
len(features_name)

In [None]:
print(X_train_counts.toarray())

After the data manipulation above, we have to use the TfidfTransformer (reference at the following link https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer) in order to create a proper count of the frequency of each word inside our dataset. Tf-idf is the acronym for Term Frequncy-Inverse Document Frequency. With this approach mw evaluate the relative importance of particular word. Tf-idf is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist. In the case of the Term Frequency is the "raw frequency" of a term in a document, i.e. the number of times a term occurs in document (a title). The "inverse document frequency" is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. The Tf-idf is the product of this two quantity

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# **Training a classifier**

We choose Logistic Regression as classification model. Instead of making a manual implemetation of this model, we exploit the sklearn method for Logistic Regression (reference at the following link https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
from sklearn.linear_model import LogisticRegression
regression = LogisticRegression()
regression.fit(X_train_tfidf, y_train)

# **Prediction with test set**

Pay attention to the methods used to countvectorize the test set. In this case we use CountVectorizer.transform() instead of CountVectorizer.fit_transform() in order to mantain the vocabulary built before.

In [None]:
X_test_counts = vectorizer.transform(X_test)
features_name = vectorizer.get_feature_names()
features_name[:10]

In [None]:
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)

In [None]:
prediction = regression.predict(X_test_tfidf)

# **Evaluation**

In [None]:
np.mean(prediction == y_test)


<a id='section2'></a>
# **Explainability Methods**
Explainability .... 

<a id='section2.1'></a>
## **SHAP**
SHAP ipsum lorem 

<a id='section2.2'></a>
## **LIME**
Lime ipsum lorem 