


<font size='10' color = 'E3A440'>**Megadata and Advanced Techniques Demystified**</font>
=======
<font color = 'E3A440'>*New Analysis Methods and their Implications for Megadata Management in SSH (part 1)*</font>
=============


This workshop is part of the training [Megadata and Advanced Techniques Demystified](https://www.4point0.ca/en/2022/08/22/formation-megadonnees-demystifiees//) (session 6).

Humanities and social sciences are often confronted with the analysis of unstructured data, such as text. After preparing the data, several analysis techniques from machine learning can be used. During this workshop, participants will be introduced to the preprocessing of textual data and to supervised and unsupervised methods for analysis purposes with Python.

Note: This workshop continues with a 2nd session on **November 10, 2022**. The two sessions cannot be considered exhaustive of the field.

Structure of the workshop :
1. Part 1 : Presentation of the [Section 1](#Section_1) in plenary mode (40 minutes)
2. Part 2 : Individual work on [Section 2](#Section_2) (10 minutes)
3. Part 3 : Team work on [Section 2](#Section_2) (30 minutes)
4. Part 4 : Conclusion in plenary mode (10 minutes)

### Autors: 
- Bruno Agard <bruno.agard@polymtl.ca>
- Davide Pulizzotto <davide.pulizzotto@polymtl.ca>

Département de Mathématiques et de génie industriel

École Polytechnique de Montréal

# <font color = 'E3A440'>0. Preparation environnement </font>

In [None]:
# Downloading of data from the GitHub project
!rm -rf Data_techniques_demystified_webinars/
!git clone https://github.com/4point0-ChairInnovation-Polymtl/Data_techniques_demystified_webinars

Cloning into 'Data_techniques_demystified_webinars'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 30 (delta 8), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (30/30), done.


In [None]:
# Import modules
import os
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

###  <font color = 'E3A440'>**0.1 The basics of Python to carry out this workshop**</font>

We present two basic concepts that will help people who have never written a line of code to orient themselves during this workshop.

<font color = 'E3A440'>Variables</font>

[Variables](https://www.w3schools.com/python/python_variables.asp) are objects that contain information in very different formats. To create a variable, we use the following syntax: `variable_name = content_of_the_variable`. Some example follows:

In [None]:
# Variable with characters
variable_avec_caracteres = 'contenu de la variable avec caractères'

In [None]:
print(variable_avec_caracteres)

contenu de la variable avec caractères


In [None]:
# Variable with numbers
variable_avec_nombre = 5
print(variable_avec_nombre)

5


In [None]:
# Variable with list of numbers
variable_avec_liste_de_nombres = [5, 6, 7, 8, 9]
print(variable_avec_liste_de_nombres)

[5, 6, 7, 8, 9]


In [None]:
# Variable with tablar data
variable_avec_donnes_tabulaires = pd.DataFrame([[1,2,3],[4,5,6]], columns = ['Col_1','Col_2','Col_3'], index = ['Row_1','Row_2'])

In [None]:
print(variable_avec_donnes_tabulaires)

       Col_1  Col_2  Col_3
Row_1      1      2      3
Row_2      4      5      6


<font color = 'E3A440'>Fonctions</font>

[Fonctions](https://www.w3schools.com/python/python_functions.asp) allow you to perform calculations based on one or more variables, data, etc. Functions generally have the following syntax:  `resultat_de_la_function = fonction(argument_1 = contenu_argument_1, argument_2 = contenu_argument_2, etc.)`. Here some examples: 

In [None]:
resultat_somme = sum(variable_avec_liste_de_nombres)
print(resultat_somme)

35


In [None]:
pd.DataFrame( data = [[1,2,3],[4,5,6]], columns = ['Col_1','Col_2','Col_3'], index = ['Row_1','Row_2'] )

Unnamed: 0,Col_1,Col_2,Col_3
Row_1,1,2,3
Row_2,4,5,6


<a name='Section_1'></a>
# <font color = 'E3A440'>1. *Preparation of textual data*</font>

Text data analysis involves the transformation of text into a mathematical object that can be used by algorithms and statistical models. This step is important because it allows to **structure** unstructured data, such as text.


###  <font color = 'E3A440'>**1.1 Basic steps of data preparation**</font>

Let's take the following sentence to illustrate the steps that will allow us to transform it into structured information.

In [None]:
sentence = """At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good."""

For now, `sentence` is just a string. You can count the number of characters that make up this variable.

In [None]:
len(sentence)

78

Knowing the number of characters in a text may not be enough to analyze its content :-).

We will see in the following blocks of code different analysis tools.

#### <font color = 'E3A440'>*a. Tokenisation*</font>

First of all, it can be useful to cut the initial string of characters into elementary linguistic units endowed with meaning, generally called "words".

In the `nltk` module, there is a function (`word_tokenize()`) which allows to perform this operation.

In [None]:
# The function word_tokenize() takes a sentence as his main argument.
words = nltk.word_tokenize(sentence)
print(words)
len(words)

['At', 'eight', "o'clock", ',', 'on', 'Thursday', 'morning', ',', 'the', 'great', 'Arthur', 'did', "n't", 'feel', 'VERY', 'good', '.']


17

#### <font color = 'E3A440'>*b. Morphosyntactic analysis*</font>

After having identified all the words, it is possible to analyze their morphosyntactic role, for analysis and/or filtering purposes.

In [None]:
# The function word_tokenize() takes a list of words as his main argument.
words_pos = nltk.pos_tag(words, tagset='universal')
print(words_pos)
len(words_pos)

[('At', 'ADP'), ('eight', 'NUM'), ("o'clock", 'NOUN'), (',', '.'), ('on', 'ADP'), ('Thursday', 'NOUN'), ('morning', 'NOUN'), (',', '.'), ('the', 'DET'), ('great', 'ADJ'), ('Arthur', 'NOUN'), ('did', 'VERB'), ("n't", 'ADV'), ('feel', 'VERB'), ('VERY', 'ADV'), ('good', 'ADJ'), ('.', '.')]


17

Here the list of all possibles POS tags:

| **POS** | **DESCRIPTION**           | **EXAMPLES**                                      |
| ------- | ------------------------- | ------------------------------------------------- |
| ADJ     | adjective                 | big, old, green, incomprehensible, first      |
| ADP     | adposition                | in, to, during                                |
| ADV     | adverb                    | very, tomorrow, down, where, there            |
| AUX     | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ    | conjunction               | and, or, but                                  |
| CCONJ   | coordinating conjunction  | and, or, but                                  |
| DET     | determiner                | a, an, the                                    |
| INTJ    | interjection              | psst, ouch, bravo, hello                      |
| NOUN    | noun                      | girl, cat, tree, air, beauty                  |
| NUM     | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART    | particle                  | ’s, not                                      |
| PRON    | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN   | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT   | punctuation               | ., (, ), ?                                    |
| SCONJ   | subordinating conjunction | if, while, that                               |
| SYM     | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :)               |
| VERB    | verb                      | run, runs, running, eat, ate, eating          |
| X       | other                     | sfpksdpsxmsa                                  |
| SPACE   | space                     |                                                   |


#### <font color = 'E3A440'>*c. Removing the punctuation*</font>

One other operation consists in removing the punctuation. This type of filtering reduces those graphic signs having a lower contribution to the construction of the semantics of the sentence.
In some contexts, such as in stylometry, this process can be performed with more sophisticated techniques.

In [None]:
# The following line of code loop over each "word" ans keep those which contains only alphanumeric characters.
words_pos1 = [(w, pos) for w, pos in words_pos if w.isalnum()]
print(words_pos1)
len(words_pos1)

[('At', 'ADP'), ('eight', 'NUM'), ('on', 'ADP'), ('Thursday', 'NOUN'), ('morning', 'NOUN'), ('the', 'DET'), ('great', 'ADJ'), ('Arthur', 'NOUN'), ('did', 'VERB'), ('feel', 'VERB'), ('VERY', 'ADV'), ('good', 'ADJ')]


12

In [None]:
# It is possible to use laso the resul of morphosyntactic analysis in order to remove punctuation.
words_pos2 = [(w, pos) for w, pos in words_pos if pos != '.']
print(words_pos2)
len(words_pos2)

[('At', 'ADP'), ('eight', 'NUM'), ("o'clock", 'NOUN'), ('on', 'ADP'), ('Thursday', 'NOUN'), ('morning', 'NOUN'), ('the', 'DET'), ('great', 'ADJ'), ('Arthur', 'NOUN'), ('did', 'VERB'), ("n't", 'ADV'), ('feel', 'VERB'), ('VERY', 'ADV'), ('good', 'ADJ')]


14

Notice the difference: the "words" `o'clock` and `n't` are absent from the first list, but present in the second.

In [None]:
words_pos = words_pos2

#### <font color = 'E3A440'>*d. Convert each character to lowercase*</font>

This step constitutes a first operation of standardization of the words and their reduction to a single graphic form. This kind of step makes it possible to group each occurrence of a word in a single form.

In [None]:
# La ligne de code suivant itère sur chaque signe graphique et le transforme en minuscule.
words_pos = [(w.lower(), pos) for w, pos in words_pos]
print(words_pos)

[('at', 'ADP'), ('eight', 'NUM'), ("o'clock", 'NOUN'), ('on', 'ADP'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('the', 'DET'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('did', 'VERB'), ("n't", 'ADV'), ('feel', 'VERB'), ('very', 'ADV'), ('good', 'ADJ')]


#### <font color = 'E3A440'>*e. Removing stopwords*</font>

Another filtering operation consists in eliminating functional words, names **stopwords**. This word list contains all sentence connectors, such as "and", "but", "however" and words with low semantic value, such as modal verbs.
Like other filtering operations, the challenge is to clean up the vocabulary as much as possible and to reduce all occurrences of a word to a single graphical form.

In [None]:
# Here , we import a stopwrod list for English
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# The following line of code loop over each word of the sentence and keep those whici does not correspond to a stopword.
words_pos = [(w, pos) for w, pos in words_pos if w not in stopwords.words("english")]
print(words_pos)
len(words_pos)

[('eight', 'NUM'), ("o'clock", 'NOUN'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ("n't", 'ADV'), ('feel', 'VERB'), ('good', 'ADJ')]


9

#### <font color = 'E3A440'>*f. Stemming or lemmatization*</font> 

Following the same objective, we remove the morphological suffix from words, which increases the level of reduction of each occurrence of a word to a unique graphic form.

There are two fundamental methods: stemming and lemmatization.
The first reduces the occurrences to a stem which is inferred by means of several techniques, the other is the reduction of the occurrence to its lemma.

In [None]:
# Stemming: Porter algorithm
from nltk.stem.porter import PorterStemmer
stemmed_pos = [(PorterStemmer().stem(w), pos) for w, pos in words_pos]
print(stemmed_pos)

[('eight', 'NUM'), ("o'clock", 'NOUN'), ('thursday', 'NOUN'), ('morn', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ("n't", 'ADV'), ('feel', 'VERB'), ('good', 'ADJ')]


In [None]:
# Stemming: Lancaster algorithm
from nltk.stem import LancasterStemmer
stemmed_pos = [(LancasterStemmer().stem(w), pos) for w, pos in words_pos]
print(stemmed_pos)

[('eight', 'NUM'), ("o'clock", 'NOUN'), ('thursday', 'NOUN'), ('morn', 'NOUN'), ('gre', 'ADJ'), ('arth', 'NOUN'), ("n't", 'ADV'), ('feel', 'VERB'), ('good', 'ADJ')]


In [None]:
# Lemmatization: using thesaurus of wordnet
from nltk.stem.wordnet import WordNetLemmatizer
lemmed_pos = [(WordNetLemmatizer().lemmatize(w), pos) for w, pos in words_pos]
print(lemmed_pos)

[('eight', 'NUM'), ("o'clock", 'NOUN'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ("n't", 'ADV'), ('feel', 'VERB'), ('good', 'ADJ')]


#### <font color = 'E3A440'>*g. Filtering by morphosyntactic role*</font>

The filtering of lexical units can be extended to the elimination of units which does not belong to a list of predefined morphosyntactic roles. For example, we can remove all words that are not *nouns* or *adjectives*.

In [None]:
# Keep nouns and adjectives only.
lemmed_pos = [(w, pos) for w, pos in words_pos if pos in ['NOUN','ADJ']]
print(lemmed_pos)

[("o'clock", 'NOUN'), ('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('good', 'ADJ')]


## <font color = 'E3A440'>**1.2 Processing of a corpus**</font>

The pre-processing of a corpus of texts may require the implementation of several steps. The first and most important is the division of the corpus.

### <font color = 'E3A440'>*1. Text splitting*</font>

Depending on the purpose of the analysis, the text can be split into several fragments, that can be a document, a paragraph, a concordance, a group of sentences, a single sentence, etc.



In [None]:
text = """At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good.
          The following morning, at nine, Arthur felt better.
          A dog run in the street."""
len(text)

175

In the next block of code, we do a sentence split.

In [None]:
sentences = nltk.sent_tokenize(text)
print(sentences)
len(sentences)

["At eight o'clock, on Thursday morning, the great Arthur didn't feel VERY good.", 'The following morning, at nine, Arthur felt better.', 'A dog run in the street.']


3

### <font color = 'E3A440'>*2. Annotation et cleaning*</font>

The previous morphosyntactic annotation and filtering operations will be applied to each fragment of the corpus that has been created.

#### <font color = 'E3A440'>*a. Creating a function*</font>

In the following code, a function is created to encompass all the operations needed for annotation and cleaning.

In [None]:
# To run this function proprlely, you need to import modules needed
def CleaningText(text_as_string, language = 'english', reduce = '', list_pos_to_keep = [], Stopwords_to_add = []):
    from nltk.corpus import stopwords

    words = nltk.word_tokenize(text_as_string)
    words_pos = nltk.pos_tag(words, tagset='universal')
    words_pos = [(w, pos) for w, pos in words_pos if w.isalnum()]
    words_pos = [(w.lower(), pos) for w, pos in words_pos]
    
    if reduce == 'stem': 
        from nltk.stem.porter import PorterStemmer
        reduced_words_pos = [(PorterStemmer().stem(w), pos) for w, pos in words_pos]
        
    elif reduce == 'lemma':
        from nltk.stem.wordnet import WordNetLemmatizer
        reduced_words_pos = [(WordNetLemmatizer().lemmatize(w), pos) for w, pos in words_pos]
    else:
        import warnings
        reduced_words_pos = words_pos
        warnings.warn("Warning : any reduction was made on words! Please, use \"reduce\" argument to chosse between 'stem' or  'lemma'")
    if list_pos_to_keep:
        reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if pos in list_pos_to_keep]
    else:
        import warnings
        warnings.warn("Warning : any POS filtering was made. Please, use \"list_pos_to_keep\" to create a list of POS tag to keep.")
    
    list_stopwords = stopwords.words(language) + Stopwords_to_add
    reduced_words_pos = [(w, pos) for w, pos in reduced_words_pos if w not in list_stopwords and len(w) > 1 ]
    return reduced_words_pos



#### <font color = 'E3A440'>*b. Application of cleaning*</font>

Now we can apply this function to each text fragment.

In [None]:
cleaned_sentences = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = ['NOUN','ADJ','VERB']) for sent in sentences]
print(cleaned_sentences)

[[('thursday', 'NOUN'), ('morning', 'NOUN'), ('great', 'ADJ'), ('arthur', 'NOUN'), ('feel', 'VERB'), ('good', 'ADJ')], [('following', 'ADJ'), ('morning', 'NOUN'), ('arthur', 'NOUN'), ('felt', 'VERB')], [('dog', 'NOUN'), ('run', 'NOUN'), ('street', 'NOUN')]]


#### <font color = 'E3A440'>*c. Frequency of words*</font>

What is the frequency of the words in our corpus? To answer, we create a list of words by removing the morphosyntactic annotation.

In [None]:
freqs_in_text = nltk.FreqDist([w for sent in cleaned_sentences for w, pos in sent ])
freqs_in_text

FreqDist({'morning': 2, 'arthur': 2, 'thursday': 1, 'great': 1, 'feel': 1, 'good': 1, 'following': 1, 'felt': 1, 'dog': 1, 'run': 1, ...})

### <font color = 'E3A440'>*3. Vectorization*</font>

Typically, to use text in a data analysis or machine learning context, text must be transformed into an appropriate mathematical object.
The simplest and most widespread model is the "bags-of-words", in which each text (or each text fragment) is defined in a vector, by a certain number of lexical units which characterize it. This model belongs to the family of vector semantics models and it has the following form:


$$X = \begin{bmatrix} 
x_{1,1} & x_{1,2} & \ldots & x_{1,w} \\
\vdots & \vdots       &  \ddots      & \vdots \\ 
x_{n,1} & x_{1,2} & \ldots & x_{n,w} \\
\end{bmatrix}
$$ 

In this matrix, the value $x_{i,j}$ represents the "weigth" of the word $j$ in the text fragment $i$. This weigth can be computed in several way. Thus :

- $x_{i,j}$ can represents the presence of the word "j" in text fragment $i$,
- $x_{i,j}$ can measures the quantoty of occurrences of a word $j$ in text fragment $i$,
- $x_{i,j}$ can represent the **value** of the word $j$ in text fragment $i$, and this, using metric such as tf-idf :
 $$\text{tf-idf}_{i,j}=\text{tf}_{i,j}.log\left(\frac{n}{n_i}\right)$$
 - $\text{tf}_{i,j}$ is the frequency of word $i$ in text fragment $j$,
 - $n$ total count of text fragments,
 - $n_i$ total counts of text fragments containing the word $i$.


In [None]:
# Object initialization
from nltk.corpus import stopwords

def identity_tokenizer(text):
    return text

# Transforming the word in frequencies
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 1, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 10, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

Use of the "vectorizer" with a list of word lists (and not a list of word-pos tuples).

In [None]:
# Create a list of list of words:
[[w for w, pos in sent] for sent in cleaned_sentences]

[['thursday', 'morning', 'great', 'arthur', 'feel', 'good'],
 ['following', 'morning', 'arthur', 'felt'],
 ['dog', 'run', 'street']]

In [None]:
# Application of the vectorizer
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_sentences])
print(pd.DataFrame(freq_term_DTM.todense(), columns =  [k for k, v in sorted(vectorized.vocabulary_.items(), key=lambda item: item[1])] ))

   arthur  dog  feel  felt  following  good  great  morning  run  street  \
0       1    0     1     0          0     1      1        1    0       0   
1       1    0     0     1          1     0      0        1    0       0   
2       0    1     0     0          0     0      0        0    1       1   

   thursday  
0         1  
1         0  
2         0  


  % sorted(inconsistent)


Thus, we assign the result of the Tf-IDF weighting to the variable named `tfidf_DTM`. 

In [None]:
# Calculate the tfidf matrix
tfidf = TfidfTransformer(norm='l1')
tfidf_DTM = tfidf.fit_transform(freq_term_DTM)
print(pd.DataFrame(tfidf_DTM.todense(), columns =  [k for k, v in sorted(vectorized.vocabulary_.items(), key=lambda item: item[1])] ))

     arthur       dog      feel      felt  following      good     great  \
0  0.137750  0.000000  0.181125  0.000000   0.000000  0.181125  0.181125   
1  0.215994  0.000000  0.000000  0.284006   0.284006  0.000000  0.000000   
2  0.000000  0.333333  0.000000  0.000000   0.000000  0.000000  0.000000   

    morning       run    street  thursday  
0  0.137750  0.000000  0.000000  0.181125  
1  0.215994  0.000000  0.000000  0.000000  
2  0.000000  0.333333  0.333333  0.000000  


<a name="Section_2"></a>
# <font color = 'E3A440'> 2. *Exercise : Sentiment Analysis on Twitter* </font>

The exercise proposed in this section is based on a simple processing chain for **sentiment analysis** on Twitter data and **analysis of lexical specificities**.

The corpus used was collected in 2020 by *trackmyhashtag.com* and contains 150,000 tweets for the 50 most followed profiles on Twitter. The data is in tabular format in a CSV file. For pedagogical reasons, this exercise foresees the use of a random sample of 5,000 tweets.

First, the textual data of 5,000 tweets will be analyzed by a sentiment analysis module of the `nltk` module. Then the text will be preprocessed and some lexical analysis will be performed.

During the exercise, the participant will be invited to fill the missing parts of the code which are indicated with `...` (three dots).

## <font color = 'E3A440'> 2.1 Presentation of the exercise </font>

### <font color = 'E3A440'> a. Import data </font>

The file with the data is archived in a `.zip` and contains more than 150,000 tweets. For educational reasons, we only import 5,000 random tweets. 

In [None]:
ROOT_DIR='Donnees_demystifiees_seance_6/'
DATA_DIR=os.path.join(ROOT_DIR, 'Data')
import zipfile
from datetime import datetime

#Unzips the dataset and gets the TSV dataset
with zipfile.ZipFile(os.path.join(DATA_DIR,'4POINT0_Top_50_tweet_profiles.zip'), 'r') as zip_ref:
    zip_ref.extractall(DATA_DIR)

df = pd.read_pickle(os.path.join(DATA_DIR,'Top_50_tweet_profiles.pkl')).sample(5000, random_state = 5641).reset_index()

Here available variables in the dataset.

In [None]:
df.columns

Index(['index', 'Tweet Id', 'Tweet URL', 'Tweet Posted Time', 'Tweet Content',
       'Tweet Type', 'Client', 'Retweets received', 'Likes received',
       'User Id', 'Name', 'Username', 'Verified or Non-Verified',
       'Profile URL', 'Protected or Not Protected', 'Profile Account'],
      dtype='object')

Here an observation (one row of the table of data):

In [None]:
df.iloc[0]

index                                                                     71560
Tweet Id                                                     656538552327630848
Tweet URL                     https://twitter.com/billboard/status/656538552...
Tweet Posted Time                                           2015-10-20 18:32:43
Tweet Content                 .@JustinBieber, @Skrillex and @Bloodpop's #Sor...
Tweet Type                                                              Retweet
Client                                                       Twitter Web Client
Retweets received                                                         20260
Likes received                                                            22109
User Id                                                                 9695312
Name                                                                  billboard
Username                                                              billboard
Verified or Non-Verified                

### <font color = 'E3A440'> b. Run Sentiment Analysis </font>

The `SentimentIntensityAnalyzer` object is used to perform sentiment analysis. The object must be initialized and then the `polarity_scores()` function can be applied to a string.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Here are three examples of sentiment analysis. The result of the `polarity_scores()` function returns four values:

 1. `neg` : indicates the degree, on a scale from 0 to 1, of negative sentiment of the text.
 2. `neu` : indicates the degree, on a scale of 0 to 1, of neutral sentiment of the text.
 3. `pos` :indicates the degree, on a scale of 0 to 1, of positive sentiment of the text.
 4. `compound` : contains a composite value of the previous three metrics with a range from -1 to 1.



In [None]:
sia.polarity_scores("Wow, Montreal Canadiens is the greatest hockey team in the world!")

{'neg': 0.0, 'neu': 0.52, 'pos': 0.48, 'compound': 0.8516}

In [None]:
sia.polarity_scores("Ottawa is not bad city!")

{'neg': 0.0, 'neu': 0.56, 'pos': 0.44, 'compound': 0.484}

In [None]:
sia.polarity_scores("No, you cannot put pineapple on a pizza! This is disgusting!")

{'neg': 0.436, 'neu': 0.564, 'pos': 0.0, 'compound': -0.7339}

These are tweets on which we will apply the sentiment analysis:

In [None]:
df['Tweet Content']

0       .@JustinBieber, @Skrillex and @Bloodpop's #Sor...
1       👏 ¡@SergioRamos y @hazardeden10 se encuentran ...
2       Here's how the market may predict the next pre...
3       “Children are magical on road trips. They have...
4       .@MelissaMcCarthy told me about the moment she...
                              ...                        
4995    So saddened to hear of the tragic theatre shoo...
4996    always takes the road less traveled... @ New O...
4997    #HustleHart #MoveWithHart https://t.co/GkQHkhKhR3
4998    This is the letter the US Attorney General sen...
4999    The Week on Instagram | 276\nhttps://t.co/9kIt...
Name: Tweet Content, Length: 5000, dtype: object

In the next block of code, we run sentiment analysis on the `Tweet Content` column, and add the resulting results to the data table (the object named `df`).

In [None]:
# Running Sentiment Analysis on the Corpus
datasent = df.apply(lambda x: sia.polarity_scores(x['Tweet Content']), 1)
df = df.join(pd.DataFrame(list(datasent)))

The result of the analysis is saved in 4 variables. Here is an example:

In [None]:
df.iloc[0]

index                                                                     71560
Tweet Id                                                     656538552327630848
Tweet URL                     https://twitter.com/billboard/status/656538552...
Tweet Posted Time                                           2015-10-20 18:32:43
Tweet Content                 .@JustinBieber, @Skrillex and @Bloodpop's #Sor...
Tweet Type                                                              Retweet
Client                                                       Twitter Web Client
Retweets received                                                         20260
Likes received                                                            22109
User Id                                                                 9695312
Name                                                                  billboard
Username                                                              billboard
Verified or Non-Verified                

To make the analysis simple, we will only use the compound metric `compound` which is automatically calculated by the `polarity_score()` function.

In [None]:
df['compound'].describe()

count    5000.000000
mean        0.199087
std         0.417216
min        -0.972600
25%         0.000000
50%         0.000000
75%         0.557400
max         0.980200
Name: compound, dtype: float64

To use the `compound` metric in a **lexical specificity analysis** context, it is necessary to constitute categories, i.e. to group the tweets under the following categories:
 1. `negative` : which groups tweets containing negative sentiment (`compound` from -1 to -0.1) 
 2. `neu` : which groups tweets that are more neutral (`compound` from -0.5 to 0.5)
 3. `positive` : which groups tweets containing a positive sentiment (`compound` more than 0.5)

In [None]:
# 1 Determine the values ​​to cut the compound metric
bins = [-1, -0.1, 0.5, 1]
# 2 Determine the names of the categories. NOTE that the numbers of category names must be less than the cut values.
names = ['negative', 'neu', 'positive']
# Execute slicing with pandas 'cut' function.
df['compound_category']  = pd.cut(df['compound'], bins, labels=names, include_lowest =True)

Here is the distribution of tweets by category:

In [None]:
Counter(df['compound_category'])

Counter({'negative': 630, 'neu': 2969, 'positive': 1401})

### <font color = 'E3A440'> c. Annotation, cleaning and vectorization of tweets </font>

We use the previously written function to clean the lexical units of tweets. For this first test, we keep only the adjectives.

This operation will take a few seconds.

In [None]:
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = ['ADJ'], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]

In the vectorization step we retain the words that appear in at least 5 documents (`min_df = 5`).

In [None]:
# Object initialization
def identity_tokenizer(text):
    return text
# Transforming the word in frequencies
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 5, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 4500, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

In [None]:
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])
freq_term_DTM

  % sorted(inconsistent)


<5000x172 sparse matrix of type '<class 'numpy.int64'>'
	with 2661 stored elements in Compressed Sparse Row format>

### <font color = 'E3A440'> d. Analysis of lexical specificities </font>

The analysis of lexical specificities makes it possible to highlight the lexical units which are specific to a particular group of data. In our case, it is possible to identify the words that are more strongly associated with positive or negative feelings.

To do this, we use a widely used metric in lexicometry, which is the likelihood function (log-likelihood ratio). The metric is based on this [article](https://aclanthology.org/J93-1003.pdf). Other methods can be used, such as mutual information, chi2 or tf-idf weighting.

In [None]:
def GetLexicalSpecificities(freq_term_DTM, logical_vector):
    # This code ref takes inspiration from this python module : https://pypi.org/project/corpus-toolkit/
    # and its main script:  https://github.com/kristopherkyle/corpus_toolkit/blob/master/corpus_toolkit/corpus_tools.py
    # which is based on this paper: https://aclanthology.org/J93-1003/
    import math
    df_freq_target = pd.DataFrame(np.asarray(freq_term_DTM[logical_vector].sum(0).T).reshape(-1))
    df_freq_target.index = [word for (word,idx) in sorted(vectorized.vocabulary_.items(), key= lambda x:x[1])]
    df_freq_target.columns = ['freq1']
    df_freq_target['freq2'] = np.asarray(freq_term_DTM[~(logical_vector)].sum(0).T).reshape(-1)
    df_freq_target['tot'] = df_freq_target['freq1'] + df_freq_target['freq2']

    df_freq_target['freq1'] = df_freq_target['freq1'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
    df_freq_target['freq2'] = df_freq_target['freq2'].apply(lambda x: 0.0000001 if x == 0 else x).astype(float)
    #
    df_freq_target['freq1_norm'] = df_freq_target['freq1']/df_freq_target['freq1'].sum() * 1000000
    df_freq_target['freq2_norm'] = df_freq_target['freq2']/df_freq_target['freq2'].sum() * 1000000
    #
    df_freq_target['fraction'] = df_freq_target['freq1_norm'] / df_freq_target['freq2_norm']
    df_freq_target['Log-likelihood Ratio'] = df_freq_target['fraction'].apply(math.log2)
    frequency_threshold = 10 # Insert your frequency threshold as integer
    return df_freq_target[df_freq_target['tot'] > frequency_threshold]['Log-likelihood Ratio'].sort_values(ascending=False).iloc[range(50)]

To perform the specificity analysis, it is necessary to create a logical vector (with binary values) which indicates with `True` the class for which we want to analyze the lexical specificity and with `False` the rest of the corpus.

In [None]:
logical_vector = df['compound_category'] == 'positive'
logical_vector

0       False
1       False
2       False
3       False
4       False
        ...  
4995    False
4996    False
4997    False
4998    False
4999    False
Name: compound_category, Length: 5000, dtype: bool

In [None]:
sum(logical_vector)

630

Run the function with the frequency matrix (`freq_term_DTM`) and the logical vector we created above.

In [None]:
GetLexicalSpecificities(freq_term_DTM, logical_vector)

free          27.839934
beautiful     27.772819
amazing        5.374933
happy          4.477503
great          3.618858
perfect        3.448933
best           3.086363
proud          2.586437
special        2.496239
much           1.311430
good           1.246304
whole          1.127005
incredible     0.957080
important      0.934360
nice           0.863971
excited        0.827445
better         0.711968
powerful       0.711968
favorite       0.629506
single         0.319650
huge           0.296930
hard           0.296930
funny          0.127005
big            0.127005
exclusive     -0.024998
young         -0.065640
able          -0.136029
sure          -0.153103
many          -0.168451
ready         -0.194923
top           -0.220918
old           -0.235565
high          -0.288032
last          -0.407331
long          -0.525071
available     -0.661491
american      -0.661491
black         -0.680350
new           -0.757518
next          -0.811594
true          -0.872995
le            -0

In [None]:
del logical_vector, freq_term_DTM

## <font color = 'E3A440'> 2.2 Exercice </font>

During the exercise, the participant are invited to fill the missing parts of the code which are indicated with `...` (three dots).

Several manipulations and different results will be required. Each sub-exercise follows this processing chain:

1. Annotation and cleaning of tweets: the participant will have to adjust some parameters of the function to choose a specific filtering.
2. Vectorization: the participant will have to adjust some parameters of the function to choose a specific filtering.
3. Creation of a logical vector to define the target group and the reference group.
4. Application of the `GetLexicalSpecificities()` function to obtain the 50 most specific words for the target group.




### <font color = 'E3A440'> a. Study the impact of morphosyntactic filtering on lexical specificities </font>

In point 2.1, only adjectives have been studied. Now do a study on nouns, adjectives and verbs and then on other combinations that are interesting for you.
Here is the list of existing POS tags:

| **POS** | **DESCRIPTION**           | **EXAMPLES**                                      |
| ------- | ------------------------- | ------------------------------------------------- |
| ADJ     | adjective                 | big, old, green, incomprehensible, first      |
| ADP     | adposition                | in, to, during                                |
| ADV     | adverb                    | very, tomorrow, down, where, there            |
| AUX     | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ    | conjunction               | and, or, but                                  |
| CCONJ   | coordinating conjunction  | and, or, but                                  |
| DET     | determiner                | a, an, the                                    |
| INTJ    | interjection              | psst, ouch, bravo, hello                      |
| NOUN    | noun                      | girl, cat, tree, air, beauty                  |
| NUM     | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART    | particle                  | ’s, not                                      |
| PRON    | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN   | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT   | punctuation               | ., (, ), ?                                    |
| SCONJ   | subordinating conjunction | if, while, that                               |
| SYM     | symbol                    | $, %, §, ©, +, −, ×, ÷, =, :)               |
| VERB    | verb                      | run, runs, running, eat, ate, eating          |
| X       | other                     | sfpksdpsxmsa                                  |
| SPACE   | space                     |                                                   |


Insert the correct value for the `list_pos_to_keep` argument in roder to keep nouns, adjectives and verbs, or any other POS tag combination of your interest.
Look at the three dots `...` and fill it! You should inspire to prvious code presentend during this workshop. Copy-Paste is permitted!

In [None]:
# 1. Annotation and cleaning : ADD adjective and verbs as POS tag to keep
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = [...], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]

Change the `min_df` parameters so that you don't exceed **750 words** of your matrix <font color='E3A440'>**Document-Term matrix**</font>, which is saved in the object `freq_term_DTM`.

Note that in this function the `ngram_range` parameter is configured to have unigrams and bigrams (its value is: `(1,2)`).

In [None]:
# 2. Vectorisation
def identity_tokenizer(text):
    return text
    
## 2.1 initialise with parameters : 
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = ..., # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 4500, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 2), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

#
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])

freq_term_DTM

Using the work already done in point **b.** of section **2.1**, choose the categories for which you want to study the lexical specificity ex. `negative` or `positive`.

In [None]:
logical_vector = df['compound_category'] == ...

Using the function defined in point **d.** of section **2.1**, add the fundamental arguments of the function, i.e. the matrix <font color='E3A440'>**Document-Term matrix**</font> and the **logical vector** created in the previous code block.

In [None]:
# This function needs 2 arguemnts: 1st is the matrix, the 2nd is the logical vector
GetLexicalSpecificities(..., ...)

### <font color = 'E3A440'> b. Investigate new categories based on the number of Retweets </font>

On Twitter, it is possible to retweet an existing tweet. The number of retweets can be considered an indicator of the interest a tweet has obtained.

Answer the following question: what are the lexical specificities of tweets that have had a very large following?

To answer, you must manipulate some line of code by performing the steps learned throughout this workshop.

Here is the distribution of the `Retweets received` column.

In [None]:
df['Retweets received'].describe()

Following the percentiles that are displayed in the distribution of the `Retweets received` column (result of the previous chunk of code), add the missing slicing values to the `bins` list in the next chunk of code.
Divide the number of Retweets into four categories:
1. `low`, grouping tweets that have received a low interest
2. `medium`, grouping tweets that have received a medium interest
3. `high`, grouping tweets that have received a high interest
4. `very_high`, grouping tweets that have received a very high interest

In [None]:
bins = [-np.inf, 161, ..., ..., 449711]
names = ['low', 'medium', 'high', 'very_high']
df['Retweets_received_category']  = pd.cut(df['Retweets received'], bins, labels=names, include_lowest =True)

Choose the **target category** for which you want analyze the lexical specificities. The value must be one of the four values ​​contained in the `Retweets_received_category` column generated in the previous chunk of code.

In [None]:
logical_vector = df['Retweets_received_category'] == ...

Run the specificity analysis.

In [None]:
GetLexicalSpecificities(freq_term_DTM, logical_vector)

### <font color = 'E3A440'> c. Study the lexical specificities ​​of different Tweet profiles </font>

In the next exercise, select two or three Twitter profiles of your choice and compare the lexical specificities by studying several POS tag combinations. 

What are the main lexical differences between the profiles you have chosen?

Here is the complete list of profiles present in the corpus and recorded under the `Profile Account` column and the number of tweets per profile.

In [None]:
Counter(df['Profile Account'])

In [None]:
# 1. Annotation and cleaning : ADD adjective and verbs as POS tag to keep
cleaned_tweets = [CleaningText(sent, reduce = 'lemma', list_pos_to_keep = [...], Stopwords_to_add=['http']) for sent in list(df['Tweet Content'])]

# 2. Vectorisation
def identity_tokenizer(text):
    return text
## 2.1 initialise with parameters : 
vectorized = CountVectorizer(lowercase = False, # Convert all characters to lowercase before tokenizing
                             min_df = 10, # Ignore terms that have a document frequency strictly lower than the given threshold 
                             max_df = 4500, # Ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)
                             stop_words = stopwords.words('english'), # Remove the list of words provided
                             ngram_range = (1, 1), # Get the lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted
                             tokenizer=identity_tokenizer) # Override the string tokenization step while preserving the preprocessing and n-grams generation steps

#
freq_term_DTM = vectorized.fit_transform([[w for w, pos in sent] for sent in cleaned_tweets])

freq_term_DTM

In [None]:
logical_vector = df['Profile Account'] == ...
GetLexicalSpecificities(freq_term_DTM, logical_vector)

## <font color = 'E3A440'> 2.3 NOTES PERSONNELLES: </font>

-----

-----