# Lab session 1 
# An introduction to textual data

## Lecture takeaways 

- The Why of NLP
- What is NLP ? the four challenges of NLP
- NLP in three pipelines

cf. https://nlp-ensae.github.io/files/NLP-ENSAE-lecture-1.pdf

## Lab session Prerequisites

- Python 
- Pandas 

For those not familiar with pandas https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html 

## Lab session in a nushell 

- Grasping a dataset 
- Basic Tokenization (Word Segmentation) of a dataset
(Compute Vocabulary and Zipf's law)
- Regex 
- Hands on some processing tools (POS, NER, ...) 

## Resources : 

- NLTK : https://www.nltk.org/api/nltk.tokenize.html 
- PANDAS : https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
- SPACY : https://spacy.io/usage/spacy-101 


## Database

We will use the following database:
https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2017/01/PLOS_narrativity.csv.zip

This database is used in a scientific article about the importance of narrativity in the citation frequency of climate change scientific articles.  https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167983  


## Tasks

### 1. Basic preprocessing
#### 1.1 Open the database. Generate simple statistics about the abstracts. How many unique articles are there? What is the mean length of abstracts in characters? 
#### 1.2 Generate simple statistics about the annotators' data for each article. Do the annotations seem consistent? 

### 2. Word-level preprocessing
#### 2.1 Split the abstracts into list of words. How many different words are there in the vocabulary? 
#### 2.2 Split the abstracts into list of words using three different tokenizers from nltk. What is the difference in terms of number of words? What do you think has changed?
#### 2.3 Check if Zipf's law applies. 

### 3. Domain specificity and regex
#### 3.1 Use regex to retrieve numbers (ints, floats, %, years, ...) using a regex. 
#### 3.2 How many percent of characters are numbers (as defined above) in a given abstract? 
#### 3.3 Is there any relationship between the percentage of numbers in an abstract and the amount of citation?  

### 4. Classic NLP pipeline
#### 4.0 Re-tokenize using spacy
#### 4.1 Lemmatize using spacy
#### 4.2 POS tagging using spacy, plot the trees
#### 4.3 NER using spacy, give the amount of each entity type for a given abstract, and compare it to the amount of citations. 

### 5. Topic Modelling
#### 5.1 Use Gensim's LDA to compute a topic model. 
#### 5.2 Use PyLDAvis to visualise the topic model. What are the different topic clusters?
#### 5.3 Use a tf-idf representation for each abstract, and use your favorite clustering algorithm.

In [1]:
# Downloading the database
!wget https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2017/01/PLOS_narrativity.csv.zip

--2025-01-27 19:16:10--  https://d1p17r2m4rzlbo.cloudfront.net/wp-content/uploads/2017/01/PLOS_narrativity.csv.zip
Résolution de d1p17r2m4rzlbo.cloudfront.net (d1p17r2m4rzlbo.cloudfront.net)… échec : nodename nor servname provided, or not known.
wget : impossible de résoudre l’adresse de l’hôte « d1p17r2m4rzlbo.cloudfront.net »


In [2]:
!unzip PLOS_narrativity.csv.zip

unzip:  cannot find or open PLOS_narrativity.csv.zip, PLOS_narrativity.csv.zip.zip or PLOS_narrativity.csv.zip.ZIP.


# 1. Basic preprocessing






## 1.1 Open the database. Generate simple statistics about the abstracts. How many unique articles are there? What is the mean length of abstracts in characters?

In [3]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
df = pd.read_csv('PLOS_narrativity.csv', index_col=0)
print("Shape:  {0}".format(df.shape))

FileNotFoundError: [Errno 2] No such file or directory: 'PLOS_narrativity.csv'

In [0]:
# e.g 
# Number of different articles in the database

802


In [0]:
# Mean length of abstracts in characters

1496.1795511221944

In [0]:
# Repartition of the abstracts length in characters

## 1.2 Generate simple statistics about the annotators' data for each article. Do the annotations seem consistent? 


In [0]:
# First, number of annotator per article
# --> X annotators/article
df.columns

Index(['X_unit_id', 'X_created_at', 'X_id', 'X_started_at', 'X_tainted',
       'X_channel', 'X_trust', 'X_worker_id', 'X_country', 'X_region',
       'X_city', 'X_ip', 'appeal_to_reader', 'conjunctions', 'connectivity',
       'narrative_perspective', 'sensory_language', 'setting', 'ab',
       'appeal_to_reader_gold', 'conjunctions_gold', 'connectivity_gold',
       'narrative_perspective_gold', 'pmid', 'py', 'sensory_language_gold',
       'setting_gold', 'so', 'tc', 'af', 'au', 'bp', 'di', 'ep', 'is', 'pd',
       'pt', 'sn', 'ti', 'ut', 'vl', 'z9', 'cin_mas', 'firstauthor',
       'numberauthors', 'pid_mas', 'title'],
      dtype='object')

In [0]:
# Seing coherence between annotators : need to transform appeal_to_reader, narrative_perspective, setting to bools. 
# Then, std on the columns. 
df['appeal_to_reader'] = df.appeal_to_reader.apply(lambda x: True if x=="yes" else False)
df['narrative_perspective'] = df.narrative_perspective.apply(lambda x: True if x=="yes" else False)
df['setting'] = df.setting.apply(lambda x: True if x=="yes" else False)

In [0]:
eval_cols = ["appeal_to_reader", "conjunctions", "connectivity", "narrative_perspective", "sensory_language", "setting"]
df.groupby(df.pmid)[eval_cols].std()

Unnamed: 0_level_0,appeal_to_reader,conjunctions,connectivity,narrative_perspective,sensory_language,setting
pmid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
18726051,0.487950,1.976047,1.000000,0.487950,1.397276,0.000000
18783869,0.534522,1.573592,1.976047,0.377964,1.718249,0.534522
18810525,0.487950,1.345185,1.799471,0.487950,1.463850,0.000000
18810526,0.487950,2.214670,0.975900,0.377964,1.214986,0.000000
18811616,0.534522,1.069045,1.380131,0.377964,1.069045,0.487950
...,...,...,...,...,...,...
22216227,0.487950,1.133893,1.718249,0.534522,2.449490,0.377964
22216263,0.487950,0.951190,2.340126,0.487950,0.975900,0.487950
22216307,0.534522,1.133893,1.799471,0.487950,1.380131,0.377964
22216315,0.534522,1.214986,1.253566,0.487950,1.397276,0.000000


In [0]:
len(df.pmid.unique())

802

# 2. Word-level preprocessing


## 2.1 Split the abstracts into list of words. How many different words are there in the vocabulary?



In [0]:
from functools import reduce
from operator import add

# List of words with separator = " "
arr = df.ab.drop_duplicates().apply(lambda x: x.split(' ')).array

arr = reduce(add, arr)
#len(set(arr))

## 2.2 Split the abstracts into list of words using three different tokenizers from nltk. What is the difference in terms of number of words? What do you think has changed?



In [0]:
# https://www.nltk.org/api/nltk.tokenize.html 
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import ToktokTokenizer
from nltk.tokenize import TweetTokenizer
# e.g : tokenizers = [TreebankWordTokenizer(), ToktokTokenizer(), TweetTokenizer()]

## 2.3 Check if Zipf's law applies.

In [0]:
from collections import Counter

# 3. Domain specificity and regex


## 3.1 Use regex to retrieve numbers (ints, floats, %, years, ...) in abstracts.


*Regex cheasheet* : see python's re module documentation https://docs.python.org/3/library/re.html  

*Other ressources* : 

- A good website to write and test regular expressions : 
https://regex101.com/
- A good game to learn regex : https://alf.nu/RegexGolf 


In [0]:
import re
# Regular expression that matches any sequence of numbers:
nb =  ''

## 3.2 How many percent of characters are numbers (as defined above) in a given abstract?


## 3.3 Is there any relationship between the percentage of numbers in an abstract and the amount of citation?