# *Natural Language Processing Assignment*

##### <font color="teal">**1. Basic preprocessing**</font>






###### <font color="teal">**1.1 Open the database. Generate simple statistics about the abstracts. How many unique articles are there? What is the mean length of abstracts in characters?**</font>

In [1]:
import pandas as pd

data = pd.read_csv("PLOS_narrativity.csv")

unique_articles = data['pmid'].nunique()

mean_abstract_length = data['ab'].dropna().apply(len).mean()

print(f"Unique articles: {unique_articles}")
print(f"Mean length of abstracts: {mean_abstract_length:.2f} characters")


Unique articles: 802
Mean length of abstracts: 1496.18 characters


##### <font color="teal">**2. Word-level preprocessing**</font>


###### <font color="teal">**2.1 Split the abstracts into list of words. How many different words are there in the vocabulary?**</font>



In [2]:
all_abstracts = ' '.join(data['ab'].dropna())

words = set(all_abstracts.lower().split(sep=' '))

unique_word_count = len(words)

print(f"Unique words in abstracts: {unique_word_count}")


Unique words in abstracts: 19383


###### <font color="teal">**2.2 Split the abstracts into list of words using three different tokenizers from nltk. What is the difference in terms of number of words? What do you think has changed?**</font>



In [3]:
import nltk
from nltk.tokenize import TreebankWordTokenizer, ToktokTokenizer, TweetTokenizer

In [4]:
all_abstracts = ' '.join(data['ab'].dropna())

treebank_tokenizer = TreebankWordTokenizer()
toktok_tokenizer = ToktokTokenizer()
tweet_tokenizer = TweetTokenizer()

treebank_tokens = treebank_tokenizer.tokenize(all_abstracts)
toktok_tokens = toktok_tokenizer.tokenize(all_abstracts)
tweet_tokens = tweet_tokenizer.tokenize(all_abstracts)

treebank_unique_words = set(treebank_tokens)
toktok_unique_words = set(toktok_tokens)
tweet_unique_words = set(tweet_tokens)

print(f"Treebank Tokenizer: {len(treebank_unique_words)} unique words")
print(f"Toktok Tokenizer: {len(toktok_unique_words)} unique words")
print(f"Tweet Tokenizer: {len(tweet_unique_words)} unique words")


Treebank Tokenizer: 16576 unique words
Toktok Tokenizer: 16594 unique words
Tweet Tokenizer: 14432 unique words


Well, for the split word by space, we got 19383. 
Treebank got 16576, Toktok got 16594, and Tweet got 14432.

TreebankWordTokenizer:
Style: Based on the Penn Treebank tokenization.
Characteristics: Handles punctuation like periods and commas separately; splits contractions (e.g., "can't" -> "ca n't").

ToktokTokenizer:
Style: A general-purpose tokenizer.
Characteristics: Maintains a balance between splitting and grouping; respects some punctuation.

TweetTokenizer:
Style: Designed for social media text.
Characteristics: Keeps hashtags, mentions, and emoticons intact; better for informal language.

The different between the number of the words is depends on how they handle words and punctuation, contractions, and special symbols.
 


In [5]:
# print random sample of all tokenizer
import random

words_list = list(words)
treebank_unique_words_list = list(treebank_unique_words)
toktok_unique_words_list = list(toktok_unique_words)
tweet_unique_words_list = list(tweet_unique_words)

for i in range(20):
    # print(words_list[random.randint(0, len(words_list) - 1)])
    # print(treebank_unique_words_list[random.randint(0, len(treebank_unique_words_list) - 1)])
    print(toktok_unique_words_list[random.randint(0, len(toktok_unique_words_list) - 1)])
    # print(tweet_unique_words_list[random.randint(0, len(tweet_unique_words_list) - 1)])

alterations.
chlorophyll
affect.
diminished
indexing
Future
garter
effect.
linearly
1,372
compute
complemented
800-170
noted
theoretical
camps
line
perspectives.
List
unpredictable


##### <font color="teal">**3. Domain specificity and regex**</font>


###### <font color="teal">**3.1 Use regex to retrieve numbers (ints, floats, %, years, ...) in abstracts.**</font>


*Regex cheasheet* : see python's re module documentation https://docs.python.org/3/library/re.html  

*Other ressources* : 

- A good website to write and test regular expressions : 
https://regex101.com/
- A good game to learn regex : https://alf.nu/RegexGolf 


In [6]:
import re

pattern = r'\b\d+\.\d+|\b\d+%|\b\d{4}|\b\d+\b'
matches = re.findall(pattern, all_abstracts)
print(f"Found {len(matches)} numbers in the abstracts.")

Found 29883 numbers in the abstracts.


In [12]:
with open("test.txt", "w") as f:
    abstr = data['ab'].head(10)
    for i in range(len(abstr)):
        f.write(abstr[i])
        f.write("\n\n")

###### <font color="teal">**3.2 How many percent of characters are numbers (as defined above) in a given abstract?**</font>


In [10]:
characters = data['ab'].dropna().apply(len)

percentages = len(matches)/characters.sum() * 100
print(f"Found {percentages:.2f} percentages in the abstracts.")

Found 0.36 percentages in the abstracts.


##### <font color="teal">**4. Classic NLP pipeline**</font>


###### <font color="teal">**4.0 Re-tokenize using spacy**</font>

It is useful to take a look at spacy's [tokenizer documentation](https://spacy.io/usage/spacy-101#annotations-token)

###### <font color="teal">**4.1 Lemmatize using spacy**</font>

###### <font color="teal">**4.2 POS tagging using spacy, plot the trees**</font>

###### <font color="teal">**4.3 NER using spacy, give the amount of each entity type for a given abstract, visualize entities**</font>


##### <font color="teal">**5. Topic Modelling**</font>


###### <font color="teal">**5.1 Use Gensim's LDA to compute a topic model.**</font> 


###### <font color="teal">**5.2 Use PyLDAvis to visualise the topic model. What are the different topic clusters?**</font>


###### <font color="teal">**5.3 Use a tf-idf representation for each abstract, and use your favorite clustering algorithm.**</font>
