# inforet 1

## getting the data

Today, we will work with the UN General Debate dataset. The corpus consists of 7,507 speeches held at the annual sessions of the United Nations General Assembly from 1970 to 2016. It was created in 2017 by Mikhaylov, Baturo, and Dasandi at Harvard “for understanding and measuring state preferences in world politics.” Each of the almost 200 countries in the United Nations has the opportunity to present its views on global topics such international conflicts, terrorism, or climate change at the annual General Debate.
Work on this data is proposed in the book 

- https://github.com/blueprints-for-text-analytics-python/blueprints-text
- from here, but it's easier to use the version on my server. 
  - https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/data/un-general-debates/un-general-debates-blueprint.csv.gz



## downloading some toy data

In [None]:
# check if the file un-general-debates-blueprint.csv is present
# if not, download it from the web and unzip it
import os

file_name = 'un-general-debates-blueprint.csv'
gz_file = file_name + '.gz'
url = 'https://gerdes.fr/saclay/inforet/' + gz_file

if os.path.exists(file_name):
    print('File already present')
else:
    print('Downloading the file...')
    os.system(f'curl -o {gz_file} {url}')
    os.system(f'gunzip {gz_file}')

	
if you have a problem with the above code, 
you can also simply get the zip, unzip and put it manuaylly next to your notbook:

https://gerdes.fr/saclay/informationRetrieval/un-general-debates-blueprint.csv.gz

or try using wget:
```
!wget https://gerdes.fr/saclay/informationRetrieval/un-general-debates-blueprint.csv.gz
import gzip, shutil
with open('un-general-debates-blueprint.csv.gz', 'rb') as f_in:
    with gzip.open('un-general-debates-blueprint.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
```
      

In [None]:
# this turns on the autotimer, so that every cell has a timing information below
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime
# to stop using autotime, run the following command
# %unload_ext autotime

In [None]:
#!pip install wordcloud seaborn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
from tqdm.notebook import tqdm
from wordcloud import WordCloud
import re

In [None]:
df = pd.read_csv("un-general-debates-blueprint.csv")
df.sample(22) #, random_state=53)

## Let's get to know the data (and Pandas):

In [None]:
df.columns, df.dtypes

In [None]:
df.describe().T

#### 🚧 todo: 
- explain 
	- why only two rows?
	- the strange row above and the values you find. Look into the data!

answers: 


In [None]:
df.info(memory_usage='deep')
# check the total memory usage compared to the original file size

## Adding length columns, describing the dataframe

In [None]:
df['nb_chars'] = df['text'].str.len()
df.describe().T

#### 🚧 todo: estimate the number of words

- what's the average word size in English? (Remember HoNLP, that class before the vacation?)
- what's the mean, min, and max of estimated wordsize?
- suppose that a page 11pt has on average 600 words, what are the values in number of pages?
- suppose that on average, an English speaker pronounces 150 words per minute, what are the values for the duration of the speeches?


In [None]:
print(f'longest, shortest, and average speech in words: {df["nb_chars"]...
print(f'longest, shortest, and average speech in pages: {df["nb_chars"]...
print(f'longest, shortest, and average speech in minutes: {df["nb_chars"]...




#### 🚧 todo: add a wordlength column

In [None]:
# # 🚧 todo: explain why this fails
# df['nb_words'] = df['text'].str.split().len()
# # 🚧 todo: explain why this fails
# df['nb_words'] = df['text'].str.split().apply(len)

# 🚧 todo: find a way of getting this column


df.describe().T

In [None]:
16424.0/2611.0

#### 🚧 todo: check the results

- how was our estimate of word length compared to reality?
- if your minumum wordlength is now 0 or 1, explain by checking the file.
- the simple tokenization by splitting gives in average longer or shorter words than a more linguistically motivated tokenization?

In [None]:
# answer: 

In [None]:
df[['country', 'country_name', 'speaker', 'position']].describe().T

#### 🚧 TODO: 
- why does the describe() function works differently now?

## NaN ≠ NA
NaN means 0/0. NaN stands for Not a Number

NA is generally interpreted as a missing value and has various forms - NA_integer_, NA_real_, etc.

https://stats.stackexchange.com/questions/5686/what-is-the-difference-between-nan-and-na

In [None]:
df.isna().sum()

In [None]:
df[df['position'].isna()]

In [None]:
df.fillna({'speaker': 'unknown', 'position': 'unknown'}, inplace=True)
df[df['position'].isna()]

# categorical values vs numerical values

In [None]:
df[df['speaker'].str.contains('Bush')]['speaker'].value_counts()

In [None]:
df['nb_words'].plot(kind='box', vert=False)


In [None]:
df['nb_words'].plot(kind='hist', bins=30) # , figsize=(8,2)

### Kernel density estimation

https://en.wikipedia.org/wiki/Kernel_density_estimation

if error: "FutureWarning: `distplot` is a deprecated function"

update scipy: `pip3 install --upgrade scipy `

if it persists
    

In [None]:
# only if you got warnings!!!
# import warnings
# warnings.filterwarnings("ignore")

In [None]:
#plt.figure(figsize=(8, 2))
sns.histplot(df['nb_words'], bins=30, kde=True)


# Seaborn docs?
https://seaborn.pydata.org/index.html  
https://seaborn.pydata.org/generated/seaborn.distplot.html

## from where?

catplot shows the relationship between a numerical and one or more categorical variables.
https://seaborn.pydata.org/generated/seaborn.catplot.html

In [None]:
sns.catplot(data=df, x="country", y="nb_words")

In [None]:
# how to build a selection:
df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS'])

In [None]:
# using the selection
where = df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS'])
sns.catplot(data=df[where], x="country", y="nb_words", kind='box')
sns.catplot(data=df[where], x="country", y="nb_words", kind='violin')

## significant differences?

Student test? Anova ?

if the boxes (marking the quartiles) don't overlap each other and the sample size is at least 10, then the two groups being compared should have different medians at the 5% level: https://stats.stackexchange.com/questions/262495/reading-box-and-whisker-plots-possible-to-glean-significant-differences-between

In [None]:
sns.catplot(data=df[where], x="country", y="nb_words", kind='box', notch= True)

## time?

size() returns the number of rows per group  
Why number of countries?

In [None]:
df.groupby('year').size().plot(title="Number of Countries")

when more people want to speak, ...?

In [None]:
df.groupby('year').agg({'nb_words': 'mean'}).plot(title="Avg. Speech Length", ylim=(0,5000))

In [None]:
where = df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS', 'FRG', 'DEU'])
sns.catplot(data=df[where], x="country", y="nb_words", kind='box', notch= True)

## 🚧 todo: When speaking English, do Germans use longer words?

- Compare to British natives, US natives, and French speakers. 
- Is the result significant?
- How do you explain this?

In [None]:
# 🚧 todo:
df['avg_wordsize'] = ...

In [None]:
# 🚧 todo:
where = df['country'].isin...

#### 🚧 todo:
answer: 



# Let's Zipf it!
## skim through this section if you have followed Hands-on NLP!
but execute the code so that we have the freq_df and start again at word clouds
### Let's first flatten the list

In [None]:
all_words = [word for speech in df['text'].dropna() for word in re.findall(r'\b\w+\b', speech.lower())]
len(all_words)

In [None]:
text = "Wait... what?! What? WHAT?! You're telling me that 99.9% of statistics—including this one—are made up?! Made up, I say! Completely, absolutely, 100% made up!"
counter = Counter(re.findall(r'\b\w+\b', text.lower()))
counter

### What are the most common words of English?

In [None]:
counter = Counter(all_words)
counter.most_common(22)

for even bigger databases, it might be advisable to do the computation iteratively:

In [None]:
counter = Counter()
df['text'].dropna().apply(lambda text: counter.update(re.findall(r'\b\w+\b', text.lower())))
counter.most_common(22)

In [None]:
freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
freq_df.sort_values('freq',  inplace=True, ascending=False)
freq_df

In [None]:
freq_df.head(22).plot(kind='bar')


In [None]:
freq_df.head(2222).plot()

In [None]:
freq_df.head(2222).plot(loglog=True)

futher reading:  
https://en.wikipedia.org/wiki/Zipf's_law  
https://stats.stackexchange.com/questions/6780/how-to-calculate-zipfs-law-coefficient-from-a-set-of-top-frequencies

# Word cloud

http://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html#wordcloud.WordCloud

In [None]:
text = df.query("year==2015 and country=='USA'")['text'].values[0]
wc = WordCloud(max_words=100)
wc.generate(text)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

In [None]:
plt.subplots(1, 2, figsize=(20, 4))

text = df.query("country=='USA'")['text'].values[0]
wc = WordCloud(max_words=100)
wc.generate(text)
plt.subplot(1, 2, 1)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

text = df.query("country=='RUS'")['text'].values[0]
wc = WordCloud(max_words=100)
wc.generate(text)

plt.subplot(1, 2, 2)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

plt.tight_layout()

In [None]:
wc = WordCloud(max_words=100, stopwords=freq_df.head(50).index)
wc.generate(text)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

the `generate_from_frequencies` function allows to generate without stopwords directly from a Counter:

In [None]:
wc.generate_from_frequencies(counter)
plt.title('from counter')
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")

# Index

We want to build an inverted index:
- make a df such that for every type, we have a 1 if the document contains the type, 0 if not.
- for every type, give a list of document ids

# 🚧 todo:
- how many types do we have?
- how many documents do we have?

In [None]:
print(...,'types')
print(...,'documents')


In [None]:
list(freq_df.index[66:77])

In [None]:
df[33:36]

In [None]:
A = np.zeros((11, 3))
A.nbytes

we will first try the naïve way, to find out that this easily gets too slow:

In [None]:
for i,t in enumerate(freq_df.index[66:77]):
    for j, text in enumerate(df['text'][33:36]):
        # Tokenize text using regex
        tokens = set(re.findall(r'\b\w+\b', text.lower()))  # Use set for faster lookup
        if t in tokens:
            A[i, j] = 1  # Mark presence of token in text
A

In [None]:
A.nbytes

In [None]:
A = np.zeros((100, 7507)) # understand this: 100 most frequent words, 7507 speeches
for i,t in tqdm(enumerate(freq_df.index[:100])):
    for j, text in enumerate(df['text'][33:100]): # play with the range to see how slow your machine is
        tokens = set(re.findall(r'\b\w+\b', text.lower())) 
        if t in tokens:
               A[i,j] =1
# can you do that loop more efficiently? This is not an obligatory task.
A

In [None]:
A.nbytes

### 🚧 todo:

What would be the size of the complete table?


In [None]:
# 🚧 todo:
A = 

### 🚧 todo:

How long will it take to fill the complete table?


In [None]:
# 🚧 todo:
# i take 9 seconds per 100, should be about linear
...,'seconds',...,'minutes', ...,'hours'


### redoing the same thing with CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

df[33:36].text

In [None]:
vectorizer = CountVectorizer(vocabulary=freq_df.index[66:77], binary=True, min_df=1, lowercase=False)
# understand the options: 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
X = vectorizer.fit_transform(df[33:36].text)
print(vectorizer.get_feature_names_out())
print(X.toarray())


In [None]:
# make it pretty:
d = {c:X.toarray()[i] for i,c in enumerate(df[33:36].index)}
df_cv = pd.DataFrame.from_dict(d,  orient='index',columns=freq_df.index[66:77])
df_cv

## trying the complete set of documents with the complete vocabulary

In [None]:
vectorizer = CountVectorizer(vocabulary=freq_df.index, binary=True, min_df=1, lowercase=False)
X = vectorizer.fit_transform(df['text'].dropna())
print(len(vectorizer.get_feature_names_out()))
print(vectorizer.get_feature_names_out()[:11])
print(X.toarray())

- wow! comparably fast!
#### 🚧 todo:
- can you get the vector of "the"? is there a speech that doesn't use it?


answer: 

#### 🚧 todo: some visualizations of the vectorization

- make 2D scatterplots of the vectorization using PCA and t-SNE.
- use the years as hue
- explain why this looks so different
- hard: choose a cluster that looks mainly stemming from earlier texts, another stemming from recent texts, and find a few examples of terms that makes them different.

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca = PCA(n_components=2)
...

In [None]:
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
...

# analyze two clusters of the PCA plot: top right and bottom left:

In [None]:
# Define clusters based on PCA components
cluster_1 = pca_df[pca_df["PC1"] > 0.8].index
cluster_2 = pca_df[(pca_df["PC1"] < -0.2) & (pca_df["PC2"] > 0.8)].index

# Extract corresponding texts
...
diff_df = diff_df.sort_values(by=[ "Cluster 2 (PC1 < -0.2, PC2 > 0.8)", "Cluster 1 (PC1 > 0.8)"], ascending=False)
# Show distinctive words
diff_df.head(10)
# of both clusters (since we have a binary vectorizer, we only get present and absent words)


# another big vocabulary:
- we could grab a pageview file here https://dumps.wikimedia.org/other/pageviews/2022/2022-01/ and  produce a list of potential terms from it
- it's easier to use wikidata, and we concentrate on people:

here is code that grabs it and produces a file of person names. this API is unstable, so i propose to download directly the result on my website, see code below.

In [None]:
# you can skip this cell if you are only interested in the result, see next cell

def fetch_wikidata_humans(limit=10000, offset=0):
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    query = f"""
    SELECT ?human ?humanLabel WHERE {{
      ?human wdt:P31 wd:Q5.  # Humans (Q5)
      SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
    }}
    LIMIT {limit}
    OFFSET {offset}
    """
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    return results

# Fetch the first batch
humans_data = []
offset = 0
batch_size = 5000  # Fetch this batch size at a time

while True:
    print(f"Fetching humans from offset {offset}")
    results = fetch_wikidata_humans(limit=batch_size, offset=offset)
    if "results" in results and "bindings" in results["results"]:
        batch = results["results"]["bindings"]
        if not batch:
            break  # Stop if no more results

        for result in batch:
            humans_data.append({
                "Wikidata ID": result["human"]["value"].split("/")[-1],
                "Name": result["humanLabel"]["value"]
            })
        offset += batch_size  # Move to the next batch
    else:
        break  # Stop if no valid response

# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(humans_data)

# write the Name column to a file
df['Name'].to_csv('wikidata_names.txt', index=False)

In [None]:
file_name = 'wikidata_names.txt'
zip_file = file_name + '.zip'
url = 'https://gerdes.fr/saclay/inforet/' + zip_file

if os.path.exists(file_name):
    print('File already present')
else:
    print('Downloading the file...')
    os.system(f'curl -o {zip_file} {url}')
    os.system(f'unzip {zip_file}')

In [None]:
# read the file back into a simple list, one item per line
with open('wikidata_names.txt') as f:
	names = set(f.read().splitlines())
names

In [None]:
vectorizer = CountVectorizer(vocabulary=names, binary=True, min_df=1, lowercase=False, ngram_range=(1,4))
X = vectorizer.fit_transform(df.text.dropna())
print(len(vectorizer.get_feature_names_out()))
print(vectorizer.get_feature_names_out()[:11])
X

#### 🚧 todo: 
- find the most frequently cited names
- analyze who cites
- analyze the length of the cited names in tokens


In [None]:
...

In [None]:
...

In [None]:
...

- check this: https://en.wikipedia.org/wiki/Kofi_Annan

# Homework

complete the # 🚧 todo:

and
## find the most frequently encountered person entity
- in number of speeches
- in number of occurrences




### Before submitting, check:
- I have not imported any other modules
- I have put explanations between the lines of code (either inline or in separate cells)
- My notebook runs all the way through when I hit
  1. the ↻ button and then
  2. the ⏩︎ button (remove or comment out cells that are too slow and not needed).
  