<h1><center>Final Tutorial Notebook</center></h1>
<h3><center>Emily Gong, Robert Morrison</center></h3>

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f6/Flag_of_the_United_Nations_%281945-1947%29.svg/2000px-Flag_of_the_United_Nations_%281945-1947%29.svg.png" width="600">

# Outline
- ### Introduction
    - Overview
    - Tools
- ### Data Collection
    - Loading Data
    - Cleaning Data **(@Robbie please make sure these categories match what you did or edit it to do so)**
    - Scraping Data  

- ### Exploratory Data Anlysis
    - Explore Words
    - Explore Entities
    - Explore Topics
    - Explore Votes 
- ### Analysis & ML
    - t-SNS
- ### Conclusions and Further Actions
    - Challenges
    - Conclusions
    - Additional Resources

# Introduction

## Overview
**Background:**  
The United Nations is an international organization of independent states to promote peace and international cooperation and security. Currently, there are 193 sovereign states and they meet every year in regular session on the Tuesday of the third week in September. Every year, the UN created a theme to focus on. Interestingly, this year’s theme is “Making the United Nations relevant to all people: global leadership and shared responsibilities for peaceful, equitable and sustainable societies”. In this notebook we are taking up the challenge of making the UN relevant by exploring the transcripts from the general debates and the votes from each country. 

**Purpose:**  
The original intention was to look at the transcripts from the debate and predict the country the speaker was from. However, after analyzing the dataset there were challenges that redirected us to find a different purpose. These challenges are mentioned later. After reevaluating, we have decided to ... {UPDATE TEXT HERE}


**Datasets:**
- The UN General Debates dataset includes the general debates at the UN from 1970 - 2016. Information such as the session ID, year, country (ISO 3166 Alpha-3 country code 1), and transcription is included. 
    - https://www.kaggle.com/unitednations/un-general-debates?fbclid=IwAR3Onf0CDZsw-dCwNlmj85hfvq12KavEpkyIkKN-FGguAuIyzpaqwvREhC0

- The UN General Assembly Votes dataset includes three separate csvs, resolutions, states, and votes. This dataset ranges from 1946-2015 and contains the votes of each country during various years. Specifically, it contains votes relating to topics such as human rights, nuclear development, and a few other categories. 
    - https://www.kaggle.com/unitednations/general-assembly


**Personal Investment:**  
On a personal level, we are interested in looking at this data because there are a lot of tools relating to NLP and we are interested in utilizing these tools to better understand what has been developed to process language. (Note Emily is a Computer Science and Linguistics Major). 


## Tools
**Processing**  
- NLTK
    - suite of libraries for working with human language
    - use cases:
        - tokenization: separate corpra by tokens (words)
        - part of speech tagging: label the type of speech
        - chunking: segments and labels multi-token sequences
    - widely used for teaching and research
- spaCy
    - advance library to implement nlp
    - use cases:
        - entity detection
            - model of English
    - widely used for production usage

**Visualization** 
- gensim
    - visualizing topic models
- pyLDAvis
    - visualizing topic models
- bokeh
    - Visualizing LDA results 

**Imports**


In [1]:
import numpy as np
import pandas as pd
from collections import OrderedDict


# regular expressions
import re

# scraping 
import requests
import bs4

# processing
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from spacy import displacy #display word entities
import en_core_web_sm #language model
from collections import Counter
nlp = en_core_web_sm.load() 

# Below are libraries for LDA using gensim which is provides less control 
from nltk.corpus import stopwords #stop words to be filited out
from gensim import models, corpora  #Used for LDA topic modeling
from sklearn.metrics.pairwise import euclidean_distances 
import pyLDAvis.gensim #python library for interactive topic model visualization

pyLDAvis.enable_notebook()

# evaluating the model
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#Below are libraries for LDA using sklearn


from sklearn.feature_extraction.text import TfidfVectorizer

# Loading Data
## Raw Data

In [21]:
debates = pd.read_csv("un-general-debates.csv")
debates.head()

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


In [22]:
print(len(debates))

7507


In [23]:
countries = debates["country"].unique()
print(countries)
print(len(countries))

['MDV' 'FIN' 'NER' 'URY' 'ZWE' 'PHL' 'SDN' 'RUS' 'CHN' 'ESP' 'SUR' 'ARG'
 'SLV' 'MYS' 'NPL' 'PRT' 'COL' 'BLR' 'MAR' 'LCA' 'EGY' 'MEX' 'BEL' 'BRN'
 'RWA' 'CAN' 'ALB' 'GRC' 'KNA' 'GUY' 'LBR' 'ATG' 'MOZ' 'JPN' 'YDYE' 'GAB'
 'BGD' 'SWE' 'TUR' 'TCD' 'SYR' 'CMR' 'JAM' 'LUX' 'ITA' 'AGO' 'CRI' 'CSK'
 'BFA' 'MNG' 'BHR' 'HTI' 'OMN' 'CIV' 'TGO' 'CYP' 'MUS' 'MMR' 'ARE' 'GTM'
 'GRD' 'LBY' 'LKA' 'TZA' 'SGP' 'NOR' 'LAO' 'ISL' 'AFG' 'CHL' 'DMA' 'UKR'
 'KEN' 'BLZ' 'FRA' 'MLI' 'VCT' 'VEN' 'MLT' 'GHA' 'GIN' 'GBR' 'ISR' 'YUG'
 'BRB' 'IRQ' 'HUN' 'AUT' 'POL' 'GNB' 'BWA' 'MRT' 'SWZ' 'DNK' 'DOM' 'MDG'
 'NIC' 'BDI' 'CUB' 'IRN' 'PAK' 'SEN' 'BGR' 'YEM' 'STP' 'NLD' 'VUT' 'BOL'
 'PNG' 'SLB' 'DEU' 'ROU' 'KHM' 'TUN' 'BRA' 'IND' 'IDN' 'AUS' 'COD' 'HND'
 'GNQ' 'FJI' 'IRL' 'DZA' 'USA' 'LSO' 'GMB' 'PER' 'DDR' 'THA' 'JOR' 'COG'
 'NGA' 'ECU' 'SAU' 'QAT' 'SYC' 'ETH' 'TTO' 'PRY' 'VNM' 'NZL' 'PAN' 'MWI'
 'DJI' 'BEN' 'SOM' 'ZMB' 'CPV' 'BHS' 'KWT' 'UGA' 'COM' 'ZAF' 'LBN' 'SLE'
 'KOR' 'BIH' 'TON' 'EU' 'HRV' 'NRU' 'TUV' 'NAM' 'S

In [24]:
years = debates["year"].unique()
print(years)
print(len(years))
# 46 Examples of each country

[1989 1970 2013 1985 2008 1991 1986 2002 1975 1996 2012 1997 1978 1988
 2010 1984 1995 2009 1971 1976 1983 1979 1999 2005 1987 1982 1998 2003
 2004 1980 2014 2011 1974 2015 1993 1977 1981 2000 1992 1990 1973 1994
 1972 2006 2007 2001]
46


## Scraping

In [25]:
# Getting a decoding table and access to each countries Wikipedia page
url = "https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)

In [26]:
found = soup.find("div", class_="plainlist").find("ul")

In [27]:
lookup = {} # dict {country code : [Country Name, wikipedia link]}

for child in found.children:
    if isinstance(child, bs4.Tag):
        c_code = child.find("span").text
        link = child.find("a").get("href")
        country = child.find("a").text
        
        lookup[c_code] = [country, link]

lookup["CSK"] = ["Czechoslovakia", "/wiki/Czechoslovakia"]
lookup["YDYE"] = ["South Yemen", "/wiki/South_Yemen"]
lookup["YUG"] = ["Yugoslavia", "/wiki/Yugoslavia"]
lookup["DDR"] = ["East Germany", "/wiki/East_Germany"]
lookup["EU"] = ["European Union", "/wiki/European_Union"]

In [28]:
debates["name"] = [lookup[code][0] for code in debates["country"]]

# EDA

In [29]:
# Retrieving the first transcript
sample_text = debates.loc[0]["text"]
sample_text = sample_text.replace(u'\ufeff', '')
print(sample_text[0:100])

It is indeed a pleasure for me and the members of my delegation to extend to Ambassador Garba our si


### Tokenizing debates

In [30]:
def clean_debates(debates):
    docs = []
    stop_words = stopwords.words('english')
    for transcript in debates['text']: 
        if transcript is not None:
            transcript = transcript.replace(u'\ufeff', '')
            tokens = nltk.word_tokenize(transcript.lower())
            docs.append([t.lower() for t in tokens if t not in stop_words and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)])
    return docs

In [31]:
transcripts = clean_debates(debates)
print(transcripts[0])



In [32]:
# dictionary contains unique words and frequency
dictionary = corpora.Dictionary(transcripts)
print(len(dictionary))

73057


### Vectorize data

In [None]:
# Convert document into the bag-of-words (BoW) format = list of (token_id, token_count) tuples.
corpus = [dictionary.doc2bow(transcript) for transcript in transcripts]

count_vect = CountVectorizer(max_df=.90, min_df=2, max_features=1000)
tfidf_vect = TfidfVectorizer(max_df=.90, min_df=2, max_features=1000)

counts = count_vect.fit_transform(debates["text"][:100])
count_total = np.array(counts.sum(axis=0))[0]
count_vocab = count_vect.vocabulary_
count_feats = np.array(count_vect.get_feature_names())

tfidfs = tfidf_vect.fit_transform(debates["text"][:100])
tfidf_vocab = tfidf_vect.vocabulary_
tfidf_feats = np.array(count_vect.get_feature_names())

print('Number of unique tokens: %d' % len(dictionary))
print('Number of debates: %d' % len(corpus))

### POS Tagging Debates

In [None]:
# pos = [nltk.pos_tag(transcript) for transcript in transcripts[:1]]
# print(pos[0][:20])

## Exploring Entities

In [43]:
nlp = en_core_web_sm.load()
nlp = spacy.load('en_core_web_sm')
doc = nlp(transcripts)
print(len(doc.ents))

print([(X.text, X.label_) for X in doc.ents[:20]])

labels = [x.label_ for x in doc.ents]
Counter(labels)

items= [x.text for x in doc.ents]
Counter(items).most_common(3)

sentences = [x for x in doc.sents]
print(sentences[21])

displacy.render(nlp(str(sentences[21])), jupyter=True, style='ent')

TypeError: Argument 'string' has incorrect type (expected str, got list)

## Exploring Topics

### Building LDA Model

In [None]:
num_topics = 2

#suggestion from https://www.kaggle.com/ykhorramz/lda-and-t-sne-interactive-visualization
eval_every = 1  # Don't evaluate model perplexity, takes too much time.

lda_model = models.LdaModel(corpus=gensim_corpus, num_topics=num_topics, id2word=dictionary, eval_every=eval_every)

In [None]:
# print("LDA Model:")
 
# for idx in range(num_topic):
#     # Print the first 10 most representative topics
#     print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
# print("=" * 20)

In [None]:
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1], sort=False)[:top_n]
    return most_similar


In [None]:
pyLDAvis.gensim.prepare(lda_model, gensim_corpus, gensim_dictionary)

The left panel, labeld Intertopic Distance Map, circles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.


The right panel, include the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. Relevence is defined as in footer 2 and can be tuned by parameter  λ
 , smaller  λ
  gives higher weight to the term's distinctiveness while larger  λ
 s corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic we'll use  λ
 =0.


In [None]:
lda_corpus = lda_model[gensim_corpus] 

def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and top n vocabs of interest
    prints a formatted list of the top n terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

topic_summaries = []
print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
for i in range(num_topics):
    print('Topic '+str(i)+' |---------------------\n')
    tmp = explore_topic(lda_model,topic_number=i, topn=10, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]
    print

In [None]:
top_labels = {'Disarment???'}
print(len(debates['text']))

In [None]:

tvectorizer = TfidfVectorizer(input='content', analyzer = 'word', lowercase=True, stop_words='english',\
                                  tokenizer=tokens, ngram_range=(1, 3), min_df=40, max_df=0.20,\
                                  norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=True)

tfidf

In [None]:
docs
docs1 = debates['text'].apply(lambda l: l[:int(len(l)/2)])
docs2 = debates['text'].apply(lambda l: l[int(len(l)/2):])

corpus1 = [gensim_dictionary.doc2bow(doc) for doc in docs1]
corpus2 = [gensim_dictionary.doc2bow(doc) for doc in docs2]

# Using the corpus LDA model tranformation
lda_corpus1 = model[corpus1]
lda_corpus2 = model[corpus2]

In [None]:
def get_doc_topic_dist(model, corpus, kwords=False):
    top_dist =[]
    keys = []
    for d in gensim_corpus:
        tmp = {i:0 for i in range(num_topics)}
        tmp.update(dict(lda_model[d]))
        vals = list(OrderedDict(tmp).values())
        top_dist += [(vals)] #removed array
        if kwords:
                keys += [(vals).argmax()]
        return [(top_dist)], keys
    
top_dist1, _ = get_doc_topic_dist(lda_model, lda_corpus1)
top_dist2, _ = get_doc_topic_dist(lda_model, lda_corpus2)


print("Intra similarity: cosine similarity for corresponding parts of a doc(higher is better):")
print(np.mean([cosine_similarity(c1.reshape(1, -1), c2.reshape(1, -1))[0][0] for c1,c2 in zip(top_dist1, top_dist2)]))

random_pairs = np.random.randint(0, len(databases['text']), size=(400, 2))

print("Inter similarity: cosine similarity between random parts (lower is better):")
print(np.mean([cosine_similarity(top_dist1[i[0]].reshape(1, -1), top_dist2[i[1]].reshape(1, -1)) for i in random_pairs]))

## Exploring Votes

# Analysis & ML

## t-SNS

# Conclusion and Further Action

## Challenges

**Data:**  
After loading our data, we realized there some issues as we were cleaning our data.
- Countries have not stayed consistent in history.
    - They have broken apart, merged, and etc.
- UN representative also changes overtime 
    - Rhetoric of the transcript for each country will not be consistent
- The formal language at the UN constraints much of the variation that occurs in regular colloquium
- These transcriptions are translated
    -  There is evidence that meaning can be lost through language translation
    
From this, we concluded that it was not clear that the transcripts could provide insight on the country that was speaking. 
