# Establishing Similarity of Countries

    The goal of this project is to find the top ten most similar 
    countries and the top ten least similar countries in a certain
    year (in this case only African countries, as the focus is on 
    the African continent) based on the UN general debates scripts of 
    1970 to 2016.
    
    The similarity function created here can be used in different 
    contexts by just changing the variables

## Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import operator
import re
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline
!conda install -c conda-forge folium=0.5.0 --yes
import folium
country_geo = 'world-countries.json'
 

 
 

 

Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 4.7.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /anaconda3

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         143 KB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    conda-4.7.12               |           py36_0         3.0 MB  conda-forge
    openssl-1.1.1c             |       h01d97ff_0         1.9 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         5.2 MB

The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2019.5.15-1 

- peak into data

In [2]:

data = pd.read_csv('un-general-debates.csv')
data.head()

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


 ## Linking and merges performed to enable a more elaborate analysis
 - source of iso code dataset:
       https://unstats.un.org/unsd/methodology/m49/overview/
- The data set represents country names with their 3-letter ISO-alpha
- To convert the ISO codes into country names, performed a left join
- Dataset also includes region(continent) to enable filtering of African countries
- Removed a duplicate column, "ISO-alpha3 Code"

In [3]:
data = pd.read_csv('un-general-debates.csv').drop('session', axis=1)
country_names = pd.read_excel ('UNSD — Methodology.xlsx')
data = pd.merge(data, country_names[['Region Name','Country or Area','ISO-alpha3 Code']],
             how='left', left_on='country', right_on='ISO-alpha3 Code')
data.drop('ISO-alpha3 Code',axis=1, inplace=True)
data.rename(columns = {'Country or Area': 'country_name'}, inplace = True)
data.head()



Unnamed: 0,year,country,text,Region Name,country_name
0,1989,MDV,﻿It is indeed a pleasure for me and the member...,Asia,Maldives
1,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ...",Europe,Finland
2,1989,NER,"﻿\nMr. President, it is a particular pleasure ...",Africa,Niger
3,1989,URY,﻿\nDuring the debate at the fortieth session o...,Americas,Uruguay
4,1989,ZWE,﻿I should like at the outset to express my del...,Africa,Zimbabwe


- Limiting country of interests to the  Africa continent

In [4]:
data = data.loc [data['Region Name']=='Africa']
data.nunique()

year              46
country           54
text            2159
Region Name        1
country_name      54
dtype: int64

## Data cleaning and binning
1. remove digits

In [5]:
def clean(s):    
    # Remove any tags:
    cleaned = re.sub(r"(?s)<.?>", " ", s)
    # Keep only regular chars:
    cleaned = re.sub(r"[^A-Za-z(),*!?\'\`]", " ", cleaned)
    # Remove unicode chars
    cleaned = re.sub("\\\\u(.){4}", " ", cleaned)
    return cleaned.strip()

# clean text
data['text'] = data.text.apply(lambda x: clean(x))
# remove data with null value in year column
data = data[data['year'].notnull()]

2. Group data by country and into 5 year periods

In [6]:
data['year'] = (data['year'] / 5).astype(int)*5
data = data.groupby(['country', 'year', 'country_name'])['text'].apply(list)
data = data.apply(lambda x: ''.join(x))
data = data.reset_index(drop=False)

data[:20]

Unnamed: 0,country,year,country_name,text
0,AGO,1975,Angola,On analysing the agenda of the thirty third se...
1,AGO,1980,Angola,"A few days ago, we had the legitimate satisfac..."
2,AGO,1985,Angola,"Mr President, today I have the honour of addr..."
3,AGO,1990,Angola,Allow me first to congratulate Mr Shihabi on ...
4,AGO,1995,Angola,Allow me at the outset to congratulate Mr Ism...
5,AGO,2000,Angola,"Allow me, on behalf of my Government and in my..."
6,AGO,2005,Angola,I am particularly honoured to address the Gen...
7,AGO,2010,Angola,On behalf of the President of the Republic of...
8,AGO,2015,Angola,"At the outset, on behalf of the President of A..."
9,BDI,1970,Burundi,"Mr President, this great Assembly made a very..."


3. Create 5000 TF-IDF features, using 3-gram

In [7]:

num_features = 5000
tfidf = TfidfVectorizer(max_features = num_features, strip_accents='unicode',
                        lowercase=True, stop_words='english', ngram_range=(1,3))
print('Fitting Data...')
tfidf.fit(data['text'].values.astype('U'))

print('Starting Transform...')
text_tfidf = tfidf.transform(data['text'])

print('Label and Incorporate TF-IDF...')
data_array = pd.DataFrame(text_tfidf.toarray())
feature_names = tfidf.get_feature_names()

for i in range(num_features):
    feature_names[i] = 'TF_' + feature_names[i]

data_array.columns = feature_names
data = pd.concat([data, data_array], axis=1)

data[:10]

Fitting Data...
Starting Transform...
Label and Incorporate TF-IDF...


Unnamed: 0,country,year,country_name,text,TF_ababa,TF_abandon,TF_abide,TF_ability,TF_abject,TF_abject poverty,...,TF_yugoslavia,TF_zaire,TF_zambia,TF_zimbabwe,TF_zimbabwe namibia,TF_zionism,TF_zionist,TF_zone,TF_zone peace,TF_zones
0,AGO,1975,Angola,On analysing the agenda of the thirty third se...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.012245,0.01764,0.073397,0.01245,0.03735,0.0,0.013603,0.019485,0.0
1,AGO,1980,Angola,"A few days ago, we had the legitimate satisfac...",0.006225,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.006494,0.010808,0.0,0.01375,0.004201,0.005008,0.007173,0.0
2,AGO,1985,Angola,"Mr President, today I have the honour of addr...",0.0,0.0,0.0,0.002424,0.0,0.0,...,0.0,0.010986,0.0,0.003292,0.0,0.0,0.005119,0.009153,0.00874,0.0
3,AGO,1990,Angola,Allow me first to congratulate Mr Shihabi on ...,0.009656,0.010041,0.005084,0.0,0.0,0.0,...,0.027618,0.0,0.0,0.004191,0.007109,0.0,0.0,0.0,0.0,0.004573
4,AGO,1995,Angola,Allow me at the outset to congratulate Mr Ism...,0.0,0.009256,0.0,0.002844,0.0,0.0,...,0.005091,0.01289,0.0,0.0,0.0,0.0,0.0,0.01074,0.005128,0.0
5,AGO,2000,Angola,"Allow me, on behalf of my Government and in my...",0.0,0.007752,0.015699,0.009529,0.0,0.0,...,0.0,0.0,0.007776,0.019413,0.0,0.0,0.0,0.005997,0.0,0.0
6,AGO,2005,Angola,I am particularly honoured to address the Gen...,0.0,0.0,0.0,0.005599,0.0,0.0,...,0.0,0.0,0.0,0.015209,0.0,0.0,0.0,0.014094,0.020188,0.0
7,AGO,2010,Angola,On behalf of the President of the Republic of...,0.007028,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,AGO,2015,Angola,"At the outset, on behalf of the President of A...",0.0,0.0,0.0,0.019756,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,BDI,1970,Burundi,"Mr President, this great Assembly made a very...",0.003431,0.0,0.003613,0.002193,0.003743,0.0,...,0.0,0.009938,0.014316,0.020848,0.0,0.0,0.027784,0.00552,0.0,0.00325


4. Obtain a list of TF-IDF features and assess their applicability

In [8]:
features = data.columns.tolist()
for i in ['year', 'country', 'country_name', 'text']:
    features.remove(i)
features

['TF_ababa',
 'TF_abandon',
 'TF_abide',
 'TF_ability',
 'TF_abject',
 'TF_abject poverty',
 'TF_able',
 'TF_abroad',
 'TF_absence',
 'TF_absolute',
 'TF_absolutely',
 'TF_abuja',
 'TF_abuse',
 'TF_accelerate',
 'TF_accelerated',
 'TF_accept',
 'TF_acceptable',
 'TF_acceptance',
 'TF_accepted',
 'TF_access',
 'TF_accession',
 'TF_accession independence',
 'TF_accommodation',
 'TF_accompanied',
 'TF_accomplished',
 'TF_accord',
 'TF_accordance',
 'TF_accorded',
 'TF_according',
 'TF_accordingly',
 'TF_accords',
 'TF_account',
 'TF_accountability',
 'TF_achieve',
 'TF_achieve peace',
 'TF_achieved',
 'TF_achievement',
 'TF_achievements',
 'TF_achieving',
 'TF_acknowledge',
 'TF_acknowledged',
 'TF_acquire',
 'TF_acquired',
 'TF_acquisition',
 'TF_act',
 'TF_acting',
 'TF_action',
 'TF_action adopted',
 'TF_action african',
 'TF_action african economic',
 'TF_action international',
 'TF_action taken',
 'TF_actions',
 'TF_active',
 'TF_active solidarity',
 'TF_actively',
 'TF_activities',


In [9]:
df = data.copy()
df['total'] = df[features].sum(axis=1).abs()
df = df.sort_values(by='total', ascending=True).reset_index(drop=True)
df[:30]



Unnamed: 0,country,year,country_name,text,TF_ababa,TF_abandon,TF_abide,TF_ability,TF_abject,TF_abject poverty,...,TF_zaire,TF_zambia,TF_zimbabwe,TF_zimbabwe namibia,TF_zionism,TF_zionist,TF_zone,TF_zone peace,TF_zones,total
0,MUS,2015,Mauritius,"Twelve years ago, I bade farewell to the Assem...",0.01093,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.053116
1,BFA,2015,Burkina Faso,It is my honour to address the Assembly as Pre...,0.025381,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.922399
2,RWA,2015,Rwanda,The adoption of the Sustainable Development Go...,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.115164
3,UGA,2015,Uganda,"I congratulate you, Sir, on your election as P...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.136202
4,GNQ,2015,Equatorial Guinea,It is a pleasure for me to take the floor befo...,0.0,0.0,0.0,0.0,0.026729,0.029625,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.16177
5,MWI,2005,Malawi,I would like to take this opportunity to cong...,0.0,0.0,0.0,0.006453,0.005507,0.006104,...,0.0,0.005266,0.004383,0.0,0.0,0.0,0.0,0.0,0.0,13.584389
6,ERI,2015,Eritrea,"It is my pleasure, at the outset, to warmly co...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.679593
7,SSD,2015,South Sudan,"On behalf of my President, His Excellency Mr ...",0.035326,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.010373
8,SYC,2015,Seychelles,"We the peoples of the United Nations, determin...",0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.089964
9,BFA,2010,Burkina Faso,"Sociopolitical crises, armed conflicts, the de...",0.0,0.0,0.0,0.002565,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.003229,0.0,0.0,14.170195


## The Similarity Function

In [10]:
def similarity(year, count=10, same_year=True):
    '''finds most similar and least similiar countries'''
    df = data.copy()
    primary = data[(data['year']==year)]
    for i in features:
        df[i] = [np.square(primary[i].values[0].astype(float) - x) for x in data[i]]
    
    df['total'] = df[features].sum(axis=1).abs()
    df = df.sort_values(by='total', ascending=True).reset_index(drop=True)
    if same_year == True:
        df = df[df['year'] == year]
        most_similar = df[1:1 + count].country_name
        least_similar = df[-count:].country_name
    else:
        most_similar = df[1:1 + count].country_name + [' ' for i in range(count)] + ['(' + str(i) + ')' for i in df[1:1 + count].year]
        least_similar = df[-count:].country_name + [' ' for i in range(count)] + ['(' + str(i) + ')' for i in df[-count:].year]
    print(str(year))
    print('most similar:')
    print(most_similar.values)
    print()
    print(str(year))
    print('least similar:')
    print(least_similar.values)
    print()
    print()
    return ''



In [11]:
similarity(1975)

1975
most similar:
['Mozambique' 'Guinea-Bissau' 'Cabo Verde' 'Sao Tome and Principe' 'Benin'
 'Burundi' 'Guinea' 'Congo' 'Somalia' 'United Republic of Tanzania']

1975
least similar:
['Gabon' 'Gambia' 'Seychelles' 'Central African Republic' 'Lesotho'
 'Comoros' 'Burkina Faso' 'Malawi' 'Equatorial Guinea' 'Eswatini']




''

In [12]:
similarity(1975)

1975
most similar:
['Mozambique' 'Guinea-Bissau' 'Cabo Verde' 'Sao Tome and Principe' 'Benin'
 'Burundi' 'Guinea' 'Congo' 'Somalia' 'United Republic of Tanzania']

1975
least similar:
['Gabon' 'Gambia' 'Seychelles' 'Central African Republic' 'Lesotho'
 'Comoros' 'Burkina Faso' 'Malawi' 'Equatorial Guinea' 'Eswatini']




''

In [13]:
similarity(1980)

1980
most similar:
['Mozambique' 'Guinea-Bissau' 'Zimbabwe' 'Guinea' 'Congo' 'Burundi'
 'Sao Tome and Principe' 'Zambia' 'Ethiopia' 'United Republic of Tanzania']

1980
least similar:
['Lesotho' 'Gambia' 'Chad' 'Burkina Faso' 'Côte d’Ivoire' 'Malawi'
 'Central African Republic' 'Eswatini' 'Equatorial Guinea' 'South Africa']




''

In [14]:
similarity(1985)

1985
most similar:
['Mozambique' 'United Republic of Tanzania' 'Congo' 'Zambia' 'Djibouti'
 'Cabo Verde' 'Zimbabwe' 'Togo' 'Botswana' 'Nigeria']

1985
least similar:
['Gabon' 'Algeria' 'Central African Republic' 'Comoros' 'Burkina Faso'
 'Malawi' 'Chad' 'Equatorial Guinea' 'Eswatini' 'Seychelles']




''

In [15]:
similarity(1990)

1990
most similar:
['Mozambique' 'Zimbabwe' 'Guinea' 'United Republic of Tanzania' 'Namibia'
 'Senegal' 'Congo' 'Djibouti' 'Ghana' 'Côte d’Ivoire']

1990
least similar:
['Eritrea' 'Seychelles' 'Central African Republic' 'Liberia' 'Niger'
 'South Africa' 'Sao Tome and Principe' 'Eswatini' 'Malawi'
 'Equatorial Guinea']




''

In [16]:
similarity(1995)

1995
most similar:
['Mozambique' 'Zimbabwe' 'United Republic of Tanzania' 'Namibia'
 'Botswana' 'Zambia' 'Congo' 'Senegal' 'Djibouti' 'Ghana']

1995
least similar:
['Sierra Leone' 'Central African Republic' 'Sao Tome and Principe'
 'Burundi' 'Seychelles' 'Malawi' 'Equatorial Guinea' 'Rwanda' 'Eritrea'
 'Eswatini']




''

In [17]:
similarity(2000)

2000
most similar:
['Congo' 'Botswana' 'Namibia' 'Algeria' 'Lesotho' 'Mozambique' 'Gabon'
 'Zimbabwe' 'Guinea' 'Cabo Verde']

2000
least similar:
['Malawi' 'Madagascar' 'Eritrea' 'Burkina Faso' 'Central African Republic'
 'Sierra Leone' 'Eswatini' 'Côte d’Ivoire' 'Sao Tome and Principe'
 'Equatorial Guinea']




''

In [18]:
similarity(2005)

2005
most similar:
['Gambia' 'Congo' 'Guinea' 'Lesotho' 'Côte d’Ivoire' 'Nigeria' 'Botswana'
 'Algeria' 'Gabon' 'Mozambique']

2005
least similar:
['Eswatini' 'Comoros' 'Tunisia' 'Somalia' 'Seychelles' 'Burkina Faso'
 'Ethiopia' 'Eritrea' 'Djibouti' 'Malawi']




''

In [19]:
similarity(2010)

2010
most similar:
['Mozambique' 'Guinea' 'Congo' 'Namibia' 'Lesotho' 'Mauritania' 'Benin'
 'Democratic Republic of the Congo' 'South Africa' 'Gambia']

2010
least similar:
['Seychelles' 'Sao Tome and Principe' 'Madagascar' 'Malawi'
 'Equatorial Guinea' 'South Sudan' 'Somalia' 'Tunisia' 'Burkina Faso'
 'Ghana']




''

In [20]:
similarity(2015)

2015
most similar:
['Niger' 'Gabon' 'Lesotho' 'South Africa' 'Eswatini' 'Zimbabwe' 'Guinea'
 'Côte d’Ivoire' 'Djibouti' 'Zambia']

2015
least similar:
['Seychelles' 'Rwanda' 'Ghana' 'Libya' 'Uganda' 'South Sudan' 'Somalia'
 'Egypt' 'Burkina Faso' 'Mauritius']




''