# NLP and Network Analysis

## This script contains the following:

### 1. Importing libraries and data

### 2. Creating NER object

### 3. Splitting sentence entities from the NER object

### 4. Filtering entities

### 5. Creating relationships dataframe

### 6. Exporting data

### 1. Importing libraries and data

In [41]:
# Import libraries

import pandas as pd
import numpy as np
import spacy
from spacy import displacy
import networkx as nx
import os
import matplotlib.pyplot as plt
import scipy
import re

In [42]:
# Download Spacy English module

!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB ? eta -:--:--
     - ------------------------------------- 0.5/12.8 MB 932.9 kB/s eta 0:00:14
     -- ------------------------------------- 0.8/12.8 MB 1.2 MB/s eta 0:00:10
     ---- ----------------------------------- 1.3/12.8 MB 1.4 MB/s eta 0:00:09
     ---- ----------------------------------- 1.6/12.8 MB 1.3 MB/s eta 0:00:09
     ----- ---------------------------------- 1.8/12.8 MB 1.3 MB/s eta 0:00:09
     ------ --------------------------------- 2.1/12.8 MB 1.4 MB/s eta 0:00:08
     ------- -------------------------------- 2.4/12.8 MB 1.4 MB/s eta 0:00:08
     -------- ------------------------------- 2.6/12.8

In [43]:
# Load Spacy English module

NER = spacy.load('en_core_web_sm')

In [44]:
# Import article data

with open('20th_century_Wiki_Lena_Cole.txt', 'r', errors='ignore') as file:
          data = file.read().replace('\n', '')

In [45]:
# Import countries data

import csv

csv_file_path = 'countries_list_20th_century.csv'
with open(csv_file_path, 'r') as file:
    csv_reader = csv.reader(file)
    data_list = []
    for row in csv_reader:
        data_list.append(row)

In [46]:
for row in data_list:
    print(row)

['country_name']
['Afghanistan']
['Albania']
['Algeria']
['Andorra']
['Angola']
['Antigua and Barbuda']
['Argentina']
['Armenia']
['Australia']
['Austria']
['Azerbaijan']
['Bahamas, The']
['Bahrain']
['Bangladesh']
['Barbados']
['Belarus']
['Belgium']
['Belize']
['Benin']
['Bhutan']
['Bolivia']
['Bosnia and Herzegovina']
['Botswana']
['Brazil']
['Brunei']
['Bulgaria']
['Burkina Faso']
['Burundi']
['Cambodia']
['Cameroon']
['Canada']
['Cape Verde']
['Central African Republic']
['Chad']
['Chile']
["China, People's Republic of"]
['Colombia']
['Comoros']
['Congo, Democratic Republic of the']
['Congo, Republic of the']
['Costa Rica']
['Croatia']
['Cuba']
['Cyprus']
['Czech Republic']
['Denmark']
['Djibouti']
['Dominica']
['Dominican Republic']
['East Timor']
['Ecuador']
['Egypt']
['El Salvador']
['Equatorial Guinea']
['Eritrea']
['Estonia']
['Eswatini']
['Ethiopia']
['Fiji']
['Finland']
['France']
['Gabon']
['Gambia, The']
['Georgia']
['Germany']
['Ghana']
['Greece']
['Grenada']
['Guatemala

### 2. Creating NER object

In [47]:
# Create NER object

article = NER(data)

In [48]:
displacy.render(article, style = 'ent', jupyter = True)

In [49]:
displacy.render(article, options = {'ents': ['GPE']}, style = 'ent', jupyter = True)

I have approached this exercise in a slightly different order to the directions given. In order to evaluate whether the text needed wrangling, I wanted to check the names of the countries used in the article against my list. What I found confirmed what I already believed - several countries have been created or have changed in the last century, such as Yugoslavia and Czechoslovakia. There were empires and unions that existed at the beginning of the 20th century that no longer do so in 2024, such as the Soviet Union and the British, Austro-Hungarian and Ottoman Empires. Also, as I noted in a previous task, the article refers to Britain, as opposed to the United Kingdom. Britain is the geographical term referring to the island that is home to England, Scotland and Wales, while the United Kingdom describees the political unit containing those countries, plus Northern Ireland. 

The simplest way to address this is to add these entities to the countries list. I would not want to replace any countries with the empire or union names, as the changes happen over the course of the article, meaning what was correct at the beginning would be incorrect by the end. As both terms are used in the article, I will add Britain to the countries list in addition to the United Kingdom.

In [50]:
# Add country/empire/union names to list.
# This was done separately as otherwise each name was placed in a different column)

data_list.append(['British Empire'])

In [51]:
data_list.append(['Britain'])

In [52]:
data_list.append(['Ottoman Empire'])

In [53]:
data_list.append(['Austria-Hungary'])

In [54]:
data_list.append(['Yugoslavia'])

In [55]:
data_list.append(['Czechoslovakia'])

In [56]:
data_list.append(['Soviet Union'])

In [57]:
data_list.append(['Persia'])

In [58]:
data_list

[['country_name'],
 ['Afghanistan'],
 ['Albania'],
 ['Algeria'],
 ['Andorra'],
 ['Angola'],
 ['Antigua and Barbuda'],
 ['Argentina'],
 ['Armenia'],
 ['Australia'],
 ['Austria'],
 ['Azerbaijan'],
 ['Bahamas, The'],
 ['Bahrain'],
 ['Bangladesh'],
 ['Barbados'],
 ['Belarus'],
 ['Belgium'],
 ['Belize'],
 ['Benin'],
 ['Bhutan'],
 ['Bolivia'],
 ['Bosnia and Herzegovina'],
 ['Botswana'],
 ['Brazil'],
 ['Brunei'],
 ['Bulgaria'],
 ['Burkina Faso'],
 ['Burundi'],
 ['Cambodia'],
 ['Cameroon'],
 ['Canada'],
 ['Cape Verde'],
 ['Central African Republic'],
 ['Chad'],
 ['Chile'],
 ["China, People's Republic of"],
 ['Colombia'],
 ['Comoros'],
 ['Congo, Democratic Republic of the'],
 ['Congo, Republic of the'],
 ['Costa Rica'],
 ['Croatia'],
 ['Cuba'],
 ['Cyprus'],
 ['Czech Republic'],
 ['Denmark'],
 ['Djibouti'],
 ['Dominica'],
 ['Dominican Republic'],
 ['East Timor'],
 ['Ecuador'],
 ['Egypt'],
 ['El Salvador'],
 ['Equatorial Guinea'],
 ['Eritrea'],
 ['Estonia'],
 ['Eswatini'],
 ['Ethiopia'],
 ['Fiji'

### 3. Splitting sentence entities from the NER object

In [59]:
df_sentences = []

# Loop through sentences to get entity list for each sentence
for sent in article.sents:
    entity_list = [ent.text for ent in sent.ents]
    df_sentences.append({'sentence': sent, 'entities': entity_list})

df_sentences = pd.DataFrame(df_sentences)

In [60]:
df_sentences.head()

Unnamed: 0,sentence,entities
0,"(Key, events, of, the, 20th, century, -, Wikip...","[the 20th century - WikipediaJump, Contribute,..."
1,"(depression1.2.2The, rise, of, dictatorship1.3...","[World War II, Pacific1.3.7.1Background1.3.8Ja..."
2,"(begins1.4The, post, -, war, world1.4.1The, en...","[Cold War, 1947â€“1991)1.4.3War, the Cold War1..."
3,"(What, links, hereRelated, changesUpload, file...","[pageGet, URLDownload, Download, Wikipedia, en..."
4,"(The, World, Wars, sparked, tension, between, ...","[the Cold War, the Space Race]"


### 4. Filtering entities

In [61]:
# Import countries list as Dataframe

df_countries = pd.DataFrame(data_list)

In [62]:
df_countries.head()

Unnamed: 0,0
0,country_name
1,Afghanistan
2,Albania
3,Algeria
4,Andorra


In [63]:
df_countries.rename(columns = {0:'Country'}, inplace = True)

In [64]:
df_countries.head()

Unnamed: 0,Country
0,country_name
1,Afghanistan
2,Albania
3,Algeria
4,Andorra


In [65]:
# Remove 'country_name' value

df_countries.drop([0])

Unnamed: 0,Country
1,Afghanistan
2,Albania
3,Algeria
4,Andorra
5,Angola
...,...
213,Austria-Hungary
214,Yugoslavia
215,Czechoslovakia
216,Soviet Union


In [66]:
# Function to filter out entities not in the list of countries

def filter_entity(ent_list, df_countries):
    return[ent for ent in ent_list
           if ent in list(df_countries['Country'])]

In [67]:
# Check

filter_entity(['Germany', 'Circle', 'Orange'], df_countries)

['Germany']

In [68]:
# Pass sentence entities into filter,so that only entities of interest are returned

df_sentences['country_entities'] = df_sentences['entities'].apply(lambda x: filter_entity(x, df_countries))

In [69]:
# Filter out sentences that do not have any country entities

df_sentences_filtered = df_sentences[df_sentences['country_entities'].map(len) > 0]

df_sentences_filtered.head()

Unnamed: 0,sentence,entities,country_entities
13,"(Interwoven, alliances, ,, an, increasing, arm...","[Europe, Allies, The Triple Entente, British E...","[British Empire, France, Russia]"
14,"(Germany, ,, Austria, -, Hungary, ,, Bulgaria,...","[Germany, Austria, Hungary, Bulgaria, the Otto...","[Germany, Austria, Hungary, Bulgaria, Russia]"
15,"(The, Bolsheviks, negotiated, the, Treaty, of,...","[the Treaty of Brest-Litovsk, Germany, Russia]","[Germany, Russia]"
16,"(In, the, treaty, ,, Bolshevik, Russia, ceded,...","[Bolshevik Russia, Baltic, Germany, Kars Oblas...",[Germany]
17,"(It, also, recognized, the, independence, of, ...","[Germany, Allied, American]",[Germany]


### 5. Creating relationships dataframe

In [70]:
# Define relationships

relationships = []

for i in range(df_sentences_filtered.index[-1]):
    end_i = min(i+5, df_sentences_filtered.index[-1])
    country_list = sum((df_sentences_filtered.loc[i: end_i].country_entities), [])

    # Remove duplicated characters that are adjacent to each other
    country_unique = [country_list[i] for i in range(len(country_list))
                  if (i==0) or country_list[i] != country_list[i-1]]

    if len(country_unique) > 1:
        for idx, a in enumerate(country_unique[:-1]):
            b = country_unique[idx + 1]
            relationships.append({'source': a, 'target': b})

In [71]:
df_relationship = pd.DataFrame(relationships)

In [72]:
df_relationship

Unnamed: 0,source,target
0,British Empire,France
1,France,Russia
2,British Empire,France
3,France,Russia
4,Russia,Germany
...,...,...
779,India,Singapore
780,India,Singapore
781,India,Singapore
782,India,Singapore


In [73]:
# Sort cases with a- >b and b- >a

df_relationships = pd.DataFrame(np.sort(df_relationship.values, axis = 1), columns = df_relationship.columns)

df_relationships.head()

Unnamed: 0,source,target
0,British Empire,France
1,France,Russia
2,British Empire,France
3,France,Russia
4,Germany,Russia


In [74]:
# Summarise interactions

df_relationships['value'] = 1
df_relationships = df_relationships.groupby(['source', 'target'], sort=False, as_index=False).sum()

In [75]:
df_relationships.head()

Unnamed: 0,source,target,value
0,British Empire,France,6
1,France,Russia,12
2,Germany,Russia,26
3,Austria,Germany,17
4,Austria,Hungary,6


### 6. Exporting data

In [76]:
# Export dataframe to csv file

df_relationships.to_csv('countries_relationships.csv')