# Importing Libraries and Data

In [1]:
import numpy as np
import pandas as pd
import scipy
import spacy
from spacy import displacy
import re

In [2]:
# Loading Spacy's english module
NER = spacy.load('en_core_web_sm')

## Loading 20th Century Article

In [3]:
with open('20th_century_article.txt', 'r', errors='ignore') as file:
    data = file.read().replace('\n', '')
data[:2000]

'The th century changed the world in unprecedented ways. The World Wars sparked tension between countries and led to the creation of atomic bombs , the Cold War led to the Space Race and the creation of space-based rockets, and the World Wide Web was created. These advancements have played a significant role in citizens\' lives and shaped the st century into what it is today. The new beginning of the th century marked significant changes. The s saw the decade herald a series of inventions, including the automobile , airplane and radio broadcasting . saw the completion of the Panama Canal . The Scramble for Africa continued in the s and resulted in wars and genocide across the continent. The atrocities in the Congo Free State shocked the civilized world. From to , the First World War, and its aftermath, caused major changes in the power balance of the world, destroying or transforming some of the most powerful empires. The First World War (or simply WWI), termed "The Great War" by conte

## Cleaning the 20th Century Article Data

In [4]:
# Removing reference section of the article along with special characters.
trimmed_data = re.sub((re.escape('Timeline of the th century')+r'.*'), '', data)
cleaned_data = re.sub(r'\s*\[ \]|  +', '', trimmed_data).strip()
cleaned_data



The 20th Century Article had a few special characters that needed to be removed along with the reference section that did not contribute relevant information about the events analyzed.

In [5]:
# Saving cleaned article as text file
with open('20th_century_CleanArticle.txt', 'w', encoding='utf-8') as f:
    f.write(str(cleaned_data))

## Loading Country List

In [6]:
with open('20th_century_countries.txt', 'r', encoding='utf8', errors='ignore') as file:
    countryData = file.read()
countryData[:500]

"Abkhazia – Republic of Abkhazia\nAfghanistan – Islamic Emirate of Afghanistan\nAlbania – Republic of Albania\nAlgeria – People's Democratic Republic of Algeria\nAndorra – Principality of Andorra\nAngola – Republic of Angola\nAntigua and Barbuda\nArgentina – Argentine Republic [ i ]\nArmenia – Republic of Armenia\nAustralia – Commonwealth of Australia\nAustria – Republic of Austria\nAzerbaijan – Republic of Azerbaijan [ k ]\nBahamas, The – Commonwealth of The Bahamas [ 12 ]\nBahrain – Kingdom of Bahrain\nBangl"

## Cleaning Countries Data

In [7]:
# Removing long names and text within square brackets
cleaned_countries = re.sub(r' – [^\n]*|,[\s][^\n]*| \[.[^\n]+', '', countryData)
# cleaned_countries = '\n'.join(line.split(' – ')[0] for line in countryData.split('\n'))
cleaned_countries

'Abkhazia\nAfghanistan\nAlbania\nAlgeria\nAndorra\nAngola\nAntigua and Barbuda\nArgentina\nArmenia\nAustralia\nAustria\nAzerbaijan\nBahamas\nBahrain\nBangladesh\nBarbados\nBelarus\nBelgium\nBelize\nBenin\nBhutan\nBolivia\nBosnia and Herzegovina\nBotswana\nBrazil\nBrunei\nBulgaria\nBurkina Faso\nBurundi\nCambodia\nCameroon\nCanada\nCape Verde\nCentral African Republic\nChad\nChile\nChina\nColombia\nComoros\nCongo\nCongo\nCook Islands\nCosta Rica\nCroatia\nCuba\nCyprus\nCzech Republic\nDenmark\nDjibouti\nDominica\nDominican Republic\nEcuador\nEgypt\nEl Salvador\nEquatorial Guinea\nEritrea\nEstonia\nEswatini\nEthiopia\nFiji\nFinland\nFrance\nGabon\nGambia\nGeorgia\nGermany\nGhana\nGreece\nGrenada\nGuatemala\nGuinea\nGuinea-Bissau\nGuyana\nHaiti\nHonduras\nHungary\nIceland\nIndia\nIndonesia\nIran\nIraq\nIreland\nIsrael\nItaly\nIvory Coast\nJamaica\nJapan\nJordan\nKazakhstan\nKenya\nKiribati\nKosovo\nKuwait\nKyrgyzstan\nLaos\nLatvia\nLebanon\nLesotho\nLiberia\nLibya\nLiechtenstein\nLithuani

The country data scraped had long names that appeared after the dash, and comma of most countries, as well as, text within square brackets that needed to be removed.

In [8]:
# Saving cleaned country list as text file
with open('20th_century_CleanedCountries.txt', 'w', encoding='utf-8') as f:
    f.write(str(cleaned_countries))

In [9]:
# Converting to a dataframe
df_countries = pd.DataFrame(cleaned_countries.strip().split('\n'), columns=['Country'])
df_countries

Unnamed: 0,Country
0,Abkhazia
1,Afghanistan
2,Albania
3,Algeria
4,Andorra
...,...
200,Venezuela
201,Vietnam
202,Yemen
203,Zambia


## Reviewing Country Summary Statistics

In [10]:
df_countries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  205 non-null    object
dtypes: object(1)
memory usage: 1.7+ KB


In [11]:
df_countries.describe()

Unnamed: 0,Country
count,205
unique,204
top,Congo
freq,2


There seems to be duplicated records in the countries data frame.

In [12]:
# Identifying duplicated rows
df_countries['Country'].value_counts(dropna=False, ascending=False)

Country
Congo          2
Abkhazia       1
Albania        1
Afghanistan    1
Andorra        1
              ..
Venezuela      1
Vietnam        1
Yemen          1
Zambia         1
Zimbabwe       1
Name: count, Length: 204, dtype: int64

In [13]:
df_countries.loc[df_countries['Country'] == 'Congo', 'Country']

39    Congo
40    Congo
Name: Country, dtype: object

In [14]:
# Removing duplicated row 39
df_noDup_countries = df_countries.drop_duplicates(subset=['Country'], keep='last')
df_noDup_countries.loc[df_noDup_countries['Country'] == 'Congo', 'Country']

40    Congo
Name: Country, dtype: object

In [15]:
# Finding special characters
chars_to_find = "|".join(map(re.escape, ["-", "_", "&", "!", "#", "$", "%", "~?",
                         "~*", "^", "@", "<", ">", "+", "(", ")", "[", "]", "{", "}"]))
# Identifying columns with special characters and generating a count of rows with these values
special_char_rows = df_noDup_countries.astype(str).apply(lambda x: x.str.contains(chars_to_find, regex=True, na=False))
special_char_rows.value_counts(dropna=False)

Country
False      202
True         2
Name: count, dtype: int64

In [16]:
# Identifying rows with special characters
df_noDup_countries[df_noDup_countries['Country'].str.contains(r'[^A-Za-z\s]', regex=True)]

Unnamed: 0,Country
71,Guinea-Bissau
177,São Tomé and Príncipe
182,Timor-Leste


The special characters are proper of the 3 country names that appear in the above data frame. Will proceed with these rows are they are and remove the special characters at the end, if necessary.

In [17]:
# Searching for columns with mixed characters such as numbers with text
for col in df_noDup_countries.columns.tolist():
    weird = (df_noDup_countries[[col]].map(type) != df_noDup_countries[[col]].iloc[0].apply(type)).any(axis=1)
    if len(df_noDup_countries[weird]) > 0:
        print(col)

In [18]:
# Searching for values with leading or trailing spaces in all columns
for col in df_noDup_countries.select_dtypes(include='object').columns:
    verfy = df_noDup_countries[col].astype(str).apply(lambda x: x != x.strip())
    valuesWithSpaces = df_noDup_countries.loc[verfy, col].unique()
    if len(valuesWithSpaces) > 0:
        print(f'\n{col}:')
        print(valuesWithSpaces[:5])

The above confirms the countries data frame does not have any rows with mixed character, leading spaces or trailing spaces.

# Analyzing Entities Identified by Spacy

In [19]:
# Visualizing entities identified by Spacy
article = NER(cleaned_data)
displacy.render(article[:800], style='ent', jupyter=True)

Spacy is doing a good job at identifying geopolitical entities (GPE) but is identifying more than just countries. It would be best to create an entity list based on the country list previously scraped to define country relationships.

# Creating an Entity List

In [20]:
# Generating entity list for each sentence
df_sentences = []

for sent in article.sents:
    entity_list = [ent.text for ent in sent.ents]
    df_sentences.append({'Sentence': sent, 'Entities': entity_list})

df_sentences = pd.DataFrame(df_sentences)
df_sentences.tail(50)

Unnamed: 0,Sentence,Entities
389,"(Computer, games, were, first, developed, by, ...","[first, Pong and Space Invaders]"
390,"(Once, the, home, computer, market, was, estab...",[]
391,"(In, order, to, take, advantage, of, advancing...",[]
392,"(Like, arcade, systems, ,, these, machines, ha...",[]
393,"(Computer, networks, appeared, in, two, main, ...",[two]
394,"(Initially, ,, computers, depended, on, the, t...",[the Bulletin Board]
395,"(However, ,, a, DARPA, project, to, create, bo...",[DARPA]
396,"(The, core, of, this, network, was, the, robus...",[TCP]
397,"(Thanks, to, efforts, from, Al, Gore, ,, the, ...",[Al Gore]
398,"(The, main, impetus, for, this, was, electroni...","[the File Transfer Protocol, FTP]"


In [21]:
# Used ChatGPT to get a dictionary list of country adjectives with their corresponding country names.
demonym_to_country = {

    # --- A ---
    "Afghan": "Afghanistan",
    "Albanian": "Albania",
    "Algerian": "Algeria",
    "American": "United States",
    "Andorran": "Andorra",
    "Angolan": "Angola",
    "Argentine": "Argentina",
    "Argentinian": "Argentina",
    "Armenian": "Armenia",
    "Australian": "Australia",
    "Austrian": "Austria",
    "Azerbaijani": "Azerbaijan",

    # --- B ---
    "Bahamian": "Bahamas",
    "Bahraini": "Bahrain",
    "Bangladeshi": "Bangladesh",
    "Barbadian": "Barbados",
    "Belarusian": "Belarus",
    "Belgian": "Belgium",
    "Belizean": "Belize",
    "Beninese": "Benin",
    "Bhutanese": "Bhutan",
    "Bolivian": "Bolivia",
    "Bosnian": "Bosnia and Herzegovina",
    "Botswanan": "Botswana",
    "Brazilian": "Brazil",
    "British": "United Kingdom",
    "Bruneian": "Brunei",
    "Bulgarian": "Bulgaria",
    "Burkinabe": "Burkina Faso",
    "Burmese": "Myanmar",
    "Burundian": "Burundi",

    # --- C ---
    "Cambodian": "Cambodia",
    "Cameroonian": "Cameroon",
    "Canadian": "Canada",
    "Cape Verdean": "Cape Verde",
    "Central African": "Central African Republic",
    "Chadian": "Chad",
    "Chilean": "Chile",
    "Chinese": "China",
    "Colombian": "Colombia",
    "Comorian": "Comoros",
    "Congolese": "Democratic Republic of the Congo",
    "Costa Rican": "Costa Rica",
    "Croatian": "Croatia",
    "Cuban": "Cuba",
    "Cypriot": "Cyprus",
    "Czech": "Czech Republic",

    # --- D ---
    "Danish": "Denmark",
    "Djiboutian": "Djibouti",
    "Dominican": "Dominican Republic",

    # --- E ---
    "Ecuadorean": "Ecuador",
    "Egyptian": "Egypt",
    "Emirati": "United Arab Emirates",
    "English": "United Kingdom",
    "Equatorial Guinean": "Equatorial Guinea",
    "Eritrean": "Eritrea",
    "Estonian": "Estonia",
    "Ethiopian": "Ethiopia",

    # --- F ---
    "Fijian": "Fiji",
    "Finnish": "Finland",
    "French": "France",

    # --- G ---
    "Gabonese": "Gabon",
    "Gambian": "Gambia",
    "Georgian": "Georgia",
    "German": "Germany",
    "Ghanaian": "Ghana",
    "Greek": "Greece",
    "Grenadian": "Grenada",
    "Guatemalan": "Guatemala",
    "Guinean": "Guinea",
    "Guinea-Bissauan": "Guinea-Bissau",
    "Guyanese": "Guyana",

    # --- H ---
    "Haitian": "Haiti",
    "Honduran": "Honduras",
    "Hungarian": "Hungary",

    # --- I ---
    "Icelandic": "Iceland",
    "Indian": "India",
    "Indonesian": "Indonesia",
    "Iranian": "Iran",
    "Iraqi": "Iraq",
    "Irish": "Ireland",
    "Israeli": "Israel",
    "Italian": "Italy",
    "Ivorian": "Cote d'Ivoire",

    # --- J ---
    "Jamaican": "Jamaica",
    "Japanese": "Japan",
    "Jordanian": "Jordan",

    # --- K ---
    "Kazakh": "Kazakhstan",
    "Kenyan": "Kenya",
    "Kiribati": "Kiribati",
    "Korean": "South Korea",
    "Kosovar": "Kosovo",
    "Kuwaiti": "Kuwait",
    "Kyrgyz": "Kyrgyzstan",

    # --- L ---
    "Laotian": "Laos",
    "Latvian": "Latvia",
    "Lebanese": "Lebanon",
    "Liberian": "Liberia",
    "Libyan": "Libya",
    "Liechtensteiner": "Liechtenstein",
    "Lithuanian": "Lithuania",
    "Luxembourgish": "Luxembourg",

    # --- M ---
    "Macedonian": "North Macedonia",
    "Malagasy": "Madagascar",
    "Malawian": "Malawi",
    "Malaysian": "Malaysia",
    "Maldivian": "Maldives",
    "Malian": "Mali",
    "Maltese": "Malta",
    "Marshallese": "Marshall Islands",
    "Mauritanian": "Mauritania",
    "Mauritian": "Mauritius",
    "Mexican": "Mexico",
    "Micronesian": "Micronesia",
    "Moldovan": "Moldova",
    "Monacan": "Monaco",
    "Mongolian": "Mongolia",
    "Montenegrin": "Montenegro",
    "Moroccan": "Morocco",
    "Mozambican": "Mozambique",

    # --- N ---
    "Namibian": "Namibia",
    "Nauruan": "Nauru",
    "Nepalese": "Nepal",
    "Dutch": "Netherlands",
    "New Zealander": "New Zealand",
    "Nicaraguan": "Nicaragua",
    "Nigerian": "Nigeria",
    "Nigerien": "Niger",
    "North Korean": "North Korea",
    "Norwegian": "Norway",

    # --- O ---
    "Omani": "Oman",

    # --- P ---
    "Pakistani": "Pakistan",
    "Palauan": "Palau",
    "Palestinian": "Palestine",
    "Panamanian": "Panama",
    "Papuan": "Papua New Guinea",
    "Paraguayan": "Paraguay",
    "Peruvian": "Peru",
    "Philippine": "Philippines",
    "Polish": "Poland",
    "Portuguese": "Portugal",

    # --- Q ---
    "Qatari": "Qatar",

    # --- R ---
    "Romanian": "Romania",
    "Russian": "Russia",
    "Rwandan": "Rwanda",

    # --- S ---
    "Saint Lucian": "Saint Lucia",
    "Salvadoran": "El Salvador",
    "Samoan": "Samoa",
    "San Marinese": "San Marino",
    "Sao Tomean": "Sao Tome and Principe",
    "Saudi": "Saudi Arabia",
    "Scottish": "United Kingdom",
    "Senegalese": "Senegal",
    "Serbian": "Serbia",
    "Seychellois": "Seychelles",
    "Sierra Leonean": "Sierra Leone",
    "Singaporean": "Singapore",
    "Slovak": "Slovakia",
    "Slovenian": "Slovenia",
    "Somali": "Somalia",
    "South African": "South Africa",
    "Spanish": "Spain",
    "Sri Lankan": "Sri Lanka",
    "Sudanese": "Sudan",
    "Surinamese": "Suriname",
    "Swazi": "Eswatini",
    "Swedish": "Sweden",
    "Swiss": "Switzerland",
    "Syrian": "Syria",

    # --- T ---
    "Taiwanese": "Taiwan",
    "Tajik": "Tajikistan",
    "Tanzanian": "Tanzania",
    "Thai": "Thailand",
    "Togolese": "Togo",
    "Tongan": "Tonga",
    "Trinidadian": "Trinidad and Tobago",
    "Tunisian": "Tunisia",
    "Turkish": "Turkey",
    "Turkmen": "Turkmenistan",
    "Tuvaluan": "Tuvalu",

    # --- U ---
    "Ugandan": "Uganda",
    "Ukrainian": "Ukraine",
    "Uruguayan": "Uruguay",
    "Uzbek": "Uzbekistan",

    # --- V ---
    "Vanuatuan": "Vanuatu",
    "Venezuelan": "Venezuela",
    "Vietnamese": "Vietnam",

    # --- Y ---
    "Yemeni": "Yemen",

    # --- Z ---
    "Zambian": "Zambia",
    "Zimbabwean": "Zimbabwe",
}

In [23]:
denonym_lower = {k.lower(): v for k, v in demonym_to_country.items()}
df_sentences['Country_Entities'] = df_sentences['Entities'].apply(lambda x: [denonym_lower.get(
    item.strip().lower(), item.strip()) for item in x] if isinstance(x, list) else x)
df_sentences.tail(50)

Unnamed: 0,Sentence,Entities,Country_Entities
389,"(Computer, games, were, first, developed, by, ...","[first, Pong and Space Invaders]","[first, Pong and Space Invaders]"
390,"(Once, the, home, computer, market, was, estab...",[],[]
391,"(In, order, to, take, advantage, of, advancing...",[],[]
392,"(Like, arcade, systems, ,, these, machines, ha...",[],[]
393,"(Computer, networks, appeared, in, two, main, ...",[two],[two]
394,"(Initially, ,, computers, depended, on, the, t...",[the Bulletin Board],[the Bulletin Board]
395,"(However, ,, a, DARPA, project, to, create, bo...",[DARPA],[DARPA]
396,"(The, core, of, this, network, was, the, robus...",[TCP],[TCP]
397,"(Thanks, to, efforts, from, Al, Gore, ,, the, ...",[Al Gore],[Al Gore]
398,"(The, main, impetus, for, this, was, electroni...","[the File Transfer Protocol, FTP]","[the File Transfer Protocol, FTP]"


# Reviewing Summary Statistics 

In [24]:
df_sentences.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Sentence          439 non-null    object
 1   Entities          439 non-null    object
 2   Country_Entities  439 non-null    object
dtypes: object(3)
memory usage: 10.4+ KB


In [25]:
df_sentences.describe()

Unnamed: 0,Sentence,Entities,Country_Entities
count,439,439,439
unique,439,355,353
top,"(The, th, century, changed, the, world, in, un...",[],[]
freq,1,67,67


The above result for the entities column indicate that there are duplicate values, rows that are empty and that the value with the most duplicates has a frequency of 67.

# Filtering Entities Based on Country List Scraped

In [26]:
# Funtion to remove irrelevant entities and return only those that are relevant
def entity_filter(entities, countries):
    return [ent for ent in entities if ent in list(countries) or ent in list(countries)]


# Testing Function
entity_filter(['France', 'Sweden', 'walk'], df_countries['Country'])

['France', 'Sweden']

In [27]:
# Applying function to entities in df_sentences
df_sentences['Article_Entities'] = df_sentences['Country_Entities'].apply(
    lambda x: entity_filter(x, df_countries['Country']))
df_sentences['Article_Entities'].head(100)

0                   []
1                   []
2                   []
3                   []
4                   []
            ...       
95                  []
96            [France]
97    [France, France]
98     [Italy, France]
99            [Greece]
Name: Article_Entities, Length: 100, dtype: object

In [28]:
# Removing rows with empty values
df_sent_filtered = df_sentences[df_sentences['Article_Entities'].map(len) > 0]
df_sent_filtered.tail(50)

Unnamed: 0,Sentence,Entities,Country_Entities,Article_Entities
237,"(On, August, the, "", sacred, decision, "", was,...","[August, Japanese, Potsdam, August, Hirohito, ...","[August, Japan, Potsdam, August, Hirohito, the...",[Japan]
238,"(The, formal, Japanese, Instrument, of, Surren...","[Japanese, September, USS Missouri, Tokyo]","[Japan, September, USS Missouri, Tokyo]",[Japan]
245,"(After, the, conquest, of, Poland, ,, the, Thi...","[Poland, Jews, Jews]","[Poland, Jews, Jews]",[Poland]
251,"(The, Nazis, created, a, system, of, extermina...","[Nazis, Poland, Jews, the Soviet Union]","[Nazis, Poland, Jews, the Soviet Union]",[Poland]
259,"(In, many, places, ,, Jews, had, to, walk, pas...","[Jews, German]","[Jews, Germany]",[Germany]
265,"(However, ,, Germany, had, surrendered, in, Ma...","[Germany, May, German]","[Germany, May, Germany]","[Germany, Germany]"
267,"(These, were, dropped, on, the, Japanese, citi...","[Japanese, Hiroshima, Nagasaki, August]","[Japan, Hiroshima, Nagasaki, August]",[Japan]
268,"(This, ,, in, combination, with, the, Soviet, ...","[Soviet, Japanese, Japanese]","[Soviet, Japan, Japan]","[Japan, Japan]"
275,"(This, new, weapon, was, alone, over, times, a...",[Japan],[Japan],[Japan]
280,"(Eventually, ,, nine, nations, would, overtly,...","[nine, today, the United States, the Soviet Un...","[nine, today, the United States, the Soviet Un...","[Russia, France, China, India, Pakistan, Israe..."


# Defining Country Relationships

In [29]:
# Creating relationship list with function limiting simultaneous iteration to 5
relationships = []

for i in range(df_sent_filtered.index[-1]):
    end_i = min(i+5, df_sent_filtered.index[-1])
    entity_list = sum((df_sent_filtered.loc[i: end_i].Article_Entities), [])

    # Removing duplicated entities that are next to each other
    uniqueEntity = [entity_list[i] for i in range(len(entity_list)) if (i == 0) or entity_list[i] != entity_list[i-1]]

    if len(uniqueEntity) > 1:
        for idx, a in enumerate(uniqueEntity[:-1]):
            b = uniqueEntity[idx + 1]
            relationships.append({'source': a, 'target': b})

In [30]:
# Converting identified relationships to a data frame
relationship_df = pd.DataFrame(relationships)
relationship_df

Unnamed: 0,source,target
0,France,Austria
1,Austria,Hungary
2,France,Austria
3,Austria,Hungary
4,Hungary,Russia
...,...,...
1054,South Africa,Rwanda
1055,South Africa,Rwanda
1056,South Africa,Rwanda
1057,Rwanda,North Korea


In [31]:
relationship_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   source  1059 non-null   object
 1   target  1059 non-null   object
dtypes: object(2)
memory usage: 16.7+ KB


In [32]:
relationship_df.describe()

Unnamed: 0,source,target
count,1059,1059
unique,57,57
top,Japan,Germany
freq,128,138


The info and describe function result indicate that there are duplicate values, and that the value with the most duplicates has a frequency of 85. This does not affect the relationship analysis. 

In [33]:
# Sorting cases with a-b and b-a
df_relationships = pd.DataFrame(np.sort(relationship_df.values, axis=1), columns=relationship_df.columns)
df_relationships

Unnamed: 0,source,target
0,Austria,France
1,Austria,Hungary
2,Austria,France
3,Austria,Hungary
4,Hungary,Russia
...,...,...
1054,Rwanda,South Africa
1055,Rwanda,South Africa
1056,Rwanda,South Africa
1057,North Korea,Rwanda


In [34]:
# Summarizing country relationships
df_relationships['Interactions'] = 1
df_relationships = df_relationships.groupby(['source', 'target'], sort=False, as_index=False).sum()
df_relationships.head(50)

Unnamed: 0,source,target,Interactions
0,Austria,France,6
1,Austria,Hungary,6
2,Hungary,Russia,5
3,Germany,Russia,21
4,Germany,Ukraine,16
5,Germany,United States,17
6,United Kingdom,United States,3
7,Germany,Italy,29
8,Austria,Germany,10
9,Germany,United Kingdom,27


The source list of countries above relate to the countries that appear first before the pair target country name when consecutive duplicates are removed from the 5 sentences iterated at the time. The number of interactions indicates the frequency of each pair identified. 

In [35]:
# Finding special characters
chars_to_find = "|".join(map(re.escape, ["-", "_", "&", "!", "#", "$", "%", "~?",
                         "~*", "^", "@", "<", ">", "+", "(", ")", "[", "]", "{", "}"]))
# find/identify columns with special characters and count values
special_char_rows = df_relationships.astype(str).apply(lambda x: x.str.contains(chars_to_find, regex=True, na=False))
special_char_rows.value_counts(dropna=False)

source  target  Interactions
False   False   False           113
True    False   False             1
Name: count, dtype: int64

In [36]:
df_relationships[df_relationships['target'].str.contains(r'[^A-Za-z\s]', regex=True)]

Unnamed: 0,source,target,Interactions


The only row containing special characters is the Guinea-Bissau	with a hyphen separating the country compound name. This should not affect our analysis.

In [37]:
# Searching for columns with mixed characters such as numbers with text
for col in df_relationships.columns.tolist():
    weird = (df_relationships[[col]].map(type) != df_relationships[[col]].iloc[0].apply(type)).any(axis=1)
    if len(df_relationships[weird]) > 0:
        print(col)

In [38]:
# Searching for values with leading or trailing spaces in all columns
for col in df_relationships.select_dtypes(include='object').columns:
    verfy = df_relationships[col].astype(str).apply(lambda x: x != x.strip())
    valuesWithSpaces = df_relationships.loc[verfy, col].unique()
    if len(valuesWithSpaces) > 0:
        print(f'\n{col}:')
        print(valuesWithSpaces[:5])

The above confirms the relationships data frame does not have any rows with mixed character, leading spaces or trailing spaces.

# Exporting Dataframes as Pickle Files

In [39]:
df_noDup_countries.to_pickle(
    r'D:\Data_Analysis\13-11-2025_Network_Visualization\03.Scripts\20th-century\country_list.pkl')
df_relationships.to_pickle(
    r'D:\Data_Analysis\13-11-2025_Network_Visualization\03.Scripts\20th-century\country_relationships.pkl')
df_noDup_countries.to_csv(
    r'D:\Data_Analysis\13-11-2025_Network_Visualization\03.Scripts\20th-century\country_list.csv')
df_relationships.to_csv(
    r'D:\Data_Analysis\13-11-2025_Network_Visualization\03.Scripts\20th-century\country_relationships.csv')