**Start a new notebook in JupyterLab and import the libraries you’ll need.**

In [51]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords,wordnet
from nltk.stem import WordNetLemmatizer
from collections import Counter
import pandas as pd
import nltk
import re
import string
import spacy

# Download the stopwords data
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shahj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shahj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Load the twentieth-century text file.**

In [26]:
# Open the file
file = open('key_events_20th_century.txt', 'r', encoding="utf-8")

# Read the contents
text = file.read()

**Evaluating whether the text needs wrangling: are there any special characters used**

In [27]:
def check_special_characters(text):
    special_characters = set()
    for char in text:
        if char not in string.ascii_letters and char not in string.digits and char not in string.whitespace:
            special_characters.add(char)
    return special_characters

#text = "Your text goes here."
special_chars = check_special_characters(text)
print("Special characters used:", special_chars)


Special characters used: {'県', ',', 'ã', '’', '?', ')', ';', '(', '縄', '.', ']', '-', '/', '–', 'ö', '^', 'í', '[', ':', 'é', '—', '|', '&', '沖', '／', '!', "'", '"', '°', '®'}


**Inferences**

{'縄', '"', '!', ',', ']', '°', 'ö', 'ã', '県', '(', '[', '—', 'í', '|', '®', "'", '/', ';', 'é', '／', '&', '-', '^', '’', ')', ':', '–', '.', '沖', '?'}

These are some of the special characters mentioned above that exist in the text. Therefore, there is a need for wrangling.

In [28]:
# Performing wrangling

# Step 1: Lowercasing
text = text.lower()

In [29]:
# Remove numbers using regular expression
text = re.sub(r'\d+', '', text)

#print(text)

In [30]:
# Step 2: Tokenize the text
tokens = word_tokenize(text)

#print(tokens)

In [31]:
# Step 3: Remove punctuation
tokens = [word for word in tokens if word not in string.punctuation]

# Step 4: Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word.lower() not in stop_words]

# Step 5: Remove special characters using regular expressions
tokens = [re.sub(r'[^a-zA-Z0-9]', '', word) for word in tokens]

#print(tokens)

In [32]:
# Step 6: Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Step 7: Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word, wordnet.VERB) for word in tokens]

#print(lemmatized_words)

In [33]:
# Open the file
file = open('countries_list.txt', 'r')

# Read the contents
countries = file.read()
countries=countries.split("\n")

In [34]:
# Function to check if a country name exists in the text
def check_country_mentions(text, country_list):
    mentions = []
    for country in country_list:
        if country.lower() in text.lower():  # Convert both to lowercase for case-insensitive comparison
            mentions.append(country)
    return mentions

# Call the function to get the list of mentioned countries
mentioned_countries = check_country_mentions(text, countries)

# Print the mentioned countries
print("Countries mentioned in the text:", mentioned_countries)

Countries mentioned in the text: ['Albania', 'Algeria', 'Angola', 'Australia', 'Austria', 'Bangladesh', 'Belarus', 'Belgium', 'Bulgaria', 'Cambodia', 'Canada', 'Cape Verde', 'Cuba', 'Denmark', 'Egypt', 'Estonia', 'Finland', 'France', 'Germany', 'Ghana', 'Greece', 'Guinea', 'Guinea', 'Bissau', 'Hungary', 'India', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Japan', 'Kenya', 'Laos', 'Latvia', 'Lebanon', 'Libya', 'Lithuania', 'Luxembourg', 'Mexico', 'Moldova', 'Mongolia', 'Morocco', 'Mozambique', 'Netherlands', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Papua New Guinea', 'Philippines', 'Poland', 'Romania', 'Russia', 'São Tomé and Príncipe', 'Serbia', 'Seychelles', 'Singapore', 'Slovakia', 'Solomon Islands', 'South Africa', 'Spain', 'Sudan', 'Sweden', 'Thailand', 'Ukraine', 'United Kingdom', 'United States', 'Vietnam']


**Inferences**

Yes, the names of the countries in my list are the same as mentioned in the text we have scraped from the 20th Century Wikipedia Page. We carefully compared the country names in both the countries_list and the scraped text and found no issues that would require correction. To ensure accurate comparison, we made sure that both the countries_list and the text were in lowercase to avoid any discrepancies due to case sensitivity. The data wrangling process allowed us to clean and prepare the data, ensuring that the country names were in a consistent format, making the comparison straightforward. As a result, we can confidently proceed with the analysis, knowing that the country names are aligned and suitable for further exploration.

**Use the text file to create a NER object.**

In [35]:
# Open the file in write mode
with open('Common_Countries.txt', 'w') as file:
    # Iterate over the list and write each element to a new line in the file
    file.write(str(mentioned_countries))

In [43]:
# Tag the words with Part-of-Speech (POS) tags
tagged_words = nltk.pos_tag(lemmatized_words)

# Use NER to extract named entities
ner_object = nltk.ne_chunk(tagged_words)

# Print the NER object
#print(ner_object)

**Split the sentence entities from the NER object.**

In [47]:
# Open the file
file = open('key_events_20th_century.txt', 'r', encoding="utf-8")

# Read the contents
text = file.read()

# Load the English language model
nlp = spacy.load('en_core_web_sm')


# Process the text with spaCy
doc = nlp(text)

# Function to check if the entity is in the countries list
def is_country(entity):
    return entity.text in countries  # Assuming countries_list is a list of countries

# Split sentence entities
sentence_entities = []
for sentence in doc.sents:
    sentence_entities.append([entity.text for entity in sentence.ents if is_country(entity)])

print(sentence_entities)

[[], [], [], [], [], [], ['France', 'Italy', 'Russia'], ['Germany', 'Austria', 'Hungary', 'Bulgaria'], ['Russia'], ['Germany', 'Russia'], ['Germany'], ['Germany'], [], [], [], [], [], [], [], ['Germany'], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], ['Germany', 'Italy'], ['Germany', 'Germany'], ['Germany', 'Germany'], ['Austria', 'Austria', 'Germany'], [], [], [], [], ['Spain'], [], [], [], [], ['France', 'Poland'], ['Poland'], ['France', 'Germany', 'Poland', 'Poland'], [], [], [], [], ['Poland', 'Germany'], ['Estonia', 'Latvia', 'Lithuania', 'Finland'], ['Germany', 'Poland', 'Belgium', 'Netherlands', 'Luxembourg'], ['Belgium'], ['Denmark', 'Norway'], ['Norway'], ['Norway', 'Denmark', 'Sweden', 'Germany'], ['France'], [], ['France'], [], [], ['France'], [], ['Italy'], [], ['Germany'], [], [], [], [], [], [], [], [], [], ['Greece', 'Albania', 'Greece'], [], [], [], ['Ukraine', 'Belarus'], [], [], ['Libya', 'Egypt'], ['Libya'], [], ['Egypt'], ['Iraq'], [], [], 

**Filter the entities so that you end up only with the ones from your countries list.**

In [49]:
# Filter the entities to keep only those from the countries list
filtered_entities = [entity.text for entity in doc.ents if is_country(entity)]

print(filtered_entities)



['France', 'Italy', 'Russia', 'Germany', 'Austria', 'Hungary', 'Bulgaria', 'Russia', 'Germany', 'Russia', 'Germany', 'Germany', 'Germany', 'Germany', 'Italy', 'Germany', 'Germany', 'Germany', 'Germany', 'Austria', 'Austria', 'Germany', 'Spain', 'France', 'Poland', 'Poland', 'France', 'Germany', 'Poland', 'Poland', 'Poland', 'Germany', 'Estonia', 'Latvia', 'Lithuania', 'Finland', 'Germany', 'Poland', 'Belgium', 'Netherlands', 'Luxembourg', 'Belgium', 'Denmark', 'Norway', 'Norway', 'Norway', 'Denmark', 'Sweden', 'Germany', 'France', 'France', 'France', 'Italy', 'Germany', 'Greece', 'Albania', 'Greece', 'Ukraine', 'Belarus', 'Libya', 'Egypt', 'Libya', 'Egypt', 'Iraq', 'Japan', 'Germany', 'Japan', 'Russia', 'Germany', 'Italy', 'Germany', 'Morocco', 'Algeria', 'Italy', 'Italy', 'Italy', 'Italy', 'France', 'Germany', 'France', 'France', 'Germany', 'Poland', 'Germany', 'Germany', 'Germany', 'Japan', 'Japan', 'Japan', 'Japan', 'Japan', 'Japan', 'Germany', 'Japan', 'Japan', 'Japan', 'Thailand',

**Create the relationships dataframe.**

In [52]:
# Create relationships between countries
relationships = []
for i in range(len(filtered_entities) - 1):
    for j in range(i + 1, len(filtered_entities)):
        relationships.append((filtered_entities[i], filtered_entities[j]))

# Step 3: Convert relationships list to DataFrame
relationships_df = pd.DataFrame(relationships, columns=['Country1', 'Country2'])
relationships_df.to_csv('Relationship.csv', index=False)



In [53]:
relationships_df

Unnamed: 0,Country1,Country2
0,France,Italy
1,France,Russia
2,France,Germany
3,France,Austria
4,France,Hungary
...,...,...
20701,Vietnam,India
20702,Vietnam,Singapore
20703,Lebanon,India
20704,Lebanon,Singapore
