<a href="https://colab.research.google.com/github/Sagaust/DH-Computational-Methodologies/blob/main/Word_Frequency_Trend.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Step 1: Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# @title Default title text
# Step 1: Mount Google Drive to access files
# from google.colab import drive
# drive.mount('/content/drive')

# Step 2: Set the path to the folder containing the text files
folder_path = '/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Amos Tutuola'  # Change 'YourFolderName' to your actual folder name

# Step 3: List all text files in the folder
import os
text_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Step 4: Merge the content of all text files into a single string
merged_content = ""
for file_name in text_files:
    with open(os.path.join(folder_path, file_name), 'r') as file:
        merged_content += file.read() + "\n\n"  # Adding two newlines to separate content of different files

# Step 5: Write the merged content to a new .txt file
output_file_path = '/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt'
with open(output_file_path, 'w') as output_file:
    output_file.write(merged_content)

print(f"Merged content written to {output_file_path}")

To visualize a word frequency trend using Python, you'd typically follow these steps:

    Tokenization: Break down the text into individual words.
    Frequency Calculation: Count the occurrences of each word.
    Visualization: Plot the frequencies of the top-N words.

For this demonstration, I'll generate a word frequency trend for the sample text.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
import re

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())

# Frequency Calculation
word_counts = Counter(words)

# Visualization: Plotting the frequencies of the top 10 words
common_words = word_counts.most_common(10)
words = [word[0] for word in common_words]
counts = [word[1] for word in common_words]

plt.figure(figsize=(10, 5))
plt.bar(words, counts, color='skyblue')
plt.title('Word Frequency Trend in Tutuola_merged_file.txt')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


To gain deeper insights and meaning from the word frequency trend in a text corpus, we can employ several styles and approaches. Let's explore a few:

    Basic Word Frequency Plot
    Word Frequency Trend over the Course of the Text
    Word Cloud
    Bigram Frequency Plot
    Trigram Frequency Plot

In [None]:
# Basic Word Frequency Plot

import matplotlib.pyplot as plt
from collections import Counter
import re

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())
# Frequency Calculation
word_counts = Counter(words)

# Visualization
common_words = word_counts.most_common(20)
words = [word[0] for word in common_words]
counts = [word[1] for word in common_words]

plt.figure(figsize=(12, 6))
plt.bar(words, counts, color='lightseagreen')
plt.title('Top 20 Words in Tutuola_merged_file.txt')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


In [None]:
# Basic Word Frequency Plot

import matplotlib.pyplot as plt
from collections import Counter
import re

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())
# Frequency Calculation
word_counts = Counter(words)

# Get the least common words; -20 gets the last 20 words from the list.
least_common_words = word_counts.most_common()[-50:]
words = [word[0] for word in least_common_words]
counts = [word[1] for word in least_common_words]

plt.figure(figsize=(12, 6))
plt.bar(words, counts, color='lightblue')
plt.title('20 Least Frequent Words in Tutuola_merged_file.txt')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


## Most Frequent Nouns Only

To focus on the most frequent nouns, you'll need to utilize Natural Language Processing (NLP) tools to identify the parts of speech (POS) of the words. The spaCy library is a popular choice for such tasks.

Here's how you can do it:

    Tokenize the text using spaCy.
    Extract nouns and their frequencies.
    Plot the most frequent nouns.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm


In [None]:
import spacy
import matplotlib.pyplot as plt
from collections import Counter

# Load the English tokenizer, POS tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Increase the max_length attribute
nlp.max_length = 2500000  # Adjust based on your text length, but this should cover your current text

# Process the text
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()
doc = nlp(text)

# Extract nouns
nouns = [token.text for token in doc if token.pos_ == "NOUN"]

# Frequency Calculation
noun_counts = Counter(nouns)

# Visualization: Plotting the frequencies of the top 20 nouns
common_nouns = noun_counts.most_common(20)
nouns = [noun[0] for noun in common_nouns]
counts = [noun[1] for noun in common_nouns]

plt.figure(figsize=(12, 6))
plt.bar(nouns, counts, color='lightgreen')
plt.title('Top 20 Nouns in Tutuola_merged_file.txt')
plt.xlabel('Nouns')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


### Bigram Frequency Plot

#### Shows the frequency of two consecutive words, which can provide context.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
import re

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())

bigrams = [(words[i], words[i+1]) for i in range(len(words)-1)]
bigram_counts = Counter(bigrams)

common_bigrams = bigram_counts.most_common(20)
bigram_pairs = [" ".join(bigram[0]) for bigram in common_bigrams]
counts = [bigram[1] for bigram in common_bigrams]

plt.figure(figsize=(12, 6))
plt.bar(bigram_pairs, counts, color='lightsalmon')
plt.title('Top 20 Bigrams in Tutuola_merged_file.txt')
plt.xlabel('Bigrams')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


### Trigram Frequency Plot

#### Shows the frequency of three consecutive words.

In [None]:
import matplotlib.pyplot as plt
from collections import Counter
import re

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())

trigrams = [(words[i], words[i+1], words[i+2]) for i in range(len(words)-2)]
trigram_counts = Counter(trigrams)

common_trigrams = trigram_counts.most_common(20)
trigram_triplets = [" ".join(trigram[0]) for trigram in common_trigrams]
counts = [trigram[1] for trigram in common_trigrams]

plt.figure(figsize=(12, 6))
plt.bar(trigram_triplets, counts, color='lightcoral')
plt.title('Top 20 Trigrams in Tutuola_merged_file.txt')
plt.xlabel('Trigrams')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


## Word Frequency Grid
### To generate a word frequency grid for the corpus, we can represent the word frequencies in a table or dataframe format. The pandas library provides an easy way to create and visualize this grid.

In [None]:
import pandas as pd
import re
from collections import Counter

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Tutuola_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())

# Frequency Calculation
word_counts = Counter(words)

# Convert to DataFrame for visualization
df = pd.DataFrame(word_counts.most_common(), columns=["Word", "Frequency"])

# Adjust display settings
pd.set_option('display.max_rows', None)  # Display all rows
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

# Display the dataframe
print(df)


### Generate a CSV file from the word frequency data
    Use the to_csv method from the pandas DataFrame to save the word frequencies to a CSV file.
    Provide a path to save the CSV file to disk.
    Generate a link that allows for downloading the saved CSV file.

In [None]:
import pandas as pd
import re
from collections import Counter

# Load the text from the file
with open('/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Okri_merged_file.txt', 'r') as file:
    text = file.read()

# Tokenization: Convert text to lowercase and split into words
words = re.findall(r'\w+', text.lower())

# Frequency Calculation
word_counts = Counter(words)

# Convert to DataFrame
df = pd.DataFrame(word_counts.most_common(), columns=["Word", "Frequency"])

# Save to CSV
output_path = "/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Okri_word_frequencies.csv"
df.to_csv(output_path, index=False)

output_path


In [None]:
# Step 2: Set the path to the folder containing the text files
folder_path = '/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Ben Okri'  # Change 'YourFolderName' to your actual folder name

# Step 3: List all text files in the folder
import os
text_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]

# Step 4: Merge the content of all text files into a single string
merged_content = ""
for file_name in text_files:
    with open(os.path.join(folder_path, file_name), 'r') as file:
        merged_content += file.read() + "\n\n"  # Adding two newlines to separate content of different files

# Step 5: Write the merged content to a new .txt file
output_file_path = '/content/drive/MyDrive/Colab Notebooks/Corpus of African Literature/Okri_merged_file.txt'
with open(output_file_path, 'w') as output_file:
    output_file.write(merged_content)

print(f"Merged content written to {output_file_path}")
