# Create O.Henry Text Corpus

First, I will install the 'pandas' and 'spacy' libraries, which are essential for the operations we intend to perform in this script.

In [23]:
# Install pandas and spacy
!pip install pandas spacy



In [24]:
# Download English language model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
      --------------------------------------- 0.3/12.8 MB 6.1 MB/s eta 0:00:03
     -- ------------------------------------- 0.7/12.8 MB 7.4 MB/s eta 0:00:02
     -- ------------------------------------- 0.9/12.8 MB 7.5 MB/s eta 0:00:02
     --- ------------------------------------ 1.2/12.8 MB 6.3 MB/s eta 0:00:02
     ---- ----------------------------------- 1.5/12.8 MB 6.3 MB/s eta 0:00:02
     ----- ---------------------------------- 1.7/12.8 MB 6.5 MB/s eta 0:00:02
     ------ --------------------------------- 2.0/12.8 MB 6.4 MB/s eta 0:00:02
     ------- -------------------------------- 2.3/12.8 MB 6.4 MB/s eta 0:00:02
     -------- ------------------------------- 2.7/12.8 MB 6.5 MB/s eta 0:00:02
     --------- --------------------------

I will read all text files from a specified folder.

In [25]:
import os
import pandas as pd

# Set the folder path where text files are stored
folder_path = './text'

# Initialize an empty list to store individual dataframes
dataframes = []

# Loop through each file in the folder
for filename in os.listdir(folder_path):
    # Check if the file is a text file
    if filename.endswith('.txt'):
        # Create the full path to the file
        file_path = os.path.join(folder_path, filename)
        # Open the file
        with open(file_path, 'r', encoding='utf-8') as file:
            # Read the content of the file
            text = file.read()

To ensure I only capture the essential text, I will remove any unnecessary copyright information. Furthermore, to avoid any copyright infringement issues, we have carefully reviewed and adhered to Project Gutenberg's guidelines.

In [26]:
# Remove copyright information and introductory content from text files
start_text = text.find("*** START OF THIS PROJECT GUTENBERG EBOOK")
end_text = text.find("*** END OF THIS PROJECT GUTENBERG EBOOK")
text = text[start_text:end_text] if start_text > -1 and end_text > -1 else text

In [27]:
# Clean up line breaks and carriage returns in text
text = text.replace('\n', ' ').replace('\r', '')

# Tokenization

From this section, I will begin the process of data annotation using the spaCy library. spaCy is a powerful and efficient tool for natural language processing, which allows us to perform tasks such as tokenization, lemmatization, part-of-speech tagging, and named entity recognition. I'll apply these techniques to our text data to extract meaningful linguistic features.

In [28]:
import spacy

In [29]:
# Load English language model
nlp = spacy.load('en_core_web_sm')

In [30]:
# Process the text with Spacy
doc = nlp(text)

In [31]:
# Generate a list of tokenization
tokens = [token.text for token in doc]
print("Tokens:", tokens)



# Lemmatization

In [32]:
# Generate a list of lemmatization
lemmas = [token.lemma_ for token in doc]
print("Lemmas:", lemmas)



# POS-tagging

In [33]:
# Generate a list of words with their POS
pos_tags = [(token.text, token.pos_) for token in doc]
print("POS tags:", pos_tags)



# Create a csv file

In [34]:
# Create a DataFrame for the current document
df = pd.DataFrame({
    'Filename': [filename],
    'Document': [text],
    'Tokens': [tokens],
    'Lemmas': [lemmas],
    'POS': [pos_tags]
})

# Append the DataFrame to the list
dataframes.append(df)

In [35]:
# Concatenate all dataframes into one
corpus_df = pd.concat(dataframes, ignore_index=True)

# Save the DataFrame to a CSV file
corpus_df.to_csv("corpus.csv", index=False)

# Print the first few rows of the DataFrame
print(corpus_df.head())

         Filename                                           Document  \
0  Whirligigs.txt  ﻿The Project Gutenberg eBook of Whirligigs    ...   

                                              Tokens  \
0  [﻿The, Project, Gutenberg, eBook, of, Whirligi...   

                                              Lemmas  \
0  [﻿the, Project, Gutenberg, eBook, of, Whirligi...   

                                                 POS  
0  [(﻿The, NOUN), (Project, PROPN), (Gutenberg, P...  
