# Introduction

Use this code to clean, section, and disaggregate texts and corpora. 

**Why Perform Text Sectioning?** 

Dividing texts into sections (for example, chapters or chunks of N length) is valuable as a precursor to topic modeling and other forms of computational analysis which perform more accurately when applied to groups of segmented documents from longer texts. 

**Why Disaggregate Texts?** 

The process of disaggregating the words in texts (in this case, by alphabetizing them) also creates data sets that can be shared freely where original texts cannot be due to copyright restrictions. 

*Input/Output Specifications:* 

This code requires plain txt files as input, either those from this repository's sample_data folder or those from a local machine. It returns csv files with disaggregated text grouped by chapter or chunk of n length.

# Upload and Add Text Files To Pandas DataFrame
In this section, text files are added into a Pandas DataFrame. Pandas is a fast and relatively easy way to work with large datasets. Though data frames are typically associated with numbers, Pandas also offers many functionalities for [working with textual data. ](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm) 

In [None]:
#Import os and glob
import glob
import os

#Import pandas
import pandas as pd

In [None]:
#Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/home/dssadmin/Desktop/SF_Analysis/Data/Rd3_Texts")

In [None]:
#Append all txt files to a pandas dataframe
filenames = []
data = []
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
    if f.endswith('.txt'):
        with open(f, 'rb') as myfile:
            filenames.append(myfile.name)
            data.append(myfile.read())
d = {'Title':filenames, 'Text': data}
books = pd.DataFrame(d)
books

In [None]:
#Remove .txt from titles
books['Title'] = books['Title'].str.replace(r'.txt', ' ', regex=True) 
books.head()

In [None]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
books['Text'] = books['Text'].apply(lambda x: x.decode('utf-8'))

#Remove newline characters
books['Text'] = books['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
books['Text'] = books['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
books

#Remove punctuation and replace with no space (except periods and hyphens)
books['Text'] = books['Text'].str.replace(r'[^\w\-\.\s]+', '', regex = True)

#Remove periods and replace with space (to prevent incorrect compounds)
books['Text'] = books['Text'].str.replace(r'[^\w\-\s]+', ' ', regex = True)
books.head()

# Clean Texts and Set Parameters for Sectioning 
Several basic cleaning processes are implemented: removing unwanted characters from titles, removing newline characters from texts, and removing punctuation. Parameters are also set for part(s) of text to be included in sectioning. In the SciFi Corpus project, "START OF BOOK" and "END OF BOOK" tags were added to delineate the body of each text. Code in this section removes any text outside the starting and ending parameters--e.g., title page, copyright page, other paratext. 

In [None]:
#Remove paratext (before and after START OF BOOK and END OF BOOK tags)
#If texts you are working with do not have these tags, ignore this cell

#Split book on start of book tag, keep text only after start of book tag
start = books["Text"].str.split("START OF BOOK", expand = True)
books['Text'] = start[1]

#Split book on end of book tag, keep text only before of book tag
end = books["Text"].str.split("END OF BOOK", expand = True)
books['Text'] = end[0]
books

In [None]:
#Check that text is cleaned and sectioned
books.iloc[0]['Text']

In [None]:
#Define new dataframe
books_cleaned = books

# Section Texts By Chapter Headings
When working with texts with clearly delineated chapters, using chapter headings is a relatively easy way to section texts into segments of (relatively) the same size. After checking the chapter counts for each text to confirm whether sectioning by chapter is a useful procedure, this code iterates through the texts and splits them each time it encounters a new "chapter" heading. From here, the text from each chapter is appended to a new dataframe and denoted by book and chapter number. 

In [None]:
#Count number of chapters in each text
chapter_counts = books_cleaned['Text'].str.count('CHAPTER')

#Append chapter counts to dataframe
books_cleaned["Chapters"] = chapter_counts
books_cleaned

In [None]:
#Make new cell each time new chapter starts 
new = books_cleaned["Text"].str.split("CHAPTER", expand = True).set_index(books_cleaned['Title'])
new

In [None]:
#Flatten dataframe so each chapter is on own row, designated by book and chapter 
chapters_df = new.stack().reset_index()
chapters_df.columns = ["Book", "Chapter", "Text"]
chapters_df

In [None]:
#Tidying the DF
#Combine book and chapter labels into one column
chapters_df['Book + Chapter'] = chapters_df['Book'].astype(str) + '_Chapter_' + chapters_df['Chapter'].astype(str)

#Remove individual book and chapter columns
chapters_df.drop(columns=['Book', 'Chapter'])

#Lowercase all words
chapters_df['Text'] = chapters_df['Text'].str.lower()

#Reindex so book + chapter is first column 
column_names = "Book + Chapter", "Text"
chapters_df = chapters_df.reindex(columns=column_names)
chapters_df

# Section Chapters by Chunks of N Length
Though chapter headings are useful for splitting texts into semi-equal segments, disparities in chapter length may occur, especially in large corpora. To further segment texts, the text of each text can be divided into chunks of n length. 

In [None]:
#Create new df to work with chunks
new_chapters_df = chapters_df

#Get number of words in each chapter (helps to determine chunk length)
ch_words = new_chapters_df["Text"].apply(lambda x: len(str(x).split(' ')))

#Append word counts to dataframe
new_chapters_df["Word Count"] = ch_words
new_chapters_df

In [None]:
#Tokenize Text
import nltk
nltk.download('punkt')
new_chapters_df['Tokens'] = new_chapters_df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
new_chapters_df

In [None]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 1000

#Create new list for chunked sentences
chunked_ch = []

#Perform chunking function on each row of tokens
s = new_chapters_df['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Add to new list
  chunked_ch.append(chunks)


In [None]:
#Create dictionary to associate chunks with titles
keys = new_chapters_df['Book + Chapter']
values = chunked_ch

res = {keys[i]: values[i] for i in range(len(keys))}

In [None]:
#Add chunks to new dataframe
chunked_ch_df = pd.DataFrame.from_dict(res, orient='index')
chunked_ch_df.head()

In [None]:
#Reset dataframe index and rename columns
chunked_ch_df = chunked_ch_df.stack().reset_index()
chunked_ch_df.columns = ["Title","Chunk","Text"]
chunked_ch_df

In [None]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_ch_df['Book + Chunk'] = chunked_ch_df['Title'].astype(str) + ' Chunk ' + chunked_ch_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_ch_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize

chunked_ch_df['Text'] = chunked_ch_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_ch_df['Text'] 

#Lowercase all words
chunked_ch_df['Text'] = chunked_ch_df['Text'].str.lower()

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_ch_df = chunked_ch_df.reindex(columns=column_names)

#Print cleaned df
chunked_ch_df

#Section Texts By Chunks of N Length
When working with texts WITHOUT discernable chapter headings--or, even if chapter headings are present but too infrequent to split texts into meaningful segments--texts can instead be sectioned by chunks of "N" length, where N is a variable that can be custom-set below. After checking the word counts for each text to determine what size chunks would be appropriate, this code iterates through the texts and splits them each time it counts "N" number of words. From here, the text from each chunk is appended to a new dataframe and denoted by book and chunk number.

In [None]:
#Get number of words in each book (helps to determine chunk length)
words = books_cleaned["Text"].apply(lambda x: len(str(x).split(' ')))

#Append chapter counts to dataframe
books_cleaned["Word Count"] = words
books_cleaned

In [None]:
#Tokenize Text
import nltk
nltk.download('punkt')
books_cleaned['Tokens'] = books_cleaned.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
books_cleaned

In [None]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 1000

#Create new list for chunked sentences
chunked_sentences = []

#Perform chunking function on each row of tokens
s = books_cleaned['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Check that text is being chunked correctly
  print(chunks[0])
  #Add to new list
  chunked_sentences.append(chunks)


In [None]:
#Create dictionary to associate chunks with titles
keys = books_cleaned['Title']
values = chunked_sentences

res = {keys[i]: values[i] for i in range(len(keys))}

In [None]:
#Add chunks to new dataframe
chunked_df = pd.DataFrame.from_dict(res, orient='index')
chunked_df.head()

In [None]:
#Reset dataframe index and rename columns
chunked_df = chunked_df.stack().reset_index()
chunked_df.columns = ["Title","Chunk","Text"]
chunked_df

In [None]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_df['Book + Chunk'] = chunked_df['Title'].astype(str) + ' Chunk ' + chunked_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize

chunked_df['Text'] = chunked_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_df['Text'] 

#Lowercase all words
chunked_df['Text'] = chunked_df['Text'].str.lower()

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_df = chunked_df.reindex(columns=column_names)

#Print cleaned df
chunked_df

# Disaggregate Texts and Download CSV Output
Working with texts split by chapter or chunk (or both), the final step of this process is to disaggregate the data. Disaggregation, or the breakdown of data into smaller (disordered) parts, is accomplished through the alphabetization of the words in each chapter/chunk. 

The resulting "bag of words" data can then be downloaded as csvs and used for further analysis, such as through the Topic Modeling pipeline in the Extracted Features repository: https://github.com/SF-Nexus/Extracted-Features/blob/main/Topic%20Modeling%20with%20SciFi%20Corpus.ipynb 

## Full Texts

In [None]:
#Working with data from texts sectioned by CHAPTER
#Alphabetize words in each chapter string
books_bow = books.copy()
books_bow['Text'] = books_bow['Text'].apply(lambda x: ' '.join(sorted(x.split())))
books_bow

In [None]:
#Download CSV with full texts (aggregated)
books.to_csv('full_texts_agg.csv', index=False)

In [None]:
#Download CSV with full texts (disaggregated)
books_bow.to_csv('full_texts_bow.csv', index=False)

## Texts Sectioned by Chapter

In [None]:
#Working with data from texts sectioned by CHAPTER
#Alphabetize words in each chapter string
chapters_df['Text'] = chapters_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chapters_df

In [None]:
#Download CSV with full texts (aggregated)
chapters_df.to_csv('chapters_agg_output.csv', encoding = 'utf-8-sig') 


In [None]:
chapters_df.to_csv('chapters_bow_output.csv', encoding = 'utf-8-sig') 

## Texts Sectioned by Chapter + Chunk

In [None]:
#Working with data from texts sectioned by CHUNK of N length
#Alphabetize words in each chunk string
chunked_ch_df['Text'] = chunked_ch_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chunked_ch_df

In [None]:
#Download disaggregated chunks to csv
chunked_ch_df.to_csv('chapter_chunks_bow_output.csv', encoding = 'utf-8-sig') 

## Texts Sectioned by Chunk

In [None]:
#Working with data from texts sectioned by CHUNK of N length
#Alphabetize words in each chunk string
chunked_df['Text'] = chunked_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chunked_df

In [None]:
#Download disaggregated chunks to csv
chunked_df.to_csv('chunks_bow_output.csv', encoding = 'utf-8-sig') 