# Introduction

Use this code to clean, section, and disaggregate texts and corpora. 

**Why Perform Text Sectioning?** 

Dividing texts into sections (for example, chapters or chunks of N length) is valuable as a precursor to topic modeling and other forms of computational analysis which perform more accurately when applied to groups of segmented documents from longer texts. 

**Why Disaggregate Texts?** 

The process of disaggregating the words in texts (in this case, by alphabetizing them) also creates data sets that can be shared freely where original texts cannot be due to copyright restrictions. 

*Input/Output Specifications:* 

This code requires plain txt files as input, either those from this repository's sample_data folder or those from a local machine. It returns csv files with disaggregated text grouped by chapter or chunk of n length.

# Upload and Add Text Files To Pandas DataFrame
In this section, text files are added into a Pandas DataFrame. Pandas is a fast and relatively easy way to work with large datasets. Though data frames are typically associated with numbers, Pandas also offers many functionalities for [working with textual data. ](https://www.tutorialspoint.com/python_pandas/python_pandas_working_with_text_data.htm) 

In [1]:
#Import os and glob
import glob
import os

#Import pandas
import pandas as pd

In [2]:
#Get current working directory 
path = os.getcwd()
print(path)

#Change working directory
path = os.chdir("/home/dssadmin/Desktop/SF_Analysis/Data/Rd3_Texts")

/home/dssadmin/Desktop/SF_Analysis/Jupyter_Notebooks


In [3]:
#Append all txt files to a pandas dataframe
filenames = []
data = []
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
    if f.endswith('.txt'):
        with open(f, 'rb') as myfile:
            filenames.append(myfile.name)
            data.append(myfile.read())
d = {'Title':filenames, 'Text': data}
books = pd.DataFrame(d)
books

Unnamed: 0,Title,Text
0,1973_PESEK_THEEARTHISNEAR.txt,"b""\xef\xbb\xbfTHE EARTH IS NEAR\r\nThey had no..."
1,1970_JAKES_MASKOFCHAOS.txt,b'\xef\xbb\xbfMASK OF CHAOS\r\nJOHN JAKES\r\nA...
2,1971_KAMIN_THEHERODMEN.txt,b'\xef\xbb\xbfTHE\r\nHEROD MEN\r\nNICK KAMIN\r...
3,1971_GLASBY_PROJECTJOVE.txt,b'\xef\xbb\xbfJOHN GLASBY\r\nPROJECT JOVE\r\nA...
4,1980_SHIRLEY_CITYCOMEAWALKIN.txt,b'\xef\xbb\xbfCITY COME A-WALKIN\'\r\nJOHN SHI...
...,...,...
137,1962_MCLAUGHLIN_DOMEWORLD.txt,b'\xef\xbb\xbfDOME WORLD\r\nOnly one thing cou...
138,2000_DEBRANT_VIRALINTELLIGENCE.txt,b'\xef\xbb\xbfViral Intelligence\r\nDon DeBran...
139,1956_GOLDING-PEAKE-WYNDHAM_SOMETIMENEVER.txt,"b'\xef\xbb\xbfSOMETIME, NEVER\r\n\r\nWIT...WON..."
140,1962_DICKSON_NOROOMFORMAN.txt,b'\xef\xbb\xbfNO ROOM FOR MAN\r\nGORDON R. DIC...


In [4]:
#Remove .txt from titles
books['Title'] = books['Title'].str.replace(r'.txt', ' ', regex=True) 
books.head()

Unnamed: 0,Title,Text
0,1973_PESEK_THEEARTHISNEAR,"b""\xef\xbb\xbfTHE EARTH IS NEAR\r\nThey had no..."
1,1970_JAKES_MASKOFCHAOS,b'\xef\xbb\xbfMASK OF CHAOS\r\nJOHN JAKES\r\nA...
2,1971_KAMIN_THEHERODMEN,b'\xef\xbb\xbfTHE\r\nHEROD MEN\r\nNICK KAMIN\r...
3,1971_GLASBY_PROJECTJOVE,b'\xef\xbb\xbfJOHN GLASBY\r\nPROJECT JOVE\r\nA...
4,1980_SHIRLEY_CITYCOMEAWALKIN,b'\xef\xbb\xbfCITY COME A-WALKIN\'\r\nJOHN SHI...


In [5]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
books['Text'] = books['Text'].apply(lambda x: x.decode('utf-8'))

#Remove newline characters
books['Text'] = books['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
books['Text'] = books['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
books

#Remove punctuation and replace with no space (except periods and hyphens)
books['Text'] = books['Text'].str.replace(r'[^\w\-\.\s]+', '', regex = True)

#Remove periods and replace with space (to prevent incorrect compounds)
books['Text'] = books['Text'].str.replace(r'[^\w\-\s]+', ' ', regex = True)
books.head()

Unnamed: 0,Title,Text
0,1973_PESEK_THEEARTHISNEAR,THE EARTH IS NEAR They had not counted on the ...
1,1970_JAKES_MASKOFCHAOS,MASK OF CHAOS JOHN JAKES AN ACE BOOK Ace Publi...
2,1971_KAMIN_THEHERODMEN,THE HEROD MEN NICK KAMIN Planned death vs unw...
3,1971_GLASBY_PROJECTJOVE,JOHN GLASBY PROJECT JOVE ACE BOOKS A Division ...
4,1980_SHIRLEY_CITYCOMEAWALKIN,CITY COME A-WALKIN JOHN SHIRLEY Stu Cole was m...


# Clean Texts and Set Parameters for Sectioning 
Several basic cleaning processes are implemented: removing unwanted characters from titles, removing newline characters from texts, and removing punctuation. Parameters are also set for part(s) of text to be included in sectioning. In the SciFi Corpus project, "START OF BOOK" and "END OF BOOK" tags were added to delineate the body of each text. Code in this section removes any text outside the starting and ending parameters--e.g., title page, copyright page, other paratext. 

In [6]:
#Remove paratext (before and after START OF BOOK and END OF BOOK tags)
#If texts you are working with do not have these tags, ignore this cell

#Split book on start of book tag, keep text only after start of book tag
start = books["Text"].str.split("START OF BOOK", expand = True)
books['Text'] = start[1]

#Split book on end of book tag, keep text only before of book tag
end = books["Text"].str.split("END OF BOOK", expand = True)
books['Text'] = end[0]
books

Unnamed: 0,Title,Text
0,1973_PESEK_THEEARTHISNEAR,PART The Long Voyage CHAPTER Gone are the day...
1,1970_JAKES_MASKOFCHAOS,CHAPTER Part I THE STRANGERS Shawnee Sachem o...
2,1971_KAMIN_THEHERODMEN,CHAPTER 1 He stepped onto the morning balcony...
3,1971_GLASBY_PROJECTJOVE,CHAPTER 1 Norbert Donner had never felt the s...
4,1980_SHIRLEY_CITYCOMEAWALKIN,CHAPTER PROLOGUE A young woman in a recording...
...,...,...
137,1962_MCLAUGHLIN_DOMEWORLD,PART 1 MAN ON THE BOTTOM CHAPTER 1 Danial Mas...
138,2000_DEBRANT_VIRALINTELLIGENCE,CHAPTER 1 We Get Going Near as I can figure o...
139,1956_GOLDING-PEAKE-WYNDHAM_SOMETIMENEVER,PART The Past CHAPTER ENVOY EXTRAORDINARY by ...
140,1962_DICKSON_NOROOMFORMAN,BOOK 1 ISOLATE And now through double glass J...


In [7]:
#Check that text is cleaned and sectioned
books.iloc[0]['Text']



In [9]:
#Define new dataframe
books_cleaned = books

# Section Texts By Chapter Headings
When working with texts with clearly delineated chapters, using chapter headings is a relatively easy way to section texts into segments of (relatively) the same size. After checking the chapter counts for each text to confirm whether sectioning by chapter is a useful procedure, this code iterates through the texts and splits them each time it encounters a new "chapter" heading. From here, the text from each chapter is appended to a new dataframe and denoted by book and chapter number. 

In [11]:
#Count number of chapters in each text
chapter_counts = books_cleaned['Text'].str.count('CHAPTER')

#Append chapter counts to dataframe
books_cleaned["Chapters"] = chapter_counts
books_cleaned

Unnamed: 0,Title,Text,Chapters
0,1973_PESEK_THEEARTHISNEAR,PART The Long Voyage CHAPTER Gone are the day...,30
1,1970_JAKES_MASKOFCHAOS,CHAPTER Part I THE STRANGERS Shawnee Sachem o...,3
2,1971_KAMIN_THEHERODMEN,CHAPTER 1 He stepped onto the morning balcony...,14
3,1971_GLASBY_PROJECTJOVE,CHAPTER 1 Norbert Donner had never felt the s...,9
4,1980_SHIRLEY_CITYCOMEAWALKIN,CHAPTER PROLOGUE A young woman in a recording...,12
...,...,...,...
137,1962_MCLAUGHLIN_DOMEWORLD,PART 1 MAN ON THE BOTTOM CHAPTER 1 Danial Mas...,24
138,2000_DEBRANT_VIRALINTELLIGENCE,CHAPTER 1 We Get Going Near as I can figure o...,16
139,1956_GOLDING-PEAKE-WYNDHAM_SOMETIMENEVER,PART The Past CHAPTER ENVOY EXTRAORDINARY by ...,3
140,1962_DICKSON_NOROOMFORMAN,BOOK 1 ISOLATE And now through double glass J...,22


In [12]:
#Make new cell each time new chapter starts 
new = books_cleaned["Text"].str.split("CHAPTER", expand = True).set_index(books_cleaned['Title'])
new

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,66,67,68,69,70,71,72,73,74,75
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1973_PESEK_THEEARTHISNEAR,PART The Long Voyage,Gone are the days when a nameless continent l...,1 Our long voyage began with fire the age-old...,2 The motionless sun was shining brightly It...,3 It would be ridiculous to say the least to ...,4 Certain changes in the blood count only sli...,5 Early on in our voyage we had often imagine...,6 Time passed by desperately slowly and the t...,7 Our visits to the observatory made us feel ...,8 We had another meteorite alarm no doubt to ...,...,,,,,,,,,,
1970_JAKES_MASKOFCHAOS,,Part I THE STRANGERS Shawnee Sachem on the or...,Part II THE GAME Four days later civil servan...,Part III THE EXORCISM Executive Fochet took t...,,,,,,,...,,,,,,,,,,
1971_KAMIN_THEHERODMEN,,1 He stepped onto the morning balcony and let...,2 The cold evening was approaching by the tim...,3 What do you think she asked as the driver a...,4 ArchCommodore Gudtsler was in uncommonly go...,5 For the third consecutive day the morning w...,6 We must leave this world o corruption and a...,7 Feels good out today Matter said And I alw...,8 Sergeant Kulcheski saw them coming up the f...,9 The black and purple uniform abraded nerve ...,...,,,,,,,,,,
1971_GLASBY_PROJECTJOVE,,1 Norbert Donner had never felt the same sinc...,2 Senator Clinton Durant had seen the vast lo...,3 There was a memory in Durant of something w...,4 Donner was still alone in the dome when the...,5 Red and green lights flickered along the cu...,6 Carefully Donner increased the amplificatio...,7 The power went off without warning Althoug...,8 Jill hung onto his arm her head back starin...,9 Donner looked about him desperately He cou...,...,,,,,,,,,,
1980_SHIRLEY_CITYCOMEAWALKIN,,PROLOGUE A young woman in a recording studio ...,1 WUN It was Saturday night ten oclock which ...,2 TEW Cole stared at the notice in disbelief ...,3 THU-EEE There was a dead man bleeding on th...,4 FOH-UR Cole tightened his grip on the metal...,5 FIE-EV Quickly He had borrowed Bills car ...,6 UH-SIXZZ In the morning as Catz slept Cole ...,7 SEV-UHN Cole sat in a dark place at the top...,8 A-A-ATE The penthouse suite stank cluttered...,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1962_MCLAUGHLIN_DOMEWORLD,PART 1 MAN ON THE BOTTOM,1 Danial Mason was weary to his very bones I...,2 After Powell walked out Mason swung his leg...,3 Joe Kramer was several levels down from Mas...,4 He squeezed himself into a corridor booth ...,5 He did not sleep that night He sat in his ...,6 He waited a long time for Krumbein to call ...,7 Again it was quiet in the room Oppressivel...,8 Deep in the sea the dome dung like a monstr...,9 He spent several minutes looking around the...,...,,,,,,,,,,
2000_DEBRANT_VIRALINTELLIGENCE,,1 We Get Going Near as I can figure out this ...,2 Darks parkade The first thing you gotta kno...,3 joey the eull Id been so wrapped up listeni...,4 Dream Pillars Sentrys story What did you ca...,5 under The city There are some parts of a st...,6 cleaner The city is hungry Kegan is hungry...,7 Things Get a Tad Rough Sentry and I stared ...,8 The story That wouldnt Die1 Perhaps It h...,9 skinny eets Lucky You mean the hotel has be...,...,,,,,,,,,,
1956_GOLDING-PEAKE-WYNDHAM_SOMETIMENEVER,PART The Past,ENVOY EXTRAORDINARY by William Golding THE TE...,CONSIDER HER WAYS by John Wyndham There was n...,BOY IN DARKNESS by Mervyn Peake The ceremonie...,,,,,,,...,,,,,,,,,,
1962_DICKSON_NOROOMFORMAN,BOOK 1 ISOLATE And now through double glass J...,1 The mine generally speaking was automatic ...,2 The skip slid Paul down some six hundred st...,3 The clerk working the afternoon division da...,4 How do you feel It was a womans voice Pa...,5 As Paul entered through the automatically-o...,6 Shuttling through the many-leveled maze of ...,7 You didnt answer the door said Warren stopp...,8 The mans dead thought Paul He took a deep ...,9 They fell like a stone while Pauls hand res...,...,,,,,,,,,,


In [13]:
#Flatten dataframe so each chapter is on own row, designated by book and chapter 
chapters_df = new.stack().reset_index()
chapters_df.columns = ["Book", "Chapter", "Text"]
chapters_df

Unnamed: 0,Book,Chapter,Text
0,1973_PESEK_THEEARTHISNEAR,0,PART The Long Voyage
1,1973_PESEK_THEEARTHISNEAR,1,Gone are the days when a nameless continent l...
2,1973_PESEK_THEEARTHISNEAR,2,1 Our long voyage began with fire the age-old...
3,1973_PESEK_THEEARTHISNEAR,3,2 The motionless sun was shining brightly It...
4,1973_PESEK_THEEARTHISNEAR,4,3 It would be ridiculous to say the least to ...
...,...,...,...
2561,1949_STEWART_EARTHABIDES,19,10 By the time he had finished the long walk ...
2562,1949_STEWART_EARTHABIDES,20,11 Z Day after day still the sun set in the c...
2563,1949_STEWART_EARTHABIDES,21,1 Jlerhaps it was that same day or perhaps it...
2564,1949_STEWART_EARTHABIDES,22,2 He awoke so early one morning that the room...


In [14]:
#Tidying the DF
#Combine book and chapter labels into one column
chapters_df['Book + Chapter'] = chapters_df['Book'].astype(str) + '_Chapter_' + chapters_df['Chapter'].astype(str)

#Remove individual book and chapter columns
chapters_df.drop(columns=['Book', 'Chapter'])

#Lowercase all words
chapters_df['Text'] = chapters_df['Text'].str.lower()

#Reindex so book + chapter is first column 
column_names = "Book + Chapter", "Text"
chapters_df = chapters_df.reindex(columns=column_names)
chapters_df

Unnamed: 0,Book + Chapter,Text
0,1973_PESEK_THEEARTHISNEAR _Chapter_0,part the long voyage
1,1973_PESEK_THEEARTHISNEAR _Chapter_1,gone are the days when a nameless continent l...
2,1973_PESEK_THEEARTHISNEAR _Chapter_2,1 our long voyage began with fire the age-old...
3,1973_PESEK_THEEARTHISNEAR _Chapter_3,2 the motionless sun was shining brightly it...
4,1973_PESEK_THEEARTHISNEAR _Chapter_4,3 it would be ridiculous to say the least to ...
...,...,...
2561,1949_STEWART_EARTHABIDES _Chapter_19,10 by the time he had finished the long walk ...
2562,1949_STEWART_EARTHABIDES _Chapter_20,11 z day after day still the sun set in the c...
2563,1949_STEWART_EARTHABIDES _Chapter_21,1 jlerhaps it was that same day or perhaps it...
2564,1949_STEWART_EARTHABIDES _Chapter_22,2 he awoke so early one morning that the room...


# Section Chapters by Chunks of N Length
Though chapter headings are useful for splitting texts into semi-equal segments, disparities in chapter length may occur, especially in large corpora. To further segment texts, the text of each text can be divided into chunks of n length. 

In [None]:
#Create new df to work with chunks
new_chapters_df = chapters_df

#Get number of words in each chapter (helps to determine chunk length)
ch_words = new_chapters_df["Text"].apply(lambda x: len(str(x).split(' ')))

#Append word counts to dataframe
new_chapters_df["Word Count"] = ch_words
new_chapters_df

In [None]:
#Tokenize Text
import nltk
nltk.download('punkt')
new_chapters_df['Tokens'] = new_chapters_df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
new_chapters_df

In [None]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 1000

#Create new list for chunked sentences
chunked_ch = []

#Perform chunking function on each row of tokens
s = new_chapters_df['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Add to new list
  chunked_ch.append(chunks)


In [None]:
#Create dictionary to associate chunks with titles
keys = new_chapters_df['Book + Chapter']
values = chunked_ch

res = {keys[i]: values[i] for i in range(len(keys))}

In [None]:
#Add chunks to new dataframe
chunked_ch_df = pd.DataFrame.from_dict(res, orient='index')
chunked_ch_df.head()

In [None]:
#Reset dataframe index and rename columns
chunked_ch_df = chunked_ch_df.stack().reset_index()
chunked_ch_df.columns = ["Title","Chunk","Text"]
chunked_ch_df

In [None]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_ch_df['Book + Chunk'] = chunked_ch_df['Title'].astype(str) + ' Chunk ' + chunked_ch_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_ch_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize

chunked_ch_df['Text'] = chunked_ch_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_ch_df['Text'] 

#Lowercase all words
chunked_ch_df['Text'] = chunked_ch_df['Text'].str.lower()

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_ch_df = chunked_ch_df.reindex(columns=column_names)

#Print cleaned df
chunked_ch_df

#Section Texts By Chunks of N Length
When working with texts WITHOUT discernable chapter headings--or, even if chapter headings are present but too infrequent to split texts into meaningful segments--texts can instead be sectioned by chunks of "N" length, where N is a variable that can be custom-set below. After checking the word counts for each text to determine what size chunks would be appropriate, this code iterates through the texts and splits them each time it counts "N" number of words. From here, the text from each chunk is appended to a new dataframe and denoted by book and chunk number.

In [None]:
#Get number of words in each book (helps to determine chunk length)
words = books_cleaned["Text"].apply(lambda x: len(str(x).split(' ')))

#Append chapter counts to dataframe
books_cleaned["Word Count"] = words
books_cleaned

In [None]:
#Tokenize Text
import nltk
nltk.download('punkt')
books_cleaned['Tokens'] = books_cleaned.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
books_cleaned

In [None]:
#Define chunking function
def split(list_a, chunk_size):
  for i in range(0, len(list_a), chunk_size):
    yield list_a[i:i + chunk_size]

#Set desired size of chunks
chunk_size = 1000

#Create new list for chunked sentences
chunked_sentences = []

#Perform chunking function on each row of tokens
s = books_cleaned['Tokens']
for content in s:
  chunks = list(split(content, chunk_size))
  #Check that text is being chunked correctly
  print(chunks[0])
  #Add to new list
  chunked_sentences.append(chunks)


In [None]:
#Create dictionary to associate chunks with titles
keys = books_cleaned['Title']
values = chunked_sentences

res = {keys[i]: values[i] for i in range(len(keys))}

In [None]:
#Add chunks to new dataframe
chunked_df = pd.DataFrame.from_dict(res, orient='index')
chunked_df.head()

In [None]:
#Reset dataframe index and rename columns
chunked_df = chunked_df.stack().reset_index()
chunked_df.columns = ["Title","Chunk","Text"]
chunked_df

In [None]:
#Tidying the DF
#Combine book and chunk labels into one column
chunked_df['Book + Chunk'] = chunked_df['Title'].astype(str) + ' Chunk ' + chunked_df['Chunk'].astype(str)

#Remove individual book and chunk columns
chunked_df.drop(columns=['Title', 'Chunk'])

#Detokenize text
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize

chunked_df['Text'] = chunked_df.apply(lambda row: TreebankWordDetokenizer().detokenize(row['Text']), axis=1)
chunked_df['Text'] 

#Lowercase all words
chunked_df['Text'] = chunked_df['Text'].str.lower()

#Reindex so book + chunk is first column 
column_names = "Book + Chunk", "Text"
chunked_df = chunked_df.reindex(columns=column_names)

#Print cleaned df
chunked_df

# Disaggregate Texts and Download CSV Output
Working with texts split by chapter or chunk (or both), the final step of this process is to disaggregate the data. Disaggregation, or the breakdown of data into smaller (disordered) parts, is accomplished through the alphabetization of the words in each chapter/chunk. 

The resulting "bag of words" data can then be downloaded as csvs and used for further analysis, such as through the Topic Modeling pipeline in the Extracted Features repository: https://github.com/SF-Nexus/Extracted-Features/blob/main/Topic%20Modeling%20with%20SciFi%20Corpus.ipynb 

## Full Texts

In [None]:
#Working with data from texts sectioned by CHAPTER
#Alphabetize words in each chapter string
books_bow = books.copy()
books_bow['Text'] = books_bow['Text'].apply(lambda x: ' '.join(sorted(x.split())))
books_bow

In [None]:
#Download CSV with full texts (aggregated)
books.to_csv('full_texts_agg.csv', index=False)

In [None]:
#Download CSV with full texts (disaggregated)
books_bow.to_csv('full_texts_bow.csv', index=False)

## Texts Sectioned by Chapter

In [None]:
#Working with data from texts sectioned by CHAPTER
#Alphabetize words in each chapter string
chapters_df['Text'] = chapters_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chapters_df

In [15]:
#Download CSV with full texts (aggregated)
chapters_df.to_csv('chapters_agg_output.csv', encoding = 'utf-8-sig') 


In [None]:
chapters_df.to_csv('chapters_bow_output.csv', encoding = 'utf-8-sig') 

## Texts Sectioned by Chapter + Chunk

In [None]:
#Working with data from texts sectioned by CHUNK of N length
#Alphabetize words in each chunk string
chunked_ch_df['Text'] = chunked_ch_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chunked_ch_df

In [None]:
#Download disaggregated chunks to csv
chunked_ch_df.to_csv('chapter_chunks_bow_output.csv', encoding = 'utf-8-sig') 

## Texts Sectioned by Chunk

In [None]:
#Working with data from texts sectioned by CHUNK of N length
#Alphabetize words in each chunk string
chunked_df['Text'] = chunked_df['Text'].apply(lambda x: ' '.join(sorted(x.split())))
chunked_df

In [None]:
#Download disaggregated chunks to csv
chunked_df.to_csv('chunks_bow_output.csv', encoding = 'utf-8-sig') 