# Word Frequency and Count of a Set of Document

In this notebook we will get the word frequency and count of the a set of documents. We will store their results into a panda dataframe. In this notebook we will use some novels by Charles Dickens taken from Gutenberg.

## Libraries and Resources used

-  Python 3
-  panda
-  nltk

## Note:

For installation of the nessesary resources and libraries refer to their respective home page for installation steps for your operation system.

Also the novels have some cleaning done to them. This involes removing the additional notes made by Gutenberg (inlcuding trademarks, notes about the book, branding) from the start and end of each novel.

Written in February 2017

In [7]:
# Import Required Libraries
import nltk
import os
import pandas as pd
import re

# Reading the Text

The first step is to collect the names of each novel and their content. This is done by going through all the texts in the "Novels" folder. It is important to note the order in which we save both the novel's content and name is synchronized.

In [3]:
# Set path to the file with novels
path = "./Novels"

# Save all the titles of the texts
textName = []

# Save all the content of the texts
textContent = []

# Go to the directory with all the text files
for filename in os.listdir(path):
    
    # Add the file name and remove the file type (in this case ".txt")
    textName.append(filename[:-4])
    
    # Open each file and add all the content 
    with open(path + '/' + filename, "r") as file:
         fileContent = file.read()

    # Add the content of the file
    textContent.append(fileContent)

## Checkpoint

This is just a double check to ensure we have equal amount of titles as content.

In [4]:
#Check the amount of text being analysis equals the amount of text titles we recorded
if len(textName) == len(textContent):
    print("The amount of text titles matches the amount of text content")
else:
    print("Amount of content and titles do not match")

The amount of text titles matches the amount of text content


## Defining Cleaning Function

We are going to declare some helper function to help tokenize and remove punctuations from the text. 

### Note:

We are not going to remove stopwords or do any alteration to the text outside removing punctuations/numbers and making all words lowercase.

In [16]:
# Define tokenize function
def tokenize_Text(text):
    tokens = nltk.word_tokenize(text.lower())
    # Insert only words
    tokens = [word for word in tokens if word[0].isalpha()]
    return tokens

## Cleaning the Text

Now that we have all the text and helper functions declared we can now tokenize the text.

In [17]:
# List to hold the results
text_tokenized = []

# Iterate through all the text
for novel in textContent:
    # Clean the text
    text_tokenized.append(tokenize_Text(novel))


# Getting Word Frequency

Now that we have all the words tokenized and cleaned we will now create a panda dataframe that contains all this information.

In [39]:
# Create a list to store all the results
wordFrequencyList = []

# For novel get the word frequency
for novel in text_tokenized:
    wordFrequencyList.append(nltk.FreqDist(novel))

## Creating the Panda Dataframe

Now that we have all the frequencies of all the text we will now insert them into a panda dataframe.

In [54]:
# Initalize a new panda dataframe for each of the novels
df_1 = pd.DataFrame(wordFrequencyList[0], index=[textName[0]])
df_2 = pd.DataFrame(wordFrequencyList[1], index=[textName[1]])
df_3 = pd.DataFrame(wordFrequencyList[2], index=[textName[2]])

# Combine all the dataframe together 
wordFrequencydf = pd.concat([df_1,df_2,df_3], axis=0)

# Replace all Nan with 0
wordFrequencydf.fillna(0, inplace=True)

In [55]:
wordFrequencydf

Unnamed: 0,a,a-a-a-business,a-a-matter,a-bed,a-blushing,a-breakfasting,a-buzz,a-coming,a-doin,a-doing,...,yourself.,yourselves,youth,youthful,youthfulness.,youths,z,zeal,zealous,zenith
A Tale of Two Cities,2944,1.0,1.0,,,,1.0,,,,...,2.0,3.0,9,3.0,1.0,1.0,,,2.0,
Oliver Twist,3702,,,1.0,1.0,1.0,,1.0,1.0,1.0,...,,3.0,8,6.0,,1.0,1.0,,2.0,1.0
A Christmas Carol,700,,,,,,,,,,...,,,1,,,,,1.0,,


In [56]:
wordFrequencydf.fillna(0, inplace=True)

In [57]:
wordFrequencydf

Unnamed: 0,a,a-a-a-business,a-a-matter,a-bed,a-blushing,a-breakfasting,a-buzz,a-coming,a-doin,a-doing,...,yourself.,yourselves,youth,youthful,youthfulness.,youths,z,zeal,zealous,zenith
A Tale of Two Cities,2944,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,2.0,3.0,9,3.0,1.0,1.0,0.0,0.0,2.0,0.0
Oliver Twist,3702,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,0.0,3.0,8,6.0,0.0,1.0,1.0,0.0,2.0,1.0
A Christmas Carol,700,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0
