 # 1.   The number of pages the document has
 


To count the number of pages in a text file using the Natural Language Toolkit (nltk), we'll first need to read the text file and then determine the number of pages based on a specific criterion. Here's one way to do this using an assumption of approximately 300 words per page  on text files:


In [39]:
import nltk
# nltk.download('all')

In [44]:
from nltk import word_tokenize

# Read the text file into a string
with open("sample2.txt", "r") as file:
    text = file.read()

# Tokenize the text into words
words = word_tokenize(text)

# Count the number of words
word_count = len(words)

# Divide the word count by 300 to get the number of pages
page_count = word_count / 300

print("Number of pages:", page_count)


Number of pages: 3.96




# 2. number of paragraphs



we can determine the number of paragraphs based on the separator used to separate paragraphs in the file. assumes the paragraphs are separated by a newline character:

In [45]:
# Split the text into paragraphs using newline as the separator
paragraphs = text.split("\n")

# Count the number of paragraphs
paragraph_count = len(paragraphs)

print("Number of paragraphs:", paragraph_count)


Number of paragraphs: 11


# 3. Number of sentences

To count the number of sentences in a text file using the Natural Language Toolkit (nltk), we'll first need to read the text file and then use the sent_tokenize function to split the text into sentences.

In [46]:
from nltk import sent_tokenize

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Count the number of sentences
sentence_count = len(sentences)

print("Number of sentences:", sentence_count)

Number of sentences: 66


# 4. Type and number of words

In order to count the number of words and identify type of words in document we should tokenize it into words, and then use the *FreqDist* class to count the frequency of each word.

In [59]:
from nltk import FreqDist

from nltk.tokenize import  regexp_tokenize


formated_word = regexp_tokenize(text, "[\w']+")

# Count the frequency of each word using the FreqDist class
fdist_all_document = FreqDist(formated_word)

# Print the number of unique types of words and the total number of words
print("Number of unique types of words:", len(fdist))
print("Number of words:", sum(fdist.values()))


Number of unique types of words: 173
Number of words: 1029


# 5. Rank of the words based on their freuquency as well as their location (in each paragraph, page, etc)

To rank the words based on their frequency and location (in each paragraph, page, etc) in a text file using the Natural Language Toolkit (nltk), we'll first need to read the text file, tokenize it into words, and then use the *FreqDist* class to count the frequency of each word. Then, we can split the text into paragraphs, pages, or any other desired location, and keep track of the word frequency and location for each word.

In [57]:
# Initialize a dictionary to store the word frequency and location for each word
word_info = {}

# Loop through each paragraph
for i, paragraph in enumerate(paragraphs):
    # Tokenize the paragraph into words
    words_in_paragraph = regexp_tokenize(paragraph, "[\w']+")
    
    # Count the frequency of each word in the paragraph using the FreqDist class
    fdist = FreqDist(words_in_paragraph)
    
    # Loop through each word and store its frequency and location in the dictionary
    for word, freq in fdist.items():
        if word in word_info:
            word_info[word]["frequency"] += freq
            word_info[word]["location"].append(i+1)
        else:
            word_info[word] = {"frequency": freq, "location": [i+1]}

# Sort the dictionary based on frequency and store the result in a list
word_info_sorted = sorted(word_info.items(), key=lambda x: x[1]["frequency"], reverse=True)

# Print the first 10 ranked words and their frequency and location
for i, (word, info) in enumerate(word_info_sorted[:10]):
    print(f"Rank {i+1}: {word} (Frequency: {info['frequency']}, Location: {info['location']})")

Rank 1: you (Frequency: 60, Location: [1, 3, 4, 6, 7, 8])
Rank 2: the (Frequency: 53, Location: [1, 3, 4, 6, 7, 8, 10])
Rank 3: and (Frequency: 39, Location: [1, 3, 4, 6, 7, 8, 9, 10])
Rank 4: a (Frequency: 37, Location: [1, 3, 4, 6, 7, 8, 10])
Rank 5: to (Frequency: 32, Location: [1, 3, 4, 6, 7, 8])
Rank 6: your (Frequency: 28, Location: [1, 3, 4, 6, 7, 8])
Rank 7: document (Frequency: 22, Location: [1, 3, 4, 6, 7, 8, 9, 10])
Rank 8: click (Frequency: 20, Location: [1, 3, 4, 6, 7, 8])
Rank 9: new (Frequency: 20, Location: [1, 3, 4, 6, 7, 8])
Rank 10: in (Frequency: 17, Location: [1, 3, 4, 6, 7, 8, 10])


# 6. Total frequency and rank of words from the total document


To get the total frequency and rank of words from a text file using the Natural Language Toolkit (nltk), we use the *FreqDist* class to count the frequency of each word. Then, we can sort the words based on their frequency and assign ranks to each word.

In [60]:

# Sort the frequency distribution based on frequency and store the result in a list
fdist_sorted = sorted(fdist_all_document.items(), key=lambda x: x[1], reverse=True)

# Assign ranks to each word
ranks = {word: rank + 1 for rank, (word, _) in enumerate(fdist_sorted)}

# Print the  ranked words and their frequency
for i, (word, freq) in enumerate(fdist_sorted):
    print(f"Rank {ranks[word]}: {word} (Frequency: {freq})")


Rank 1: you (Frequency: 60)
Rank 2: the (Frequency: 53)
Rank 3: and (Frequency: 39)
Rank 4: a (Frequency: 37)
Rank 5: to (Frequency: 32)
Rank 6: your (Frequency: 28)
Rank 7: document (Frequency: 22)
Rank 8: click (Frequency: 20)
Rank 9: new (Frequency: 20)
Rank 10: in (Frequency: 17)
Rank 11: When (Frequency: 16)
Rank 12: can (Frequency: 16)
Rank 13: want (Frequency: 16)
Rank 14: for (Frequency: 13)
Rank 15: add (Frequency: 12)
Rank 16: that (Frequency: 12)
Rank 17: Word (Frequency: 12)
Rank 18: change (Frequency: 12)
Rank 19: where (Frequency: 12)
Rank 20: on (Frequency: 12)
Rank 21: headings (Frequency: 10)
Rank 22: text (Frequency: 9)
Rank 23: Video (Frequency: 8)
Rank 24: provides (Frequency: 8)
Rank 25: way (Frequency: 8)
Rank 26: help (Frequency: 8)
Rank 27: video (Frequency: 8)
Rank 28: You (Frequency: 8)
Rank 29: also (Frequency: 8)
Rank 30: fits (Frequency: 8)
Rank 31: To (Frequency: 8)
Rank 32: header (Frequency: 8)
Rank 33: cover (Frequency: 8)
Rank 34: page (Frequency: 8)
R