## Extracting Rich Vocabulary Words from a PDF

### Step 1: Extract Text from the PDF
I have used the `PyMuPDF` library (`fitz` module) to extract text from the PDF.

### Step 2: Tokenize the Text
I have used the `nltk` library to tokenize the text into words.

### Step 3: Filter Common Words
I have used the `nltk.corpus` to filter out common English words (stopwords).

### Step 4: Select Uncommon Words
I have counted the frequency of each word and selected the least frequent ones to identify rich vocabulary words.

### Step 5: Select 100 Words
I have sorted the words by frequency and selected the top 100 uncommon words.

### Python Code

First, install the required libraries:

```sh
pip install PyMuPDF nltk 
pip install langid

```

In [22]:
import fitz  # PyMuPDF
import nltk
from nltk.corpus import stopwords
import string

# Step 1: Extract Text from the PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

# Step 2: Tokenize and Filter Words
def tokenize_and_filter(text):
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('words')  # Download the words corpus
    
    words = nltk.word_tokenize(text)
    english_words = set(nltk.corpus.words.words())  # Set of English words
    stop_words = set(stopwords.words('english'))
    
    filtered_words = [
        word.lower() for word in words
        if word.isalpha() and word.lower() in english_words and word.lower() not in stop_words
    ]
    return filtered_words

# Step 3: Select Uncommon Words
def select_uncommon_words(words, top_n=100):
    word_freq = nltk.FreqDist(words)
    sorted_words = sorted(word_freq.items(), key=lambda item: item[1])  # Sort by frequency (ascending)
    uncommon_words = [word for word, freq in sorted_words[:top_n]]  # Select top N least frequent words
    return uncommon_words

# Main function to perform the analysis
def analyze_pdf_for_rich_vocabulary(pdf_path):
    text = extract_text_from_pdf(pdf_path)
    filtered_words = tokenize_and_filter(text)
    uncommon_words = select_uncommon_words(filtered_words)
    return uncommon_words

# Example usage
pdf_path = 'D:\Rich_Vocab_filter\data.pdf'
top_100_words = analyze_pdf_for_rich_vocabulary(pdf_path)
# print(len(top_100_words))
print(top_100_words)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\haswa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\haswa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\haswa\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


100
['cutting', 'fortunately', 'glimpse', 'passenger', 'angel', 'trickling', 'wished', 'bit', 'hoped', 'follow', 'mine', 'crash', 'smashing', 'uncertainty', 'bidding', 'spectrum', 'medieval', 'canvas', 'garment', 'blend', 'unseen', 'darkness', 'onward', 'immaculate', 'fluster', 'bone', 'funny', 'oh', 'honey', 'muster', 'win', 'rainbow', 'awfully', 'dull', 'sam', 'tech', 'employee', 'uneventful', 'flung', 'floor', 'couch', 'lethargy', 'pizza', 'kindle', 'doorbell', 'strangely', 'gut', 'lab', 'principal', 'suspicion', 'unease', 'kind', 'pillar', 'grieving', 'apple', 'cyanide', 'horrific', 'image', 'tearful', 'serene', 'revenge', 'harassment', 'duty', 'grief', 'badge', 'scales', 'early', 'dreary', 'usual', 'robot', 'prepared', 'nuclear', 'inactive', 'browser', 'damn', 'panicked', 'start', 'indeed', 'unlucky', 'glad', 'showman', 'grade', 'trend', 'private', 'tuition', 'sports', 'stream', 'prestigious', 'significant', 'progress', 'vision', 'panting', 'messy', 'abandoned', 'tell', 'truth', '