nltk: Natural Language Toolkit, a powerful library for working with human language data.
stopwords: A list of common words (e.g., "the", "is", "and") that are often removed in text analysis because they don't carry much information.
WordNetLemmatizer: Used for lemmatization, which reduces words to their base or root form.
sent_tokenize and word_tokenize: Functions for tokenizing text into sentences and words, respectively.
Counter: A class for counting the occurrences of elements in a collection (useful for frequency analysis).
translate: A library for translating text.
pyttsx3: A library for text-to-speech conversion.
fitz: PyMuPDF, a library for working with PDF files.
os: Provides a way of interacting with the operating system, used for file operations.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
from translate import Translator as TranslateLib
import pyttsx3
import fitz
import os

It will download the stopwords, wordnet, and punkt form the nltk.

In [None]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Here, this function will count the words.

In [None]:
def count_words(input_text):
    words = input_text.split()
    return len(words)

This function will read the pdf and gives the text of the pdf and the pages. Moreover it will take the starting page and the ending page then after it will print the number of total pages and number of words and then it will print the text from the pdf.

In [None]:
def read_pdf(file_path, start_page=0, end_page=None):
    # Read text from specific pages of a PDF file using PyMuPDF
    doc = fitz.open(file_path)
    
    # Determine the end page
    if end_page is None or end_page > doc.page_count:
        end_page = doc.page_count
    
    text = ""
    for page_num in range(start_page, end_page):
        page = doc[page_num]
        text += page.get_text()
    
    # Close the PDF document
    doc.close()
    
    return text, end_page


pdf_file_path = 'Docs/Research.pdf'
start_page = 0  # Specify the starting page
end_page = 5    # Specify the ending page (set to None if you want to process all pages)

text, total_pages = read_pdf(pdf_file_path, start_page, end_page)
print(f"Total number of pages in the PDF: {total_pages}")

Text_Wrods = count_words(text)
print(f"Number of words in the input: {Text_Wrods}")
# print(text)

This will take the text input and count the words present in the input text.

In [None]:
# # Example usage:
# text = """


# """

# Text_Count = count_words(text)
# print(f"Number of words in the input: {Text_Count}")

Here, the python text to speech will make the audio file of the inout text and it will be saved in the audio folder.

In [None]:
engine = pyttsx3.init()

# # Set properties (optional)
# engine.setProperty('rate', 150)  # Speed of speech

# Convert the text to audio and save it as an MP3 file
# input_pdf_audio_file = 'input_pdf_audio.mp3'
input_pdf_audio_file = 'Audios/1.Main_Text.wav'
engine.save_to_file(text, input_pdf_audio_file)

# Wait for the audio to finish playing
engine.runAndWait()

# Runs the audio File
# os.system(f"start {input_pdf_audio_file}")

Create Lemmatizer and Stop Words Set:
An instance of the WordNetLemmatizer is created, which will be used to lemmatize words in the text.
The set of English stop words is obtained using stopwords.words('english').

Tokenize Text:
The input text is tokenized into sentences using sent_tokenize.
Each sentence is further tokenized into words using word_tokenize.

Lemmatize Words and Remove Stop Words:
The lemmatization process begins. For each sentence, it goes through each word:
Words are lemmatized using the lemmatizer.lemmatize method.
Stop words are removed by checking if the lowercase version of the word is not in the set of stop words.

Return Lemmatized Sentences:
The function returns a list of lemmatized sentences, where each sentence is represented as a list of lemmatized words.

This function essentially takes a piece of text, tokenizes it into sentences and words, lemmatizes each word, and removes common English stop words, returning the processed text as a list of lemmatized sentences.

In [None]:
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    # Tokenize the text into sentences and words
    sentences = sent_tokenize(text)
    tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

    # Lemmatize each word and remove stop words
    lemmatized_sentences = [
        [lemmatizer.lemmatize(word) for word in words if word.lower() not in stop_words]
        for words in tokenized_sentences
    ]

    return lemmatized_sentences

Lemmatize Text:
The input text is lemmatized using the lemmatize_text function, generating a list of lemmatized sentences.

Flatten Lemmatized Words:
The list of lemmatized sentences is flattened into a single list of all lemmatized words.

Calculate Word Frequency:
The frequency of each lemmatized word in the text is calculated using the Counter class, creating a dictionary (word_freq) mapping words to their frequencies.

Calculate Sentence Importance:
For each sentence, an importance score is calculated by summing up the frequencies of its constituent words.

Calculate Summary Length:
The desired length of the summary is determined based on a percentage of the total number of sentences in the lemmatized text.

Select Sentences for Summary:
Sentences are selected for the summary based on their importance scores. The sorted function is used to sort indices in descending order of importance.

Join Summary Sentences:
The selected sentences are joined to form the final summary. Each sentence is joined by a space, and the result is a string representing the summary.

Return Summary:
The generated summary is returned by the function.

In [None]:
def variable_length_summary(text, summary_length_percentage=20):
    lemmatized_sentences = lemmatize_text(text)

    # Flatten the list of lemmatized words
    all_words = [word for sentence in lemmatized_sentences for word in sentence]

    # Calculate word frequency
    word_freq = Counter(all_words)

    # Calculate sentence importance based on word frequency
    sentence_importance = [sum(word_freq[word] for word in sentence) for sentence in lemmatized_sentences]

    # Calculate the total length of the summary in sentences
    num_sentences = int(len(lemmatized_sentences) * summary_length_percentage / 100)

    # Select sentences based on importance scores for summary
    summary_sentences = [lemmatized_sentences[i] for i in sorted(range(len(sentence_importance)), key=lambda k: sentence_importance[k], reverse=True)[:num_sentences]]

    # Join the summary sentences to form the final summary
    summary = ' '.join([' '.join(sentence) for sentence in summary_sentences])

    return summary

Set Summary Length Percentage:
The variable summary_length_percentage is set to 50. This variable represents the desired length of the summary as a percentage of the total number of sentences in the original text.

Generate Summary:
The variable_length_summary function is called with the input text and the specified summary_length_percentage. The result is stored in the variable summary.

Word Count Calculation:
The count_words function is used to calculate the number of words in the generated summary. The result is stored in the variable word_count.

Display Word Count:
The result of the word count calculation is displayed using the print function. It prints a formatted string indicating the number of words in the summary.

Display Summary:
The generated summary is displayed using the print function.

In [None]:
# Set the summary length percentage (between 1 and 100)
summary_length_percentage = 50

# Get the summary
summary = variable_length_summary(text, summary_length_percentage)

# Word Count
word_count = count_words(summary)
# Display result
print(f"Number of words in the Summary: {word_count}")

# Display the summary
print("\nSummary:\n", summary)

Specify Summary Audio File:
The variable summary_audio_file is set to the path and filename where the summary audio file will be saved. In this case, it's set to 'Audios/2.Summary.mp3'.

Save Summary to Audio File:
The engine.save_to_file function is used to save the generated summary to the specified audio file (summary_audio_file). This function converts the text into speech and saves it as an audio file.

Run Text-to-Speech Engine:
The engine.runAndWait() function is called to execute the text-to-speech engine. This command initiates the generation of the audio file based on the provided summary.

Play Audio File:
uses the os.system command to play the generated audio file using the default system player.

In [None]:
summary_audio_file = 'Audios/2.Summary.mp3'
engine.save_to_file(summary, summary_audio_file)
engine.runAndWait()
# os.system(f"start {summary_audio_file}")

Instantiate Translator:
The function starts by creating an instance of a translation library using the TranslateLib class. This instance is named translator, and it is configured to translate text to a specified target language.

Chunking Text:
The input text is broken down into smaller chunks of a specified size. This is done using list comprehension, where each chunk is a substring of the original text. The range function is used to iterate over the text in steps of chunk_size.

Translate Each Chunk:
The function then enters a loop to iterate over each chunk of text. For each chunk, it uses the translator.translate method to obtain the translation of that chunk. The translation is stored in the translation variable.

Concatenate Translations:
The translated chunks are concatenated together to form the final translated text. The variable translated_text is used to accumulate the translated content from each chunk.

Return Translated Text:
The function returns the complete translated text, which is the concatenation of translations for all the chunks.

In [None]:
def translate_text_with_chunking(text, target_language='en', chunk_size=500):
    translator = TranslateLib(to_lang=target_language)

    # Break down the text into chunks
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

    # Translate each chunk and concatenate the results
    translated_text = ""
    for chunk in chunks:
        translation = translator.translate(chunk)
        translated_text += translation

    return translated_text

Target Language:
The target language is set here.

In [None]:
# Set the target languages for translation
target_languages = ['hi']  # Hindi and for Gujarati('gu')

Loop through Target Languages:
The code uses a for loop to iterate over each target language specified in the target_languages list.

Translate Summary for Each Language:
Inside the loop, the translate_text_with_chunking function is called with the summary as input and the current lang as the target language. The result is stored in the variable translated_summary.

Display Translated Summary:
Prints the translated summary.

In [None]:
# Translate the summary to each target language with chunking
for lang in target_languages:
    translated_summary = translate_text_with_chunking(summary, target_language=lang)

    # Display the translated summary
    print(f"\nTranslated Summary in {lang}:\n", translated_summary)


The Translated summary will converted to audio file.

In [None]:
# Save and play the translated summary as audio
translated_audio_file = f'Audios/Translated_Summary_To_{lang}.mp3'
engine.save_to_file(translated_summary, translated_audio_file)
engine.runAndWait()
# os.system(f"start {translated_audio_file}")