<b> 1. Tokenize sentences </b>

In [96]:
import nltk
from nltk.tokenize import sent_tokenize

# Text to be Summarized
text = """Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness. The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year. 
She reportedly succumbed to the pressures of extensive working hours, just four months into her job.
"""
# Tokenize Given Sentences
def tokenize_sentences(text):
    return sent_tokenize(text)
sentences = tokenize_sentences(text)
print("** Step 1: Tokenized Sentences **")
for i, sentence in enumerate(sentences, 1):
    print(f"\n Sentence {i}: {sentence}")


** Step 1: Tokenized Sentences **

 Sentence 1: Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness.

 Sentence 2: The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year.

 Sentence 3: She reportedly succumbed to the pressures of extensive working hours, just four months into her job.


<b> 2. Create Frequency Matrix of Words in Each Sentence. </b>

In [99]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict

stop_words = set(stopwords.words("english"))

# Create Frequency Matrix, frequency = (Number of times the term t appears in a document)/ (Total number of terms in the document)
def create_frequency_matrix(sentences):
    frequency_matrix = {}
    for sentence in sentences:
        words = word_tokenize(sentence.lower())
        frequency_matrix[sentence] = {}
        for word in words:
            if word.isalnum() and word not in stop_words:
                frequency_matrix[sentence][word] = frequency_matrix[sentence].get(word, 0) + 1
    return frequency_matrix

frequency_matrix = create_frequency_matrix(sentences)
print("\n **Step 2: Frequency Matrix **")
for sentence, freqs in frequency_matrix.items():
    print(f"\n Sentence: \"{sentence}\"")
    print(f">> Frequencies: {freqs}")


 **Step 2: Frequency Matrix **

 Sentence: "Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness."
>> Frequencies: {'larsen': 1, 'toubro': 1, 'chairman': 1, 'sn': 1, 'subrahmanyan': 1, 'stirred': 1, 'storm': 1, 'recent': 1, 'proposal': 1, 'advocating': 1, 'workweek': 1, 'including': 1, 'sundays': 1, 'maintain': 1, 'competitiveness': 1}

 Sentence: "The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year."
>> Frequencies: {'controversial': 1, 'suggestion': 1, 'comes': 1, 'backdrop': 1, 'growing': 1, 'concerns': 1, 'dangers': 1, 'overwork': 1, 'highlighted': 1, 'tragic': 1, 'death': 1, 'ey': 1, 'employee': 1, 'anna': 1, 'sebastian': 1, 'perayil': 1, 'last': 1, 'year': 1}

 Sentence: "She reportedly succumbed to the pressures of 

<b> 3. Calculate Term Frequency and Generate Matrix. </b>

In [102]:
# Calculate Term Frequency Tf and generate the matrix
def calculate_term_frequency(frequency_matrix):
    tf_matrix = {}
    for sentence, freqs in frequency_matrix.items():
        tf_matrix[sentence] = {}
        total_words = sum(freqs.values())
        for word, count in freqs.items():
            tf_matrix[sentence][word] = count / total_words
    return tf_matrix

tf_matrix = calculate_term_frequency(frequency_matrix)
print("\n** Step 3: Term Frequency (TF) Matrix **")
for sentence, tf_scores in tf_matrix.items():
    print(f"\n Sentence: \"{sentence}\"")
    print(f">> TF Scores: {tf_scores}\n")



** Step 3: Term Frequency (TF) Matrix **

 Sentence: "Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness."
>> TF Scores: {'larsen': 0.06666666666666667, 'toubro': 0.06666666666666667, 'chairman': 0.06666666666666667, 'sn': 0.06666666666666667, 'subrahmanyan': 0.06666666666666667, 'stirred': 0.06666666666666667, 'storm': 0.06666666666666667, 'recent': 0.06666666666666667, 'proposal': 0.06666666666666667, 'advocating': 0.06666666666666667, 'workweek': 0.06666666666666667, 'including': 0.06666666666666667, 'sundays': 0.06666666666666667, 'maintain': 0.06666666666666667, 'competitiveness': 0.06666666666666667}


 Sentence: "The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year."
>> TF Scores: {'controversial': 0.05555555555555

<b> 4. Create a table for documents per words. </b>

In [105]:
# Creation of table for document per word - number of sentences contain a word
def doc_word_table(frequency_matrix):
    word_document_counts = defaultdict(int)
    for sentence, freqs in frequency_matrix.items():
        for word in freqs.keys():
            word_document_counts[word] += 1
    return word_document_counts

word_document_counts = doc_word_table(frequency_matrix)
print("\n**Step 4: Document Per Word Table **\n")
for word, doc_count in word_document_counts.items():
    print(f"{word}: {doc_count}")


**Step 4: Document Per Word Table **

larsen: 1
toubro: 1
chairman: 1
sn: 1
subrahmanyan: 1
stirred: 1
storm: 1
recent: 1
proposal: 1
advocating: 1
workweek: 1
including: 1
sundays: 1
maintain: 1
competitiveness: 1
controversial: 1
suggestion: 1
comes: 1
backdrop: 1
growing: 1
concerns: 1
dangers: 1
overwork: 1
highlighted: 1
tragic: 1
death: 1
ey: 1
employee: 1
anna: 1
sebastian: 1
perayil: 1
last: 1
year: 1
reportedly: 1
succumbed: 1
pressures: 1
extensive: 1
working: 1
hours: 1
four: 1
months: 1
job: 1


<b> 5. Calculate IDF and generate matrix. </b>

In [110]:
import math

# Calculate Inverse Document Frequency
def calculate_idf(sentences, word_document_counts):
    total_documents = len(sentences)
    idf_scores = {}
    for word, count in word_document_counts.items():
        idf_scores[word] = math.log10(total_documents / (1 + count))
    return idf_scores

idf_scores = calculate_idf(sentences, word_document_counts)
print("\n### Step 5: Inverse Document Frequency (IDF) Scores ###\n")
for word, idf_Rating in idf_scores.items():
    print(f"{word}: {idf_scores}")


### Step 5: Inverse Document Frequency (IDF) Scores ###

larsen: {'larsen': 0.17609125905568124, 'toubro': 0.17609125905568124, 'chairman': 0.17609125905568124, 'sn': 0.17609125905568124, 'subrahmanyan': 0.17609125905568124, 'stirred': 0.17609125905568124, 'storm': 0.17609125905568124, 'recent': 0.17609125905568124, 'proposal': 0.17609125905568124, 'advocating': 0.17609125905568124, 'workweek': 0.17609125905568124, 'including': 0.17609125905568124, 'sundays': 0.17609125905568124, 'maintain': 0.17609125905568124, 'competitiveness': 0.17609125905568124, 'controversial': 0.17609125905568124, 'suggestion': 0.17609125905568124, 'comes': 0.17609125905568124, 'backdrop': 0.17609125905568124, 'growing': 0.17609125905568124, 'concerns': 0.17609125905568124, 'dangers': 0.17609125905568124, 'overwork': 0.17609125905568124, 'highlighted': 0.17609125905568124, 'tragic': 0.17609125905568124, 'death': 0.17609125905568124, 'ey': 0.17609125905568124, 'employee': 0.17609125905568124, 'anna': 0.17609125

<b> 6. Calculate TF-IDF and Generate Matrix. </b>

In [113]:
# Calculate TF-IDF Matrix
def calculate_tfidf(tf_matrix, idf_scores):
    tfidf_matrix = {}
    for sentence, tf_scores in tf_matrix.items():
        tfidf_matrix[sentence] = {}
        for word, tf_Rating in tf_scores.items():
            tfidf_matrix[sentence][word] = tf_Rating* idf_scores.get(word, 0)
    return tfidf_matrix

tfidf_matrix = calculate_tfidf(tf_matrix, idf_scores)
print("\n** Step 6: TF-IDF Matrix **\n")
for sentence, tfidf_scores in tfidf_matrix.items():
    print(f"Sentence: \"{sentence}\"")
    print(f">> TF-IDF Scores: {tfidf_scores}\n")


** Step 6: TF-IDF Matrix **

Sentence: "Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness."
>> TF-IDF Scores: {'larsen': 0.01173941727037875, 'toubro': 0.01173941727037875, 'chairman': 0.01173941727037875, 'sn': 0.01173941727037875, 'subrahmanyan': 0.01173941727037875, 'stirred': 0.01173941727037875, 'storm': 0.01173941727037875, 'recent': 0.01173941727037875, 'proposal': 0.01173941727037875, 'advocating': 0.01173941727037875, 'workweek': 0.01173941727037875, 'including': 0.01173941727037875, 'sundays': 0.01173941727037875, 'maintain': 0.01173941727037875, 'competitiveness': 0.01173941727037875}

Sentence: "The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year."
>> TF-IDF Scores: {'controversial': 0.009782847725315624, 's

<b> 7. Rating the sentences. </b>

In [118]:
# RatingSentences
def score_sentences(tfidf_matrix):
    sentence_scores = {}
    for sentence, tfidf_scores in tfidf_matrix.items():
        sentence_scores[sentence] = sum(tfidf_scores.values())
    return sentence_scores

sentence_scores = score_sentences(tfidf_matrix)
print("\n** Step 7: Sentence Scores **")
for sentence, Rating in sentence_scores.items():
    print(f"\n Sentence: \"{sentence}\" - Score: {Rating}")


** Step 7: Sentence Scores **

 Sentence: "Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness." - Score: 0.17609125905568124

 Sentence: "The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year." - Score: 0.17609125905568124

 Sentence: "She reportedly succumbed to the pressures of extensive working hours, just four months into her job." - Score: 0.17609125905568124


<b> 8. Find the threshold. </b>

In [131]:
# Calculate Threshold
def calculate_threshold(sentence_scores):
    threshold_factor = 0.9
    return sum(sentence_scores.values()) / len(sentence_scores) * threshold_factor

threshold = calculate_threshold(sentence_scores)
print("\n** Step 8: Threshold Value **")
print(f"Threshold (Average Score): {threshold}")


** Step 8: Threshold Value **
Threshold (Average Score): 0.1584821331501131


<b> 9. Generate the Summary. </b>

In [134]:
# Generate Summary
def generate_summary(sentences, sentence_scores, threshold):
    return [sentence for sentence in sentences if sentence_scores[sentence] > threshold]

summary = generate_summary(sentences, sentence_scores, threshold)
print("\n** Step 9: Generated Summary **\n")
print(" ".join(summary))



** Step 9: Generated Summary **

Larsen & Toubro Chairman SN Subrahmanyan has stirred a storm with his recent proposal advocating for a 90-hour workweek, including Sundays, 
to maintain competitiveness. The controversial suggestion comes against the backdrop of growing concerns about the dangers of overwork,
highlighted by the tragic death of a 26-year-old EY employee, Anna Sebastian Perayil, last year. She reportedly succumbed to the pressures of extensive working hours, just four months into her job.
