<b> 1. Tokenize sentences </b>

In [7]:
import nltk
from nltk.tokenize import sent_tokenize

# Text to be Summarized
text = """Workday is a steaming pile of burnt garbage. I've never come across software so obtusely non-intuitive for both candidates and recruiters alike. Completely bug ridden, date and hierarchically driven dependencies that are monolithic and inflexible, sends an email notification and text every time you do anything in it, shitty interface and wtf would anyone go "oh yes, I'd love to create a new, separate online account every time I apply at a company that uses workday!"
I used green screen systems in the late 80's early 90's that were more intuitive. It's so bad, I'm embarrassed to use it and am probably going to change jobs because of it. I'm reading a litany of bugs across every module my company has implemented in the support channel from "I entered a candidate into TA manually and they've disappeared" to Project staff saying "where have all my projects and resourcing projections gone, they were visible in test tenant!"
"""
# Tokenize Given Sentences
def tokenize_sentences(text):
    return sent_tokenize(text)
sentences = tokenize_sentences(text)
print("** Step 1: Tokenized Sentences **")
for i, sentence in enumerate(sentences, 1):
    print(f"\n Sentence {i}: {sentence}")


** Step 1: Tokenized Sentences **

 Sentence 1: Workday is a steaming pile of burnt garbage.

 Sentence 2: I've never come across software so obtusely non-intuitive for both candidates and recruiters alike.

 Sentence 3: Completely bug ridden, date and hierarchically driven dependencies that are monolithic and inflexible, sends an email notification and text every time you do anything in it, shitty interface and wtf would anyone go "oh yes, I'd love to create a new, separate online account every time I apply at a company that uses workday!"

 Sentence 4: I used green screen systems in the late 80's early 90's that were more intuitive.

 Sentence 5: It's so bad, I'm embarrassed to use it and am probably going to change jobs because of it.

 Sentence 6: I'm reading a litany of bugs across every module my company has implemented in the support channel from "I entered a candidate into TA manually and they've disappeared" to Project staff saying "where have all my projects and resourcing pr

<b> 2. Create Frequency Matrix of Words in Each Sentence. </b>

In [9]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict

stop_words = set(stopwords.words("english"))

# Create Frequency Matrix, frequency = (Number of times the term t appears in a document)/ (Total number of terms in the document)
def create_frequency_matrix(sentences):
    frequency_matrix = {}
    for sentence in sentences:
        words = word_tokenize(sentence.lower())
        frequency_matrix[sentence] = {}
        for word in words:
            if word.isalnum() and word not in stop_words:
                frequency_matrix[sentence][word] = frequency_matrix[sentence].get(word, 0) + 1
    return frequency_matrix

frequency_matrix = create_frequency_matrix(sentences)
print("\n **Step 2: Frequency Matrix **")
for sentence, freqs in frequency_matrix.items():
    print(f"\n Sentence: \"{sentence}\"")
    print(f">> Frequencies: {freqs}")


 **Step 2: Frequency Matrix **

 Sentence: "Workday is a steaming pile of burnt garbage."
>> Frequencies: {'workday': 1, 'steaming': 1, 'pile': 1, 'burnt': 1, 'garbage': 1}

 Sentence: "I've never come across software so obtusely non-intuitive for both candidates and recruiters alike."
>> Frequencies: {'never': 1, 'come': 1, 'across': 1, 'software': 1, 'obtusely': 1, 'candidates': 1, 'recruiters': 1, 'alike': 1}

 Sentence: "Completely bug ridden, date and hierarchically driven dependencies that are monolithic and inflexible, sends an email notification and text every time you do anything in it, shitty interface and wtf would anyone go "oh yes, I'd love to create a new, separate online account every time I apply at a company that uses workday!""
>> Frequencies: {'completely': 1, 'bug': 1, 'ridden': 1, 'date': 1, 'hierarchically': 1, 'driven': 1, 'dependencies': 1, 'monolithic': 1, 'inflexible': 1, 'sends': 1, 'email': 1, 'notification': 1, 'text': 1, 'every': 2, 'time': 2, 'anything':

<b> 3. Calculate Term Frequency and Generate Matrix. </b>

In [11]:
# Calculate Term Frequency Tf and generate the matrix
def calculate_term_frequency(frequency_matrix):
    tf_matrix = {}
    for sentence, freqs in frequency_matrix.items():
        tf_matrix[sentence] = {}
        total_words = sum(freqs.values())
        for word, count in freqs.items():
            tf_matrix[sentence][word] = count / total_words
    return tf_matrix

tf_matrix = calculate_term_frequency(frequency_matrix)
print("\n** Step 3: Term Frequency (TF) Matrix **")
for sentence, tf_scores in tf_matrix.items():
    print(f"\n Sentence: \"{sentence}\"")
    print(f">> TF Scores: {tf_scores}\n")



** Step 3: Term Frequency (TF) Matrix **

 Sentence: "Workday is a steaming pile of burnt garbage."
>> TF Scores: {'workday': 0.2, 'steaming': 0.2, 'pile': 0.2, 'burnt': 0.2, 'garbage': 0.2}


 Sentence: "I've never come across software so obtusely non-intuitive for both candidates and recruiters alike."
>> TF Scores: {'never': 0.125, 'come': 0.125, 'across': 0.125, 'software': 0.125, 'obtusely': 0.125, 'candidates': 0.125, 'recruiters': 0.125, 'alike': 0.125}


 Sentence: "Completely bug ridden, date and hierarchically driven dependencies that are monolithic and inflexible, sends an email notification and text every time you do anything in it, shitty interface and wtf would anyone go "oh yes, I'd love to create a new, separate online account every time I apply at a company that uses workday!""
>> TF Scores: {'completely': 0.027777777777777776, 'bug': 0.027777777777777776, 'ridden': 0.027777777777777776, 'date': 0.027777777777777776, 'hierarchically': 0.027777777777777776, 'driven': 0

<b> 4. Create a table for documents per words. </b>

In [13]:
# Creation of table for document per word - number of sentences contain a word
def doc_word_table(frequency_matrix):
    word_document_counts = defaultdict(int)
    for sentence, freqs in frequency_matrix.items():
        for word in freqs.keys():
            word_document_counts[word] += 1
    return word_document_counts

word_document_counts = doc_word_table(frequency_matrix)
print("\n**Step 4: Document Per Word Table **\n")
for word, doc_count in word_document_counts.items():
    print(f"{word}: {doc_count}")


**Step 4: Document Per Word Table **

workday: 2
steaming: 1
pile: 1
burnt: 1
garbage: 1
never: 1
come: 1
across: 2
software: 1
obtusely: 1
candidates: 1
recruiters: 1
alike: 1
completely: 1
bug: 1
ridden: 1
date: 1
hierarchically: 1
driven: 1
dependencies: 1
monolithic: 1
inflexible: 1
sends: 1
email: 1
notification: 1
text: 1
every: 2
time: 1
anything: 1
shitty: 1
interface: 1
wtf: 1
would: 1
anyone: 1
go: 1
oh: 1
yes: 1
love: 1
create: 1
new: 1
separate: 1
online: 1
account: 1
apply: 1
company: 2
uses: 1
used: 1
green: 1
screen: 1
systems: 1
late: 1
80: 1
early: 1
90: 1
intuitive: 1
bad: 1
embarrassed: 1
use: 1
probably: 1
going: 1
change: 1
jobs: 1
reading: 1
litany: 1
bugs: 1
module: 1
implemented: 1
support: 1
channel: 1
entered: 1
candidate: 1
ta: 1
manually: 1
disappeared: 1
project: 1
staff: 1
saying: 1
projects: 1
resourcing: 1
projections: 1
gone: 1
visible: 1
test: 1
tenant: 1


<b> 5. Calculate IDF and generate matrix. </b>

In [15]:
import math

# Calculate Inverse Document Frequency
def calculate_idf(sentences, word_document_counts):
    total_documents = len(sentences)
    idf_scores = {}
    for word, count in word_document_counts.items():
        idf_scores[word] = math.log10(total_documents / (1 + count))
    return idf_scores

idf_scores = calculate_idf(sentences, word_document_counts)
print("\n### Step 5: Inverse Document Frequency (IDF) Scores ###\n")
for word, idf_Rating in idf_scores.items():
    print(f"{word}: {idf_scores}")


### Step 5: Inverse Document Frequency (IDF) Scores ###

workday: {'workday': 0.3010299956639812, 'steaming': 0.47712125471966244, 'pile': 0.47712125471966244, 'burnt': 0.47712125471966244, 'garbage': 0.47712125471966244, 'never': 0.47712125471966244, 'come': 0.47712125471966244, 'across': 0.3010299956639812, 'software': 0.47712125471966244, 'obtusely': 0.47712125471966244, 'candidates': 0.47712125471966244, 'recruiters': 0.47712125471966244, 'alike': 0.47712125471966244, 'completely': 0.47712125471966244, 'bug': 0.47712125471966244, 'ridden': 0.47712125471966244, 'date': 0.47712125471966244, 'hierarchically': 0.47712125471966244, 'driven': 0.47712125471966244, 'dependencies': 0.47712125471966244, 'monolithic': 0.47712125471966244, 'inflexible': 0.47712125471966244, 'sends': 0.47712125471966244, 'email': 0.47712125471966244, 'notification': 0.47712125471966244, 'text': 0.47712125471966244, 'every': 0.3010299956639812, 'time': 0.47712125471966244, 'anything': 0.47712125471966244, 'shit

<b> 6. Calculate TF-IDF and Generate Matrix. </b>

In [17]:
# Calculate TF-IDF Matrix
def calculate_tfidf(tf_matrix, idf_scores):
    tfidf_matrix = {}
    for sentence, tf_scores in tf_matrix.items():
        tfidf_matrix[sentence] = {}
        for word, tf_Rating in tf_scores.items():
            tfidf_matrix[sentence][word] = tf_Rating* idf_scores.get(word, 0)
    return tfidf_matrix

tfidf_matrix = calculate_tfidf(tf_matrix, idf_scores)
print("\n** Step 6: TF-IDF Matrix **\n")
for sentence, tfidf_scores in tfidf_matrix.items():
    print(f"Sentence: \"{sentence}\"")
    print(f">> TF-IDF Scores: {tfidf_scores}\n")


** Step 6: TF-IDF Matrix **

Sentence: "Workday is a steaming pile of burnt garbage."
>> TF-IDF Scores: {'workday': 0.06020599913279624, 'steaming': 0.09542425094393249, 'pile': 0.09542425094393249, 'burnt': 0.09542425094393249, 'garbage': 0.09542425094393249}

Sentence: "I've never come across software so obtusely non-intuitive for both candidates and recruiters alike."
>> TF-IDF Scores: {'never': 0.059640156839957804, 'come': 0.059640156839957804, 'across': 0.03762874945799765, 'software': 0.059640156839957804, 'obtusely': 0.059640156839957804, 'candidates': 0.059640156839957804, 'recruiters': 0.059640156839957804, 'alike': 0.059640156839957804}

Sentence: "Completely bug ridden, date and hierarchically driven dependencies that are monolithic and inflexible, sends an email notification and text every time you do anything in it, shitty interface and wtf would anyone go "oh yes, I'd love to create a new, separate online account every time I apply at a company that uses workday!""
>> T

<b> 7. Rating the sentences. </b>

In [19]:
# RatingSentences
def score_sentences(tfidf_matrix):
    sentence_scores = {}
    for sentence, tfidf_scores in tfidf_matrix.items():
        sentence_scores[sentence] = sum(tfidf_scores.values())
    return sentence_scores

sentence_scores = score_sentences(tfidf_matrix)
print("\n** Step 7: Sentence Scores **")
for sentence, Rating in sentence_scores.items():
    print(f"\n Sentence: \"{sentence}\" - Score: {Rating}")


** Step 7: Sentence Scores **

 Sentence: "Workday is a steaming pile of burnt garbage." - Score: 0.4419030029085262

 Sentence: "I've never come across software so obtusely non-intuitive for both candidates and recruiters alike." - Score: 0.45510984733770227

 Sentence: "Completely bug ridden, date and hierarchically driven dependencies that are monolithic and inflexible, sends an email notification and text every time you do anything in it, shitty interface and wtf would anyone go "oh yes, I'd love to create a new, separate online account every time I apply at a company that uses workday!"" - Score: 0.45755555926903113

 Sentence: "I used green screen systems in the late 80's early 90's that were more intuitive." - Score: 0.4771212547196624

 Sentence: "It's so bad, I'm embarrassed to use it and am probably going to change jobs because of it." - Score: 0.4771212547196624

 Sentence: "I'm reading a litany of bugs across every module my company has implemented in the support channel f

<b> 8. Find the threshold. </b>

In [50]:
# Calculate Threshold
def calculate_threshold(sentence_scores):
    threshold_factor = 1
    return sum(sentence_scores.values()) / len(sentence_scores) * threshold_factor

threshold = calculate_threshold(sentence_scores)
print("\n** Step 8: Threshold Value **")
print(f"Threshold (Average Score): {threshold}")


** Step 8: Threshold Value **
Threshold (Average Score): 0.4608002037645942


<b> 9. Generate the Summary. </b>

In [53]:
# Generate Summary
def generate_summary(sentences, sentence_scores, threshold):
    return [sentence for sentence in sentences if sentence_scores[sentence] > threshold]

summary = generate_summary(sentences, sentence_scores, threshold)
print("\n** Step 9: Generated Summary **\n")
print(" ".join(summary))



** Step 9: Generated Summary **

I used green screen systems in the late 80's early 90's that were more intuitive. It's so bad, I'm embarrassed to use it and am probably going to change jobs because of it.
