<a href="https://colab.research.google.com/github/JeanMusenga/ASSORT-Automatic-Summarization-of-Stack-Overflow-Posts/blob/main/Boosting_StopwordRemoval_Lemmatization_InTextRank_V01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import nltk
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
# Download necessary NLTK resources
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Text of the Stack Overflow question
text = """I need help with the architecture pattern I should use in a NestJS project. So I am using a command/query approach for developing my RestAPIs.
Right now it's a Monolith and not a microservice architecture, but I am developing it in a way that tomorrow it will be easy to switch to Microservice.
So consider a scenario, where I have 2 APIs one is createStudent and the other is createUser. In my application, I have 2 separate folders under src users and students
where users will handle all stuff related to users and students will cater to student fees, attendance etc. Each of them has its own entities and repository files.
Also, creating a student involves a step of creating a user as well. So basically let's say for a student to create, a user will be created, its contacts and address
details will be saved, institute details will be saved, document details will be saved etc. Considering this, creating users, contacts, and addresses are part of
the user folder, and repository files and entity files related to these are stored in the user folder. Create student, assign institute to the student, insert documents,
are part of the student folder and repository, and entity files related to these are stored in the student folder. Right now what I am doing is, in createStudent handler,
I have injected repositories for user, userAddresses and userContacts and using them in the handler to get/create/update records related to user, address or contacts.
Though I have a separate handler for createUser as well, where I also need to do the same eventually, it will have nothing to do with the student. I am still able to do
stuff I need to do, I am just thinking, that tomorrow If I switch to a microservices approach where the user and student will be different microservices with different
databases, I will not be able to inject the repository and somehow need to call the Rest API for user or student to achieve this. Am I doing it the right way or Is there
a way where I can call one handler from another handler in NestJS so that I can segregate the logic in their specific handlers? The second thought is, if users and students
are so closely linked to each other and the exchange of data is happening, should those be segregated into different microservices or not?"""

# Architectural keywords to focus on
keywords = ["architecture pattern", "architecture concern", "monolith", "microservice", "NestJS", "REST API", "system requirement", "repository"]

# Helper functions
def normalize_sentence(sentence):
    words = sentence.split()
    return ' '.join([lemmatizer.lemmatize(word.lower()) for word in words])

# Preprocess: Split the text into sentences
sentences = nltk.sent_tokenize(text)

# Create a version of sentences with stopwords removed for weighting purposes
def remove_stopwords(sentence):
    return ' '.join([word for word in sentence.split() if word.lower() not in stop_words])

sentences_no_stopwords = [remove_stopwords(sentence) for sentence in sentences]

# Build a TF-IDF vector representation of sentences without stopwords for weighting
vectorizer = TfidfVectorizer().fit_transform(sentences_no_stopwords)
vectors = vectorizer.toarray()

# Compute cosine similarity matrix based on sentences without stopwords
similarity_matrix = cosine_similarity(vectors)

# Boost sentences containing keywords by increasing similarity scores
def boost_similarity_for_keywords(sentences, similarity_matrix, keywords, boost_factor=1.5):
    for i, sentence in enumerate(sentences):
        keyword_count = sum(keyword.lower() in sentence.lower() for keyword in keywords)
        if keyword_count > 0:
            # Boost similarity scores proportional to the number of keywords
            similarity_matrix[i, :] *= boost_factor * keyword_count
            similarity_matrix[:, i] *= boost_factor * keyword_count
    return similarity_matrix

# Apply boosting to the similarity matrix
boosted_similarity_matrix = boost_similarity_for_keywords(sentences, similarity_matrix, keywords)

# Build the similarity graph (nodes are sentences, edges are similarity scores)
nx_graph = nx.from_numpy_array(boosted_similarity_matrix)

# PageRank algorithm with the boosted similarity matrix
scores = nx.pagerank(nx_graph)

# Rank sentences by score (we use original sentences here with stopwords)
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)

# Filter top sentences that also contain keywords
def filter_top_sentences(ranked_sentences, keywords, top_n=5):
    top_sentences = [sentence for _, sentence in ranked_sentences if any(keyword.lower() in sentence.lower() for keyword in keywords)]
    return top_sentences[:top_n]

# Get top 5 filtered sentences
top_sentences = filter_top_sentences(ranked_sentences, keywords)

# Output the final extractive summary with scores
def get_summary(ranked_sentences, top_n=5):
    return [(sentence, score) for score, sentence in ranked_sentences[:top_n]]

# Display the summary with sentence scores
summary = get_summary(ranked_sentences)
for i, (sentence, score) in enumerate(summary, 1):
    print(f"{i}. {sentence} (score: {score})")


1. I am still able to do
stuff I need to do, I am just thinking, that tomorrow If I switch to a microservices approach where the user and student will be different microservices with different
databases, I will not be able to inject the repository and somehow need to call the Rest API for user or student to achieve this. (score: 0.24370031296901695)
2. Right now it's a Monolith and not a microservice architecture, but I am developing it in a way that tomorrow it will be easy to switch to Microservice. (score: 0.10526122466617574)
3. I need help with the architecture pattern I should use in a NestJS project. (score: 0.0935469027596098)
4. Considering this, creating users, contacts, and addresses are part of
the user folder, and repository files and entity files related to these are stored in the user folder. (score: 0.07257372572430101)
5. Create student, assign institute to the student, insert documents,
are part of the student folder and repository, and entity files related to these a