## Latent Semantic Analysis

In [1]:
import os
import codecs

corpus_path = "news-corpus//"
article_paths = [os.path.join(corpus_path,p) for p in os.listdir(corpus_path)]

doc_complete = []
for path in article_paths:
    with open(path, 'rb') as f:
        doc_content = f.read().decode(errors='ignore')
        doc_complete.append(doc_content)


In [2]:
#doc_complete[0]

In [3]:
import re
for i in range(len(doc_complete)):
    doc_complete[i] = re.sub(r'[^\w\s.]', '', doc_complete[i])

In [4]:
#doc_complete[0]

In [5]:
#import nltk
#nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

In [6]:
generated_summaries = []

In [7]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, LsiModel



# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

# Preprocess the text data

if len(doc_complete) >= 2:
    doc_complete.pop(1)

# Define stopwords and lemmatizer

stopwords = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Function to preprocess a document
def preprocess_document(document):
    
    # Split the document into sentences
    sentences = sent_tokenize(document)
    
    # Preprocess each sentence
    preprocessed_sentences = []
    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence.lower())
        
        # Remove stopwords and punctuation
        tokens = [token for token in tokens if token.isalpha() and token not in stopwords]
        
        # Lemmatize the tokens
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
        
        preprocessed_sentences.extend(tokens)
    
    return preprocessed_sentences

# Create a list of preprocessed documents
preprocessed_documents = [preprocess_document(doc) for doc in doc_complete]

# Flatten the list of tokens
flattened_documents = [token for document in preprocessed_documents for token in document]

# Create a dictionary of terms
dictionary = Dictionary([flattened_documents])
corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

# Create a TF-IDF model
tfidf = TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

# Apply LSA
lsa_model = LsiModel(corpus_tfidf, id2word=dictionary, num_topics=20)  # Adjust the number of topics as needed

# Generate summaries
for i, doc in enumerate(preprocessed_documents):
    bow = dictionary.doc2bow(doc)
    doc_topics = lsa_model[bow]
    doc_topics_sorted = sorted(doc_topics, key=lambda x: x[1], reverse=True)
    
    # Extract top sentences as summary
    top_sentences = [sent_tokenize(doc_complete[i])[j] for j, _ in doc_topics_sorted[0:3]]  # Extract top 3 sentences from the original document
    
    # Print the summary
    print("Summary for Document", i+1)
    print("\n")
    print("\n".join(top_sentences))
    generated_summaries.append("".join((top_sentences)))
    print("==========================")

Summary for Document 1


At a time when the two extremes of malnourishment and obesity plague large portions of the world India has taken it upon herself to educate the masses about these smallseeded grasses that are highly beneficial to human health.Millet and grain cereals despite being rich sources of protein and antioxidants with high nutritional value have never been considered fashionable foods however India has done remarkably well when it has come to meeting the caloric needs and demands of her people.India the worlds largest producer and the worlds secondlargest exporter of millet are hoping to change the humble millets reputation worldwide.Unlike a large part of the rest of the world almost every Indian household is acquainted with the taste and the benefits of millet.Millets have been a staple of the Indian diet especially in rural India for years and remain prevalent even today.
We grow several types of Shri Anna Millets such as Shri Anna Jowar Shri Anna Ragi Shri Anna Bajr

In [76]:
# generated_summaries

In [8]:
Reference_summaries = list(doc_complete)
Generated_summaries = list(generated_summaries)

In [24]:
#Reference_summaries[4]

In [25]:
#Generated_summaries[4]

In [80]:
from nltk.translate.bleu_score import sentence_bleu

bleu_scores = []
for ref, gen in zip(Reference_summaries, Generated_summaries):
    bleu_score = sentence_bleu([ref], gen)
    bleu_scores.append(bleu_score)

# Print BLEU score for each summary
for i, score in enumerate(bleu_scores):
    print("BLEU score for Summary", i+1, ":", score)



BLEU score for Summary 1 : 0.08135100400101479
BLEU score for Summary 2 : 0.002747182871391865
BLEU score for Summary 3 : 0.0017575743437884257
BLEU score for Summary 4 : 0.017289883619560634
BLEU score for Summary 5 : 0.6672570576716137


- Summary 1 has a BLEU score of 0.0814, indicating a relatively low similarity to the reference summaries. The generated summary captures some aspects but lacks significant overlap with the reference summaries.


- Summary 2 has a very low BLEU score of 0.00275, indicating a minimal overlap with the reference summaries. The generated summary does not capture the content or structure of the reference summaries effectively.


- Summary 3 has a similarly low BLEU score of 0.00176, indicating a lack of meaningful similarity to the reference summaries. The generated summary does not capture the key information or context present in the reference summaries.


- Summary 4 has a slightly higher BLEU score of 0.0173, suggesting a bit more overlap with the reference summaries compared to the previous summaries. However, the generated summary still falls short in capturing the main points and details of the reference summaries.


- Summary 5 stands out with a relatively high BLEU score of 0.6673. This indicates a significant overlap and similarity to the reference summaries. The generated summary captures the essence and key information from the reference summaries effectively.


__In summary, based on the BLEU scores, Summary 5 appears to be the most coherent and accurate among the generated summaries, while the other summaries lack substantial similarity and fail to fully represent the reference summaries__.