<a href="https://colab.research.google.com/github/Shri-Aiswarya/NLP/blob/main/NLP_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 1. Text Generation

• Install markovify

• Import pandas and markovify

• Load the file ‘abcnews-date-text.csc’ as ‘inp’

• Look at the three top rows

• Create a model with markovify as ‘text_model’ to generate text

• Print ten randomly generated sentences using the built model.

In [3]:
#Installing markovify
!pip install markovify

#Importing all the required libraries
import pandas as pd
import markovify

#Loading the csv file as inp
inp = pd.read_csv('abcnews-date-text.csv')

#Displaying the three top rows
print(inp.head(3))

#Combining all the text into a single string
text_data = '. '.join(inp['headline_text'].dropna().astype(str))

#Creating a Markov model using markovify.Text
text_model = markovify.Text(text_data)

#Printing ten randomly generated sentences
for i in range(10):
    print(text_model.make_sentence())


   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
140000 without power for mars training lab. snowboarding injuries increase. south east nepal kills 2. bomb attacks injure 4 in hospital after shooting. pope urges conservatives to be broken. same sex marriage and indigenous battlecry. opposition criticises local government reform in 20 act drivers affected by credit card. medical marijuana law. nsw parliament full of praise for locals storm preparedness. probe continues into cut phone cable. proud apologetic after glassing incident. man to front court over girlfriends death. telstra continues battle for free deal extended as qld government plans to stop before. rescues under way on port road costs will be profitable again. explosion hits bus shelter. schumacher shrugs off 

Exercise 2. Text Summarization

• Use sumy to summarize the ‘alice.txt’ file

• Download the ‘punkt’ and 'tokenizers/punkt/PY3/english.pickle' NLTK libraries.

In [5]:
#Installing sumy
!pip install sumy nltk

#Importing the required libraries
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
import nltk

#Downloading punkt
nltk.download('punkt')

#Loading the text from the file
with open('alice.txt', 'r', encoding='utf-8') as file:
    text = file.read()

#Creating a parser
parser = PlaintextParser.from_string(text, Tokenizer("english"))

#Initializing the LSA summarizer
summarizer = LsaSummarizer()

#Summarizing the text
summary = summarizer(parser.document, 5)  # Summarize to 5 sentences

#Printing the summary
for sentence in summary:
    print(sentence)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The Mouse did not answer, so Alice went on eagerly:  `There is such a nice little dog near our house I should like to show you!
The poor little thing sobbed again (or grunted, it was impossible to say which), and they went on for some while in silence.
He sent them word I had not gone (We know it to be true): If she should push the matter on, What would become of you?
Don't let him know she liked them best, For this must ever be A secret, kept from all the rest, Between yourself and me.'
`If there's no meaning in it,' said the King, `that saves a world of trouble, you know, as we needn't try to find any.


Exercise 3. Topic Modeling

• Determine the top 20 topics using the Non-Negative Matrix
Factorization (NMF) using ‘from sklearn.decomposition import NMF’

• Vectorize the words after cleaning up the text

• Use ‘print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word [topic.argsort()[-10:]]]))) to list the topics

In [6]:
#Importing the required libraries
!pip install nltk scikit-learn pandas
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import re

#Loading the text data
with open('alice.txt', 'r', encoding='utf-8') as file:
    text = file.read()

#Cleaning the text
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'\W+', ' ', text)  # Remove punctuation
    text = text.lower()                # Convert to lowercase
    return text

cleaned_text = clean_text(text)
documents = cleaned_text.splitlines()  # Split into individual lines

#Vectorizing the text
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

#Setting the number of topics
n_topics = 20

#Fitting the NMF model
nmf_model = NMF(n_components=n_topics, random_state=42)
W = nmf_model.fit_transform(tfidf_matrix)
H = nmf_model.components_

#Getting the feature names
idx_to_word = vectorizer.get_feature_names_out()

#Printing the topics
for i, topic in enumerate(H):
    print("Topic {}: {}".format(i + 1, ", ".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))


Topic 1: king, time, thought, queen, went, like, know, little, alice, said
Topic 2: instantly, told, soup, butter, curiosity, likely, slowly, came, speaking, set
Topic 3: hands, writing, explain, rest, loud, matter, elbow, long, consider, won
Topic 4: grave, breath, funny, drink, history, officers, meaning, confusion, change, mock
Topic 5: sang, gryphon, afraid, hadn, humbly, accident, notion, wondering, possibly, distance
Topic 6: rabbit, times, crumbs, corner, said, queer, subject, march, arches, sounds
Topic 7: goose, remarking, picked, time, fellow, come, sage, little, sun, alice
Topic 8: peeped, life, beds, quiet, weak, piece, filled, low, middle, think
Topic 9: dogs, yer, executioner, hadn, word, players, reach, hands, beg, long
Topic 10: hadn, arches, thing, shaped, slipped, saw, backs, puzzled, fury, sad
Topic 11: swim, mabel, natural, minute, won, pack, quite, ann, yesterday, keeping
Topic 12: called, temper, love, hope, twinkling, says, courage, trouble, clock, poison
Topic 1