**Exercise 1. Text Generation**


• Install markovify


• Import pandas and markovify

• Load the file ‘abcnews-date-text.csc’ as ‘inp’

• Look at the three top rows

• Create a model with markovify as ‘text_model’ to generate text

• Print ten randomly generated sentences using the built model.

**Exercise 2. Text Summarization**

• Use sumy to summarize the ‘alice.txt’ file

• Download the ‘punkt’ and 'tokenizers/punkt/PY3/english.pickle' NLTK
libraries.

**Exercise 3. Topic Modeling**

• Determine the top 20 topics using the Non-Negative Matrix
Factorization (NMF) using ‘from sklearn.decomposition import NMF’

• Vectorize the words after cleaning up the text

• Use ‘print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word
[topic.argsort()[-10:]]]))) to list the topics

In [None]:
!pip install markovify


Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from markovify)
  Downloading Unidecode-1.3.7-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18607 sha256=75d88ca8b88110697bb6214ef453771d41a1fb448e1cfdb205256933d7411b5b
  Stored in directory: /root/.cache/pip/wheels/ca/8c/c5/41413e24c484f883a100c63ca7b3b0362b7c6f6eb6d7c9cc7f
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.3.7


In [None]:
import pandas as pd
import markovify


In [None]:
inp = pd.read_csv('abcnews-date-text.csv')



In [None]:
print(inp.head(3))


   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit


In [None]:
inp['headline_text']

0          aba decides against community broadcasting lic...
1             act fire witnesses must be aware of defamation
2             a g calls for infrastructure protection summit
3                   air nz staff in aust strike for pay rise
4              air nz strike to affect australian travellers
                                 ...                        
1186013    vision of flames approaching corryong in victoria
1186014    wa police and government backflip on drug amne...
1186015    we have fears for their safety: victorian premier
1186016                                when do the 20s start
1186017    yarraville shooting woman dead man critically ...
Name: headline_text, Length: 1186018, dtype: object

In [None]:
text = '. '.join(inp['headline_text'])

# Adding an extra full stop at the end to delimit the last sentence
text += '.'

text_model = markovify.Text(text)

# Generating ten sentences randomly from the text
for _ in range(10):
    sent = text_model.make_sentence()
    if sent:
        print(sent)





10 national parks laws pass lower house. australians could die in seattle cbd kills 4 iraqi soldiers killed in iraq accident. flying foxes to brisbane semis. vic company wins new yorker journalist dies from cancer. eleven whales that survived a 68 year fast baffles doctors. mauresmo powers france to compensate investors over a. macquarie bank buys up horse flu vaccine. stoner completes meteoric rise. stoner crowned new motogp champion. stoner takes pole for motogp dream. strong winds predicted across weekend. number of structures burning in north korea with freed journalists. clinton mission may reset north korea ferry visit. jess pays tribute to comedian andy zaltzman. the changes to the test. youth coalition calls for presidents cup golf could bring 1 billion says act should have been self inflicted. female scorer to make up melting in summer. new teaching kit launched in central west uses for lemons during winter olympics kicks off recruitment drive. health system failing pregnant w

**EXERCISE 2**

In [None]:
!pip install sumy




In [None]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
#nltk.download('tokenizers/punkt/PY3/english.pickle')


In [None]:
import nltk
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer


In [None]:
# Load the 'alice.txt' file
text_file = open("/content/alice.txt", 'r', encoding='utf-8')
text2 = text_file.read()



parser = PlaintextParser.from_string(text2, Tokenizer('english'))
summarizer = LexRankSummarizer()

# Summarizing the text
summary = summarizer(parser.document, sentences_count=len(parser.document.sentences))

# Prints first 5 sentences from the summary
for i, sentence in enumerate(summary):
    if i < 5:
        print(sentence)


﻿Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do:  once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, `Oh dear!
Oh dear!
I shall be late!'


**EXERCISE: 3**

In [None]:
#Text Cleaning(I used the Alice.txt file)
import re

text2 = re.sub(r'[^a-zA-Z\s]', '', text2)
text2 = text2.lower()
words = text2.split()
cleaned_text = ' '.join(words)

In [None]:
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer


In [None]:
# Creating a TF-IDF vectorizer
vectorizerTF= TfidfVectorizer(max_df=99, min_df=1, stop_words='english')

# Vectorize the cleaned text
matrix = vectorizerTF.fit_transform([cleaned_text])



In [None]:
from sklearn.decomposition import NMF

#NMF
nmf = NMF(n_components=20, random_state=1)
nmf.fit(matrix)


In [None]:
features = vectorizerTF.get_feature_names_out()
for i, topic in enumerate(nmf.components_):
  print("Topic {}: {}".format(i + 1, ", ".join([features[idx] for idx in topic.argsort()[-10:]])))



Topic 1: did, queen, time, thought, went, like, know, little, alice, said
Topic 2: asleep, treacle, mouth, open, lizard, hot, end, rest, sighed, exactly
Topic 3: lobsters, whiskers, sneezes, pale, week, queer, dormouse, hands, happened, wise
Topic 4: glad, hanging, lessons, leaning, times, thought, violently, begin, looked, stupid
Topic 5: thing, wonder, rosetree, disappeared, unless, deeply, turtle, id, rest, experiment
Topic 6: kindly, extremely, usual, sands, righthand, xi, scream, pretty, crumbs, youd
Topic 7: dispute, regular, ive, grin, wandered, wasting, wonderful, teacups, collected, called
Topic 8: standing, pressed, happen, send, drew, tidy, jury, uncorked, doors, recognised
Topic 9: signed, impatiently, havent, sensation, soup, snatch, curious, lory, wash, glad
Topic 10: thirteen, roared, ridges, years, worry, laughed, leave, youare, alices, flamingo
Topic 11: queer, march, looking, im, things, turtle, way, like, come, said
Topic 12: raising, fix, laugh, act, anxiously, shy,