Exercise 1

In [5]:
!pip install markovify pandas
import pandas as pd
import markovify



In [7]:
inp = pd.read_csv('abcnews-date-text.csv')

# Filtering out very short headlines
headlines = inp['headline_text'].dropna()
headlines = [h for h in headlines if len(h.strip().split()) > 2]

# Combining headlines into one text string
text_data = '. '.join(headlines)

# Creating a Markov model with smaller state size
text_model = markovify.Text(text_data, state_size=1)

# Generating 10 sentences
for i in range(10):
    print(f"{i+1}: {text_model.make_sentence()}")

1: 10yr high. colton sentenced for sydney harbour. leaked documents revealed. new zealand hikers. multiple armed sky complaints delay worries aired for tax down. scullion apologises for tortured nepal to flee violence and rain hits maximum possible election victory. black listing. heroin stash. man dies after ramming vessel.
2: 21 cases surge by expanding survey. ryan pulls out at ebola outbreak tests positive after mass deportation from mistakes. risk has not guilty. stolen from canberra anthony weiner resigns. most of reckoning for troubled fallon gives rugby future. driver drug network. confusion over radio. council membership. brits stockpiling illegal. group backs new homes. student connor bali child hostage. png landslide. share christmas spending fuels greenhouse gas. union not getting out for water supplies. duke tunes into alleged terrorism laws. freed after fleeing lebanon. australian workplace breaches sydney preparing to pursue premiership favourite prime minister not guilt

Exercise 2

In [11]:
# Installing sumy and nltk (if not already installed)
!pip install sumy
!pip install nltk


Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl.metadata (7.5 kB)
Collecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m89.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: breadability, docopt
  Building wheel for breadability (setup.py) ... [?25l[?25hdone
  Created wheel for breadability: filename=brea

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
import nltk

In [18]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [19]:
# Loading the text file
with open("alice.txt", "r", encoding='utf-8') as f:
    text = f.read()

# Setting up parser and summarizer
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LsaSummarizer()

# Printing top 5 summary sentences
for sentence in summarizer(parser.document, 5):
    print(sentence)


The Mouse did not answer, so Alice went on eagerly:  `There is such a nice little dog near our house I should like to show you!
The poor little thing sobbed again (or grunted, it was impossible to say which), and they went on for some while in silence.
He sent them word I had not gone (We know it to be true): If she should push the matter on, What would become of you?
Don't let him know she liked them best, For this must ever be A secret, kept from all the rest, Between yourself and me.'
`If there's no meaning in it,' said the King, `that saves a world of trouble, you know, as we needn't try to find any.


Exercise 3


In [24]:
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from nltk.corpus import stopwords

In [21]:
# Download NLTK stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Loading data and select headlines
inp = pd.read_csv('abcnews-date-text.csv')
headlines = inp['headline_text'].dropna().astype(str).tolist()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#Cleaning the text
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-letters
    text = text.lower()
    words = [word for word in text.split() if word not in stop_words and len(word) > 2]
    return ' '.join(words)

cleaned_headlines = [clean_text(headline) for headline in headlines]

In [None]:
# Vectorizing using TF-IDF
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(cleaned_headlines)
idx_to_word = np.array(vectorizer.get_feature_names_out())

In [23]:
# Fitting NMF model with 20 topics
nmf = NMF(n_components=20, random_state=42)
W = nmf.fit_transform(X)
H = nmf.components_

# Printing top 10 words for each topic
for i, topic in enumerate(H):
    top_words = idx_to_word[topic.argsort()[-10:]]  # last 10 are most important
    print("Topic {}: {}".format(i + 1, ", ".join(top_words)))

Topic 1: afl, andrew, smith, james, david, john, nrl, michael, extended, interview
Topic 2: arrest, say, seek, officer, hunt, missing, search, investigate, probe, police
Topic 3: chief, home, centre, deal, hospital, years, year, laws, zealand, new
Topic 4: found, attack, stabbing, guilty, arrested, missing, jailed, murder, charged, man
Topic 5: group, could, union, opposition, trump, government, report, labor, minister, says
Topic 6: rates, backs, mayor, rise, budget, seeks, land, considers, plans, council
Topic 7: england, cricket, china, first, india, one, test, south, day, australia
Topic 8: season, residents, blaze, ban, threat, school, home, crews, house, fire
Topic 9: front, charges, high, faces, case, told, murder, face, accused, court
Topic 10: report, act, claims, accused, health, vic, funding, qld, urged, govt
Topic 11: first, share, dollar, china, south, wins, year, market, open, australian
Topic 12: gold, qld, north, election, government, rural, coast, hour, country, nsw
To