## Exercise 1. Text Generation

• Install markovify

• Import pandas and markovify

• Load the file ‘abcnews-date-text.csc’ as ‘inp’

• Look at the three top rows

• Create a model with markovify as ‘text_model’ to generate text

• Print ten randomly generated sentences using the built model.

## Exercise 2. Text Summarization

• Use sumy to summarize the ‘alice.txt’ file

• Download the ‘punkt’ and 'tokenizers/punkt/PY3/english.pickle' NLTK
libraries.

## Exercise 3. Topic Modeling

• Determine the top 20 topics using the Non-Negative Matrix
Factorization (NMF) using ‘from sklearn.decomposition import NMF’

• Vectorize the words after cleaning up the text

• Use ‘print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word
[topic.argsort()[-10:]]]))) to list the topics

In [None]:
!pip install markovify

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from markovify)
  Downloading Unidecode-1.3.7-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18606 sha256=9bbd918eb3aba54c5357476e292389de3142b2243478b6d2ceeec63a1383aca8
  Stored in directory: /root/.cache/pip/wheels/ca/8c/c5/41413e24c484f883a100c63ca7b3b0362b7c6f6eb6d7c9cc7f
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.3.7


In [None]:
# Importing the necessary libraraies
import pandas as pd
import markovify

In [None]:
# Loading the Data
inp = pd.read_csv('abcnews-date-text.csv')
print(inp.head(3))

   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit


In [None]:
#Printing the Top 3 rows
print(inp.head())
print(inp['headline_text'].dtype)

   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers
object


In [None]:
# Preprocess the text data by converting it to a list of sentences
text_data = inp['headline_text'].astype(str).tolist()
text_data

['aba decides against community broadcasting licence',
 'act fire witnesses must be aware of defamation',
 'a g calls for infrastructure protection summit',
 'air nz staff in aust strike for pay rise',
 'air nz strike to affect australian travellers',
 'ambitious olsson wins triple jump',
 'antic delighted with record breaking barca',
 'aussie qualifier stosur wastes four memphis match',
 'aust addresses un security council over iraq',
 'australia is locked into war timetable opp',
 'australia to contribute 10 million in aid to iraq',
 'barca take record as robson celebrates birthday in',
 'bathhouse plans move ahead',
 'big hopes for launceston cycling championship',
 'big plan to boost paroo water supplies',
 'blizzard buries united states in bills',
 'brigadier dismisses reports troops harassed in',
 'british combat troops arriving daily in kuwait',
 'bryant leads lakers to double overtime win',
 'bushfire victims urged to see centrelink',
 'businesses should prepare for terrorist a

In [None]:
# Create a text model using markovify with state size 1
text_model = markovify.NewlineText(inp['headline_text'][1], state_size=1)


In [None]:
text_model

<markovify.text.NewlineText at 0x7bd6aeed7880>

## Exercise 2

• Use sumy to summarize the ‘alice.txt’ file

• Download the ‘punkt’ and 'tokenizers/punkt/PY3/english.pickle' NLTK libraries.

In [None]:
# Installing the Sumy
!pip install sumy

Collecting sumy
  Downloading sumy-0.11.0-py2.py3-none-any.whl (97 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/97.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m81.9/97.3 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m97.3/97.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt<0.7,>=0.6.1 (from sumy)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting breadability>=0.1.20 (from sumy)
  Downloading breadability-0.1.20.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pycountry>=18.2.23 (from sumy)
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdon

In [None]:
!conda install -c conda-forge sumy

/bin/bash: line 1: conda: command not found


In [None]:
# Importing the required libraries
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer


In [None]:
# Importing the nltk and downloading the required packages.
import nltk
nltk.download('punkt')
nltk.download('tokenizers/punkt/PY3/english.pickle')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Error loading tokenizers/punkt/PY3/english.pickle: Package
[nltk_data]     'tokenizers/punkt/PY3/english.pickle' not found in
[nltk_data]     index


False

In [None]:
# Loading the Data
filename = 'alice.txt'

In [None]:

parser = PlaintextParser.from_file(filename, Tokenizer("english"))
summarizer = LexRankSummarizer()

In [None]:

summary = summarizer(parser.document, sentences_count=5)

In [None]:
# Printing the Summary of the Sentence
for sentence in summary:
    print(sentence)

said Alice.
said Alice.
`I'm a--I'm a--'
said Alice.
said Alice.


## Exercise 3

 Determine the top 20 topics using the Non-Negative Matrix Factorization (NMF) using ‘from sklearn.decomposition import NMF’

• Vectorize the words after cleaning up the text

• Use ‘print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word [topic.argsort()[-10:]]]))) to list the topics

In [None]:
# Importing the vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# Importing the NMF
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(inp['headline_text'])
tfidf

<1186018x61923 sparse matrix of type '<class 'numpy.float64'>'
	with 6067135 stored elements in Compressed Sparse Row format>

In [None]:
num_topics = 20
nmf = NMF(n_components=num_topics, random_state=1)
# Removed alpha
nmf.fit(tfidf)

In [None]:
feature_names = tfidf_vectorizer.vocabulary_

In [None]:
# Topics List
feature_names = tfidf_vectorizer.vocabulary_

for idx, topic in enumerate(nmf.components_):
    indices = topic.argsort()[:-11:-1]
    names = [feature_names[i] for i in indices if i in feature_names]
    print("Topic {}: {}".format(idx+1, ", ".join(names)))

Topic 1: 
Topic 2: 
Topic 3: 
Topic 4: 
Topic 5: 
Topic 6: 
Topic 7: 
Topic 8: 
Topic 9: 
Topic 10: 
Topic 11: 
Topic 12: 
Topic 13: 
Topic 14: 
Topic 15: 
Topic 16: 
Topic 17: 
Topic 18: 
Topic 19: 
Topic 20: 
