# 01-News-Modelling

![](https://images.unsplash.com/photo-1495020689067-958852a7765e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Roman Kraft](https://unsplash.com/photos/_Zua2hyvTBk)

This exercise is about modelling the main topics of a database of News headlines.

Begin by importing the needed libraries:

In [18]:
# TODO: import needed libraries
import pandas as pd
import numpy as np

Load the data in the file `random_headlines.csv`

In [19]:
# TODO: load the dataset
data = pd.read_csv('random_headlines.csv')

This is always a good idea to perform some EDA on a dataset...

In [20]:
# TODO: Perform a short EDA
data

Unnamed: 0,publish_date,headline_text
0,20120305,ute driver hurt in intersection crash
1,20081128,6yo dies in cycling accident
2,20090325,bumper olive harvest expected
3,20100201,replica replaces northernmost sign
4,20080225,woods targets perfect season
...,...,...
19995,20030301,judge attacks walkinshaw over running of arrows
19996,20070908,polish govt collapses elections to be held next
19997,20150529,the drum friday may 29
19998,20071006,winterbottom on bathurst provisional pole


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   publish_date   20000 non-null  int64 
 1   headline_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


In [22]:
data.describe(include='all')

Unnamed: 0,publish_date,headline_text
count,20000.0,20000
unique,,19802
top,,weather in 90 seconds
freq,,19
mean,20095580.0,
std,38754.03,
min,20030220.0,
25%,20060820.0,
50%,20100220.0,
75%,20130420.0,


Now perform all the needed preprocessing on those headlines: case lowering, tokenization, punctuation removal, stopwords removal, stemming/lemmatization.

In [23]:
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    output = np.asarray(pos_tag)
    for i in range(len(pos_tag)):
        if pos_tag[i][1].startswith('J'):
            output[i][1] = wordnet.ADJ
        elif pos_tag[i][1].startswith('V'):
            output[i][1] = wordnet.VERB
        elif pos_tag[i][1].startswith('R'):
            output[i][1] = wordnet.ADV
        else:
            output[i][1] = wordnet.NOUN
    return output

In [28]:
# TODO: Preprocess the input data
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')

def preproc(text):
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.isalpha()]
    tokens = [t for t in tokens if t not in stop_words]
    pos_tags = nltk.pos_tag(tokens)
    wordnet_tags = get_wordnet_pos(pos_tags)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(tokens[i], wordnet_tags[i,1]) for i in range(len(tokens))]
    return tokens

In [29]:
headlines = data.headline_text

In [30]:
headlines_preproc = headlines.apply(preproc)

In [31]:
headlines_preproc

0                 [ute, driver, hurt, intersection, crash]
1                                    [dy, cycle, accident]
2                         [bumper, olive, harvest, expect]
3                  [replica, replaces, northernmost, sign]
4                          [wood, target, perfect, season]
                               ...                        
19995              [judge, attack, walkinshaw, run, arrow]
19996       [polish, govt, collapse, election, hold, next]
19997                                  [drum, friday, may]
19998          [winterbottom, bathurst, provisional, pole]
19999    [pull, pork, pawpaw, salad, local, success, st...
Name: headline_text, Length: 20000, dtype: object

Now use Gensim to compute a BOW

In [32]:
# TODO: Compute the BOW using Gensim
import gensim

In [33]:
from gensim.corpora import Dictionary

In [34]:
id2word = Dictionary(headlines_preproc)

In [43]:
for key, value in id2word.items():
    print(key, value)

0 crash
1 driver
2 hurt
3 intersection
4 ute
5 accident
6 cycle
7 dy
8 bumper
9 expect
10 harvest
11 olive
12 northernmost
13 replaces
14 replica
15 sign
16 perfect
17 season
18 target
19 wood
20 adelaide
21 dramatic
22 draw
23 leckie
24 salvage
25 future
26 gauge
27 group
28 rail
29 service
30 ahead
31 anti
32 go
33 hunt
34 rally
35 still
36 aid
37 congo
38 dr
39 first
40 receive
41 refugee
42 agreement
43 muslim
44 rebel
45 thailand
46 application
47 centre
48 dubbo
49 lodge
50 shopping
51 west
52 action
53 industrial
54 nsw
55 say
56 teacher
57 unavoidable
58 anzac
59 disrupt
60 protester
61 urge
62 ag
63 agree
64 export
65 scrap
66 subsidy
67 wto
68 approval
69 fire
70 housing
71 process
72 game
73 lakers
74 scorch
75 stoudemire
76 three
77 bird
78 flag
79 fly
80 frigate
81 nauru
82 continue
83 mansell
84 trial
85 assault
86 charge
87 dismiss
88 matai
89 bronson
90 buy
91 deal
92 jacob
93 orica
94 strike
95 dump
96 liberal
97 party
98 senate
99 ticket
100 watson
101 accuse
102 driv

1418 despite
1419 injection
1420 japanese
1421 stock
1422 crew
1423 victoria
1424 extra
1425 maternity
1426 whitsunday
1427 fan
1428 halatau
1429 manly
1430 bag
1431 plastic
1432 measles
1433 miscarriage
1434 trigger
1435 dirk
1436 hartog
1437 tag
1438 turtle
1439 boston
1440 fbi
1441 suspect
1442 advice
1443 liveability
1444 lord
1445 perth
1446 arm
1447 bouncer
1448 conflict
1449 military
1450 russia
1451 unlikely
1452 agribusiness
1453 investment
1454 scrutiny
1455 bombing
1456 condemn
1457 moscow
1458 suicide
1459 manus
1460 supreme
1461 finish
1462 genome
1463 human
1464 map
1465 klinger
1466 michael
1467 competitor
1468 horse
1469 scg
1470 broncos
1471 might
1472 much
1473 rabbitohs
1474 effort
1475 triple
1476 coorong
1477 fish
1478 slowly
1479 would
1480 bridge
1481 complex
1482 yambuna
1483 limo
1484 bushfires
1485 suspicious
1486 couple
1487 dinner
1488 gatecrashes
1489 ammonium
1490 inspector
1491 nitrate
1492 chair
1493 harshly
1494 sack
1495 student
1496 tap
1497 unruly
14

2771 plant
2772 catchment
2773 storage
2774 blow
2775 emphatic
2776 sprinkler
2777 patrick
2778 stifle
2779 metre
2780 shore
2781 sink
2782 firebrand
2783 iman
2784 mosque
2785 announces
2786 cuba
2787 obama
2788 promotion
2789 receipt
2790 elder
2791 livermore
2792 ross
2793 defence
2794 manufacturer
2795 mla
2796 qch
2797 gallery
2798 portrait
2799 sculpture
2800 cahill
2801 everton
2802 training
2803 recov
2804 ekka
2805 among
2806 proceeding
2807 titan
2808 enforce
2809 typhoon
2810 criticise
2811 drc
2812 oxfam
2813 forecaster
2814 aec
2815 remote
2816 rethink
2817 divide
2818 ntch
2819 sewerage
2820 campbell
2821 coolum
2822 toolleen
2823 cobar
2824 ourt
2825 hargrave
2826 jump
2827 lawrence
2828 detain
2829 knowledge
2830 gerard
2831 oak
2832 whateleys
2833 pork
2834 trickle
2835 tuesday
2836 april
2837 wednesday
2838 dance
2839 ensemble
2840 youth
2841 ombudsman
2842 workplace
2843 ballarat
2844 assassinate
2845 yemen
2846 banishes
2847 brown
2848 third
2849 corner
2850 preview

4295 error
4296 disquiet
4297 quells
4298 quinn
4299 copy
4300 crocodile
4301 douglas
4302 leg
4303 single
4304 wimbledon
4305 brekkie
4306 relationship
4307 withdraw
4308 cover
4309 gown
4310 wear
4311 idiot
4312 interchange
4313 venture
4314 drizzly
4315 visitor
4316 harrigan
4317 pickett
4318 amends
4319 switzerland
4320 factor
4321 religion
4322 kurd
4323 wolfowitz
4324 entertain
4325 buffalo
4326 java
4327 optus
4328 dale
4329 detective
4330 cancel
4331 florentine
4332 ioc
4333 bleijie
4334 taxpayer
4335 do
4336 advise
4337 amalgamate
4338 representation
4339 surprising
4340 phallus
4341 huxley
4342 oceania
4343 boom
4344 kindy
4345 hussey
4346 intense
4347 reel
4348 belmont
4349 pga
4350 canola
4351 gm
4352 kilmore
4353 inflation
4354 innocent
4355 bogongs
4356 ac
4357 berlusconi
4358 milan
4359 presidency
4360 reclaim
4361 giteau
4362 grateful
4363 bulldog
4364 stealing
4365 anchor
4366 famous
4367 mao
4368 durante
4369 eager
4370 salmon
4371 bet
4372 tote
4373 majority
4374 del

5834 representative
5835 menace
5836 redback
5837 tokyo
5838 towards
5839 qu
5840 escapee
5841 maldon
5842 scrum
5843 andersen
5844 fulfil
5845 aide
5846 sadr
5847 titans
5848 eruption
5849 inevitable
5850 armed
5851 allah
5852 god
5853 neck
5854 plumber
5855 actew
5856 nowruz
5857 ten
5858 crumble
5859 pettersson
5860 pentagon
5861 southon
5862 ki
5863 exicse
5864 indexation
5865 suing
5866 triggers
5867 demolition
5868 practical
5869 antony
5870 compass
5871 defaulter
5872 vicroads
5873 stan
5874 finance
5875 toro
5876 garden
5877 mcveigh
5878 steel
5879 drag
5880 attracts
5881 tablelands
5882 cfa
5883 elton
5884 marries
5885 medicine
5886 quart
5887 samantha
5888 stosur
5889 strasbourg
5890 newcomer
5891 blackall
5892 woodbridge
5893 belief
5894 indecently
5895 tait
5896 jeopardy
5897 parrot
5898 asx
5899 jitter
5900 plummeting
5901 anniversary
5902 dandenong
5903 preliminary
5904 contender
5905 trawl
5906 braille
5907 chapter
5908 zvonareva
5909 retiree
5910 summon
5911 guideline
5

7349 roughen
7350 paroo
7351 savaii
7352 upolu
7353 duo
7354 pokie
7355 winegrowers
7356 afford
7357 kasey
7358 poddy
7359 horticultural
7360 can
7361 misconduct
7362 embarrass
7363 inter
7364 nefa
7365 thrilling
7366 ginepri
7367 gonzalez
7368 evil
7369 ideology
7370 fullerton
7371 morning
7372 yoyo
7373 trujillo
7374 scaremongering
7375 vukovics
7376 wipe
7377 backflip
7378 folau
7379 incense
7380 greenbrier
7381 overton
7382 holt
7383 biodiversity
7384 koschitzke
7385 skip
7386 astronomer
7387 horizon
7388 pluto
7389 corangamite
7390 rag
7391 parvovirus
7392 involvement
7393 apply
7394 shortcoming
7395 gilgandra
7396 concert
7397 springsteen
7398 gigantic
7399 brick
7400 require
7401 path
7402 henbury
7403 landing
7404 dwyers
7405 speculate
7406 bottle
7407 palliative
7408 scone
7409 customary
7410 crayfish
7411 jailing
7412 poacher
7413 alarm
7414 modify
7415 kolan
7416 lockdown
7417 deniliquin
7418 maritime
7419 known
7420 induct
7421 breakdown
7422 completion
7423 grampians
7424 

8851 finishing
8852 beetle
8853 dung
8854 bronze
8855 montenegro
8856 serbia
8857 via
8858 polluter
8859 armida
8860 historical
8861 halving
8862 expectation
8863 exasperate
8864 brent
8865 edinburgh
8866 hibs
8867 starter
8868 anc
8869 airs
8870 mislead
8871 jarman
8872 ahmad
8873 poach
8874 recruitment
8875 guerrouj
8876 pittman
8877 sizzle
8878 sickies
8879 danby
8880 hopman
8881 overload
8882 nagas
8883 consortium
8884 qr
8885 mayorship
8886 pisasale
8887 cossack
8888 pilbaras
8889 halliburton
8890 overcharge
8891 accused
8892 heidi
8893 glen
8894 woolley
8895 garb
8896 living
8897 coleslaw
8898 banker
8899 redhage
8900 sixer
8901 honoured
8902 truly
8903 pardon
8904 arthur
8905 policing
8906 skinny
8907 solstice
8908 timbertown
8909 picker
8910 aside
8911 botulism
8912 meteorite
8913 busselton
8914 iraqis
8915 uae
8916 fierce
8917 secretly
8918 mature
8919 glasser
8920 suspended
8921 invitational
8922 milky
8923 kanaar
8924 accountable
8925 whether
8926 regs
8927 statistic
8928 sa

10365 convert
10366 herston
10367 printing
10368 molybdenum
10369 copenhagen
10370 contempt
10371 nuttall
10372 cutback
10373 piracy
10374 chinalco
10375 tariff
10376 exhume
10377 driller
10378 wow
10379 timebomb
10380 guaranteed
10381 mri
10382 aisle
10383 lecture
10384 puking
10385 uncharted
10386 detonate
10387 drape
10388 decrease
10389 herring
10390 symptom
10391 h
10392 statue
10393 cloke
10394 magpies
10395 goodbye
10396 roundup
10397 sleepy
10398 tracked
10399 meander
10400 nlc
10401 provision
10402 xmas
10403 didaks
10404 firies
10405 musharrafs
10406 manipulation
10407 callum
10408 seized
10409 brooky
10410 francisco
10411 hose
10412 hype
10413 manjimup
10414 feud
10415 magnate
10416 ton
10417 youhana
10418 doncaster
10419 excellerator
10420 cert
10421 revenge
10422 truss
10423 cereal
10424 bradken
10425 foundry
10426 banger
10427 zims
10428 zinc
10429 brunswick
10430 reassess
10431 ghana
10432 shearer
10433 antibiotic
10434 moor
10435 shoal
10436 spigelman
10437 medicos
1043

11928 learnt
11929 warves
11930 worksafe
11931 forthcoming
11932 examination
11933 reconnect
11934 displaced
11935 blockage
11936 dwindling
11937 temora
11938 noisy
11939 ryegrass
11940 lorne
11941 wedensday
11942 albion
11943 mysteriously
11944 irishman
11945 differentiates
11946 brumbys
11947 glazer
11948 seekamp
11949 hoddy
11950 lahoud
11951 vacuum
11952 caledonia
11953 present
11954 reservist
11955 lewas
11956 mccartneys
11957 creator
11958 playpen
11959 vandal
11960 rapid
11961 panetta
11962 pear
11963 queiroz
11964 applies
11965 mehrtens
11966 defraud
11967 vigilantes
11968 asa
11969 holah
11970 dethrone
11971 instructor
11972 parafield
11973 shower
11974 shelf
11975 thread
11976 galactic
11977 spaceshiptwo
11978 suitcase
11979 barn
11980 caravel
11981 beamer
11982 diane
11983 bradshaw
11984 koolendong
11985 marines
11986 saved
11987 rangeland
11988 snatcher
11989 hayward
11990 arena
11991 bikeways
11992 pointer
11993 unlawful
11994 vandenberg
11995 setanta
11996 usual
11997 kno

13350 aljazeera
13351 postpones
13352 nutritional
13353 gina
13354 suns
13355 assaulting
13356 jamal
13357 flog
13358 francene
13359 mint
13360 chaplin
13361 bland
13362 spongebob
13363 squarepants
13364 reiterate
13365 shanley
13366 humanity
13367 ipod
13368 argy
13369 bargy
13370 taller
13371 onion
13372 tristar
13373 escas
13374 sharing
13375 ranns
13376 caravanners
13377 originality
13378 yesterday
13379 mooroopna
13380 spc
13381 serf
13382 verkerk
13383 punishes
13384 sicilian
13385 accessory
13386 needing
13387 enlist
13388 flour
13389 tremain
13390 assuncao
13391 betis
13392 whitlands
13393 beaudeserts
13394 revives
13395 reviewer
13396 jamberoo
13397 legislator
13398 bafta
13399 winslet
13400 quickup
13401 karrathas
13402 doorstop
13403 baboon
13404 politkovskayas
13405 wx
13406 cube
13407 rubiks
13408 feil
13409 meligeni
13410 warsaw
13411 bataclan
13412 tearful
13413 bleak
13414 domination
13415 format
13416 ausnetra
13417 version
13418 spooner
13419 conrad
13420 crighton
134

In [44]:
bow = [id2word.doc2bow(headline) for headline in headlines_preproc]

In [45]:
bow

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(5, 1), (6, 1), (7, 1)],
 [(8, 1), (9, 1), (10, 1), (11, 1)],
 [(12, 1), (13, 1), (14, 1), (15, 1)],
 [(16, 1), (17, 1), (18, 1), (19, 1)],
 [(20, 1), (21, 1), (22, 1), (23, 1), (24, 1)],
 [(25, 1), (26, 1), (27, 1), (28, 1), (29, 1)],
 [(30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1)],
 [(36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1)],
 [(15, 1), (42, 1), (43, 1), (44, 1), (45, 1)],
 [(46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1)],
 [(52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1)],
 [(29, 1), (58, 1), (59, 1), (60, 1), (61, 1)],
 [(62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1)],
 [(68, 1), (69, 1), (70, 1), (71, 1)],
 [(72, 1), (73, 1), (74, 1), (75, 1), (76, 1)],
 [(77, 1), (78, 1), (79, 1), (80, 1), (81, 1)],
 [(82, 1), (83, 1), (84, 1)],
 [(85, 1), (86, 1), (87, 1), (88, 1)],
 [(89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1)],
 [(95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1)],
 [(101, 1), (102, 1

Compute the TF-IDF using Gensim

In [46]:
# TODO: Compute TF-IDF
from gensim.models import TfidfModel

model = TfidfModel(bow)

In [47]:
tf_idf = model[bow]

In [49]:
print(len(tf_idf))

20000


Finally compute the **LSA** (also called LSI) using Gensim, for a given number of Topics that you choose yourself

In [73]:
# TODO: Compute LSA
from gensim.models import LsiModel

lsa = LsiModel(corpus=tf_idf, id2word=id2word, num_topics=5)

In [74]:
from pprint import pprint

For each of the topic, show the most frequent words.

In [75]:
# TODO: Print the 3 or 4 most frequent words of each topic
pprint(lsa.print_topics())

[(0,
  '-0.470*"man" + -0.390*"police" + -0.331*"charge" + -0.148*"court" + '
  '-0.138*"murder" + -0.134*"face" + -0.121*"miss" + -0.117*"woman" + '
  '-0.113*"crash" + -0.106*"new"'),
 (1,
  '-0.496*"second" + -0.417*"abc" + -0.395*"news" + -0.346*"weather" + '
  '-0.266*"business" + -0.213*"sport" + 0.190*"man" + 0.127*"charge" + '
  '-0.110*"rural" + -0.100*"national"'),
 (2,
  '0.408*"man" + 0.300*"charge" + 0.223*"second" + -0.216*"plan" + '
  '-0.198*"council" + -0.198*"govt" + -0.182*"new" + 0.158*"abc" + '
  '0.156*"weather" + -0.142*"police"'),
 (3,
  '-0.766*"police" + 0.210*"charge" + 0.208*"man" + 0.152*"council" + '
  '-0.134*"probe" + -0.131*"search" + -0.127*"investigate" + 0.121*"plan" + '
  '-0.118*"miss" + 0.114*"court"'),
 (4,
  '-0.474*"abc" + 0.471*"news" + 0.389*"rural" + 0.309*"national" + '
  '-0.297*"weather" + -0.185*"sport" + 0.165*"second" + -0.137*"new" + '
  '0.122*"qld" + -0.095*"council"')]


What do you think about those results?

Now let's try to use LDA instead of LSA using Gensim

In [104]:
# TODO: Compute LDA
from gensim.models.ldamodel import LdaModel

lda = LdaModel(tf_idf, 5, id2word, random_state=0, passes=10)

In [105]:
# TODO: print the most frequent words of each topic
pprint(lda.print_topics())

[(0,
  '0.004*"police" + 0.004*"kill" + 0.003*"review" + 0.003*"govt" + '
  '0.003*"call" + 0.003*"health" + 0.003*"road" + 0.003*"council" + '
  '0.003*"minister" + 0.003*"safety"'),
 (1,
  '0.005*"rural" + 0.004*"country" + 0.003*"hour" + 0.003*"qld" + '
  '0.003*"national" + 0.003*"study" + 0.003*"nsw" + 0.003*"club" + '
  '0.003*"stand" + 0.002*"hunt"'),
 (2,
  '0.005*"rate" + 0.004*"day" + 0.003*"new" + 0.003*"west" + 0.003*"closer" + '
  '0.003*"export" + 0.003*"gas" + 0.003*"sale" + 0.003*"ahead" + 0.002*"drop"'),
 (3,
  '0.008*"charge" + 0.007*"man" + 0.006*"second" + 0.005*"miss" + '
  '0.005*"police" + 0.005*"murder" + 0.004*"weather" + 0.004*"abc" + '
  '0.004*"business" + 0.003*"crash"'),
 (4,
  '0.004*"police" + 0.003*"brisbane" + 0.003*"force" + 0.003*"drug" + '
  '0.003*"plan" + 0.003*"worker" + 0.003*"inquiry" + 0.002*"new" + 0.002*"say" '
  '+ 0.002*"sex"')]


Now, how does it work with LDA?

Let's make some visualization of the LDA results using pyLDAvis.

In [106]:
# TODO: show visualization results of the LDA
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()

In [107]:
vis = pyLDAvis.gensim.prepare(lda, bow, id2word)

In [108]:
vis

Depending on your results, you can try to fine tune the algorithm: number of topics, hyperparameters...
And check with others their results.