<a href="https://colab.research.google.com/github/SWE3T/TopicModeling/blob/main/SpaceNews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install tomotopy

Collecting tomotopy
  Downloading tomotopy-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tomotopy
Successfully installed tomotopy-0.12.4


### Final project (Space news dataset)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import tomotopy as tp
import pandas as pd
import numpy as np
import string
import spacy
import sys

import gensim
from gensim.models.phrases import Phrases, Phraser

import nltk
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('words')
lemmatizer = WordNetLemmatizer()
english_words = set(words.words())
exclude = set(string.punctuation)
stop_words = set(stopwords.words('english'))

spacy.cli.download("en_core_web_lg")
nlp = spacy.load('en_core_web_lg')

# spacy.cli.download("en_core_web_sm")
# nlp = spacy.load('en_core_web_sm')

pd.options.mode.chained_assignment = None

Mounted at /content/drive


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [28]:
def clean(doc):
    stop_free = " ".join([w for w in doc.lower().split() if w not in stop_words])
    english_only = " ".join([w for w in stop_free.split() if w in english_words])
    punc_free = "".join([ch for ch in english_only if ch not in exclude])
    return punc_free

def lemmatize(document):
    doc = nlp(document)
    lemmas = [token.lemma_ for token in doc]
    return ' '.join(lemmas)

In [29]:
fpath='/content/drive/MyDrive/Códigos/Modelagem de tópicos/Final Work/dataset /'

data = pd.read_csv(fpath+'spacenews-december-2022.csv')

In [30]:
data

Unnamed: 0,title,url,content,author,date,postexcerpt
0,Orion splashes down to end Artemis 1,https://spacenews.com/orion-splashes-down-to-e...,Updated at 5:45 p.m. Eastern after post-splash...,Jeff Foust,"December 11, 2022",Fifty years to the day after the last Apollo m...
1,Polaris Dawn crewed mission could suffer addit...,https://spacenews.com/polaris-dawn-crewed-miss...,LAS VEGAS — A billionaire-backed private astro...,Jeff Foust,"October 25, 2022",A billionaire-backed private astronaut mission...
2,DART on track for asteroid collision,https://spacenews.com/dart-on-track-for-astero...,WASHINGTON — A NASA spacecraft is on course to...,Jeff Foust,"September 25, 2022",A NASA spacecraft is on course to deliberately...
3,U.S. Space Command calls for investment in tec...,https://spacenews.com/u-s-space-command-calls-...,"WASHINGTON — Lt. Gen. John Shaw, deputy comman...",Sandra Erwin,"August 31, 2022",U.S. Space Command's Lt. Gen. John Shaw said '...
4,SpaceX requests permission for direct-to-smart...,https://spacenews.com/spacex-requests-permissi...,"TAMPA, Fla. — SpaceX could provide “full and c...",Jason Rainbow,"December 8, 2022",SpaceX could provide “full and continuous” dir...
...,...,...,...,...,...,...
18349,Kendall lays out Pentagon thinking on future s...,https://spacenews.com/frank-kendall-at-wsbr/,"\nFrank Kendall, the Pentagon’s top acquisitio...",SpaceNews Staff,"February 25, 2016","Frank Kendall, the Pentagon’s top acquisition ..."
18350,A larger share of NOAA’s declining space budge...,https://spacenews.com/a-larger-share-of-noaas-...,Updated Feb. 10 at 10:18 p.m. Eastern The U.S....,Debra Werner,"February 10, 2016",The U.S. National Oceanic and Atmospheric Admi...
18351,Think Tank Turns Its Attention To Mars As 2016...,https://spacenews.com/think-tank-turns-its-att...,WASHINGTON — As NASA develops a long-term stra...,Jeff Foust,"June 11, 2015",As NASA develops a long-term strategy to suppo...
18352,House Bill Leaves Last Three JPSS Satellites i...,https://spacenews.com/no-money-for-noaa-weathe...,WASHINGTON — A spending bill the House passed ...,Dan Leone,"June 4, 2015",A spending bill the House passed June 3 would ...


It is clear that the most important feature for us is the **content**, so, let's work on it.

First of all, we can check the word count to the documents, as well as the average, min and max values:

In [31]:
data['wordcounter'] = data['content'].apply(lambda x: len(str(x).split()))
data.describe()

Unnamed: 0,wordcounter
count,18354.0
mean,624.80146
std,366.239719
min,0.0
25%,388.0
50%,593.0
75%,781.0
max,12555.0


let's check if there are some missing values on it.

In [32]:
data['content'] = data['content'].replace(r'\b\w{16,}\b|\s+|\\n', ' ', regex=True)
data.replace(' ', np.nan, inplace=True)
print(data['content'].isnull().sum())


169


In [33]:
missing_news = data[data['content'].isnull()]

Let's check on some news where the content is missing:

In [34]:
missing_news.sample(10)

Unnamed: 0,title,url,content,author,date,postexcerpt,wordcounter
11282,FAA Comstac Meeting Tweet-by-Tweet |Day 1,https://spacenews.com/happening-now-faa-comsta...,,Brian Berger,"February 4, 2015",The FAA Office of Commercial Space Transportat...,1
14999,ULA Gets $1.5 Billion Air Force Contract for N...,https://spacenews.com/ula-gets-15-billion-air-...,,Dan Leone,"January 11, 2012",WASHINGTON — The U.S. Air Force has awarded Un...,1
16426,"Dordain: Even with Government Cuts, ESA Progra...",https://spacenews.com/dordain%e2%80%82even-gov...,,Peter B. de Selding,"June 14, 2010",BERLIN — European Space Agency (ESA) Director-...,1
14004,Many ESA Programs Approved in Naples Still Wan...,https://spacenews.com/many-esa-programs-approv...,,Peter B. de Selding,"December 6, 2012",ESA ministerial conference raised as many issu...,1
17907,Q&A with U.S. Air Force Secretary Deborah Lee ...,https://spacenews.com/qa-with-u-s-air-force-se...,,Mike Gruss,"October 27, 2015",Deborah James is the Pentagon’s first principa...,1
16570,White House Seeks to Consolidate Export Licensing,https://spacenews.com/white-house-seeks-consol...,,Amy Klamper,"April 20, 2010",WASHINGTON — U.S. Secretary of Defense Robert ...,1
14604,"Orbital, NASA Hit New Snags on Landsat Develop...",https://spacenews.com/orbital-nasa-hit-new-sna...,,Dan Leone,"May 18, 2012","GREENBELT, Md. — Orbital Sciences Corp. contin...",1
13870,ESA Activates Galileo Search-and-rescue Payload,https://spacenews.com/esa-activates-galileo-se...,,Peter B. de Selding,"January 25, 2013",The SAR payload aboard one of Europe’s in-orbi...,1
16228,Lawmakers Curb Spending on Defense Weather Sat...,https://spacenews.com/lawmakers-curb-spending-...,,Turner Brinton,"September 17, 2010",WASHINGTON — U.S. Senate appropriators on Sept...,1
4663,Coronavirus special coverage,https://spacenews.com/coronavirus-special-cove...,,SpaceNews Staff,"March 19, 2020",SpaceNews special coverage of the novel coron...,1


These rows where there is no content available won't be very usefull for us.

Some aproaches can be used to surpass this problem, such as:

- Use the post excerpt to replace the content;
- Use the title as a replacement;
- Delete the rows with missing values.


In [35]:
data = data.dropna(subset=['content'])
print(data['content'].isnull().sum())

0


In [36]:
data.loc[data['wordcounter'].idxmin()]
print("This is the smallest document in the collection:")
data.loc[[10363]]

This is the smallest document in the collection:


Unnamed: 0,title,url,content,author,date,postexcerpt,wordcounter
10363,The Week Ahead for Oct. 12,https://spacenews.com/the-week-ahead-for-oct-12/,Wednesday: Wednesday-Thursday: Friday: Friday-...,SpaceNews Staff,"October 12, 2015",The 66th International Astronautical Congress ...,4


In [37]:
data.loc[data['wordcounter'].idxmax()]
print("This is the biggest document in the collection:")
data.loc[[4649]]

This is the biggest document in the collection:


Unnamed: 0,title,url,content,author,date,postexcerpt,wordcounter
4649,The latest COVID-19 news and event updates for...,https://spacenews.com/coronavirus-space-impacts/,Follow our reporters on Twitter for updates. S...,SpaceNews Staff,"March 22, 2020",A time line of the coronavirus pandemic's impa...,12555


In [38]:
data.at[4649, 'content']



The dataset contains many unecessary data that won't be used for the topic modeling;

Fields such as **url**, **author**, and **date** will be removed.

It seems that the dataset, also includes an **postexcerpt**. Which seems to be an abstract of the text that is present in the content column. Although it could be used to make a biased guess on the topic of the text, I believe it's best to dont use it for now, and use only the content of the article.


In [39]:
data = data.drop(['url', 'date', 'author', 'postexcerpt', 'wordcounter'], axis=1)

In [40]:
data

Unnamed: 0,title,content
0,Orion splashes down to end Artemis 1,Updated at 5:45 p.m. Eastern after post-splash...
1,Polaris Dawn crewed mission could suffer addit...,LAS VEGAS — A billionaire-backed private astro...
2,DART on track for asteroid collision,WASHINGTON — A NASA spacecraft is on course to...
3,U.S. Space Command calls for investment in tec...,"WASHINGTON — Lt. Gen. John Shaw, deputy comman..."
4,SpaceX requests permission for direct-to-smart...,"TAMPA, Fla. — SpaceX could provide “full and c..."
...,...,...
18349,Kendall lays out Pentagon thinking on future s...,"Frank Kendall, the Pentagon’s top acquisition..."
18350,A larger share of NOAA’s declining space budge...,Updated Feb. 10 at 10:18 p.m. Eastern The U.S....
18351,Think Tank Turns Its Attention To Mars As 2016...,WASHINGTON — As NASA develops a long-term stra...
18352,House Bill Leaves Last Three JPSS Satellites i...,WASHINGTON — A spending bill the House passed ...


Now, there is only two columns remaining, **title** and the **content** itself.

And we made sure that all the content column contain something,

The title will be concateneted with the content, to make it easier to use.

In [41]:
data['content'] = data['title'] + ' ' + data['content']
data = data.drop(['title'], axis=1)

Some details about the dataset:


In [42]:
print("Number of lines:", data.shape[0])
data

Number of lines: 18185


Unnamed: 0,content
0,Orion splashes down to end Artemis 1 Updated a...
1,Polaris Dawn crewed mission could suffer addit...
2,DART on track for asteroid collision WASHINGTO...
3,U.S. Space Command calls for investment in tec...
4,SpaceX requests permission for direct-to-smart...
...,...
18349,Kendall lays out Pentagon thinking on future s...
18350,A larger share of NOAA’s declining space budge...
18351,Think Tank Turns Its Attention To Mars As 2016...
18352,House Bill Leaves Last Three JPSS Satellites i...


In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18185 entries, 0 to 18353
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  18185 non-null  object
dtypes: object(1)
memory usage: 800.2+ KB


### Now, for the data treatment part:


In [44]:
# For testing purposes, this line set data to be a part of the dataset;
data = data[:575]
data

Unnamed: 0,content
0,Orion splashes down to end Artemis 1 Updated a...
1,Polaris Dawn crewed mission could suffer addit...
2,DART on track for asteroid collision WASHINGTO...
3,U.S. Space Command calls for investment in tec...
4,SpaceX requests permission for direct-to-smart...
...,...
570,NASA preparing for late September Artemis 1 la...
571,U.S. Army a key customer of BlackSky’s next-ge...
572,Phase Four adopts iodine for next-gen Max-V en...
573,Army looking at new ways to use space technolo...


Now, this part of the code, removes from the documents **ponctuation**, **stop words** and non **english** words

In [45]:
data['tokens'] = data['content'].apply(lambda x: clean(x))

And here, we generate the bigrams.

In [46]:
data['tokens'] = data['tokens'].apply(word_tokenize)

sentences = data['tokens'].tolist()

bigram_model = Phrases(sentences, min_count=4)
bigram_phraser = Phraser(bigram_model)

def apply_bigrams(tokens):
    return ' '.join(list(bigram_phraser[tokens]))

data['tokens'] = data['tokens'].apply(apply_bigrams)

In [47]:
def filter_words_with_underscore(text):
    words = text.split()
    return set([word for word in words if '_' in word])

data['bigrams'] = data['tokens'].apply(lambda x: filter_words_with_underscore(x))
data

Unnamed: 0,content,tokens,bigrams
0,Orion splashes down to end Artemis 1 Updated a...,end eastern fifty day last moon mission touche...,"{major_step, took_place, take_place, low_earth..."
1,Polaris Dawn crewed mission could suffer addit...,dawn mission could suffer additional las priva...,"{crew_dragon, launch_complex, fourth_quarter, ..."
2,DART on track for asteroid collision WASHINGTO...,dart track asteroid collision course deliberat...,"{planetary_defense, work_done, orbit_around, c..."
3,U.S. Space Command calls for investment in tec...,space_command investment deep_space deputy com...,"{space_command, space_station, space_traffic, ..."
4,SpaceX requests permission for direct-to-smart...,permission service could provide across much g...,"{would_enable, federal_commission, current_gen..."
...,...,...,...
570,NASA preparing for late September Artemis 1 la...,late launch_attempt laying groundwork another ...,"{want_go, launch_complex, space_force, three_d..."
571,U.S. Army a key customer of BlackSky’s next-ge...,army key customer satellite earth_observation ...,"{ground_station, defense_innovation, take_adva..."
572,Phase Four adopts iodine for next-gen Max-V en...,phase_four iodine engine mountain propulsion p...,"{last_year, supply_chain, air_force, national_..."
573,Army looking at new ways to use space technolo...,army looking new ways use space technology unc...,"{national_defense, missile_defense, vice_presi..."


I've created a new column to keep track of the original text, as well as the bigrams that were generated;

Now, we apply the lemmatizer, to transform words into a more recognizable version of the word.

In [48]:
data['tokens'] = data['tokens'].apply(lemmatize)

The bigrams column wont be used for training the model, only for showing puposes. But the tokens already have the bigrams converted in then.

In [49]:
data['tokens'] = data['tokens'].apply(word_tokenize)
data

Unnamed: 0,content,tokens,bigrams
0,Orion splashes down to end Artemis 1 Updated a...,"[end, eastern, fifty, day, last, moon, mission...","{major_step, took_place, take_place, low_earth..."
1,Polaris Dawn crewed mission could suffer addit...,"[dawn, mission, could, suffer, additional, las...","{crew_dragon, launch_complex, fourth_quarter, ..."
2,DART on track for asteroid collision WASHINGTO...,"[dart, track, asteroid, collision, course, del...","{planetary_defense, work_done, orbit_around, c..."
3,U.S. Space Command calls for investment in tec...,"[space_command, investment, deep_space, deputy...","{space_command, space_station, space_traffic, ..."
4,SpaceX requests permission for direct-to-smart...,"[permission, service, could, provide, across, ...","{would_enable, federal_commission, current_gen..."
...,...,...,...
570,NASA preparing for late September Artemis 1 la...,"[late, launch_attempt, lay, groundwork, anothe...","{want_go, launch_complex, space_force, three_d..."
571,U.S. Army a key customer of BlackSky’s next-ge...,"[army, key, customer, satellite, earth_observa...","{ground_station, defense_innovation, take_adva..."
572,Phase Four adopts iodine for next-gen Max-V en...,"[phase_four, iodine, engine, mountain, propuls...","{last_year, supply_chain, air_force, national_..."
573,Army looking at new ways to use space technolo...,"[army, look, new, way, use, space, technology,...","{national_defense, missile_defense, vice_presi..."


### We have our data treated; It is time to start training our model to extract the topics

In [50]:
def printTopics(model, p=None):
	for k in range(model.k):
		print('Topic #{}'.format(k))
		if p is None:
			for word, prob in model.get_topic_words(topic_id=k,top_n=10):
				print(' ', word, prob, sep=' ')
		elif p==1:

			for word, prob in model.get_topic_words(topic_id=k,top_n=10,timepoint=0):
				print(' ', word, prob, sep=' ')
		else:
			for word, prob in model.get_topic_words(sub_topic_id=k,top_n=10):
				print(' ', word, prob, sep=' ')


def printCoherence(model):
	for preset in ('u_mass', 'c_uci', 'c_npmi', 'c_v'):
		coh = tp.coherence.Coherence(model, coherence=preset)
		average_coherence = coh.get_score()
		coherence_per_topic = [coh.get_score(topic_id=k) for k in range(model.k)]
		print('Coherence : {}'.format(preset))
		print('Average:', average_coherence, '\nper topic:', coherence_per_topic)
		print()

def runModel(model, document):
	for i,d in enumerate(document):
		print(i,end='')
		model.add_doc(d)
	print()
	model.burn_in = 100
	model.train(0)
	print('Total amount of documents:', len(model.docs), ', vocabullary size:', len(model.used_vocabs), ', number of words:', model.num_words)
	print('Removed top words:', model.removed_top_words)
	# print('Training...', file=sys.stderr, flush=True)

	for i in range(0, 1000, 5):
			model.train(5)
			if i % 50 == 0:
					print("Iteration: {}\t likelihood: {}".format(i, model.ll_per_word))

	model.summary()
	# print('Saving...', file=sys.stderr, flush=True)
	# model.save('test.lda.bin', True)


model = tp.DMRModel(tw=tp.TermWeight.IDF, min_cf=15, min_df=3, rm_top=10, k=7, seed=777)
runModel(model, data['tokens'])

0123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369

  model.train(0)
  model.train(5)


Iteration: 50	 likelihood: -7.2422004959865705
Iteration: 100	 likelihood: -7.167797459139243
Iteration: 150	 likelihood: -7.115734660657668
Iteration: 200	 likelihood: -7.08174281523597
Iteration: 250	 likelihood: -7.0634123054357
Iteration: 300	 likelihood: -7.047569033905289
Iteration: 350	 likelihood: -7.033340185429867
Iteration: 400	 likelihood: -7.027040049540785
Iteration: 450	 likelihood: -7.020792384921085
Iteration: 500	 likelihood: -7.014177367186179
Iteration: 550	 likelihood: -7.0037661927355215
Iteration: 600	 likelihood: -7.001731452327819
Iteration: 650	 likelihood: -6.991817081944601
Iteration: 700	 likelihood: -6.992368608034555
Iteration: 750	 likelihood: -6.990073065624575
Iteration: 800	 likelihood: -6.98833015998578
Iteration: 850	 likelihood: -6.982831198103976
Iteration: 900	 likelihood: -6.983267574354592
Iteration: 950	 likelihood: -6.991534821076799
<Basic Info>
| DMRModel (current version: 0.12.4)
| 575 docs, 72097 words
| Total Vocabs: 6074, Used Vocabs: 1

In [51]:
printTopics(model)

Topic #0
  million 0.009140142239630222
  market 0.00827164389193058
  constellation 0.008262869901955128
  technology 0.007820377126336098
  demand 0.007564379833638668
  optical 0.007298086769878864
  business 0.007227974012494087
  build 0.007096909452229738
  expand 0.0070809368044137955
  remote 0.006656029261648655
Topic #1
  rocket 0.015591410920023918
  test 0.012989256531000137
  vehicle 0.011806213296949863
  second 0.011137254536151886
  firefly 0.011059250682592392
  flight 0.010160736739635468
  weather 0.010086357593536377
  engine 0.009997059591114521
  first 0.009712721221148968
  launch_vehicle 0.008816404268145561
Topic #2
  go 0.00857292115688324
  get 0.008136256597936153
  cost 0.006602638401091099
  take 0.006527237594127655
  think 0.0065130931325256824
  could 0.006352598778903484
  come 0.006235099397599697
  look 0.006114513613283634
  need 0.006036325823515654
  work 0.005590606946498156
Topic #3
  spectrum 0.02265489473938942
  network 0.021229835227131844
 

In [52]:
printCoherence(model)

Coherence : u_mass
Average: -2.1256655850840573 
per topic: [-1.5476814652951618, -1.4249505403779235, -0.820521674583644, -1.6439128218204746, -2.752615729864161, -5.371588617971757, -1.3183882456752802]

Coherence : c_uci
Average: -1.4197659981426634 
per topic: [-0.37319027375889857, -0.4331183787471379, -0.24842358509685344, -1.1100408437105476, -3.740018104082404, -4.703540961863597, 0.669970160260795]

Coherence : c_npmi
Average: 0.0017006212850521896 
per topic: [-0.003285706559536563, 0.030493685537101238, 0.013022332761604768, 0.060814965255128445, -0.07897944762947792, -0.09936011126072741, 0.08919863089127278]

Coherence : c_v
Average: 0.658609027415514 
per topic: [0.547073221206665, 0.700865876674652, 0.44183064326643945, 0.794278347492218, 0.6557369709014893, 0.7182523727416992, 0.7522257596254349]

