<a href="https://colab.research.google.com/github/SWE3T/TopicModeling/blob/main/SpaceNews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [141]:
!pip install tomotopy



### Final project (Space news dataset)

In [142]:
from google.colab import drive
drive.mount('/content/drive')

import tomotopy as tp
import pandas as pd
import numpy as np
import string
import spacy
import sys

import gensim
from gensim.models.phrases import Phrases, Phraser

import nltk
from nltk.corpus import stopwords
from nltk.corpus import words
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('words')
lemmatizer = WordNetLemmatizer()
english_words = set(words.words())
exclude = set(string.punctuation)
stop_words = set(stopwords.words('english'))

spacy.cli.download("en_core_web_lg")
nlp = spacy.load('en_core_web_lg')

# spacy.cli.download("en_core_web_sm")
# nlp = spacy.load('en_core_web_sm')

pd.options.mode.chained_assignment = None

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [143]:
def clean(doc):
    stop_free = " ".join([w for w in doc.lower().split() if w not in stop_words])
    english_only = " ".join([w for w in stop_free.split() if w in english_words])
    punc_free = "".join([ch for ch in english_only if ch not in exclude])
    return punc_free

def lemmatize(document):
    doc = nlp(document)
    lemmas = [token.lemma_ for token in doc]
    return ' '.join(lemmas)

In [144]:
fpath='/content/drive/MyDrive/Códigos/Modelagem de tópicos/Final Work/dataset /'

data = pd.read_csv(fpath+'spacenews-december-2022.csv')

In [145]:
data

Unnamed: 0,title,url,content,author,date,postexcerpt
0,Orion splashes down to end Artemis 1,https://spacenews.com/orion-splashes-down-to-e...,Updated at 5:45 p.m. Eastern after post-splash...,Jeff Foust,"December 11, 2022",Fifty years to the day after the last Apollo m...
1,Polaris Dawn crewed mission could suffer addit...,https://spacenews.com/polaris-dawn-crewed-miss...,LAS VEGAS — A billionaire-backed private astro...,Jeff Foust,"October 25, 2022",A billionaire-backed private astronaut mission...
2,DART on track for asteroid collision,https://spacenews.com/dart-on-track-for-astero...,WASHINGTON — A NASA spacecraft is on course to...,Jeff Foust,"September 25, 2022",A NASA spacecraft is on course to deliberately...
3,U.S. Space Command calls for investment in tec...,https://spacenews.com/u-s-space-command-calls-...,"WASHINGTON — Lt. Gen. John Shaw, deputy comman...",Sandra Erwin,"August 31, 2022",U.S. Space Command's Lt. Gen. John Shaw said '...
4,SpaceX requests permission for direct-to-smart...,https://spacenews.com/spacex-requests-permissi...,"TAMPA, Fla. — SpaceX could provide “full and c...",Jason Rainbow,"December 8, 2022",SpaceX could provide “full and continuous” dir...
...,...,...,...,...,...,...
18349,Kendall lays out Pentagon thinking on future s...,https://spacenews.com/frank-kendall-at-wsbr/,"\nFrank Kendall, the Pentagon’s top acquisitio...",SpaceNews Staff,"February 25, 2016","Frank Kendall, the Pentagon’s top acquisition ..."
18350,A larger share of NOAA’s declining space budge...,https://spacenews.com/a-larger-share-of-noaas-...,Updated Feb. 10 at 10:18 p.m. Eastern The U.S....,Debra Werner,"February 10, 2016",The U.S. National Oceanic and Atmospheric Admi...
18351,Think Tank Turns Its Attention To Mars As 2016...,https://spacenews.com/think-tank-turns-its-att...,WASHINGTON — As NASA develops a long-term stra...,Jeff Foust,"June 11, 2015",As NASA develops a long-term strategy to suppo...
18352,House Bill Leaves Last Three JPSS Satellites i...,https://spacenews.com/no-money-for-noaa-weathe...,WASHINGTON — A spending bill the House passed ...,Dan Leone,"June 4, 2015",A spending bill the House passed June 3 would ...


It is clear that the most important feature for us is the **content**, so, let's work on it.

First of all, we can check the word count to the documents, as well as the average, min and max values:

In [146]:
data['wordcounter'] = data['content'].apply(lambda x: len(str(x).split()))
data.describe()

Unnamed: 0,wordcounter
count,18354.0
mean,624.80146
std,366.239719
min,0.0
25%,388.0
50%,593.0
75%,781.0
max,12555.0


let's check if there are some missing values on it.

In [147]:
data['content'] = data['content'].replace(r'\b\w{16,}\b|\s+|\\n', ' ', regex=True)
data.replace(' ', np.nan, inplace=True)
print(data['content'].isnull().sum())


169


In [148]:
missing_news = data[data['content'].isnull()]

Let's check on some news where the content is missing:

In [149]:
missing_news.sample(10)

Unnamed: 0,title,url,content,author,date,postexcerpt,wordcounter
15118,Equity Question Only Snag for Loral’s Plan To ...,https://spacenews.com/equity-question-only-sna...,,Peter B. de Selding,"November 11, 2011",PARIS — Loral Space and Communications expects...,1
18290,SpaceX’s Musk Lands on Forbes Billionaire List...,https://spacenews.com/spacexs-musk-lands-forbe...,,Forbes,"October 1, 2011","Elon Musk, founder and chief executive of Spac...",1
17306,The Bottom Line | Launchers’ Siren Call,https://spacenews.com/the-bottom-line-launcher...,,Peter B. de Selding,"May 27, 2016",The head of one of the companies that would be...,1
13863,"Big Support Contracts, Small Science Missions ...",https://spacenews.com/33374big-support-contrac...,,Dan Leone,"January 28, 2013",After awarding big-money human spaceflight con...,1
16226,Astrium Unit To Design Lunar Lander for ESA,https://spacenews.com/astrium-unit-design-luna...,,SpaceNews Staff,"September 17, 2010",Astrium Space Transportation will complete mis...,1
16865,U.S. Air Force Will Pay To Place SBSS Satellit...,https://spacenews.com/us-air-force-will-pay-pl...,,Turner Brinton,"January 4, 2010",WASHINGTON — The U.S. Air Force will contract ...,1
17907,Q&A with U.S. Air Force Secretary Deborah Lee ...,https://spacenews.com/qa-with-u-s-air-force-se...,,Mike Gruss,"October 27, 2015",Deborah James is the Pentagon’s first principa...,1
10734,Special Coverage of GEOINT 2015 Symposium,https://spacenews.com/news-from-the-u-s-geospa...,,SpaceNews Staff,"June 22, 2015",The Geoint 2015 Symposiumfeatured the usual he...,1
15149,Lockheed Nabs U.S. Army Aerostat Service Contract,https://spacenews.com/lockheed-nabs-us-army-ae...,,SpaceNews Staff,"November 2, 2011",The U.S. Army awarded Lockheed Martin a $383 m...,1
17853,The growing imperative of commercialization,https://spacenews.com/the-growing-imperative-o...,,Intelsat General,"September 20, 2016",Globalization has made the world a smaller pla...,1


These rows where there is no content available won't be very usefull for us.

Some aproaches can be used to surpass this problem, such as:

- Use the post excerpt to replace the content;
- Use the title as a replacement;
- Delete the rows with missing values.


In [150]:
data = data.dropna(subset=['content'])
print(data['content'].isnull().sum())

0


In [151]:
data.loc[data['wordcounter'].idxmin()]
print("This is the smallest document in the collection:")
data.loc[[10363]]

This is the smallest document in the collection:


Unnamed: 0,title,url,content,author,date,postexcerpt,wordcounter
10363,The Week Ahead for Oct. 12,https://spacenews.com/the-week-ahead-for-oct-12/,Wednesday: Wednesday-Thursday: Friday: Friday-...,SpaceNews Staff,"October 12, 2015",The 66th International Astronautical Congress ...,4


In [152]:
data.loc[data['wordcounter'].idxmax()]
print("This is the biggest document in the collection:")
data.loc[[4649]]

This is the biggest document in the collection:


Unnamed: 0,title,url,content,author,date,postexcerpt,wordcounter
4649,The latest COVID-19 news and event updates for...,https://spacenews.com/coronavirus-space-impacts/,Follow our reporters on Twitter for updates. S...,SpaceNews Staff,"March 22, 2020",A time line of the coronavirus pandemic's impa...,12555


In [153]:
data.at[4649, 'content']



The dataset contains many unecessary data that won't be used for the topic modeling;

Fields such as **url**, **author**, and **date** will be removed.

It seems that the dataset, also includes an **postexcerpt**. Which seems to be an abstract of the text that is present in the content column. Although it could be used to make a biased guess on the topic of the text, I believe it's best to dont use it for now, and use only the content of the article.


In [154]:
data = data.drop(['url', 'date', 'author', 'postexcerpt', 'wordcounter'], axis=1)

In [155]:
data

Unnamed: 0,title,content
0,Orion splashes down to end Artemis 1,Updated at 5:45 p.m. Eastern after post-splash...
1,Polaris Dawn crewed mission could suffer addit...,LAS VEGAS — A billionaire-backed private astro...
2,DART on track for asteroid collision,WASHINGTON — A NASA spacecraft is on course to...
3,U.S. Space Command calls for investment in tec...,"WASHINGTON — Lt. Gen. John Shaw, deputy comman..."
4,SpaceX requests permission for direct-to-smart...,"TAMPA, Fla. — SpaceX could provide “full and c..."
...,...,...
18349,Kendall lays out Pentagon thinking on future s...,"Frank Kendall, the Pentagon’s top acquisition..."
18350,A larger share of NOAA’s declining space budge...,Updated Feb. 10 at 10:18 p.m. Eastern The U.S....
18351,Think Tank Turns Its Attention To Mars As 2016...,WASHINGTON — As NASA develops a long-term stra...
18352,House Bill Leaves Last Three JPSS Satellites i...,WASHINGTON — A spending bill the House passed ...


Now, there is only two columns remaining, **title** and the **content** itself.

And we made sure that all the content column contain something,

The title will be concateneted with the content, to make it easier to use.

In [156]:
data['content'] = data['title'] + ' ' + data['content']
data = data.drop(['title'], axis=1)

Some details about the dataset:


In [157]:
print("Number of lines:", data.shape[0])
data

Number of lines: 18185


Unnamed: 0,content
0,Orion splashes down to end Artemis 1 Updated a...
1,Polaris Dawn crewed mission could suffer addit...
2,DART on track for asteroid collision WASHINGTO...
3,U.S. Space Command calls for investment in tec...
4,SpaceX requests permission for direct-to-smart...
...,...
18349,Kendall lays out Pentagon thinking on future s...
18350,A larger share of NOAA’s declining space budge...
18351,Think Tank Turns Its Attention To Mars As 2016...
18352,House Bill Leaves Last Three JPSS Satellites i...


In [158]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18185 entries, 0 to 18353
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  18185 non-null  object
dtypes: object(1)
memory usage: 800.2+ KB


### Now, for the data treatment part:


In [159]:
# For testing purposes, this line set data to be a part of the dataset;
data = data[:275]
data

Unnamed: 0,content
0,Orion splashes down to end Artemis 1 Updated a...
1,Polaris Dawn crewed mission could suffer addit...
2,DART on track for asteroid collision WASHINGTO...
3,U.S. Space Command calls for investment in tec...
4,SpaceX requests permission for direct-to-smart...
...,...
270,NASA astronaut ready for Soyuz flight to ISS W...
271,NASA selects Falcon Heavy to launch Roman Spac...
272,NDAA compromise bill wants more focus on satel...
273,HawkEye 360 to launch satellites on Rocket Lab...


Now, this part of the code, removes from the documents **ponctuation**, **stop words** and non **english** words

In [160]:
data['tokens'] = data['content'].apply(lambda x: clean(x))

And here, we generate the bigrams.

In [161]:
data['tokens'] = data['tokens'].apply(word_tokenize)

sentences = data['tokens'].tolist()

bigram_model = Phrases(sentences, min_count=4)
bigram_phraser = Phraser(bigram_model)

def apply_bigrams(tokens):
    return ' '.join(list(bigram_phraser[tokens]))

data['tokens'] = data['tokens'].apply(apply_bigrams)

In [162]:
def filter_words_with_underscore(text):
    words = text.split()
    return set([word for word in words if '_' in word])

data['bigrams'] = data['tokens'].apply(lambda x: filter_words_with_underscore(x))
data

Unnamed: 0,content,tokens,bigrams
0,Orion splashes down to end Artemis 1 Updated a...,end eastern fifty day last moon mission touche...,"{distant_retrograde, heat_shield, associate_ad..."
1,Polaris Dawn crewed mission could suffer addit...,dawn mission could suffer additional las priva...,"{fourth_quarter, first_flight, falcon_heavy, c..."
2,DART on track for asteroid collision WASHINGTO...,dart track asteroid collision course deliberat...,"{near_earth, planetary_defense, dress_rehearsa..."
3,U.S. Space Command calls for investment in tec...,space_command investment deep space deputy com...,"{commander_space, space_station, machine_learn..."
4,SpaceX requests permission for direct-to-smart...,permission service could provide across much g...,"{per_second, federal_commission, use_spectrum,..."
...,...,...,...
270,NASA astronaut ready for Soyuz flight to ISS W...,astronaut ready flight astronaut flying intern...,"{space_station, would_allow, took_place}"
271,NASA selects Falcon Heavy to launch Roman Spac...,falcon_heavy launch space_telescope selected l...,"{top_priority, government_accountability, make..."
272,NDAA compromise bill wants more focus on satel...,compromise bill focus satellite responsive lau...,"{house_armed, national_guard, national_securit..."
273,HawkEye 360 to launch satellites on Rocket Lab...,launch rocket first mission soil provider radi...,"{national_reconnaissance, commercially_availab..."


I've created a new column to keep track of the original text, as well as the bigrams that were generated;

Now, we apply the lemmatizer, to transform words into a more recognizable version of the word.

In [163]:
data['tokens'] = data['tokens'].apply(lemmatize)

In [164]:
data['tokens'] = data['tokens'].apply(word_tokenize)
data

Unnamed: 0,content,tokens,bigrams
0,Orion splashes down to end Artemis 1 Updated a...,"[end, eastern, fifty, day, last, moon, mission...","{distant_retrograde, heat_shield, associate_ad..."
1,Polaris Dawn crewed mission could suffer addit...,"[dawn, mission, could, suffer, additional, las...","{fourth_quarter, first_flight, falcon_heavy, c..."
2,DART on track for asteroid collision WASHINGTO...,"[dart, track, asteroid, collision, course, del...","{near_earth, planetary_defense, dress_rehearsa..."
3,U.S. Space Command calls for investment in tec...,"[space_command, investment, deep, space, deput...","{commander_space, space_station, machine_learn..."
4,SpaceX requests permission for direct-to-smart...,"[permission, service, could, provide, across, ...","{per_second, federal_commission, use_spectrum,..."
...,...,...,...
270,NASA astronaut ready for Soyuz flight to ISS W...,"[astronaut, ready, flight, astronaut, fly, int...","{space_station, would_allow, took_place}"
271,NASA selects Falcon Heavy to launch Roman Spac...,"[falcon_heavy, launch, space_telescope, select...","{top_priority, government_accountability, make..."
272,NDAA compromise bill wants more focus on satel...,"[compromise, bill, focus, satellite, responsiv...","{house_armed, national_guard, national_securit..."
273,HawkEye 360 to launch satellites on Rocket Lab...,"[launch, rocket, first, mission, soil, provide...","{national_reconnaissance, commercially_availab..."


### We have our data treated; It is time to start training our model to extract the topics

In [169]:
def printTopics(model, p=None):
	for k in range(model.k):
		print('Topic #{}'.format(k))
		if p is None:
			for word, prob in model.get_topic_words(topic_id=k,top_n=10):
				print(' ', word, prob, sep=' ')
		elif p==1:

			for word, prob in model.get_topic_words(topic_id=k,top_n=10,timepoint=0):
				print(' ', word, prob, sep=' ')
		else:
			for word, prob in model.get_topic_words(sub_topic_id=k,top_n=10):
				print(' ', word, prob, sep=' ')


def printCoherence(model):
	for preset in ('u_mass', 'c_uci', 'c_npmi', 'c_v'):
		coh = tp.coherence.Coherence(model, coherence=preset)
		average_coherence = coh.get_score()
		coherence_per_topic = [coh.get_score(topic_id=k) for k in range(model.k)]
		print('==== Coherence : {} ===='.format(preset))
		print('Average:', average_coherence, '\nPer Topic:', coherence_per_topic)
		print()

def runModel(model, document):
	for i,d in enumerate(document):
		print(i,end='')
		#ch = d.split()
		model.add_doc(d)
		print('\r\r\r\r\r\r\r\r\r\r',end='')
	print()
	model.burn_in = 100
	model.train(0)
	print('Num document:', len(model.docs), ', Vocab size:', len(model.used_vocabs), ', Num words:', model.num_words)
	print('Removed top words:', model.removed_top_words)
	print('Training...', file=sys.stderr, flush=True)

	for i in range(0, 1000, 5):
			model.train(5)
			if i % 50 == 0:
					print("Iteration: {}\tLog-likelihood: {}".format(i, model.ll_per_word))

	model.summary()
	print('Saving...', file=sys.stderr, flush=True)
	model.save('test.lda.bin', True)


print('************************** Running the model ***************************')
model = tp.DMRModel(tw=tp.TermWeight.IDF, min_cf=15, min_df=3, rm_top=10, k=7, seed=777) #,corpus=cp
runModel(model, data['tokens'])

************************** Running the model ***************************
012345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777

  model.train(0)
Training...
  model.train(5)


Num document: 275 , Vocab size: 695 , Num words: 29769
Removed top words: ['space', 'say', 'launch', 'satellite', 'commercial', 'also', 'company', 'new', 'would', 'mission']
Iteration: 0	Log-likelihood: -7.278748127916662
Iteration: 50	Log-likelihood: -6.685687276042972
Iteration: 100	Log-likelihood: -6.63070755757772
Iteration: 150	Log-likelihood: -6.55543886939229
Iteration: 200	Log-likelihood: -6.528613183674047
Iteration: 250	Log-likelihood: -6.51172242548123
Iteration: 300	Log-likelihood: -6.487740920697614
Iteration: 350	Log-likelihood: -6.487471281797593
Iteration: 400	Log-likelihood: -6.489317846540648
Iteration: 450	Log-likelihood: -6.457995363532986
Iteration: 500	Log-likelihood: -6.45086856504775
Iteration: 550	Log-likelihood: -6.447235239292348
Iteration: 600	Log-likelihood: -6.450234761046867
Iteration: 650	Log-likelihood: -6.450902423016353
Iteration: 700	Log-likelihood: -6.45449370109891
Iteration: 750	Log-likelihood: -6.440560319844404
Iteration: 800	Log-likelihood: -6.

Saving...


<Basic Info>
| DMRModel (current version: 0.12.4)
| 275 docs, 29769 words
| Total Vocabs: 4444, Used Vocabs: 695
| Entropy of words: 6.26266
| Entropy of term-weighted words: 6.46106
| Removed Vocabs: space say launch satellite commercial also company new would mission
|
<Training Info>
| Iterations: 1000, Burn-in steps: 100
| Optimization Interval: 10
| Log-likelihood per word: -6.44180
|
<Initial Parameters>
| tw: TermWeight.IDF
| min_cf: 15 (minimum collection frequency of words)
| min_df: 3 (minimum document frequency of words)
| rm_top: 10 (the number of top words to be removed)
| k: 7 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (an initial value of exponential of mean of normal distribution for `lambdas`, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic - word)
| sigma: 1.0 (standard deviation of normal distribution for `lambdas`)


In [170]:
printTopics(model)

Topic #0
  space_force 0.027024520561099052
  dod 0.01770210824906826
  report 0.017060456797480583
  national 0.016873003914952278
  billion 0.016615282744169235
  government 0.016434606164693832
  defense 0.016390910372138023
  national_security 0.014378569088876247
  iridium 0.013687397353351116
  committee 0.01340295560657978
Topic #1
  rocket 0.03878680616617203
  test 0.028541699051856995
  long_march 0.024965494871139526
  china 0.020284520462155342
  engine 0.01863938197493553
  vehicle 0.018459312617778778
  dawn 0.01737215928733349
  rocket_lab 0.017322294414043427
  launch_vehicle 0.0158975962549448
  eastern 0.015336071141064167
Topic #2
  lunar 0.06205111742019653
  spire 0.04135410115122795
  space_station 0.040183138102293015
  solar 0.03807360678911209
  science 0.030686354264616966
  astronaut 0.028095237910747528
  cargo 0.028069546446204185
  rover 0.028062835335731506
  ministerial 0.026913993060588837
  axiom 0.026708785444498062
Topic #3
  asteroid 0.0604531243443

In [167]:
printCoherence(model)

==== Coherence : u_mass ====
Average: -3.1733040213873585 
Per Topic: [-1.6634621647389953, -9.09541588744845, -4.588003374213069, -2.151832932562305, -1.816671272758877, -1.2143436255178204, -1.6833988924719927]

==== Coherence : c_uci ====
Average: -3.648597479824055 
Per Topic: [-2.764592304888015, -7.788357704872855, -5.773093289678366, -1.4622934186701906, -4.832290956942592, -0.8473757656718609, -2.072178918044507]

==== Coherence : c_npmi ====
Average: -0.0852273249091701 
Per Topic: [-0.07864312266827682, -0.2334968131764477, -0.15043666283213925, 0.04369712540625807, -0.13237983373568882, 0.022664539204169616, -0.06799650656206589]

==== Coherence : c_v ====
Average: 0.6228650680610112 
Per Topic: [0.535642558336258, 0.6688781023025513, 0.6550076544284821, 0.7618832349777221, 0.6776309728622436, 0.5783081442117691, 0.48270480930805204]

