# NLP ASTRONOMY PROJECT

**Authors**
- Grzegorz Kaminski
- Maciej Lewiński
- Shourya Bhandari
- Toma Vučinić
- Jakub Wysocki

In this project, research questions about the astronomy subreddit will be investigated:
1. RQ1: What features and functionalities do users value most in stargazing apps and astronomy software?
2. RQ2: What hardware specifications are discussed and desired most frequently by the community?
3. RQ3: What locations are most frequently recommended or discussed for photographing astronomical objects or events?

In [1]:
# General import statements
import json
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

True

# DATASET INSPECTION

In [2]:
from HelperFunctions import load_ndjson

comments_file = "astronomy_comments.ndjson"
submissions_file = "astronomy_submissions.ndjson"

# Pandas dataframe storing comments and submissions
comments_df = load_ndjson(comments_file)
submissions_df = load_ndjson(submissions_file)

print(f"Successfully loaded Reddit comments. Our dataset includes {comments_df.shape[0]} Reddit comments \n")
print(f"Successfully loaded Reddit submissions. Our dataset includes {submissions_df.shape[0]} Reddit submissions\n")

Successfully loaded Reddit comments. Our dataset includes 64271 Reddit comments 

Successfully loaded Reddit submissions. Our dataset includes 8058 Reddit submissions



In [3]:
print("Comments Keys:\n")
print(comments_df.columns.tolist())
print("Comments Sample:\n")
comments_df.head()

Comments Keys:

['all_awardings', 'archived', 'associated_award', 'author', 'author_created_utc', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'body', 'can_gild', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'collapsed_reason_code', 'comment_type', 'controversiality', 'created_utc', 'distinguished', 'edited', 'gilded', 'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'name', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'score_hidden', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_name_prefixed', 'subreddit_type', 'top_awarded_type', 'total_awards_received', 'treatment_tags', 'unrepliable_reason', 'author_cakeday', 'editable']
Comments Sample:



Unnamed: 0,all_awardings,archived,associated_award,author,author_created_utc,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,top_awarded_type,total_awards_received,treatment_tags,unrepliable_reason,author_cakeday,editable
0,[],False,,rara0o,1627269000.0,,,[],,,...,Astronomy,t5_2qhor,r/Astronomy,public,,0,[],,,
1,[],False,,The_Dead_See,1364700000.0,,,[],,,...,Astronomy,t5_2qhor,r/Astronomy,public,,0,[],,,
2,[],False,,CeaselessStumblingOn,1610957000.0,,,[],,,...,Astronomy,t5_2qhor,r/Astronomy,public,,0,[],,,
3,[],False,,[deleted],,,,,,,...,Astronomy,t5_2qhor,r/Astronomy,public,,0,[],,,
4,[],False,,Ok-Golf-58,1601784000.0,,,[],,,...,Astronomy,t5_2qhor,r/Astronomy,public,,0,[],,,


In [4]:
print("Submissions Keys:\n")
print(submissions_df.columns.tolist())
print("Submissions Sample:\n")
submissions_df.head()

Submissions Keys:

['all_awardings', 'allow_live_comments', 'archived', 'author', 'author_created_utc', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'awarders', 'banned_by', 'can_gild', 'can_mod_post', 'category', 'content_categories', 'contest_mode', 'created_utc', 'discussion_type', 'distinguished', 'domain', 'edited', 'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'name', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'quarantine', 'removed_by', 'removed_by_category', 'retrieved_on', 'ret

Unnamed: 0,all_awardings,allow_live_comments,archived,author,author_created_utc,author_flair_background_color,author_flair_css_class,author_flair_template_id,author_flair_text,author_flair_text_color,...,author_fullname,author_patreon_flair,author_premium,post_hint,preview,gallery_data,is_gallery,media_metadata,author_cakeday,call_to_action
0,[],False,False,[deleted],,,,,,dark,...,,,,,,,,,,
1,[],False,False,[deleted],,,,,,dark,...,,,,,,,,,,
2,[],False,False,Soliastro,1578060000.0,,,,,,...,t2_5d0qj1f3,False,False,,,,,,,
3,[],False,False,Walden_Schmidt,1648920000.0,,,,,,...,t2_lg91j5ak,False,False,,,,,,,
4,[],False,False,Mammoth_Lifeguard_57,1653326000.0,,,,,,...,t2_njc2v380,False,False,,,,,,,


From the inspection we can see that:
1. The dataset contains 64271 comments
2. The dataset contains 8058 submissions
3. The relevant keys in the comments include {'id', 'author', 'link_id', 'parent_id', 'created_utc', 'body', 'score'}
4. The relevant keys in the submissions include {'title', 'selftext', 'created_utc', 'url', 'score', 'num_comments'}

# PREPROCESSING

Given the raw comments and submissions, the following preprocessing steps are applied:
1. Only the relevant keys for the comments and submissions are retained
2. Any rows with '[deleted]' or '[removed]' in text fields is removed
3. Timestamps are converted to datetime format for potentially analysing temporal behaviour
4. Preparation and Loading Dataset
5. Regular Expression Analysis: Store url domains that mention software related terms for RQ1
6. Text Preprocessing: Expanding contractions, case folding, removing links, removing punctuation, removing extra whitespaces, tokenization, lemmatization 

## Retaining Relevant Keys

In [5]:
from Preprocessing import retain_relevant_keys

comments_df, submissions_df = retain_relevant_keys(comments_df, submissions_df)

Sanity Check
Relevant keys for Comments:  Index(['id', 'author', 'link_id', 'parent_id', 'created_utc', 'body', 'score'], dtype='object')
Relevant keys for Submissions:  Index(['title', 'selftext', 'created_utc', 'url', 'score', 'num_comments'], dtype='object')


## Removing Deleted Text Fields

In [6]:
from Preprocessing import remove_deleted_text_fields

comments_df, submissions_df = remove_deleted_text_fields(comments_df, submissions_df)

Sanity Check
Number of deleted comments after removal: 0
Number of deleted submissions after removal: 0
Number of comments after removal: 58807
Number of submissions after removal: 683


Note: The number of submissions has reduced by more than 90%!!!

## Convert UTC timestamps to datetime format

In [7]:
from Preprocessing import convert_timestamps_to_datetime

comments_df, submissions_df = convert_timestamps_to_datetime(comments_df, submissions_df)
comments_df['created_utc'] = comments_df['created_utc'].astype(str)
submissions_df['created_utc'] = submissions_df['created_utc'].astype(str)

Note: datetime format has been converted to string to ensure compatibility with json formatting.

## Preparation and Loading Dataset

In [8]:
from HelperFunctions import load_reddit_data

comments, submissions = load_reddit_data(comments_df, submissions_df)

## Regular Expression Analysis: Store URLs that mention 'software' related terms for RQ1

In [9]:
from HelperFunctions import search_pattern

software_related_terms_regex = r'\b[Aa]pps?\b|\b[Aa]pplications?\b|\b[Ss]oftware\b|\b[Pp]rograms?\b|\b[Pp]latforms?\b|\b[Tt]ools?\b'

comments_mentioning_software = search_pattern(software_related_terms_regex, comments['body'])
submissions_mentioning_software = search_pattern(software_related_terms_regex, submissions['selftext'])

In [10]:
from HelperFunctions import extract_urls

urls_in_comments_mentioning_software = extract_urls(comments_mentioning_software)
urls_in_submissions_mentioning_software = extract_urls(submissions_mentioning_software)

print(f"Overall we find {len(urls_in_comments_mentioning_software)} URLs in the comments mentioning software")
print("Some examples below...")
print(urls_in_comments_mentioning_software[:2])

print(f"Overall we find {len(urls_in_submissions_mentioning_software)} URLs in the submissions mentioning software")
print("Some examples below...")
print(urls_in_submissions_mentioning_software[:2])

Overall we find 639 URLs in the comments mentioning software
Some examples below...
['https://play.google.com/store/apps/details?id=com.waddensky.nightshift&amp;referrer=utm_source%3Dreddit%26utm_medium%3Dforum%26utm_campaign%3Dpost).', 'http://heavens-above.com/)']
Overall we find 32 URLs in the submissions mentioning software
Some examples below...
['https://youtu.be/2xZgCVG_Bzk', 'https://youtu.be/fakankfe8co']


In [11]:
from HelperFunctions import extract_domain
from collections import Counter

domains_in_comments_mentioning_software = [extract_domain(x) for x in urls_in_comments_mentioning_software]
print("Most popular domains in comments mentioning software")
print(Counter(domains_in_comments_mentioning_software).most_common(20))

domains_in_submissions_mentioning_software = [extract_domain(x) for x in urls_in_submissions_mentioning_software]
print("Most popular domains in submissions mentioning software")
print(Counter(domains_in_submissions_mentioning_software).most_common(20))

Most popular domains in comments mentioning software
[('instagram.com', 43), ('wikipedia.org', 38), ('reddit.com', 36), ('imgur.com', 30), ('youtube.com', 29), ('nasa.gov', 27), ('stsci.edu', 24), ('google.com', 22), ('github.com', 22), ('youtu.be', 19), ('stellarium.org', 18), ('', 16), ('flickr.com', 12), ('lis.tn', 12), ('goo.gl', 10), ('amazon.com', 10), ('prabhuastrophotography.com', 10), ('astrobin.com', 9), ('skyandtelescope.org', 9), ('astrob.in', 8)]
Most popular domains in submissions mentioning software
[('youtu.be', 4), ('github.com', 4), ('redd.it', 4), ('nasa.gov', 3), ('imgur.com', 2), ('reddit.com', 1), ('pinimg.com', 1), ('shopify.com', 1), ('instructables.com', 1), ('office.com', 1), ('nsf.gov', 1), ('', 1), ('noaa.gov', 1), ('discoverhawaiiscience.com', 1), ('coursera.org', 1), ('ucsd.edu', 1), ('wikipedia.org', 1), ('xkcd.com', 1), ('mapthematics.com', 1), ('github.io', 1)]


## Text Preprocessing

### Obtaining sentences and words in the reddit comments

The sentences for each comment are stored in comments['sentences'] (type: list)

The words for each comment are stored in comments['words'] (type: list)

All the sentences present in the reddit comments is stored in *all_comments_sentences*

All the words present in the reddit comments is stored in *all_comments_words*

In [12]:
from Preprocessing import preprocess_text

comments['sentences'] = []
comments['words'] = []
all_comments_sentences = []
all_comments_words = []

for i in range(len(comments['body'])):
    sentences, words = preprocess_text(comments['body'][i])
    comments['sentences'].append(sentences)
    comments['words'].append(words)
    all_comments_sentences.extend(sentences)
    all_comments_words.extend(words)

### Obtaining sentences and words in the reddit submissions

The sentences for each submission are stored in submissions['sentences'] (type: list)

The words for each submission are stored in submission['words'] (type: list)

All the sentences present in the reddit submissions is stored in *all_submissions_sentences*

All the words present in the reddit submissions is stored in *all_submissions_words*

In [13]:
from Preprocessing import preprocess_text

submissions['sentences'] = []
submissions['words'] = []
all_submissions_sentences = []
all_submissions_words = []

for i in range(len(submissions['selftext'])):
    sentences, words = preprocess_text(submissions['selftext'][i])
    submissions['sentences'].append(sentences)
    submissions['words'].append(words)
    all_submissions_sentences.extend(sentences)
    all_submissions_words.extend(words)

### Concatenating comments and submissions  

In [14]:
all_sentences = all_comments_sentences + all_submissions_sentences
all_words = all_comments_words + all_submissions_words
print(f"The total number of sentences and words in the entire reddit dataset is {len(all_sentences)} and {len(all_words)} respectively")
print(f"The total number of sentences and words in the reddit comments is {len(all_comments_sentences)} and {len(all_comments_words)} respectively")
print(f"The total number of sentences and words in the reddit submissions is {len(all_submissions_sentences)} and {len(all_submissions_words)} respectively")

The total number of sentences and words in the entire reddit dataset is 127327 and 840347 respectively
The total number of sentences and words in the reddit comments is 123835 and 808076 respectively
The total number of sentences and words in the reddit submissions is 3492 and 32271 respectively


# Saving sentences and words in text files

In [15]:
from HelperFunctions import save_comments_to_json, save_submissions_to_json

save_comments_to_json(comments, "preprocessed_comments_json_format.ndjson")
save_submissions_to_json(submissions, "preprocessed_submissions_json_format.ndjson")

In [16]:
from HelperFunctions import write_to_text_file

write_to_text_file(all_comments_sentences, "preprocessed_comments_sentences")
write_to_text_file(all_comments_words, "preprocessed_comments_words")
write_to_text_file(all_submissions_sentences, "preprocessed_submissions_sentences")
write_to_text_file(all_submissions_words, "preprocessed_submissions_words")