# Webscraping r/Aliens and r/ConspiracyTheories


Reddit is a social news aggregation, web content rating, and discussion website where members submit content to the site such as links, text posts, images, and videos, which are then voted up or down by other members. Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover topics such as news, politics, religion, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing. 


In this project, I have identified two sub reddit pages *r/aliens* and *r/conspiracytheories*:

1) https://www.reddit.com/r/aliens/

2) https://www.reddit.com/r/conspiracytheories/


The idea is to explore what are some of the similarities and differences between these 2 subreddits. how to balance the credibility of aliens' existence with conspiracy theories revolving around aliens. 

Humans have been looking for extraterrestyial existence for the longest time with multiple reports on strange sightings and even setting up programs or facilities to look into these UFO sightings. For Example: Area 51 in Nevada and Project BLUE BOOK. 

Some examples of conspiracy theories regarding aliens are: 

1) Various governments and politicians globally, in particular the Government of the United States, are suppressing evidence that unidentified flying objects are controlled by a non-human intelligence or built using alien technology.


In [10]:
# Imports
import numpy as np
import pandas as pd
import requests

# Collect Data - 10,000 posts per subreddit

Pushshift now limits to 100 posts per request. 
I will be collecting an initial 1000 posts, before replicating and collecting 10,000 posts. 

I will have to make 10 requests per reddit page to obtain 10,000 posts. 

In [586]:
base_url = 'https://api.pushshift.io/reddit/search/submission'

In [634]:
# create base_df 
def get_base_df(base_url, subreddit): 
    
# set params

    params = {
    'subreddit':subreddit,
    'size': 100
    }
    
    res = requests.get(base_url, params)

    if res.status_code != 200:
        return f'Error: {res.status_code}'
    else:
        data = res.json()
        posts = data['data']
        
    return pd.DataFrame(posts)


# update params 
def update_params(base_df, subreddit): 
    
    params = {
    'subreddit':subreddit,
    'size': 100,
    'before':base_df.iloc[-1]['created_utc']
    }
    return params 


#pull posts 

def pull_posts(base_url, params):
    
    res = requests.get(base_url, params)
    
    if res.status_code != 200:
        return f'Error: {res.status_code}'

    else:
        data = res.json()
        posts = data['data']
        
        return posts

#convert new posts to df 

def posts_to_df(posts):
    return pd.DataFrame(posts)

# add to base_df 
def update_base_df(base_df, posts):
    frame = [base_df, posts]
    base_df = pd.concat(frame)
    return base_df

#create function to update base_df with 100 posts
def total_df(base_df, subreddit, base_url):

    new_params = update_params(base_df, subreddit)

    new_posts = pull_posts(base_url, new_params)

    new_df = posts_to_df(new_posts)

    base_df = base_df.append(new_df)

    return base_df

# 1) Create base dataframe for r/conspiracytheories posts
## Pulling the first 100 posts

In [635]:
#Set up base df 

base_df_ct = get_base_df('https://api.pushshift.io/reddit/search/submission', 'conspiracytheories')

In [645]:
base_df_ct.shape

(9999, 85)

In [637]:
base_df_ct.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media_embed,secure_media,secure_media_embed,author_flair_template_id,author_flair_text_color,gallery_data,is_gallery,media_metadata,crosspost_parent,crosspost_parent_list
0,[],False,Aintsosimple,,[],,text,t2_12wjkbi,False,False,...,,,,,,,,,,
1,[],False,Numerous_Cut_5410,,[],,text,t2_gro4ujih,False,False,...,,,,,,,,,,
2,[],False,Light-based,,[],,text,t2_iwqr47qo,False,False,...,,,,,,,,,,
3,[],False,sbspixie,,[],,text,t2_52xmjx3q,False,False,...,,,,,,,,,,
4,[],False,Numerous_Cut_5410,,[],,text,t2_gro4ujih,False,False,...,,,,,,,,,,


In [638]:
# Look at columns: subreddit, selftext (description), title
base_df_ct[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,conspiracytheories,There have been several things in world histor...,Did someone invent time travel?,1646202980
1,conspiracytheories,[removed],Did you know that moderna is a Ukrainian compa...,1646200837
2,conspiracytheories,Theory: They can't get their more advanced cra...,UAP/Non-human intelligence is interfering with...,1646200764
3,conspiracytheories,What if the Mandela effect is actually the gov...,Mandela effect?,1646200667
4,conspiracytheories,[removed],"Beige listed still owes me $1,000",1646199782


# Pulling the next 9900 posts

In [639]:
#number of times I will have to reun the query to pull 50 posts each time 
9900/100

99.0

Due to difficulties running the loop for 99 times without the server drops (It is a common issue faced with the pushAPI), I have decided to run the loop 49 times first, and 50 times after to collect all posts. 

In [640]:
for i in range(49):
    base_df_ct = total_df(base_df_ct, 'conspiracytheories', 'https://api.pushshift.io/reddit/search/submission')
    
    if i in [9, 19, 29, 39]:
        print(base_df_ct.shape)

base_df_ct.shape

(1100, 84)
(2100, 84)
(3100, 85)
(4099, 85)


(4999, 85)

In [641]:
for i in range(50):
    base_df_ct = total_df(base_df_ct, 'conspiracytheories', 'https://api.pushshift.io/reddit/search/submission')
    
    if i in [10, 20, 30, 40]:
        print(base_df_ct.shape)

base_df_ct.shape

(6099, 85)
(7099, 85)
(8099, 85)
(9099, 85)


(9999, 85)

In [642]:
base_df_ct.shape

(9999, 85)

In [643]:
base_df_ct.to_csv('./data/ct_submissions.csv', index=False)

# 2) Create base dataframe for r/aliens posts
## pulling the first 100 posts

In [692]:
#Set up base df 

base_df_aliens = get_base_df('https://api.pushshift.io/reddit/search/submission', 'aliens')

In [693]:
base_df_aliens.shape

(100, 76)

In [694]:
base_df_aliens.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,url_overridden_by_dest,removed_by_category,author_flair_template_id,author_flair_text_color,is_gallery,media_metadata
0,[],False,knowledgeCaterpillar,,[],,text,t2_ahp8jfc6,False,False,...,,,,,,,,,,
1,[],False,opism_ex,,[],,text,t2_etd6haf9,False,False,...,,,,,,,,,,
2,[],False,Dan_Vasilache,,[],,text,t2_be6uhfsv,False,False,...,{'oembed': {'author_name': 'Theory All Inclusi...,"{'content': '&lt;iframe width=""267"" height=""20...",105.0,140.0,https://youtu.be/BVG5IDeV8yA,,,,,
3,[],False,Wookiesarepeopletoo_,,[],,text,t2_k7cdau9f,False,False,...,,,,,,moderator,,,,
4,[],False,Theespacebaby,,[],,text,t2_dkvjlgue,False,False,...,,,,,,,,,,


In [695]:
# Look at columns: subreddit, selftext (description), title
base_df_aliens[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,aliens,I just woke up from a crazy alien dream! It st...,Aliens shot me with a subatomic particle beam ...,1646214399
1,aliens,I live in the south so we'll usually hear guns...,maybe an alien ?,1646213158
2,aliens,,HUMAN-ALIEN hybrid lives in India! #Shorts,1646211873
3,aliens,[removed],UFO sighting,1646207848
4,aliens,What are signs that someone’s been abducted? A...,Signs you’ve been abducted,1646205635


# Pulling the next 9900 posts

In [696]:
for i in range(29):
    base_df_aliens = total_df(base_df_aliens, 'aliens', 'https://api.pushshift.io/reddit/search/submission')
    
    if i in [9 , 19, 29]:
        print(base_df_aliens.shape)

base_df_aliens.shape

(1100, 81)
(2099, 81)


(2999, 84)

In [697]:
for i in range(35):
    base_df_aliens = total_df(base_df_aliens, 'aliens', 'https://api.pushshift.io/reddit/search/submission')
    
    if i in [15, 30]:
        print(base_df_aliens.shape)

base_df_aliens.shape

(4599, 84)
(6099, 85)


(6499, 85)

In [698]:
for i in range(35):
    base_df_aliens = total_df(base_df_aliens, 'aliens', 'https://api.pushshift.io/reddit/search/submission')
    
    if i in [15, 30]:
        print(base_df_aliens.shape)

base_df_aliens.shape

(8097, 86)
(9597, 86)


(9996, 86)

In [699]:
base_df_aliens.to_csv('./data/aliens_submissions.csv', index=False)

# Next steps
In the next notebook, we will be cleaning, and preprocessing the data collected.

I have tried using praw to get more posts that was more stable as compared to Pushshift's reddit API, but I am only limited to 1000. 

In [8]:
import praw

user_agent = 'Scraper 1.0 Awong'
reddit = praw.Reddit(
    client_id = "_eY_nsv4fwA_yjBdHkvMtA",
    client_secret = "VU3ru74cfNczJdWpmrgGhQUnl9TtzQ",
    user_agent=user_agent

)


In [11]:
posts = []

aliens_subreddit = reddit.subreddit('aliens')

for post in aliens_subreddit.new(limit=10000):
    posts.append([post.subreddit, post.title, post.selftext, post.id, post.score, post.distinguished,
                  post.created_utc, post.num_comments, post.upvote_ratio])
posts = pd.DataFrame(posts,columns=['subreddit', 'title', 'selftext', 'id', 'score', 'distinguished','createdutc',
                                    'num_comments', 'upvoteratio'])


In [12]:
posts.shape

(951, 9)

In [13]:
posts.head()

Unnamed: 0,subreddit,title,selftext,id,score,distinguished,createdutc,num_comments,upvoteratio
0,aliens,RH negative Blood - thoughts from Rainbow and ...,,t5fdzv,1,,1646270000.0,5,0.57
1,aliens,Possible alien found on shore in Australia?,,t5eite,0,,1646267000.0,7,0.45
2,aliens,What do psychedelics have to do with aliens?,"While sitting in the hot sun and grazing, many...",t5eckp,4,,1646267000.0,16,0.71
3,aliens,Any contactees? Phone a friend? :),Hi all! Any ET contactees on here? I'm looking...,t5chy0,0,,1646262000.0,0,0.5
4,aliens,So how do you guys think a conversation would ...,In this hypothetical. If both men were to sit ...,t5b3il,0,,1646258000.0,18,0.38


In [14]:
posts.to_csv('./data/aliens_1000.csv')

In [15]:
posts = []

aliens_subreddit = reddit.subreddit('conspiracytheories')

for post in aliens_subreddit.new(limit=10000):
    posts.append([post.subreddit, post.title, post.selftext, post.id, post.score, post.distinguished,
                  post.created_utc, post.num_comments, post.upvote_ratio])
posts = pd.DataFrame(posts,columns=['subreddit', 'title', 'selftext', 'id', 'score', 'distinguished','createdutc',
                                    'num_comments', 'upvoteratio'])


In [16]:
posts.shape

(891, 9)

In [17]:
posts.head()

Unnamed: 0,subreddit,title,selftext,id,score,distinguished,createdutc,num_comments,upvoteratio
0,conspiracytheories,Is Putin Using Cannon Fodder Before The Real I...,,t5h3e7,5,,1646275000.0,3,1.0
1,conspiracytheories,Is the Starlink Sattelites just the beginning ...,They have 1900 sattelites up there and more co...,t5ghie,0,,1646273000.0,4,0.33
2,conspiracytheories,The Great Russian Market Crash is just an insi...,"I mean, think about it. Putin isn't an idiot, ...",t5g6s6,1,,1646272000.0,4,0.53
3,conspiracytheories,Prediction for what will happen in the world w...,\n\nI believe though the Ukrainians will put ...,t5fvgo,0,,1646272000.0,6,0.33
4,conspiracytheories,China Asked Russia to Delay Ukraine War Until ...,,t5e8vx,1,,1646267000.0,0,0.67


In [18]:
posts.to_csv('./data/ct_1000.csv')