# Project 3: NLP Classification of Two Topics (Marvel Cinematic Universe vs DC Extended Universe)

# 01: Data Collection

## Problem Statement

We, Super Heroes Media Incorporated, are a marketing and branding agency that works with both Disney and Warner Bros for superhero movie franchises, namely the Marvel Cinematic Universe (MCU) for Disney and DC Extended Universe (DCEU) for Warner Bros. As we are representing two clients both working in the same field, it is important to maintain a distinct brand for both MCU and DCEU. By distinct brand, we mean to ensure that there is no misunderstanding when there is discussion or advertising for super hero characters (e.g. When Captain America is mentioned, he should be correctly assosicated to MCU. Similarly, Superman should be associated to DCEU). If a mistake is made during one of our advertising campaign, it could result in an expensive lawsuit from either Disney or Warner Bros.  

As a way to ensure a clear brand and prevent mixing characters or overlappting information between MCU and DCEU, we will be creating a classification model using posts from Reddit. In particular, we will use the following two subreddits:
* r/marvelstudios for MCU
* r/DC_cinematic for DCEU  

After the model is trained, the intent is to funnel all advertising information through the model first before begining the advertising campaign to verify that our information is not infringing into its competitors domain. 

## Import Libraries

In [1]:
# Import the necessary libraries
import pandas as pd
import requests

In [2]:
# Set all columns and rows to be displayed
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Data Collection

We'll be using the Pushshift API to scrape the posts from Reddit. As we're targeting 1,000 posts per topic, we'll be scraping 2,000 posts in total. As Pushshift is only able to scrape a maximum of 500 posts per request, each topic will require two requests each. 

In [3]:
# Number of requests per topic?
1_000/500

2.0

In [4]:
# Define the base url for submission requests from pushshift
url = 'https://api.pushshift.io/reddit/search/submission/'

### Function definition to request for data

Here, we will define the functions we need to reuse to scrap posts from both subreddits.

In [5]:
# Function to request for the first 500 posts as
# the pushshift API can only request 500 posts at a time

def request_posts(params, url = url):
    res = requests.get(url, params)
    if res.status_code != 200:
        return print(f"An error occured. The status code received was {res.status_code}")
    else:
        data = res.json()
        posts = data['data']
        
    return posts

In [6]:
# Function to update the reference point to request data after the initial request
# Else, pushshift will just keep requesting the latest 500 rows

def get_updated_params(initial_df, subreddit):
    params = {
        'subreddit' : subreddit,
        'size' : 500,
        'before' : initial_df.loc[(initial_df.shape[0] - 1), 'created_utc']
    }
    
    return params

In [7]:
# Function to update the initial dataframe (that contains the first 500 posts) 
# with the subsequent data request (next 500 posts) 

def second_request_posts(initial_df, subreddit):
    params = get_updated_params(initial_df, subreddit)
    posts = request_posts(params)
    temp_df = pd.DataFrame(posts)
    df_posts = pd.concat([initial_df, temp_df], axis=0, ignore_index=True, 
                         sort=True)
    
    return df_posts

### Collect posts from the Marvel subreddit

In [8]:
# Define the parameters for pushshift to request data from the Marvel subreddit

params_marvel = {
    'subreddit' : 'marvelstudios',
    'size' : 500,
}

In [9]:
# Request the initial 500 posts 

posts_marvel = request_posts(params_marvel)

In [10]:
# Store the posts in a dataframe

df_marvel = pd.DataFrame(posts_marvel)

In [11]:
# Check the shape of the dataframe

df_marvel.shape

(499, 95)

In [12]:
# Collect next 500 posts

df_marvel = second_request_posts(df_marvel, 'marvelstudios')

In [13]:
# Check the shape of the dataframe

df_marvel.shape

(999, 95)

In [14]:
# Take a quick look at the data

df_marvel.head()

Unnamed: 0,all_awardings,allow_live_comments,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_gild,category,content_categories,contest_mode,created_utc,discussion_type,distinguished,domain,edited,edited_on,gallery_data,gilded,gildings,hidden,hide_score,id,is_created_from_ads_ui,is_crosspostable,is_gallery,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_metadata,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,quarantine,removed_by,removed_by_category,retrieved_utc,score,secure_media,secure_media_embed,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,top_awarded_type,total_awards_received,treatment_tags,updated_utc,upvote_ratio,url,url_overridden_by_dest,utc_datetime_str,view_count,whitelist_status,wls
0,[],False,False,NotGodsThrowaway,,,,[],,,,text,t2_qz0qszo4,False,False,[],True,,,False,1677980039,,,i.redd.it,False,,,0,{},False,True,11ijqix,False,True,,False,False,True,True,False,False,#0087d2,Discuss,"[{'e': 'text', 't': 'Discussion'}]",d556a05a-43fb-11e8-82a6-0e6ba3dc5484,Discussion,light,richtext,False,,{},,False,True,0,0,False,all_ads,/r/marvelstudios/comments/11ijqix/i_hope_layla...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677980056,1,,{},,True,False,False,marvelstudios,t5_2uii8,r/marvelstudios,3066270,public,confidence,https://b.thumbs.redditmedia.com/gzEL6you86w9J...,59.0,140.0,I hope Layla comes back. Her introduction as t...,,0,[],1677980057,1.0,https://i.redd.it/lrrz7hw7qtla1.png,https://i.redd.it/lrrz7hw7qtla1.png,2023-03-05 01:33:59,,all_ads,6
1,[],False,False,Louis_DCVN,,,,[],,,,text,t2_6db17i0l,False,False,[],True,,,False,1677979927,,,variety.com,False,,,0,{},False,True,11ijoz3,False,True,,False,False,False,True,False,False,#bb18d7,Article,"[{'e': 'text', 't': 'Article'}]",0ee036fc-28fb-11ea-bbf1-0ee82508daf1,Article,light,richtext,False,,{},,False,True,0,0,False,all_ads,/r/marvelstudios/comments/11ijoz3/ryan_reynold...,False,link,{'images': [{'source': {'url': 'https://extern...,6,False,,,1677979941,1,,{},,True,False,False,marvelstudios,t5_2uii8,r/marvelstudios,3066271,public,confidence,https://b.thumbs.redditmedia.com/xUpusRYnpvt7l...,78.0,140.0,Ryan Reynolds Casts Doubt on 'Free Guy' Sequel...,,0,[],1677979942,1.0,https://variety.com/2023/film/global/ryan-reyn...,https://variety.com/2023/film/global/ryan-reyn...,2023-03-05 01:32:07,,all_ads,6
2,[],False,False,anthonystrader18,,,,[],,,,text,t2_1or2xht2,False,False,[],True,,,False,1677977799,,,i.redd.it,False,,,0,{},False,True,11iiw55,False,True,,False,False,True,True,False,False,#0087d2,Discuss,"[{'e': 'text', 't': 'Discussion'}]",d556a05a-43fb-11e8-82a6-0e6ba3dc5484,Discussion,light,richtext,False,,{},,False,True,0,0,False,all_ads,/r/marvelstudios/comments/11iiw55/what_does_ev...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677977810,1,,{},,True,False,False,marvelstudios,t5_2uii8,r/marvelstudios,3066263,public,confidence,https://b.thumbs.redditmedia.com/TX1ruc0zUy7dv...,140.0,140.0,What Does Everyone think of Dr Strange Multive...,,0,[],1677977810,1.0,https://i.redd.it/se7so83i1vla1.jpg,https://i.redd.it/se7so83i1vla1.jpg,2023-03-05 00:56:39,,all_ads,6
3,[],False,False,FictionFantom,,,stan,"[{'e': 'text', 't': 'Stan Lee'}]",6c04cc00-ca07-11e5-87cf-0e57f4d38c67,Stan Lee,dark,richtext,t2_gfl9i,False,False,[],True,,,False,1677977311,,,i.redd.it,False,,,0,{},False,True,11iipjp,False,True,,False,False,True,True,False,False,#00c99b,Humour,"[{'e': 'text', 't': 'Humour'}]",47cf6780-28fb-11ea-a558-0ef7e51a1d1f,Humour,light,richtext,False,,{},,False,True,0,0,False,all_ads,/r/marvelstudios/comments/11iipjp/cast_this_mo...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677977330,1,,{},,True,False,False,marvelstudios,t5_2uii8,r/marvelstudios,3066256,public,confidence,https://b.thumbs.redditmedia.com/lB9TqTh2EvBVd...,140.0,140.0,Cast this movie with Earth’s Mightiest Muppets,,0,[],1677977331,1.0,https://i.redd.it/gcz7niq10vla1.jpg,https://i.redd.it/gcz7niq10vla1.jpg,2023-03-05 00:48:31,,all_ads,6
4,[],False,False,FictionFantom,,,stan,"[{'e': 'text', 't': 'Stan Lee'}]",6c04cc00-ca07-11e5-87cf-0e57f4d38c67,Stan Lee,dark,richtext,t2_gfl9i,False,False,[],True,,,False,1677976899,,,i.redd.it,False,,,0,{},False,True,11iijzo,False,True,,False,False,True,True,False,False,#00c99b,Humour,"[{'e': 'text', 't': 'Humour'}]",47cf6780-28fb-11ea-a558-0ef7e51a1d1f,Humour,light,richtext,False,,{},,False,True,0,0,False,all_ads,/r/marvelstudios/comments/11iijzo/okay_but_hea...,False,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677976911,1,,{},,True,False,False,marvelstudios,t5_2uii8,r/marvelstudios,3066257,public,confidence,https://b.thumbs.redditmedia.com/a7b82SMR3vbzb...,140.0,140.0,Okay but hear me out…,,0,[],1677976911,1.0,https://i.redd.it/eaj61potyula1.jpg,https://i.redd.it/eaj61potyula1.jpg,2023-03-05 00:41:39,,all_ads,6


### Collect posts from the DCEU subreddit

In [15]:
# Define the parameters for pushshift to request data from the DCEU subreddit

params_dceu = {
    'subreddit' : 'DC_Cinematic',
    'size' : 500,
}

In [16]:
# Request the initial 500 posts 

posts_dceu = request_posts(params_dceu)

In [17]:
# Store the posts in a dataframe

df_dceu = pd.DataFrame(posts_dceu)

In [18]:
# Check the shape of the dataframe

df_dceu.shape

(500, 98)

In [19]:
# Collect next 500 posts

df_dceu = second_request_posts(df_dceu, 'DC_Cinematic')

In [20]:
# Check the shape of the dataframe

df_dceu.shape

(1000, 98)

In [21]:
# Take a quick look at the data

df_dceu.head()

Unnamed: 0,all_awardings,allow_live_comments,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_gild,category,content_categories,contest_mode,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,edited,edited_on,gallery_data,gilded,gildings,hidden,hide_score,id,is_created_from_ads_ui,is_crosspostable,is_gallery,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_metadata,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,poll_data,post_hint,preview,pwls,quarantine,removed_by,removed_by_category,retrieved_utc,score,secure_media,secure_media_embed,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,top_awarded_type,total_awards_received,treatment_tags,updated_utc,upvote_ratio,url,url_overridden_by_dest,utc_datetime_str,view_count,whitelist_status,wls
0,[],False,False,Louis_DCVN,,,,[],,,,text,t2_6db17i0l,False,False,[],True,,,False,1677979634,,,,,variety.com,False,,,0,{},False,True,11ijl1a,False,True,,False,False,False,True,False,False,#ea0027,news,[],ff1bbabe-5d95-11e5-8d28-0ead67bd777b,NEWS,light,text,False,,{},,False,True,0,0,False,all_ads,/r/DC_Cinematic/comments/11ijl1a/green_lantern...,False,,link,{'images': [{'source': {'url': 'https://extern...,6,False,,,1677979649,1,,{},,True,False,False,DC_Cinematic,t5_2ykm6,r/DC_Cinematic,376363,public,,https://b.thumbs.redditmedia.com/xUpusRYnpvt7l...,78.0,140.0,"Green Lantern 2011, which also starred Lively,...",,0,[],1677979650,1.0,https://variety.com/2023/film/global/ryan-reyn...,https://variety.com/2023/film/global/ryan-reyn...,2023-03-05 01:27:14,,all_ads,6
1,[],False,False,AldebaranTauro,,,tacbatman,[],,,dark,text,t2_7k9iqm7,False,False,[],True,,,False,1677976528,,,,,i.redd.it,False,,,0,{},False,True,11iievr,False,True,,False,False,True,True,False,False,#ff66ac,merchandise,[],bf96d4b6-5edd-11e5-9e15-129e4297eb59,MERCHANDISE,light,text,False,,{},,False,True,0,0,False,all_ads,/r/DC_Cinematic/comments/11iievr/new_look_at_f...,False,,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677976550,1,,{},,True,False,False,DC_Cinematic,t5_2ykm6,r/DC_Cinematic,376365,public,,https://a.thumbs.redditmedia.com/Y-eywtB0nhB_v...,140.0,140.0,New look at Funko Pops of young Barry in impro...,,0,[],1677976551,1.0,https://i.redd.it/sd0kj0e6gtla1.jpg,https://i.redd.it/sd0kj0e6gtla1.jpg,2023-03-05 00:35:28,,all_ads,6
2,[],False,False,Illustrious-Sign3015,,,,[],,,,text,t2_91yyd27r,False,False,[],True,,,False,1677975350,,,,,i.redd.it,False,,,0,{},False,True,11ihy29,False,True,,False,False,True,True,False,False,#ff4500,humor,[],f58866c8-5d95-11e5-bb0f-0e4ee382cd79,HUMOR,light,text,False,,{},,False,True,0,0,False,all_ads,/r/DC_Cinematic/comments/11ihy29/if_the_joker_...,False,,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677975364,1,,{},,True,False,False,DC_Cinematic,t5_2ykm6,r/DC_Cinematic,376366,public,,https://b.thumbs.redditmedia.com/Q7rMw3p_pk2i4...,78.0,140.0,If the Joker got his hands on the Ecto 1 from ...,,0,[],1677975364,1.0,https://i.redd.it/md8gzcv7uula1.jpg,https://i.redd.it/md8gzcv7uula1.jpg,2023-03-05 00:15:50,,all_ads,6
3,[],False,False,Sha_Shock,,,,[],,,,text,t2_6gcus97s,False,False,[],True,,,False,1677975157,t3_11ihun4,"[{'approved_at_utc': None, 'subreddit': 'dccom...",,,i.redd.it,False,,,0,{},False,True,11ihve8,False,True,,False,False,True,True,False,False,#ff4500,humor,[],f58866c8-5d95-11e5-bb0f-0e4ee382cd79,HUMOR,light,text,False,,{},,False,True,0,0,False,all_ads,/r/DC_Cinematic/comments/11ihve8/rantiwork_use...,False,,image,{'images': [{'source': {'url': 'https://previe...,6,False,,,1677975170,1,,{},,True,False,False,DC_Cinematic,t5_2ykm6,r/DC_Cinematic,376366,public,,https://b.thumbs.redditmedia.com/OiPzSw0E2J-OT...,96.0,140.0,r/antiwork users after that one guy killed the...,,0,[],1677975170,1.0,https://i.redd.it/t69qykivbtla1.png,https://i.redd.it/t69qykivbtla1.png,2023-03-05 00:12:37,,all_ads,6
4,[],False,False,BatmanNewsChris,,,batman2,[],,Batman,dark,text,t2_yzgfz,False,False,[],True,,,False,1677975012,,,,,self.DC_Cinematic,False,,,0,{},False,True,11ihtc2,False,True,,False,False,False,True,True,False,#ea0027,news,[],ff1bbabe-5d95-11e5-8d28-0ead67bd777b,NEWS,light,text,False,,{},,False,False,0,0,False,all_ads,/r/DC_Cinematic/comments/11ihtc2/shazam_fury_o...,False,,,,6,False,,,1677975031,1,,{},Warner Bros. released a bunch of Shazam! Fury ...,False,False,False,DC_Cinematic,t5_2ykm6,r/DC_Cinematic,376366,public,,self,,,'Shazam! Fury of the Gods' full credits are in...,,0,[],1677975031,1.0,https://www.reddit.com/r/DC_Cinematic/comments...,,2023-03-05 00:10:12,,all_ads,6


### Export the Data

In [22]:
# Export the marvel data into a csv file

df_marvel.to_csv('../data/marvel.csv', index=False)

In [23]:
# Export the DC data into a csv file

df_dceu.to_csv('../data/dceu.csv', index=False)

### Next step

With this, we have collected the necessary data we require for the analysis. We'll proceed with the cleaning of the data in the next notebook.