<h1> How to Create a Reddit Post That Will Get The Most Engagement From Reddit Users - Scrape Data

In this notebook, I've used Reddit's PushShift API to scrape data from posts under the subreddit "Immigration".  

# Imports

In [1]:
import praw
from psaw import PushshiftAPI
import pandas as pd
from datetime import datetime, timezone

# Webscrapping For Data

In [2]:
api = PushshiftAPI()

In [3]:
api_request_generator = api.search_submissions(subreddit='Immigration')

In [4]:
df = pd.DataFrame([submission.d_ for submission in api_request_generator])



In [5]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers',
       'subreddit_type', 'suggested_sort', 'thumbnail', 'title',


In [6]:
df['created_utc'].head(3)

0    1662143055
1    1662142139
2    1662139650
Name: created_utc, dtype: int64

In [7]:
#create a data column 
df['date_posted'] = pd.to_datetime(df['created_utc'], utc=True, unit='s')

In [8]:
df['time_on_reddit']= datetime.now(timezone.utc) - df['date_posted'] 

In [13]:
new_df = df[['subreddit','title', 'selftext', 'num_comments','time_on_reddit', 'date_posted']]
new_df.head(3)

Unnamed: 0,subreddit,title,selftext,num_comments,time_on_reddit,date_posted
0,immigration,Trying to schedule a US visa interview in a co...,This is a weird one.\n\nI had my last visa int...,0,0 days 02:48:41.822990,2022-09-02 18:24:15+00:00
1,immigration,Visa interview date L-1B Blanket,"Hi, I have two questions I’m hoping to get som...",0,0 days 03:03:57.822990,2022-09-02 18:08:59+00:00
2,immigration,USA 485: possible violation of auth work restr...,Waiting for my F-2 EAD some years ago I did a ...,0,0 days 03:45:26.822990,2022-09-02 17:27:30+00:00


In [14]:
new_df.dtypes

subreddit                      object
title                          object
selftext                       object
num_comments                    int64
time_on_reddit        timedelta64[ns]
date_posted       datetime64[ns, UTC]
dtype: object

In [15]:
df.num_comments.describe()

count    78250.000000
mean         4.087412
std          9.561958
min          0.000000
25%          0.000000
50%          1.000000
75%          5.000000
max       1316.000000
Name: num_comments, dtype: float64

In [16]:
#add popular column (binary)
#0 - not popular , 1 - popular 
median_num_comments = new_df['num_comments'].median()
median_num_comments

1.0

In [17]:
new_df.num_comments.isnull().sum()

0

In [18]:
new_df['popular'] = [1 if num_comments > median_num_comments 
                     else 0 for num_comments in new_df['num_comments']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['popular'] = [1 if num_comments > median_num_comments


In [19]:
new_df.head()

Unnamed: 0,subreddit,title,selftext,num_comments,time_on_reddit,date_posted,popular
0,immigration,Trying to schedule a US visa interview in a co...,This is a weird one.\n\nI had my last visa int...,0,0 days 02:48:41.822990,2022-09-02 18:24:15+00:00,0
1,immigration,Visa interview date L-1B Blanket,"Hi, I have two questions I’m hoping to get som...",0,0 days 03:03:57.822990,2022-09-02 18:08:59+00:00,0
2,immigration,USA 485: possible violation of auth work restr...,Waiting for my F-2 EAD some years ago I did a ...,0,0 days 03:45:26.822990,2022-09-02 17:27:30+00:00,0
3,immigration,TN Question,Hi! \n\nCan I (Canadian with TN status) bring ...,0,0 days 04:18:28.822990,2022-09-02 16:54:28+00:00,0
4,immigration,What paperwork is usually needed at preclearance?,I’m travelling from Dublin to jfk to visit my ...,0,0 days 05:20:22.822990,2022-09-02 15:52:34+00:00,0


In [20]:
#export data to CSV
new_df.to_csv("reddit_immigration_posts")

I've successfully reddit's PushShift API to gather the data above. I collected the subreddits name (in this project it is just immigration), title of post, text of the post, length of time the post was on reddit, date of post,score of post, number of comments, number of cross posts, upvote ratio, have created a binary variable named popular that will be used for classification later. A post is considered popular if it has more than the median number of comments (which is 1) else it is not. Popular is denoted as 1, not popular is denoted as 0.