<a href="https://colab.research.google.com/github/simonsanvil/FinalProjectMLA/blob/master/notebooks/reddit_data_acquisition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Final Project: Data Set Creation**

## Using the Reddit API to scrap several posts from Reddit to use for our NLP tasks

### Universidad Carlos III de Madrid - Machine Learning Applications

---

**Authors:**
- Enrique Botía Barberá 
- David Méndez Encinas 
- Andrés Ruiz Calvo 
- Simón E. Sánchez Viloria

In this notebook we make use of Python's Reddit API Wrapper [PRAW](https://praw.readthedocs.io/en/latest/index.html) to create a dataset of posts from multiple communities of the website in order to use it for our NLP tasks.

**Load Environmental variables:**

To scrap data from Reddit using their API a `CLIENT_ID` and `CLIENT_SECRET` are required. These can be obtained by creating a reddit account and [registering an application](https://www.reddit.com/prefs/apps/). It is required that the enviromental variables `REDDIT_CLIENT_ID` and `REDDIT_CLIENT_SECRET` are set with these corresponding values in order to run this notebook. 

In [None]:
#load environmental variables from .env file
import os
!pip install python-dotenv'>=0.5.1'
from dotenv import load_dotenv, find_dotenv
# find .env automagically by walking up directories until it's found
dotenv_path = "redditenv.env" #find_dotenv()
# load up the entries as environment variables
if os.path.isfile(dotenv_path):
  load_dotenv(dotenv_path)
  print("environmental variables found and loaded")
else:
  print('.env not found')

environmental variables found and loaded


-------------------------

In [None]:
## Formatting
%matplotlib inline
%config InlineBackend.figure_format = 'retina' 

#For fancy table Display
%load_ext google.colab.data_table

! pip install fastprogress #progress bar
from fastprogress import master_bar, progress_bar
from termcolor import colored #colored prints

#To wrap long text lines
from IPython.display import HTML, display
def set_css():
  display(HTML('<style> pre { white-space: pre-wrap;} </style>'))
get_ipython().events.register('pre_run_cell', set_css)



----------
## Data Extraction:

In [None]:
import os
import pandas as pd
!pip install praw
import praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/48/a8/a2e2d0750ee17c7e3d81e4695a0338ad0b3f231853b8c3fa339ff2d25c7c/praw-7.2.0-py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 15.0MB/s 
[?25hCollecting prawcore<3,>=2
  Downloading https://files.pythonhosted.org/packages/7d/df/4a9106bea0d26689c4b309da20c926a01440ddaf60c09a5ae22684ebd35f/prawcore-2.0.0-py3-none-any.whl
Collecting websocket-client>=0.54.0
[?25l  Downloading https://files.pythonhosted.org/packages/f7/0c/d52a2a63512a613817846d430d16a8fbe5ea56dd889e89c68facf6b91cb6/websocket_client-0.59.0-py2.py3-none-any.whl (67kB)
[K     |████████████████████████████████| 71kB 6.0MB/s 
[?25hCollecting update-checker>=0.18
  Downloading https://files.pythonhosted.org/packages/0c/ba/8dd7fa5f0b1c6a8ac62f8f57f7e794160c1f86f31c6d0fb00f582372a3e4/update_checker-0.18.0-py3-none-any.whl
Installing collected packages: prawcore, websocket-client, update-checker, praw
Successfully installed praw

In [None]:
reddit = praw.Reddit(
    client_id = os.environ['REDDIT_CLIENT_ID'],
    client_secret = os.environ['REDDIT_CLIENT_SECRET'],
    user_agent = 'Data extraction',
    check_for_async=False,
    )

In [None]:
def get_submissions_df(reddit,subreddit_name,sub_sort='top',tags=None,time_filter='all',limit=1000):
    '''
    Scrap submissions/posts from a subreddit given and makes a dataset from the posts information. 

    @reddit: An instance of a praw.Reddit class
    @subreddit_name<str>: name of the subreddit to fetch the posts from. 
    @sub_sort<str>: How to sort the subreddit before fetching the results. One of top, hot, new or controversial. Default is "top" to sort descendingly based on the score.
    @tags<list>: Tags of a submmission to filter the results fetched. Only the posts with these tags will be returned. Default is any post
    @time_filter<str>: One of "all", "year", "month", "week", "day", "hour" to filter by time the submissions that will be fetched.  Default is "all" posts
    @limit<int>: Number of submissions to attemp to fetch. Warning: Reddit's API limit sets this to 1000 for regular accounts, any more than that wont be returned by the API. 
                 Default is 1000
    '''
    
    subr = reddit.subreddit(subreddit_name)
    if sub_sort=='new':
      sort_fun = subr.new
    elif sub_sort=='hot':
      sort_fun = subr.hot
    elif sub_sort=='controversial':
      sort_fun = subr.controversial
    else:
      sort_fun = subr.top

    mods = get_moderators(reddit,subreddit_name) #get the moderators of the forum since we dont want posts from them (they are usually meta or automated)

    if tags is None:
      is_valid_submission = lambda submission : submission.author not in mods and submission.is_self
    else:
      tags = [tags] if not isinstance(tags,(list,tuple,set,dict)) else tags
      is_valid_submission = lambda submission : submission.author not in mods and submission.is_self and submission.link_flair_text in tags
    sort_args = dict(limit=limit,time_filter=time_filter) if sub_sort not in ['new','hot'] else dict(limit=limit)

    submissions = [submission for submission in sort_fun(**sort_args) if is_valid_submission(submission)]

    if len(submissions)==0:
      return pd.DataFrame(None,columns = ['title','text','score','subreddit','url'])

    posts_info = [
        {'title':submission.title, 
         'text':submission.selftext,
         'score':submission.score,
         'subreddit':subr,
         'url':submission.url
        } for submission in submissions
    ]
    
    return pd.DataFrame(posts_info)

def get_moderators(reddit,subreddit_name):
  '''
  get the moderators of the given subreddit
  '''
  mods = [moderator for moderator in reddit.subreddit(subreddit_name).moderator()]
  return mods

In [None]:
df = get_submissions_df(reddit,'datascience',limit=10)
df

Unnamed: 0,title,text,score,subreddit,url
0,Shout Out to All the Mediocre Data Scientists ...,I've been lurking on this sub for a while now ...,2932,datascience,https://www.reddit.com/r/datascience/comments/...
1,I created a four-page Data Science Cheatsheet ...,"Hey guys, I’ve been doing a lot of preparation...",2069,datascience,https://www.reddit.com/r/datascience/comments/...


We'll create the full dataset by obtaining the submission of various subreddits. We'll try to get at least 1500 different posts from each of them using various filters and sort types. 

In [None]:
import itertools

subreddits = [
    'medicine',
    'books',
    'datascience',
    'truegaming',
    'politicaldiscussion',
    'debatereligion',
    'investing',
    'relationships',
    'casualconversation',
    'legaladvice',
]

sort_types = ['top','controversial']
time_filters = ['all','month','week','day','hour']
all_combinations = list(itertools.product([sort_types[0]], time_filters)) + [('hot','all'),('new','all')] + list(itertools.product([sort_types[1]], time_filters))

min_rows_per_sub = 1500
final_df = None
for subname in progress_bar(subreddits): #for each subreddit
  if isinstance(subname,(list,tuple,dict)):
    tags = subname[1]
    subname = subname[0]
  else:
    tags = None
  sub_df = pd.DataFrame(None,columns = ['title','text','score','subreddit','url'])
  for sort_type,time_filter in progress_bar(all_combinations):  #for each sort and time_filter combination obtain the corresponding dataset
    print('\r',f'r/{subname}: Posts obtained: ',len(sub_df),end='')
    df = get_submissions_df(reddit,subname,limit=1000,sub_sort=sort_type,tags=tags,time_filter=time_filter)
    df = df[df.text.str.split(" ").str.len() >= 25] #remove posts with less than 25 words
    sub_df = pd.concat([sub_df,df],axis=0)
    sub_df.drop_duplicates(inplace=True)
    if len(sub_df) >= min_rows_per_sub: #if the size of this subreddit's dataframe surpasses the minimum posts per subreddit continue to the next
      break
  print('\r',f"r/{subname}: total number of posts obtained: {len(sub_df)}")
  final_df = pd.concat([final_df,sub_df],axis=0)

 r/medicine: total number of posts obtained: 1920


 r/books: total number of posts obtained: 1651


 r/datascience: total number of posts obtained: 1561


 r/truegaming: total number of posts obtained: 1834


 r/politicaldiscussion: total number of posts obtained: 1883


 r/debatereligion: total number of posts obtained: 1792


 r/investing: total number of posts obtained: 1574


 r/relationships: total number of posts obtained: 1788


 r/casualconversation: total number of posts obtained: 1952


 r/legaladvice: total number of posts obtained: 1951


In [None]:
final_df.subreddit.value_counts()

casualconversation     1952
legaladvice            1951
medicine               1920
politicaldiscussion    1883
truegaming             1834
debatereligion         1792
relationships          1788
books                  1651
investing              1574
datascience            1561
Name: subreddit, dtype: int64

In [None]:
final_df['score'] = final_df['score'].astype(int) 
final_df.drop_duplicates(subset=['title','text','subreddit'],inplace=True) #remove duplicate posts
print("Size of the dataset:",len(final_df))
final_df.head()

Size of the dataset: 17906


Unnamed: 0,title,text,score,subreddit,url
0,3 Days of Inpatient Care in New York,Day 1.\n\n3 COVID cases in a census of 14 (one...,4838,medicine,https://www.reddit.com/r/medicine/comments/fp2...
1,Help! A doctor in my hometown was kidnapped by...,"Well, here goes nothing, I hope this gets at l...",4643,medicine,https://www.reddit.com/r/medicine/comments/kb3...
2,There is no emergency in a pandemic,I was asked to repost this with the news of 13...,4240,medicine,https://www.reddit.com/r/medicine/comments/flk...
3,"Testimony of a surgeon working in Bergamo, in ...",«In one of the non-stop e-mails that I receive...,4102,medicine,https://www.reddit.com/r/medicine/comments/ff8...
4,"He is small for an eight-year-old boy, made ev...","He is small for an eight-year-old boy, made ev...",3994,medicine,https://www.reddit.com/r/medicine/comments/4g8...


In [None]:
final_df.to_excel("data/reddit_posts_df_2.xlsx",index=False) #save the dataset