## Scrape Reddit

Now, imagine that I am working for a gaming news platform.  
My platform is rolling out a new service, pushing interesting contents from a bank of mixed up posts to user, based on ther user's preference.  
Before I push the post to the Nintendo/ PlayStation fans, I will first need to identify where these posts come from.  
Somehow the contents the game news platform generates is very similar to reddit, so I decided to gather some reddit posts and make a classifier which can identify is the post is from `r/nintendo` or `r/playstation`.  

---

For this notebook, my goal is:  
Collect posts from two subreddits. `r/nintendo` and `r/playstation`, using `praw`, [The Python Reddit API Wrapper](https://praw.readthedocs.io/en/stable/#praw-the-python-reddit-api-wrapper).   

In [1]:
import praw
import random
import pandas as pd
from os.path import isfile
from time import sleep

Write a class to scrape reddit.  

In [2]:
class ScrapeAndMakeCSV(object):
    '''This class is to scrape reddit and make a CSV at the end of it
    Args:
        sub: name of subreddit to scrape
        sort_by: how to sort subreddit, default = hot
        num: number of post to gather
        save_path: path to save the csv
    '''
    def __init__(self, sub, sort_by='hot', num=1000, save_path = '../data/'):
        self.sub = sub
        self.sort_by = sort_by
        self.num = num
        self.save_path = save_path
        self.bot = praw.Reddit('Bot1') # Initialise bot using praw.ini
        print(f'Initilised API call for\n scrapping: {self.sub},\n sorted by: {self.sort_by},\n for: {self.num} posts.')
        
    def set_sort(self):
        '''Set how to sort subreddit posts. Default = hot
        '''
        reddit = self.bot
        if self.sort_by == 'hot':
            posts =  reddit.subreddit(self.sub).hot(limit=self.num)
        if self.sort_by == 'new':
            posts =  reddit.subreddit(self.sub).new(limit=self.num)
        elif self.sort_by == 'top':
            posts =  reddit.subreddit(self.sub).top(limit=self.num)
        else:
            self.sort_by = 'hot'
            print('Sort method was not recognized, defaulting to hot.')
            posts =  reddit.subreddit(self.sub).hot(limit=self.num)
        print(f'Calling to: {self.sub}, with posts sorted by: {self.sort_by}, gathering {self.num} posts.')
        return posts
    
    def get_help(self, item):
        '''Run the help function in praw
        Args:
            item: what to get help on
        '''
        r = self.bot
        if item == 'bot':
            help(reddit)
        elif item == 'posts':
            helper = r.subreddit(self.sub).hot(limit=1)
            help(helper)
        elif item == 'post':
            helper = r.subreddit(self.sub).hot(limit=1)
            for post in helper:
                help(post)
    
    def make_df(self):
        '''Makes a DataFrame with posts gathered
        '''
        # Make a dataframe dict
        new_posts_dict = {
            'id': [],
            'title': [],
            'post_content': [], 
            'user': []
        }
        
        csv = f'{self.save_path}{self.sub}_posts.csv'
        
        # Set csv_loaded to True if csv exists since you can't 
        # evaluate the truth value of a DataFrame.
        df, csv_loaded = (pd.read_csv(csv), True) if isfile(csv) else ('', False)
        
        print(f'csv_loaded = {csv_loaded}')
        
        # Get post from 3 possible sorts methods
        posts = self.set_sort()
        
        for post in posts:
            # Check if post.id is in df and set to True if df is empty.
            # This way new posts are still added to dictionary when df = '' 
            unique_id = post.id not in tuple(df.id) if csv_loaded else True
            
            # Save any unique posts to sub_dict.
            if unique_id:
                new_posts_dict['id'].append(post.id)
                new_posts_dict['title'].append(post.title)
                new_posts_dict['post_content'].append(post.selftext)
                new_posts_dict['user'].append(post.author)
            # Sleep for short while
            #sleep(random.randint(0,2)) # This is probably not needed, remove if you want code to run fast
            sleep(0.1)
            
        # Make new dataframe
        new_df = pd.DataFrame(new_posts_dict)
        # Add new_df to df if df exists then save it to a csv.
        if 'DataFrame' in str(type(df)):
            pd.concat([df, new_df], axis=False, sort=False).to_csv(csv, index=False)
            print(f'{len(new_df)} new posts collected and added to {csv}')
        else:
            new_df.to_csv(csv, index=False)
            print(f'{len(new_df)} posts collected and saved to {csv}')

In [3]:
posts = ScrapeAndMakeCSV('playstation','hot', num=1500)

Initilised API call for
 scrapping: playstation,
 sorted by: hot,
 for: 1500 posts.


Version 7.6.1 of praw is outdated. Version 7.7.0 was released 20 hours ago.


In [4]:
posts.make_df()

csv_loaded = False
Sort method was not recognized, defaulting to hot.
Calling to: playstation, with posts sorted by: hot, gathering 1500 posts.
927 posts collected and saved to ../data/playstation_posts.csv


In [15]:
posts = ScrapeAndMakeCSV('playstation','top', num=1500)

Initilised API call for
 scrapping: playstation,
 sorted by: top,
 for: 1500 posts.


In [16]:
posts.make_df()

csv_loaded = True
Calling to: playstation, with posts sorted by: top, gathering 1500 posts.
982 new posts collected and added to ../data/playstation_posts.csv


In [9]:
posts = ScrapeAndMakeCSV('nintendo','hot', num=1500)

Initilised API call for
 scrapping: nintendo,
 sorted by: hot,
 for: 1500 posts.


In [10]:
posts.make_df()

csv_loaded = False
Sort method was not recognized, defaulting to hot.
Calling to: nintendo, with posts sorted by: hot, gathering 1500 posts.
863 posts collected and saved to ../data/nintendo_posts.csv


In [13]:
posts = ScrapeAndMakeCSV('nintendo','top', num=1500)

Initilised API call for
 scrapping: nintendo,
 sorted by: top,
 for: 1500 posts.


In [14]:
posts.make_df()

csv_loaded = True
Calling to: nintendo, with posts sorted by: top, gathering 1500 posts.
981 new posts collected and added to ../data/nintendo_posts.csv


These data will then be saved into 2 different csv files under the `data` folder.  
They are:  
[nintendo_posts.csv](./data/nintendo_posts.csv)  
[playstation_posts.csv](./data/playstation_posts.csv)

### Data dictionary
|no|feature|description|
|-|-|-|
|1|id|the unique id to each reddit post, used to make only 1 post is scrapped|
|2|title|title of the post scrapped|
|3|post_content|if the post is word post, then post content exit|

In [43]:
ps_df = pd.read_csv('../data/playstation_posts.csv')
ps_df.shape

(1910, 3)

In [44]:
ns_df = pd.read_csv('../data/nintendo_posts.csv')
ns_df.shape

(1916, 3)

With this, I have collected about 1900 posts each from the reddit forums. 
Now, I will be able to start my EDA and prototype model.

---
If there are any adjustments needed to the bot, I can always `get_help()`

In [11]:
posts.get_help('post')

Help on Submission in module praw.models.reddit.submission object:

class Submission(praw.models.listing.mixins.submission.SubmissionListingMixin, praw.models.reddit.mixins.UserContentMixin, praw.models.reddit.mixins.fullname.FullnameMixin, praw.models.reddit.base.RedditBase)
 |  Submission(reddit: 'praw.Reddit', id: Optional[str] = None, url: Optional[str] = None, _data: Optional[Dict[str, Any]] = None)
 |  
 |  A class for submissions to Reddit.
 |  
 |  .. include:: ../../typical_attributes.rst
 |  
 |  Attribute                  Description
 |  ``author``                 Provides an instance of :class:`.Redditor`.
 |  ``author_flair_text``      The text content of the author's flair, or ``None`` if
 |                             not flaired.
 |  ``clicked``                Whether or not the submission has been clicked by the
 |                             client.
 |  ``comments``               Provides an instance of :class:`.CommentForest`.
 |  ``created_utc``            Time the 