# Reddit



## Prerequistes

To scrape Reddit, we will be using the Praw library. This library is a wrapper around Reddit's API, so you will first need to create a Reddit account and create an application. From that, you will need to copy the client-id and client-secret for your application. A tutorial on how to do so can be found [here](https://www.youtube.com/watch?v=0mGpBxuYmpU).

## Getting Started

First, install the Praw library in your environment using ```pip install praw```

In [None]:
pip install praw

Now we will want to actually import the Praw library into our code. We will also ```import io``` at this time so that we can eventually load the information we collect into a ```.csv``` file

In [None]:
import praw
import io

## Collecting Information

Now we'll get to the interesting stuff. First we will create an instance of the Praw wrapper. Here is where you will need to paste the ```client_id``` and ```client_secret``` you copied earlier. You will also need to include the ```username``` and ```password``` of the Reddit account that contains the application secrets.


With this instance, we can grab a subreddit through the ```reddit.subreddit``` function and then iterate through different submissions/posts in the subreddit. In this example, we iterate through hottest submissions in r/worldnews, but we could have also gotten the newest submissions using ```subreddit.new()```, or the top submissions using ```subreddit.top()```. A full list of of subreddit functionality can be found [here](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html#).


Within each post, we can get items like the ```title```, ```url```, and ```author```. More interestingly, we can also traverse through the comments and obtain information about each comment in similar fashion.

In the code below, you will notice a ```limit``` value set for both ```subreddit.hot()``` and ```post.comments.replace_more()```. That is because these functions create lists to iterate on, and creating a list with all posts/comments will likely be too much information to handle and will cause the underlying Reddit API to time out. Typically limits up to 2000 will be fine, but it may take some experimentation to set correctly. An alternative would be to set the limit to ```None``` and run the script until the API sends a ```TooManyRequests``` error.

In [None]:
reddit = praw.Reddit(user_agent=True, client_id="<client_id>", client_secret="<client_secret>", username="<username>", password="<password>", check_for_async=False)

subreddit = reddit.subreddit("worldnews")
print(subreddit.title)
for post in subreddit.hot(limit=50):
  print(f"Post: {post.title}")
  #print(post.url)
  print("---------")
  post.comments.replace_more(limit=100)
  for comment in post.comments.list():
    print(f"\t{comment.author}:\'{comment.body}\'\n")

### Redditors

You can also use Praw to get information about specific Redditors. As shown above, this can be done through the ```.author``` parameter of a post, but if we are interested in a specific Redditor, then we can use the ```.redditor(<username>)``` function to give us a Redditor object to collect data from. As alluded to earlier, note how ```limit``` can be set to ```None``` if desired.

In [None]:
redditor = reddit.redditor("helix2d")
for comment in redditor.comments.new(limit=None):
    print(comment.body)
    print("----")

## Putting it in a ```.csv```

Now that we can retrieve information from Reddit, we can store that data for further processing using Python's ```io``` library. Due to the variety of characters possible in a comment, we will first preprocess entries to clean them of any tokens that may break the ```.csv``` format and then write them into the file.

In [None]:
subreddit = reddit.subreddit("worldnews")
#print(subreddit.title)
csv_file_path = 'data.csv'
file = open(csv_file_path, mode='w')
file.write("Username,Comment\n")
for post in subreddit.hot(limit=5):
  post.comments.replace_more(limit=10)
  for comment in post.comments.list():
    sbuilder = io.StringIO()
    clean_body = comment.body.replace("\n", "<newline>")
    clean_body = clean_body.replace(",", "<comma>")
    clean_body = clean_body.replace("\"", "<quote>")
    clean_body = clean_body.replace("\'", "<squote>")
    #print(f"{comment.author},\'{clean_body}\'\n")
    sbuilder.write(f"{comment.author},\'{clean_body}\'\n")
    stri = sbuilder.getvalue()
    file.write(stri)
    sbuilder.close()
file.close()

## Further Learning

Documentation of the Praw library and its extended features can be found [here](https://praw.readthedocs.io/en/stable/index.html)