# Mapping controversies tutorial x: search reddit submissions and posts for keywords



## Step 1: Installing the right libraries 
Libraries for Jupyter can be understood as preprogrammed script parts. This means, that instead of writing a lot of lines of code in order e.g. make contact to Wikipedia, you can do it in one command.  

In order to run the installation, click on the cell below and press "Run" in the menu. 


In [None]:
# In this cell Jupyter checks whether you have the right libraries installed to carry out the harvest of data from Wikipedia

try: #First, Jupyter tries to import a library
    import praw
    print("praw library has been imported")
except: #If it fails, it will try to install the library
    print("praw library not found. Installing...")
    !pip install praw
    try:#... and try to import it again
        import praw
    except: #unless it fails, and raises an error.
        print("Something went wrong in the installation of the praw library. Please check your internet connection and consult output from the installation below")
try:
    import pandas
    print("pandas api library has been imported")
except:
    print("pandas api library not found. Installing...")
    !pip install pandas
    
    try:
        import pandas
    except:
        print("Something went wrong in the installation of the pandas api library. Please check your internet connection and consult output from the installation below")

try:
    import psaw
    print("psaw api library has been imported")
except:
    print("psaw api library not found. Installing...")
    !pip install psaw
    
    try:
        import psaw
    except:
        print("Something went wrong in the installation of the psaw api library. Please check your internet connection and consult output from the installation below")
        

## Step 2: Generate Reddit app

The first step is to create an app for reddit. This is done in order to get access to the API. You can do so by following the first step of [this tutorial](http://www.storybench.org/how-to-scrape-reddit-with-python/). 

### When you have the _14-characters personal use script_ and _27-character secret key_  run the cell below and input the information.

<img src="https://res.cloudinary.com/dra3btd6p/image/upload/v1550562228/Mapping%20controversies%202019/reddit_app.jpg" title="Category:circumcision" style="width: 900px;" /> 


In [None]:
##### RUN THIS CELL FIRST!


print("Enter the 27 character secret key from the app page: ")
secret=input()

print("Enter the 14 character personal use script key from the app page: ")
pus=input()

print("Enter your app-name :")
app_name=input()

print("Enter your reddit user name: ")
user_name=input()

print("Enter your reddit password: ")
pw=input()

## Step 3: Harvest the data from Reddit

Search for a keyword in submissions, and harvest the submission thread. Limit to subreddit or search all Reddit.

The script will output two csv files. One with all submissions and comments and one where users, up and downvotes, likes etc are aggregated on submission level. 

## BE AWARE! This script may take several hours or even days if the limit is set too high!

In [4]:
import pandas as pd
import datetime as dt
import praw
import psaw
import datetime as dt
import csv

reddit = praw.Reddit(client_id=pus, \
                     client_secret=secret, \
                     user_agent=app_name, \
                     username=user_name, \
                     password=pw)
api=psaw.PushshiftAPI(reddit)

print("There is a limit of max 2.000 submissions. It cannot be increased")
print("Do you wan't to lower it (y/n)?")
limit_c=input()
max_response_cache=2000
if limit_c.lower()=="y":
    print("Enter the new limit (below 2.000)")
    max_response_cache=int(input())
if max_response_cache>2000:
    max_response_cache=2000

print("Do you wan't to limit your search to a specific subreddit (y/n)?")
sub_filter=input()
sub_reddit=""
if sub_filter.lower()=="y":
    sub_reddit=input()

print("Set the start date: (Format: yyyy-mm-dd) ")
start_date=input()
start_year=int(start_date.split("-")[0])
start_month=int(start_date.split("-")[1])
start_day=int(start_date.split("-")[2])
start_epoch=int(dt.datetime(start_year, start_month, start_day).timestamp())


print("What would you like to call the file?")
input_filename=input()

print("Input the keyword you would like to search for: ")
keyword=input()
csv_headers=["Type","id","author","body","created","up_votes","down_votes","likes","depth", "parent_id", "url", "reports", "subreddit", "submission"]
blacklisted_users=[]
print("Enter user names you want to blacklist. If you want to blacklist multiple users, use comma separation")
raw_blacklist=input()

if "," in raw_blacklist:
    for each in raw_blacklist.split(","):
        blacklisted_users.append(each.strip().lower())
else: 
    blacklisted_users.append(raw_blacklist.strip().lower())
if ".csv" in input_filename:
    filename="keyword_search_submissions_"+input_filename
    filename_2="keyword_search_aggregate_users_submissions_"+input_filename
else:
    filename="keyword_search_submissions_"+input_filename+".csv"
    filename_2="keyword_search_aggregate_users_submissions_"+input_filename+".csv"

with open(filename,"w", newline='',encoding='utf-8') as f:
    wr = csv.writer(f, delimiter=",")
    wr.writerow(csv_headers)
    
if sub_filter.lower()=="y":
    gen = api.search_submissions(q=keyword, subreddit=sub_reddit, after=start_epoch)
else:
    gen = api.search_submissions(q=keyword, after=start_epoch)

cache = []
com_count=0
sub_count=0

csv_headers_2=["submission_id", "submission_url", "users", "subreddit"]
with open(filename_2,"w", newline='',encoding='utf-8') as q:
    wr2 = csv.writer(q, delimiter=",")
    wr2.writerow(csv_headers_2)

for c in gen:
    sub=c
    sub_users=[]
    sub_author=sub.author
    sub_downs=sub.downs
    sub_ups=sub.ups
    sub_likes=sub.likes
    sub_title=sub.title
    sub_created=sub.created
    sub_created=dt.datetime.utcfromtimestamp(sub_created).strftime('%Y-%m-%d %H:%M:%S')
    subreddit_name=str(sub.subreddit)
    sub_id=sub.id
    sub_reports=sub.num_reports
    sub_text=sub.selftext
    sub_users.append(str(sub_author))
    sub_url="https://www.reddit.com/r/"+subreddit_name+"/comments/"+sub_id
    sub.comments.replace_more(limit=None)
    sub_list=sub.comments.list()
    sub_count=sub_count+1
    csv_list=["Submission",sub_id,sub_author, sub_text,sub_created,sub_ups,sub_downs, sub_likes, "N/A", subreddit_name, sub_url, sub_reports, subreddit_name, sub_id]
    with open(filename,"a", newline='',encoding='utf-8') as f:
        wr = csv.writer(f, delimiter=",")
        wr.writerow(csv_list)
    for comment in sub_list:
        comment_depth=comment.depth
        comment_parent_id=comment.parent_id
        comment_parent_id=comment_parent_id.split("_")[1]
        comment_reports=comment.num_reports
        comment_author=str(comment.author)
        if comment_author.lower() in blacklisted_users:
            continue
        comment_body=comment.body
        comment_created=comment.created_utc
        comment_created=dt.datetime.utcfromtimestamp(comment_created).strftime('%Y-%m-%d %H:%M:%S')
        comment_downs=comment.downs
        comment_ups=comment.ups
        comment_likes=comment.likes
        comment_id=comment.id
        sub_users.append(str(comment_author))

        comment_url="https://www.reddit.com/r/"+subreddit_name+"/comments/"+sub_id+"/"+sub_title+"/"+comment_id
        csv_list=["comment",comment_id, comment_author, comment_body,comment_created, comment_ups,comment_downs,comment_likes,comment_depth,comment_parent_id,comment_url,comment_reports, subreddit_name, sub_id]
        with open(filename,"a", newline='',encoding='utf-8') as f:
            wr = csv.writer(f, delimiter=",")
            wr.writerow(csv_list)
        com_count=com_count+1

    user_entry=""
    for user in sub_users:
        if not sub_users.index(user)==len(sub_users)-1:
            if user:
                user_entry=user_entry+user+";"
        else:
            if user:
                user_entry=user_entry+user
    csv_list_2=[str(sub_id), str(sub_url), user_entry,subreddit_name]
            
    with open(filename_2,"a", newline='',encoding='utf-8') as q:
        wr2 = csv.writer(q, delimiter=",")
        wr2.writerow(csv_list_2)
    if sub_count >= max_response_cache:
        break
    if sub_count % 50 == 0:
        print("Data harvested from "+str(sub_count)+" submissions out of maximum "+str(max_response_cache)+". Continuing harvest...")

if False:
    for c in gen:
        sub=c
        sub_users=[]
        sub_author=sub.author
        sub_downs=sub.downs
        sub_ups=sub.ups
        sub_likes=sub.likes
        sub_title=sub.title
        sub_created=sub.created
        sub_created=dt.datetime.utcfromtimestamp(sub_created).strftime('%Y-%m-%d %H:%M:%S')
        subreddit_name=str(sub.subreddit)
        sub_id=sub.id
        sub_reports=sub.num_reports
        sub_text=sub.selftext
        sub_url="https://www.reddit.com/r/"+subreddit_name+"/comments/"+sub_id
        sub.comments.replace_more(limit=None)
        sub_list=sub.comments.list()
        sub_count=sub_count+1
        sub_users.append(str(sub_author))
        csv_list=["Submission",sub_id,sub_author, sub_text,sub_created,sub_ups,sub_downs, sub_likes, "N/A", subreddit_name, sub_url, sub_reports, subreddit_name, sub_id]
        with open(filename,"a", newline='',encoding='utf-8') as f:
            wr = csv.writer(f, delimiter=",")
            wr.writerow(csv_list)
        for comment in sub_list:
            comment_depth=comment.depth
            comment_parent_id=comment.parent_id
            comment_parent_id=comment_parent_id.split("_")[1]
            comment_reports=comment.num_reports
            comment_author=str(comment.author)
            if comment_author.lower() in blacklisted_users:
                continue
            comment_body=comment.body
            comment_created=comment.created_utc
            comment_created=dt.datetime.utcfromtimestamp(comment_created).strftime('%Y-%m-%d %H:%M:%S')
            comment_downs=comment.downs
            comment_ups=comment.ups
            comment_likes=comment.likes
            comment_id=comment.id
            sub_users.append(str(comment_author))
            
            comment_url="https://www.reddit.com/r/"+subreddit_name+"/comments/"+sub_id+"/"+sub_title+"/"+comment_id
            csv_list=["comment",comment_id, comment_author, comment_body,comment_created, comment_ups,comment_downs,comment_likes,comment_depth,comment_parent_id,comment_url,comment_reports, subreddit_name, sub_id]
            with open(filename,"a", newline='',encoding='utf-8') as f:
                wr = csv.writer(f, delimiter=",")
                wr.writerow(csv_list)
            com_count=com_count+1
        user_entry=""
        for user in sub_users:
            if not sub_users.index(user)==len(sub_users)-1:
                if user:
                    user_entry=user_entry+user+";"
            else:
                if user:
                    user_entry=user_entry+user
        csv_list_2=[str(sub_id), str(sub_url), user_entry,str(subreddit_name)]
        with open(filename_2,"a", newline='',encoding='utf-8') as q:
            wr2 = csv.writer(q, delimiter=",")
            wr2.writerow(csv_list_2)
        if sub_count % 50 == 0:
            print("Data harvested from "+str(sub_count)+" submissions out of maximum "+str(max_response_cache)+". Continuing harvest...")

print("The script is done! We have harvested "+str(sub_count)+" submissions with "+ str(com_count)+" comments in total.")


There is a limit of max 2.000 submissions. It cannot be increased
Do you wan't to lower it (y/n)?
y
Enter the new limit (below 2.000)
300
Do you wan't to limit your search to a specific subreddit (y/n)?
n
Set the start date: (Format: yyyy-mm-dd) 
2017-01-01
What would you like to call the file?

Input the keyword you would like to search for: 
home sharing
Enter user names you want to blacklist. If you want to blacklist multiple users, use comma separation

Data harvested from 50 submissions out of maximum 300. Continuing harvest...
Data harvested from 100 submissions out of maximum 300. Continuing harvest...
Data harvested from 150 submissions out of maximum 300. Continuing harvest...
Data harvested from 200 submissions out of maximum 300. Continuing harvest...
Data harvested from 250 submissions out of maximum 300. Continuing harvest...
The script is done! We have harvested 300 submissions with 2484 comments in total.
