# Using Pushshift to Get Around PRAW request limits
When you use the official reddit API with PRAW there's a request limit. For some applications this isn't a big deal, top 1000 recent comments from a user are quite useful. But in some cases you really want to get a ton of data and for that you should use Pushshift. An unofficial copy of reddit that doesn't have these limits and also supports aggregations. Learn more about it here https://github.com/pushshift/api  
  
In this tutorial I'll show a pretty common use case of using Pushshift to get the "index" of post ids and then using PRAW to get data from that. This way you can search for all posts or comments matching something farther back then reddit would let you using sort by new, 1000 most recent posts. 
  
Sidenote for those familiar with Pushshift access in Python: I prefer [PSAW](https://github.com/dmarx/psaw) over [PMAW](https://github.com/mattpodolak/pmaw) because it returns [PRAW](https://praw.readthedocs.io/en/stable/) objects and I am irked how PMAW sets up it's own logger. The speed improvements aren't as significant as I would have thought, PMAW doesn't seem to support aggregations, and working with PRAW objects each time is easier to work with my existing code base

In [1]:
from secret_services import reddit,psaw_pushshift
import utils
import tqdm

Version 7.4.0 of praw is outdated. Version 7.5.0 was released Sunday November 14, 2021.


In [2]:
import pandas as pd

#pmaw example - returns json
subreddit_name="environment"
word_to_check="companies"

In [5]:
#psaw example - returns praw objects
subreddit_name="environment"
word_to_check="companies"
comments=psaw_pushshift.search_comments(q=word_to_check, subreddit=subreddit_name, limit=1001, before=1629990795)
import pandas as pd

post_with_comments=[]
for comment in comments:
    if word_to_check in comment.body.lower():#case insensitive check
        post_with_comments.append(
            {"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": comment.submission.id
            }
        )
    else:
        #edited or removed comments don't work
        print(comment.body)
df=pd.DataFrame(post_with_comments)
df




[deleted]
[deleted]
[deleted]
[removed]
[deleted]
[removed]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[removed]
[deleted]
[deleted]
[deleted]
[removed]
[deleted]
[deleted]
[deleted]
[deleted]
[deleted]
[removed]
[deleted]
[removed]
[removed]
[deleted]


Unnamed: 0,comment_id,comment_text,score,post
0,hafl3n0,I don't artificially inseminate..... my cows f...,-1,p9wdlq
1,haf6krp,Why the fuck cut down the last of the old grow...,16,pbm2jq
2,haemz66,I was beginning to believe that the vegan move...,1,pbbq9k
3,hac5sym,Its just a move by oil companies to keep doing...,8,pbhpam
4,habovkv,"There's going to be a fight, but it is looking...",2,pbh9tg
...,...,...,...,...
960,h0bdgsn,Part of it might be due to legal steps that ha...,3,nq7290
961,h09i2m4,🤷🏻‍♂️ \n\nVast majority of republicans now agr...,2,nq53zt
962,h08pa7z,"That is absolutely true, but check this out: ...",1,nq0j33
963,h08h8n7,I mean we are in a capitalist society. It’s an...,1,np15wg


In [7]:
reddit.submission(df['post'].iloc[5])

Submission(id='pam1cx')

In [8]:
full_discussion_rows=[]
#just doing first 5, traverse posts is super slow with 6th post ID and takes a while to do it https://www.reddit.com/r/environment/comments/pam1cx/the_colorado_river_that_supplies_water_to_40/
for post_id in tqdm.tqdm(df['post'].iloc[:5]):
    comments=utils.traverse_post(reddit.submission(post_id))
    for comment,level in comments:
        full_discussion_rows.append({"comment_id": comment.id, "comment_text": comment.body,"score": comment.score,"post": post_id, 
     "level": level
            })
full_discussion_df=pd.DataFrame(full_discussion_rows)
full_discussion_df

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:10<00:00,  2.02s/it]


Unnamed: 0,comment_id,comment_text,score,post,level
0,ha15f6s,"Yes! Tax plastics, tax meat production, etc. I...",239,p9wdlq,1
1,ha1du14,Plastics is much harder to accurately tax than...,50,p9wdlq,2
2,ha22vw7,Just start banning single use plastics anytime...,17,p9wdlq,3
3,ha2bpqi,"We do it with redemption values, those are an ...",12,p9wdlq,3
4,ha2f04u,"Oh, definitely, it's just that hose aspulls ca...",11,p9wdlq,4
...,...,...,...,...,...
346,habovkv,"There's going to be a fight, but it is looking...",2,pbh9tg,1
347,habqpnj,Alaska Election Info\n\n[Register to Vote](htt...,2,pbh9tg,1
348,habqqui,"Would you look at that, all of the words in yo...",1,pbh9tg,2
349,hac33vt,How many bots are there on Reddit anyway?,1,pbh9tg,3


In [16]:
full_discussion_df.to_csv("all_comments_from_found_posts_with_pushshift.csv",index=False)

# DONE!
We've done it, now we've extracted a corpus of posts matching a subreddit and query filter. Use this as you wish. For larger workflows or big posts you may want to add more functionality to utils.traverse_post. It's super slow on that Lake Mead post https://www.reddit.com/r/environment/comments/pam1cx/the_colorado_river_that_supplies_water_to_40/