# Search Comments
In this notebook, I will show you how to use the method `search_comments` from `PMAW` to retrieve comments from the Reddit Pushshift API. To view more details about the Search Comments endpoint you can view the Pushshift [documentation](https://github.com/pushshift/api#searching-comments).

In [1]:
import pandas as pd
from pmaw import PushshiftAPI

In [2]:
# instantiate
api = PushshiftAPI()

## Data Preparation

In [30]:
# import test data into a dataframe
posts_df = pd.read_csv(f'./test_data.csv', delimiter=';', header=0)
posts_df.head(5)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_cakeday,distinguished,suggested_sort,crosspost_parent,crosspost_parent_list,category,top_awarded_type,poll_data,steward_reports,comment_ids
0,[],False,nf_hades,,[],,text,t2_hriq1b,False,False,...,,,,,,,,,,"gjacwx5,gjad2l6,gjadatw,gjadc7w,gjadcwh,gjadgd..."
1,[],False,MyLittleDeku,,[],,text,t2_7dj62vj2,False,False,...,,,,,,,,,,gjacn1r
2,[],False,lilirucaarde12,,[],,text,t2_6i04uaxw,False,False,...,,,,,,,,,,"gjac5fb,gjacdy5,gjaco45,gjasj4f,gjbxfeg"
3,[],False,[deleted],,,,,,,,...,,,,,,,,,,gjac9d6
4,[],False,sirdimpleton,,[],,text,t2_bznmn4i,False,False,...,,,,,,,,,,"gjaocmg,gjb2jsj,gjbisrw,gjbjbk8"


In [31]:
len(posts_df)

2500

The data in `posts_df`, contains 2500 submissions and their respective metadata extracted from a subreddit submission search, the comment_ids were added post-search with additional requests.

In [32]:
posts_df.loc[:, 'comment_ids'].isna().sum()

271

In [33]:
# extract comment_ids
comment_ids_str = list(posts_df.loc[posts_df['comment_ids'].notna(), 'comment_ids'])
comment_ids_str

['gjacwx5,gjad2l6,gjadatw,gjadc7w,gjadcwh,gjadgd7,gjadlbc,gjadnoc,gjadog1,gjadphb,gjadtz3,gjaduck,gjadxa0,gjaeb3p,gjaeb5o,gjaeg5d,gjaegdn,gjaemkt,gjaenva,gjaerpm,gjaex2y,gjaf5nv,gjaim0d,gjapx5s,gjaqruo,gjarqic',
 'gjacn1r',
 'gjac5fb,gjacdy5,gjaco45,gjasj4f,gjbxfeg',
 'gjac9d6',
 'gjaocmg,gjb2jsj,gjbisrw,gjbjbk8',
 'gjaciiq,gjacll6,gjacnpu,gjad0li,gjad2rq,gjahtqa,gjahz69,gjaimh4,gjaip2p,gjaixlq,gjaj41x,gjak02t,gjaxv70,gjay8xy,gjb8kbs,gjbwke7,gjc0n8u,gjc0ran',
 'gja9pzd,gja9q0l,gjabeoy,gjacf5k,gjad8x6,gjaes65,gjagszp,gjcntcq',
 'gja8u8b',
 'gja7tcu',
 'gja8327,gja85g1,gja89wk,gja8ad3,gja8apg,gja8fbs,gja8h6x,gja8hbn,gja8lee,gja8mm0,gja8mpu,gja8s6g,gja8uti,gja8whd,gja8wtq,gja91nq,gja928d,gja9rpi,gjabj7f',
 'gja7n4o',
 'gja7juf',
 'gja9tf1,gjadsu9,gjbl3js,gjbs0pw,gjbs2a7,gjbs62p,gjbs8de',
 'gja77dy,gja7l5e,gja82gc,gja857n,gja97f6,gja9ghh,gjabw2v,gjacgjq,gjacmp2,gjacrpb,gjad97r,gjadnju,gjadrt6,gjadvy1,gjaeqh6,gjaf8uv,gjag2ev,gjag67b,gjagffm,gjagoi0,gjagy9r,gjah0br,gjah5sp,gjah9g7,gjahz5j,gj

In [34]:
# convert strings to lists
comment_ids = []
for c_str in comment_ids_str:
    # exclude ending , since all entries include one
    comment_ids.extend(c_str[:-1].split(","))
num_comments = len(comment_ids)
print(f'Ready to retrieve {num_comments} comments')

Ready to retrieve 43219 comments


In [35]:
comment_ids[:10]

['gjacwx5',
 'gjad2l6',
 'gjadatw',
 'gjadc7w',
 'gjadcwh',
 'gjadgd7',
 'gjadlbc',
 'gjadnoc',
 'gjadog1',
 'gjadphb']

## Search Comments by ID

### Using a Single Comment ID

In [36]:
comment = api.search_comments(ids=comment_ids[0])
comment

Total Success Rate: 100.00% -- Total Reqs: 1 -- Num Retries: 0


[{'all_awardings': [],
  'approved_at_utc': None,
  'associated_award': None,
  'author': 'AVrandomusic',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_747ea0dh',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'banned_at_utc': None,
  'body': "Who's complaining? I'm a thigh stand!",
  'can_mod_post': False,
  'collapsed': False,
  'collapsed_because_crowd_control': None,
  'collapsed_reason': None,
  'comment_type': None,
  'created_utc': 1610668310,
  'distinguished': None,
  'edited': False,
  'gildings': {},
  'id': 'gjacwx5',
  'is_submitter': False,
  'link_id': 't3_kxi2w8',
  'locked': False,
  'no_follow': False,
  'parent_id': 't3_kxi2w8',
  'permalink': '/r/anime/comments/kxi2w8/stop_complaining_about_the_thighs_in/gjacwx5/',
  'ret

### Using Multiple Comment IDs

In [37]:
%%time
comments_arr = api.search_comments(ids=comment_ids)

Total Success Rate: 72.73% -- Total Reqs: 44 -- Num Retries: 0
Total Success Rate: 75.00% -- Total Reqs: 56 -- Num Retries: 1
Total Success Rate: 75.86% -- Total Reqs: 58 -- Num Retries: 2
Wall time: 1min 2s


In [38]:
print(f'{len(comments_arr)} comments returned by Pushshift')

40990 comments returned by Pushshift


In [39]:
comments_arr[:3]

[{'all_awardings': [],
  'approved_at_utc': None,
  'associated_award': None,
  'author': 'AutoModerator',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_6l4z3',
  'author_patreon_flair': False,
  'author_premium': True,
  'awarders': [],
  'banned_at_utc': None,
  'body': '# Source Material Corner\n\nReply to this comment for any source-related discussion, future spoilers (including future characters, events and general hype about future content), comparison of the anime adaptation to the original, or just general talk about the source material. **You are still required to tag all spoilers.** Discussions about the source outside of this comment tree will be removed, and replying with spoilers outside of the source corner will lead to bans.\n\nThe spoiler syntax is:  \n`[Spo

### Save Comments to CSV

In [40]:
# convert comments to dataframe
comments_df = pd.DataFrame(comments_arr)

In [41]:
comments_df.head(3)

Unnamed: 0,all_awardings,approved_at_utc,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,author_cakeday
0,[],,,AutoModerator,,,[],,,,...,1610718439,1,False,True,anime,t5_2qh22,,0,[],
1,[],,,AutoModerator,,,[],,,,...,1610719160,1,False,False,anime,t5_2qh22,,0,[],
2,[],,,lostintheabyss_,,,[],,,,...,1610719193,0,True,False,anime,t5_2qh22,,0,[],


In [43]:
# replace usage of ; in comment bodies
import re
for index, row in comments_df.iterrows():
    row['body'] = re.sub(r';+', '.', row['body'])
    
comments_df.to_csv('./test_comments.csv', sep=';', header=True, index=False, columns=list(comments_df.axes[1]))