# Scraping and Gathering

This section will demonstrate how data was scraped from the two subreddits of focus: r/Relationships and r/RelationshipAbuse. Scraping was done with the Pushshift API, which can be found [here](https://github.com/pushshift/api). 

In [134]:
import requests
import pandas as pd
import numpy as np 

### Scraping: r/Relationships

When attempting to manage parameters with a dictionary, there were issues with formatting of the parameters working both with the API link and python conventions. This caused me to manually generate the links used to scrape data. 

##### r/Relationships parameters
| parameter name | condition | 
|:---|:---|
| subreddit | relationships | 
| size | 500 | 
| num_comments | >100 | 
|fields | title, selftext, subreddit | 

The goal is to not only get posts that were made in the community, but that also resonate with the community. I chose to target posts with a certain number of comments. The number 100 was arrived at after looking at the number of comments for posts that were the top for the day and the week. 

I scraped during a few different time periods, the conditions are as follows: 

| parameter name | condition | 
|:--|:---|
| after | 30d | 
| | | 
| before | 31d | 
| after | 60d | 
| | |
| before | 61d | 

##### r/Relationships API links
_Base link:_ 
 - https: //api.pushshift.io/reddit/search/submission/?subreddit=relationships&size=500&num_comments=>100&fields=title,subreddit,selftext

_Link add-ons for date intervals:_
 - In the past month: 
   - &after=30d
 - In the previous month: 
   - &after=60d&before=31d
 - Before the previous two months: 
   - &before=61d

In [92]:
recent_mo_url_r = "https://api.pushshift.io/reddit/search/submission/?subreddit=relationships&size=500&num_comments=>100&fields=title,subreddit,selftext&after=30d"

recent_2mo_url_r = "https://api.pushshift.io/reddit/search/submission/?subreddit=relationships&size=500&num_comments=>100&fields=title,subreddit,selftext&after=60d&before=31d"

before_2mo_url_r = "https://api.pushshift.io/reddit/search/submission/?subreddit=relationships&size=500&num_comments=>100&fields=title,subreddit,selftext&before=61d"

In [93]:
response1 = requests.get(recent_mo_url_r)

response2 = requests.get(recent_2mo_url_r)

response3 = requests.get(before_2mo_url_r)

In [94]:
response1.status_code

200

In [95]:
response2.status_code

200

In [96]:
response3.status_code

200

In [97]:
r_data1 = response1.json()

r_data2 = response2.json()

r_data3 = response3.json()

In [98]:
r_df1 = pd.DataFrame(r_data1["data"])

r_df2 = pd.DataFrame(r_data2["data"])

r_df3 = pd.DataFrame(r_data3["data"])

r_df = r_df1.append([r_df2, r_df3])

In [99]:
r_df.shape

(1108, 3)

In [100]:
r_df.head()

Unnamed: 0,selftext,subreddit,title
0,"So, backstory, my friend was visiting NY from...",relationships,I (f29) have an issue that just popped up invo...
1,Original: https://www.reddit.com/r/relationshi...,relationships,[UPDATE] I'm [24/f] pregnant and the father/my...
2,A bit of background - I am of the “naturally t...,relationships,My sister (24f) has been humiliating me (19f) ...
3,throwaway for obvious reasons.\n\nI graduated ...,relationships,I[25 F] got a job offer across the country and...
4,Throwaway for several reasons. \n\n\nI starte...,relationships,My (24M) gf (26F) is on the autism spectrum. I...


### Scraping: r/AbusiveRelationships

##### r/AbusiveRelationships parameters
| parameter name | condition | 
|:---|:---|
| subreddit | abusiverelationships | 
| size | 500 | 
| num_comments | >3 | 
|fields | title, selftext, subreddit | 

r/Relationships is significantly more popular than r/AbusiveRelationships, earning more posts per day and having significantly more engagement. After looking at top posts for the day and week in r/AbusiveRelationships, posts with more than 3 comments seemed a reasonable minimum for posts that garnered some community engagement. 

I scraped during a few different time periods, the conditions are as follows: 

| parameter name | condition | 
|:--|:---|
| after | 30d | 
| | | 
| before | 31d | 
| after | 60d | 
| | |
| before | 61d | 

##### r/AbusiveRelationships API links
_Base link:_ 
 - https: //api.pushshift.io/reddit/search/submission/?subreddit=abusiverelationships&size=500&num_comments=>3&fields=title,subreddit,selftext

_Link add-ons for date intervals:_
 - In the past month: 
   - &after=30d
 - In the previous month: 
   - &after=60d&before=31d
 - Before the previous two months: 
   - &before=61d

In [103]:
recent_mo_url_ar = "https://api.pushshift.io/reddit/search/submission/?subreddit=abusiverelationships&size=500&num_comments=>3&fields=title,subreddit,selftext&after=30d"

recent_2mo_url_ar = "https://api.pushshift.io/reddit/search/submission/?subreddit=abusiverelationships&size=500&num_comments=>3&fields=title,subreddit,selftext&after=60d&before=31d"

before_2mo_url_ar = "https://api.pushshift.io/reddit/search/submission/?subreddit=abusiverelationships&size=500&num_comments=>3&fields=title,subreddit,selftext&before=61d"

In [104]:
response1 = requests.get(recent_mo_url_ar)

response2 = requests.get(recent_2mo_url_ar)

response3 = requests.get(before_2mo_url_ar)

In [105]:
response1.status_code

200

In [106]:
response2.status_code

200

In [107]:
response3.status_code

200

In [108]:
ar_data1 = response1.json()

ar_data2 = response2.json()

ar_data3 = response3.json()

In [109]:
ar_df1 = pd.DataFrame(ar_data1["data"])

ar_df2 = pd.DataFrame(ar_data2["data"])

ar_df3 = pd.DataFrame(ar_data3["data"])

ar_df = ar_df1.append([ar_df2, ar_df3])

In [111]:
ar_df.shape

# 964 posts vs 1108 posts for r/Relationships... not bad! 

(964, 3)

In [112]:
ar_df.head()

Unnamed: 0,selftext,subreddit,title
0,,abusiverelationships,You don't owe them anything ♥️💜💙🖤💛
1,"He isn’t sorry for what he did and blames me,I...",abusiverelationships,"Threatened me before Christmas,I came back and..."
2,Locked in the bathroom can’t come out because ...,abusiverelationships,This to shall pass
3,,abusiverelationships,He still wants to destroy me. I hate that it s...
4,"We are low on money, so we put out an ad to ge...",abusiverelationships,Manipulation not working anymore


### Combination

In [124]:
# create one big dataset

df = r_df.append(ar_df)

In [125]:
# give more descriptive name for subreddit column, change to binary values 
# 1 = r/AbusiveRelationships, 0 = r/Relationships

df.rename(columns = {"subreddit": "abusive_relationship"}, inplace = True)

df["abusive_relationship"] = np.where((df["abusive_relationship"] == "abusiverelationships"), 1, 0)

df["abusive_relationship"].value_counts(normalize = True)

0    0.534749
1    0.465251
Name: abusive_relationship, dtype: float64

In [133]:
# export raw, combined df

df.to_csv("./datasets/combined_raw.csv")