# Introduction
## Overview of notebook
This notebook will be used to gather data from reddit through an API. The two subreddits we will be gather data from are:
1) ) [Ask Men](https://www.reddit.com/r/AskMen)
2) ) [Ask Women](https://www.reddit.com/r/AskWomen)

There were two iterations of data collection. The second iteration was done because there were significant issues found in the EDA process [Data cleaning note book](./Code/02_data_cleaning.ipynb) such as almost 70% of the threads in the data set were removed by a moderator.
## Stats

Below is the total cumulative number of post gathered from each subreddit

Records count:

|subreddit |iteration 1 | iteration 2| Iteration 3| Iteration 4|
|-----|-----|------|-------|-----|
|Ask men| 999| 998| 998 |1000|
|Ask Women| 996|999|998|998|
_________________________________________________________________________________
To avoid overloading the server we will keep track of when data was pulled and at minimum wait a day.

Total Date data collected:


|Date(YYYY-MM-DD)|Records|
|-----|-----|
|2023-04-24|1995|
|2023-04-30|1997|
|2023-05-04|1996|
|2023-05-04|1998|
___________________________________________________________
## Data sourcing
For the first iteration I created separate dataframes for the first 2000~ post descending from each subreddit on April 24, 2023. The initial request pulled the data without any filter parameters aside from identifying which subreddit to pull data from. After the first request I used the UTC time stand to pull data older than the oldest post form each subreddit. This way I can have more confidence I can pull the maximum amount of data without waiting for new post to come in.

With the second iteration, after conducting some EDA in the [Data cleaning note book](./Code/02_data_cleaning.ipynb) we found a significant portion of our post were removed for violating community guidelines. As a result we returned here to pull more data each subreddit. This time we created a function with the parameters:

API filters:

|Parameter|function|
|-------|------|
|Subreddit| determine which sub reddit to get data from|
|size|The amount of post to get from the API|
|until|the date we want to pull proceeding data from (before)|
|min_comments|get post with a minimum number of comments this was used in place of mod_removed because most of the removed post had 2 or fewer comments|
______________________________________________


# Data Collection

## Imports

In [1]:
import requests
import pandas as pd

### Women

#### Inital dataframe

Due to instability of the API I wanted to created an inital dataframe and manually append new data. I was afraid of using a loop and have issues with API timeouts or getting blocked.

In [3]:
#params to get the 1000 newest posts from the askwomen subreddit
women_params={
    'subreddit': 'askwomen',
    'size' : 1000,
}

In [4]:
req_women = requests.get(url, women_params)

In [5]:
#check to see if we established a connect
print(f' women response code: {req_women}')

 women response code: <Response [200]>


In [6]:
#checked keys to see what data we need
req_women.json().keys()

dict_keys(['data', 'error', 'metadata'])

In [94]:
req_women.json().keys()

dict_keys(['data', 'error', 'metadata'])

In [43]:
#manually creating a dataframe then concating new dataframes due to pushshift api issues, this is the original dataframe
women_df=pd.DataFrame(req_women.json()['data'])

In [47]:
#check to see how many records were pulled
len(women_df)

996

In [44]:
#get the oldest post, referenced the utc_datetime_str for more readability
women_df.sort_values('created_utc', ascending=True).head(2)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,link_flair_template_id
995,AskWomen,[removed],t2_9pxvqwsvf,0,How I 22 f and moms man daughter 22f,[],r/AskWomen,False,6,,...,1682113209,0,,False,1682113223,1682113223,2023-04-21 21:40:09,,,
994,AskWomen,,t2_5g35dfr5,0,What do you think about a Muslim guy as a bf?,[],r/AskWomen,False,6,,...,1682113436,0,,False,1682113448,1682113448,2023-04-21 21:43:56,,,


#### New data
Here I simply replicated the previous code block, but added a until feature which pulls records older than the specified parameter then apended the results to the original dataframe. Ideally, This would be a function that takes in the subreddit name, size, and epoch time. Then return the dataframe which I can then use a loop to append the dataframes together.

In [52]:
#creating a new parameters for the datafarme were appending
new_women_params={
    'subreddit': 'askwomen',
    'size' : 1000,
    'until' : 1682113209 #oldset post from the women data frame-- 'women_df'
}

In [53]:
new_women_req = requests.get(url, new_women_params)

In [54]:
new_women = pd.DataFrame(new_women_req.json()['data'])

In [55]:
#comparing the size of the two dataframes
print(len(new_women), len(women_df))

999 996


In [56]:
new_women.head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,link_flair_template_id,author_cakeday
0,AskWomen,[removed],t2_7eati,0,"Women who bartend, when is it okay to ask for ...",[],r/AskWomen,False,6,,...,0,,False,1682112804,1682112805,2023-04-21 21:33:10,,,,
1,AskWomen,,t2_vkwn0tcr,0,What job/career would you have if your mental ...,[],r/AskWomen,False,6,,...,0,,False,1682112727,1682112728,2023-04-21 21:31:53,,,,
2,AskWomen,[removed],t2_7f7gnzky,0,My partner of five years broke up with me unex...,[],r/AskWomen,False,6,,...,0,,False,1682112224,1682112224,2023-04-21 21:23:33,,,,


In [57]:
#combine the original women data frame with the new datafame
women_df = pd.concat([women_df, new_women])

In [87]:
#see total rows, missing values, and an overview of the dataframe
women_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1995 entries, 0 to 998
Data columns (total 90 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   subreddit                      1995 non-null   object 
 1   selftext                       1995 non-null   object 
 2   author_fullname                1968 non-null   object 
 3   gilded                         1995 non-null   int64  
 4   title                          1995 non-null   object 
 5   link_flair_richtext            1995 non-null   object 
 6   subreddit_name_prefixed        1995 non-null   object 
 7   hidden                         1995 non-null   bool   
 8   pwls                           1995 non-null   int64  
 9   link_flair_css_class           4 non-null      object 
 10  thumbnail_height               0 non-null      object 
 11  top_awarded_type               0 non-null      object 
 12  hide_score                     1995 non-null   bo

In [60]:
women_df.to_csv('../Data/women.csv', index=False)

### Men

#### Inital datafame
Due to instability of the API I wanted to created an inital dataframe and manually append new data. I was afraid of using a loop and have issues with API timeouts or getting blocked.

In [61]:
#params to get the 1000 newest posts from the askmen subreddit
men_params={
    'subreddit': 'askmen',
    'size' : 1000
}

In [62]:
req_men = requests.get(url, men_params)

In [63]:
print(f' men response code: {req_men}')

 men response code: <Response [200]>


In [64]:
#manually creating a dataframe then concating new dataframes due to pushshift api issues, this is the original dataframe
men_df=pd.DataFrame(req_men.json()['data'])

In [65]:
#check to see how many records were pulled
len(men_df)

999

In [68]:
men_df.sort_values('created_utc', ascending=True).head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,author_cakeday
998,AskMen,,t2_969ax2xq8,0,Thoughts on Bollywood movies and why?,[],r/AskMen,False,6,,...,1682195470,0,,False,1682195488,1682195488,2023-04-22 20:31:10,,,
997,AskMen,,t2_mjo8y3q1,0,Do some guys not like kissing during sex ?,[],r/AskMen,False,6,,...,1682195591,0,,False,1682195604,1682195605,2023-04-22 20:33:11,,,
996,AskMen,,t2_8ywwtdo2p,0,"my girlfriend &amp; I just broke up, suspectin...",[],r/AskMen,False,6,,...,1682195624,0,,False,1682195636,1682195636,2023-04-22 20:33:44,,,


#### New data
Here I simply replicated the previous code block, but added a until feature which pulls records older than the specified parameter then apended the results to the original dataframe. Ideally, This would be a function that takes in the subreddit name, size, and epoch time. Then return the dataframe which I can then use a loop to append the dataframes together.

In [69]:
new_men_params={
    'subreddit': 'askmen',
    'size' : 1000,
    'until' : 1682195470
}

In [70]:
new_men_req = requests.get(url, new_men_params)

In [71]:
new_men = pd.DataFrame(new_men_req.json()['data'])

In [72]:
print(len(new_men), len(men_df))

998 999


In [73]:
#check the first few rows of men
new_men.head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,author_cakeday
0,AskMen,[removed],t2_wa2icro,0,Do men care about very muscular women?,[],r/AskMen,False,6,,...,1682195369,0,,False,1682195385,1682195386,2023-04-22 20:29:29,,,
1,AskMen,[removed],t2_7jcvcqwk,0,My(23M) gf’s(26F) dad was diagnosed with cance...,[],r/AskMen,False,6,,...,1682194947,0,,False,1682194962,1682194963,2023-04-22 20:22:27,,,
2,AskMen,[removed],t2_wa2icro,0,Do men care about women having muscles?,[],r/AskMen,False,6,,...,1682194901,0,,False,1682194920,1682194921,2023-04-22 20:21:41,,,


In [74]:
men_df = pd.concat([men_df,new_men])

In [86]:
#see total rows, missing values, and an overview of the dataframe
men_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1997 entries, 0 to 997
Data columns (total 89 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   subreddit                      1997 non-null   object 
 1   selftext                       1997 non-null   object 
 2   author_fullname                1946 non-null   object 
 3   gilded                         1997 non-null   int64  
 4   title                          1997 non-null   object 
 5   link_flair_richtext            1997 non-null   object 
 6   subreddit_name_prefixed        1997 non-null   object 
 7   hidden                         1997 non-null   bool   
 8   pwls                           1997 non-null   int64  
 9   link_flair_css_class           0 non-null      object 
 10  thumbnail_height               0 non-null      object 
 11  top_awarded_type               0 non-null      object 
 12  hide_score                     1997 non-null   bo

In [77]:
men_df.to_csv('../Data/ask_men.csv', index=False)

### Combine data

In [88]:
sub_reddit_data = pd.concat([women_df, men_df])

In [89]:
sub_reddit_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3992 entries, 0 to 997
Data columns (total 90 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   subreddit                      3992 non-null   object 
 1   selftext                       3992 non-null   object 
 2   author_fullname                3914 non-null   object 
 3   gilded                         3992 non-null   int64  
 4   title                          3992 non-null   object 
 5   link_flair_richtext            3992 non-null   object 
 6   subreddit_name_prefixed        3992 non-null   object 
 7   hidden                         3992 non-null   bool   
 8   pwls                           3992 non-null   int64  
 9   link_flair_css_class           4 non-null      object 
 10  thumbnail_height               0 non-null      object 
 11  top_awarded_type               0 non-null      object 
 12  hide_score                     3992 non-null   bo

In [None]:
sub_reddit_data['subreddit'].value_counts()

In [95]:
sub_reddit_data.to_csv('../Data/sub_reddit_data.csv', index=False)

## First iteraton: Eda Process results.

After out first iteration in the [data_cleaning notebook]('./Code/02_data_cleaning.ipynb') we explored, analyzed, and cleaned the data; however, after finding out most of our post have been removed. We returned here to add additional post that have not been removed. We also noticed about 70% of post were removed by a moderator and often had under 3 comments which we assumed was due violating community guidelines. Because of that proportion of removed post we return to this note book to pull more data to offset that imbalance, but wanted to maintain those post for potential future use.

So from there we moved onto the second iteration.

_______________________________________________________________________________________________
# Second iteration: Pulling data from api 

After our first iteration of data cleaning in the [data cleaning notebook]('./Code/01_data_cleaning.ipynb') we identified a need for additional data, we decided to create functions to easily pull data as needed. So from here on forth we will be using functions as needed.

In [3]:
# api url
url =  'https://api.pushshift.io/reddit/search/submission'

These functions will be used to pull the data we need. For ease of use, the function get_data_before will be used more because will more likely get closer to the full amount of post. The get_data_after can be used after a long period of time has elapsed assuming the subreddit is not active

In [79]:
#function to get data
def get_data_before(subreddit, size, ini_sub):
    '''
    This function will pull a specified ammount of title pos within a given subreddit.
    The utc parameter will pull titles before the specified time given.
    The minimum number of post is 5 because that's a good indicator of a post not being removed.
    The ini_sub parameter passes the inital data frame, so we can find the utc
    '''
    utc = ini_sub['created_utc'].sort_values(ascending = False).values[-1]
    print(f'utc doe is {utc}')
    params ={
        'subreddit' : subreddit,
        'size': size,
        'until': utc,
        'min_comments' : 5
    }
    data_req = requests.get(url, params)
    data_df = pd.DataFrame(data_req.json()['data'])
    print(f' {subreddit} response code: {data_req}')
    return data_df

In [96]:
#function to get data
def get_data_after(subreddit, size, ini_sub):
    '''
    This function will pull a specified ammount of title pos within a given subreddit.
    The utc parameter will pull titles after the specified time given.
    The minimum number of post is 5 because that's a good indicator of a post not being removed.
    The ini_sub parameter passes the inital data frame, so we can find the utc
    '''
    utc = ini_sub['created_utc'].sort_values(ascending = False).values[0]
    print(f'utc doe is {utc}')
    params ={
        'subreddit' : subreddit,
        'size': size,
        'since': utc,
        'min_comments' : 5
    }
    data_req = requests.get(url, params)
    data_df = pd.DataFrame(data_req.json()['data'])
    print(f' {subreddit} response code: {data_req}')
    return data_df

### Women

In [97]:
ini_women = pd.read_csv('../Data/ask_women.csv') #initial ask women data

In [98]:
new_askwomen= get_data_before('askwomen', 1000, ini_women)

utc doe is 1681854044
 askwomen response code: <Response [200]>


It seems like there are less removed post, but the filter didn't work, so its probably deprecated. The comment number parameter appears to be working as intended. Which is useful because most of the removed post had 1 comment, so this was a small work around. So the next step is to see if were pulling the correct data by comparing the oldest post of the new dataframe and the old dataframe.

In [92]:
ini_women[['created_utc', 'title']].sort_values(by = 'created_utc', ascending = False).head(3)

Unnamed: 0,created_utc,title
0,1682375066,How about true friendship between men and wome...
1,1682374690,Does this mean she’s not interested? What shou...
2,1682374630,What’s something that your brain cannot compre...


In [95]:
new_askwomen[['created_utc', 'title']].sort_values(by = 'created_utc', ascending = False).head(3)

Unnamed: 0,created_utc,title
0,1679324774,What Would You Do if You Found Weed in Your 14...
1,1677685126,What are your views on buying luxury handbags?...
2,1677684203,How did you get over someone you never thought...


In [105]:
len(new_askwomen), len(ini_women)

(1000, 1995)

In [102]:
#combined the inital women csv to the new pulled data.
combined_women = pd.concat([ini_women, new_askwomen])

In [103]:
len(combined_women)

2995

In [117]:
combined_women.to_csv('../Data/ask_women.csv')

### Men

In [106]:
#inital ask men data
ini_men =pd.read_csv('../Data/ask_men.csv')

In [107]:
#get the utc of thenewest post
ini_men.sort_values('created_utc', ascending=True).head(2)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,author_cakeday
1996,AskMen,What are the things that would make you lose i...,t2_75n4i5f5,0,"Men of reddit, what is something that would ma...",[],r/AskMen,False,6,,...,1681999288,0,,False,1681999300,1681999301,2023-04-20 14:01:28,,,
1995,AskMen,,t2_qzhumucd,0,What’s the most attractive thing about a woman??,[],r/AskMen,False,6,,...,1682000135,0,,False,1682000148,1682000149,2023-04-20 14:15:35,,,


In [108]:
new_askmen= get_data_before('askmen', 1000, ini_men)

utc doe is 1681999288
 askmen response code: <Response [200]>


In [109]:
new_askmen['removed_by_category'].value_counts()

moderator    261
deleted      211
reddit         6
Name: removed_by_category, dtype: int64

In [110]:
new_askmen['num_comments'].value_counts(ascending=False).head(10)

8     45
9     37
11    36
5     35
6     34
13    30
12    30
23    28
14    27
7     26
Name: num_comments, dtype: int64

Similar to the second iteration of the Askwomen API call, we notice the removed my moderator parameter was deprecated, but the minimum number of comments seems to be working.

In [111]:
#combined the inital men csv to the new pulled data.
combined_men = pd.concat([ini_men, new_askmen])

In [112]:
len(combined_men)

2997

In [116]:
combined_men.to_csv('../Data/ask_men.csv')

### Combined data frames

Since we used the original askwomen and askmen subreddit data, we don't need to load in the previous combined data. Otherwise we would have a lot of duplicate data.

In [113]:
combined_subreddits = pd.concat([combined_men, combined_women])

In [114]:
print(f' askmen: {len(combined_men)} \n askwomen: {len(combined_women)} \n Total combined threads {len(combined_subreddits)}')

 askmen: 2997 
 askwomen: 2995 
 Total combined threads 5992


In [66]:
combined_subreddits.to_csv('../Data/sub_reddits_data.csv', index=False)

### Getting additional data

In [128]:
ini_subrredits = pd.read_csv('../Data/sub_reddits_data.csv')

In [118]:
askmen_3 = pd.read_csv('../Data/ask_men.csv')

In [122]:
askwomen_3 = pd.read_csv('../Data/ask_women.csv')

In [119]:
askmen_3 = get_data_before('askmen', 1000, askmen_3)

utc doe is 1669857196
 askmen response code: <Response [200]>


In [123]:
askwomen_3 = get_data_before('askwomen', 1000, askwomen_3)

utc doe is 1666608446
 askwomen response code: <Response [200]>


In [138]:
len(askmen_3), len(askwomen_3)

(1000, 998)

In [126]:
combined_3 = pd.concat([askmen_3, askwomen_3])

In [133]:
len(combined_3), len(ini_subrredits)

(1998, 5992)

In [144]:
combined_subreddits = pd.concat([combined_3, ini_subrredits])

In [145]:
len(combined_subreddits)

7990

In [146]:
combined_subreddits.to_csv('../Data/sub_reddits_data.csv', index=False)

## Second iteraton: Eda Process results.

From here we moved back to the [Data cleaning note book](./Code/02_data_cleaning.ipynb) and cleaned the second iteration of our Dateset and then compared those results to the first iteration