# Introduction
This notebook will be used to gather data from reddit through an API. The two subreddits we will be gather data from are:
1) ) [Ask Men](https://www.reddit.com/r/AskMen): 1995 records
2) ) [Ask Women](https://www.reddit.com/r/AskWomen): 1997 records

For this portion I created separate dataframes for the first 2000 post descending from April 24, 2023. I pulled the entire dataset from the API incase I need any information in the future. Before modeling I will subset the information I need such as:
1) ) title
2) ) utc_datetime_str
3) ) subreddit
4) ) hidden
These files will be stored as seperate CSVs as a backup, but a file called sub_reddit_data will be created combining the two.

## Imports

In [1]:
import requests
import pandas as pd

In [2]:
# api url
url =  'https://api.pushshift.io/reddit/search/submission'

## Pulling data from api

### Women

#### Inital dataframe

Due to instability of the API I wanted to created an inital dataframe and manually append new data. I was afraid of using a loop and have issues with API timeouts or getting blocked.

In [3]:
#params to get the 1000 newest posts from the askwomen subreddit
women_params={
    'subreddit': 'askwomen',
    'size' : 1000,
}

In [4]:
req_women = requests.get(url, women_params)

In [5]:
#check to see if we established a connect
print(f' women response code: {req_women}')

 women response code: <Response [200]>


In [6]:
#checked keys to see what data we need
req_women.json().keys()

dict_keys(['data', 'error', 'metadata'])

In [94]:
req_women.json().keys()

dict_keys(['data', 'error', 'metadata'])

In [43]:
#manually creating a dataframe then concating new dataframes due to pushshift api issues, this is the original dataframe
women_df=pd.DataFrame(req_women.json()['data'])

In [47]:
#check to see how many records were pulled
len(women_df)

996

In [44]:
#get the oldest post, referenced the utc_datetime_str for more readability
women_df.sort_values('created_utc', ascending=True).head(2)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,link_flair_template_id
995,AskWomen,[removed],t2_9pxvqwsvf,0,How I 22 f and moms man daughter 22f,[],r/AskWomen,False,6,,...,1682113209,0,,False,1682113223,1682113223,2023-04-21 21:40:09,,,
994,AskWomen,,t2_5g35dfr5,0,What do you think about a Muslim guy as a bf?,[],r/AskWomen,False,6,,...,1682113436,0,,False,1682113448,1682113448,2023-04-21 21:43:56,,,


#### New data
Here I simply replicated the previous code block, but added a until feature which pulls records older than the specified parameter then apended the results to the original dataframe. Ideally, This would be a function that takes in the subreddit name, size, and epoch time. Then return the dataframe which I can then use a loop to append the dataframes together.

In [52]:
#creating a new parameters for the datafarme were appending
new_women_params={
    'subreddit': 'askwomen',
    'size' : 1000,
    'until' : 1682113209 #oldset post from the women data frame-- 'women_df'
}

In [53]:
new_women_req = requests.get(url, new_women_params)

In [54]:
new_women = pd.DataFrame(new_women_req.json()['data'])

In [55]:
#comparing the size of the two dataframes
print(len(new_women), len(women_df))

999 996


In [56]:
new_women.head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,link_flair_template_id,author_cakeday
0,AskWomen,[removed],t2_7eati,0,"Women who bartend, when is it okay to ask for ...",[],r/AskWomen,False,6,,...,0,,False,1682112804,1682112805,2023-04-21 21:33:10,,,,
1,AskWomen,,t2_vkwn0tcr,0,What job/career would you have if your mental ...,[],r/AskWomen,False,6,,...,0,,False,1682112727,1682112728,2023-04-21 21:31:53,,,,
2,AskWomen,[removed],t2_7f7gnzky,0,My partner of five years broke up with me unex...,[],r/AskWomen,False,6,,...,0,,False,1682112224,1682112224,2023-04-21 21:23:33,,,,


In [57]:
#combine the original women data frame with the new datafame
women_df = pd.concat([women_df, new_women])

In [87]:
#see total rows, missing values, and an overview of the dataframe
women_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1995 entries, 0 to 998
Data columns (total 90 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   subreddit                      1995 non-null   object 
 1   selftext                       1995 non-null   object 
 2   author_fullname                1968 non-null   object 
 3   gilded                         1995 non-null   int64  
 4   title                          1995 non-null   object 
 5   link_flair_richtext            1995 non-null   object 
 6   subreddit_name_prefixed        1995 non-null   object 
 7   hidden                         1995 non-null   bool   
 8   pwls                           1995 non-null   int64  
 9   link_flair_css_class           4 non-null      object 
 10  thumbnail_height               0 non-null      object 
 11  top_awarded_type               0 non-null      object 
 12  hide_score                     1995 non-null   bo

In [60]:
women_df.to_csv('../Data/women.csv', index=False)

### Men

#### Inital datafame
Due to instability of the API I wanted to created an inital dataframe and manually append new data. I was afraid of using a loop and have issues with API timeouts or getting blocked.

In [61]:
#params to get the 1000 newest posts from the askmen subreddit
men_params={
    'subreddit': 'askmen',
    'size' : 1000
}

In [62]:
req_men = requests.get(url, men_params)

In [63]:
print(f' men response code: {req_men}')

 men response code: <Response [200]>


In [64]:
#manually creating a dataframe then concating new dataframes due to pushshift api issues, this is the original dataframe
men_df=pd.DataFrame(req_men.json()['data'])

In [65]:
#check to see how many records were pulled
len(men_df)

999

In [68]:
men_df.sort_values('created_utc', ascending=True).head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,author_cakeday
998,AskMen,,t2_969ax2xq8,0,Thoughts on Bollywood movies and why?,[],r/AskMen,False,6,,...,1682195470,0,,False,1682195488,1682195488,2023-04-22 20:31:10,,,
997,AskMen,,t2_mjo8y3q1,0,Do some guys not like kissing during sex ?,[],r/AskMen,False,6,,...,1682195591,0,,False,1682195604,1682195605,2023-04-22 20:33:11,,,
996,AskMen,,t2_8ywwtdo2p,0,"my girlfriend &amp; I just broke up, suspectin...",[],r/AskMen,False,6,,...,1682195624,0,,False,1682195636,1682195636,2023-04-22 20:33:44,,,


#### New data
Here I simply replicated the previous code block, but added a until feature which pulls records older than the specified parameter then apended the results to the original dataframe. Ideally, This would be a function that takes in the subreddit name, size, and epoch time. Then return the dataframe which I can then use a loop to append the dataframes together.

In [69]:
new_men_params={
    'subreddit': 'askmen',
    'size' : 1000,
    'until' : 1682195470
}

In [70]:
new_men_req = requests.get(url, new_men_params)

In [71]:
new_men = pd.DataFrame(new_men_req.json()['data'])

In [72]:
print(len(new_men), len(men_df))

998 999


In [73]:
#check the first few rows of men
new_men.head(3)

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,created_utc,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,post_hint,preview,author_cakeday
0,AskMen,[removed],t2_wa2icro,0,Do men care about very muscular women?,[],r/AskMen,False,6,,...,1682195369,0,,False,1682195385,1682195386,2023-04-22 20:29:29,,,
1,AskMen,[removed],t2_7jcvcqwk,0,My(23M) gf’s(26F) dad was diagnosed with cance...,[],r/AskMen,False,6,,...,1682194947,0,,False,1682194962,1682194963,2023-04-22 20:22:27,,,
2,AskMen,[removed],t2_wa2icro,0,Do men care about women having muscles?,[],r/AskMen,False,6,,...,1682194901,0,,False,1682194920,1682194921,2023-04-22 20:21:41,,,


In [74]:
men_df = pd.concat([men_df,new_men])

In [86]:
#see total rows, missing values, and an overview of the dataframe
men_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1997 entries, 0 to 997
Data columns (total 89 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   subreddit                      1997 non-null   object 
 1   selftext                       1997 non-null   object 
 2   author_fullname                1946 non-null   object 
 3   gilded                         1997 non-null   int64  
 4   title                          1997 non-null   object 
 5   link_flair_richtext            1997 non-null   object 
 6   subreddit_name_prefixed        1997 non-null   object 
 7   hidden                         1997 non-null   bool   
 8   pwls                           1997 non-null   int64  
 9   link_flair_css_class           0 non-null      object 
 10  thumbnail_height               0 non-null      object 
 11  top_awarded_type               0 non-null      object 
 12  hide_score                     1997 non-null   bo

In [77]:
men_df.to_csv('../Data/ask_men.csv', index=False)

## Combine data

In [88]:
sub_reddit_data = pd.concat([women_df, men_df])

In [89]:
sub_reddit_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3992 entries, 0 to 997
Data columns (total 90 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   subreddit                      3992 non-null   object 
 1   selftext                       3992 non-null   object 
 2   author_fullname                3914 non-null   object 
 3   gilded                         3992 non-null   int64  
 4   title                          3992 non-null   object 
 5   link_flair_richtext            3992 non-null   object 
 6   subreddit_name_prefixed        3992 non-null   object 
 7   hidden                         3992 non-null   bool   
 8   pwls                           3992 non-null   int64  
 9   link_flair_css_class           4 non-null      object 
 10  thumbnail_height               0 non-null      object 
 11  top_awarded_type               0 non-null      object 
 12  hide_score                     3992 non-null   bo

In [1]:
sub_reddit_data['subreddit'].value_counts()

NameError: name 'sub_reddit_data' is not defined

In [95]:
sub_reddit_data.to_csv('../Data/sub_reddit_data.csv', index=False)