# Introduction
This notebook will be used to gather data from reddit through an API. The two subreddits we will be gather data from are:
1) ) [Ask Men](https://www.reddit.com/r/AskMen)
2) ) [Ask Women](https://www.reddit.com/r/AskWomen)

For this portion I created separate list for the first 1000 post descending from April 19, 2023. For the information stored I kept track of the:
1) ) Title
2) ) Time stamp
3) ) Subreddit

These files will be stored as seperate as a backup, but a file called sub_reddit_data will be created combining the two.

## Imports

In [1]:
import requests
import pandas as pd

In [201]:
# api url
url =  'https://api.pushshift.io/reddit/search/submission'

## Pulling data from api

### Women

In [202]:
#params to get the 1000 newest posts from the askwomen subreddit
women_params={
    'subreddit': 'askwomen',
    'size' : 1000
}

In [205]:
req_women = requests.get(url, women_params)

In [206]:
#check to see if we established a connect
print(f' women response code: {req_women}')

 women response code: <Response [200]>


In [207]:
#checked keys to see what data we need
req_women.json().keys()

dict_keys(['data', 'error', 'metadata'])

In [208]:
'''
See the keys of the nested dictionaries and list
That data is stored in a dictionary stored in a list stored in a dictionary 

'''
req_women.json()['data'][0].keys()

dict_keys(['subreddit', 'selftext', 'author_fullname', 'gilded', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'score', 'is_created_from_ads_ui', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'link_flair_type', 'wls', 'removed_by_category', 'author_flair_type', 'domain', 'allow_live_comments', 'suggested_sort', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18', 'all_awardings', 'awarders', 'media_only', 'can_gild', 'spoiler', 'locked', 'author_f

In [183]:
#had to do 999 because loop kept passing the index
women_list = []
women_time = []
for i in range(len(req_women.json()['data'])):
    women_list.append(req_women.json()['data'][i]['title'])
    women_time.append(req_women.json()['data'][i]['utc_datetime_str'])

In [184]:
#zipping list so I can create a dataframe to save as a csv
women_combined_list = list(zip(women_list, women_time))

In [185]:
#converted into a dataframe
women_df = pd.DataFrame(women_combined_list, columns=['title','time_stamp'])

In [186]:
#lable the sub reddit
women_df['subreddit'] = 'ask_women'

In [187]:
#check the the ammount of entires
len(women_df)

999

In [188]:
#see the first 5 rows to see if we got the titles.
women_df.head()

Unnamed: 0,title,time_stamp,subreddit
0,Women who lowered their standards significantl...,2023-04-19 22:44:12,ask_women
1,"Women who ignored red flags on the first date,...",2023-04-19 22:42:37,ask_women
2,What would you do if a man doesn’t respond to ...,2023-04-19 22:41:14,ask_women
3,ladies of Reddit that do meditation in a neigh...,2023-04-19 22:14:51,ask_women
4,"Hello, Shrinking technology was invented and g...",2023-04-19 21:54:55,ask_women


In [189]:
women_df.to_csv('./Data/ask_women.csv', index=False)

### Men

In [209]:
#params to get the 1000 newest posts from the askmen subreddit
men_params={
    'subreddit': 'askmen',
    'size' : 1000
}

In [210]:
req_men = requests.get(url, men_params)

In [211]:
print(f' men response code: {req_men}')

 men response code: <Response [200]>


In [174]:
men_list = []
men_time = []
for i in range(len(req_men.json()['data'])):
    men_list.append(req_men.json()['data'][i]['title'])
    men_time.append(req_men.json()['data'][i]['utc_datetime_str'])

In [175]:
#combine the list
men_combined_list = list(zip(men_list, men_time))

In [176]:
#convert to a data frame
men_df = pd.DataFrame(men_combined_list, columns=['title','time_stamp'])

In [177]:
#lable the sub reddit
men_df['subreddit'] = 'ask_men'

In [190]:
len(men_df)

998

In [179]:
men_df.head()

Unnamed: 0,title,time_stamp,subreddit
0,Replay but not respond,2023-04-19 23:10:46,ask_men
1,"Would you rather be sexually assaulted, or fal...",2023-04-19 23:07:17,ask_men
2,Deleted Instagram to avoid seeing women,2023-04-19 23:03:58,ask_men
3,If women get periods why don't men get commas?,2023-04-19 23:01:48,ask_men
4,What characters can you identify by their mask...,2023-04-19 22:59:52,ask_men


In [191]:
men_df.to_csv('./Data/ask_men.csv', index=False)

## Combine data

In [192]:
sub_reddit_data = pd.concat([women_df, men_df])

In [193]:
sub_reddit_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1997 entries, 0 to 997
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   title       1997 non-null   object
 1   time_stamp  1997 non-null   object
 2   subreddit   1997 non-null   object
dtypes: object(3)
memory usage: 62.4+ KB


In [194]:
sub_reddit_data['subreddit'].value_counts()

ask_women    999
ask_men      998
Name: subreddit, dtype: int64

In [195]:
sub_reddit_data.to_csv('./Data/sub_reddit_data.csv')