# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & NLP

## Part 1
Fetching the top posts from the subreddits on Reddit for the past month:

1. [JapanTravel](https://www.reddit.com/r/JapanTravel/)
2. [koreatravel](https://www.reddit.com/r/koreatravel/)

Using PRAW (Python Reddit API Wrapper) is ideal for scraping comments from Reddit due to its official status as a Reddit API wrapper, simplified authentication, automatic rate limiting, easy data parsing, comprehensive documentation, enhanced error handling, community support, stability, and respect for Reddit's guidelines, making it a convenient, efficient, and ethical tool for accessing Reddit's API and extracting comments.

In [1]:
# Import necessary libraries and set up parameters.
import pandas as pd
import praw
import pickle

### Japan

#### JapanTravel (Comments)
Utilizing the praw library to authenticate with the Reddit API and scrape comments from the "JapanTravel" subreddit. The script sets the limit to 100 posts and it extracts the comment ID, creation time, comment body, and score of each comment under those top posts.

- `Comment ID`: Extracting the comment's ID instead of the author's ID can help us identify duplicates much easier.
- `Timestamp`: Extracting the timestamp of the comment created, can help to determine if there are any seasonal patterns.
- `Comment Body`: Extracting the comments instead of the posts can provide us with much more datas to identify trending keywords much easier. 
- `Upvotes`: Extracting the score of the comment can help us to differentiate positive comments vs negative comments according to the context. 

In [2]:
# Replace these with your API credentials
reddit_client_id = ''
reddit_client_secret = ''
reddit_user_agent = ''

# Authenticate with the Reddit API
reddit = praw.Reddit(client_id=reddit_client_id,
                     client_secret=reddit_client_secret,
                     user_agent=reddit_user_agent)

# Subreddit to scrape comments from
subreddit_name = 'JapanTravel'

# Choose the number of top posts to fetch
limit = 100

# Fetch top posts from the subreddit
subreddit = reddit.subreddit(subreddit_name)
top_posts = subreddit.top(limit=limit)

# Lists to store the attributes of comments
comment_ids = []
comment_timestamps = []
comment_bodies = []
comment_upvotes = []

# Process the fetched top posts and extract comments from them
for post in top_posts:
    post.comments.replace_more(limit=0)  # Replace "MoreComments" objects with actual comments
    for comment in post.comments.list():
        comment_ids.append(comment.id)
        comment_timestamps.append(comment.created_utc)
        comment_bodies.append(comment.body)
        comment_upvotes.append(comment.score)
        
# Create a DataFrame from the lists
jp_top = pd.DataFrame({
    'Comment ID': comment_ids,
    'Timestamp': comment_timestamps,
    'Comment Body': comment_bodies,
    'Upvotes': comment_upvotes
})

jp_top['Timestamp'] = pd.to_datetime(jp_top['Timestamp'], unit='s')

print(jp_top.shape)
jp_top.head()

(14461, 4)


Unnamed: 0,Comment ID,Timestamp,Comment Body,Upvotes
0,dwkf6l4,2018-03-31 04:41:48,This is probably the most uneventful part of t...,1081
1,dwk6wsz,2018-03-31 01:40:06,What hotel. I’m in Osaka with nothing to do to...,673
2,dwkii8n,2018-03-31 06:19:18,Ze Documents have been delivered!!!,2361
3,dwk9186,2018-03-31 02:24:31,"Alright redditors, the adventure begins. I’m o...",9226
4,dwkjj5i,2018-03-31 06:55:34,On the way for real now!,377


#### Export Data

In [3]:
# Save the updated DataFrame to a new CSV file
jp_top.to_pickle('data/jp_top.pkl')

### Korea

#### koreatravel (Comments)
Reapplying the same technique to Japan. But with the limit of 450 posts, in order to match a similar number of rows as `'jp_top'`.

- `Comment ID`: Extracting the comment's ID instead of the author's ID can help us identify duplicates much easier.
- `Timestamp`: Extracting the timestamp of the comment created, can help to determine if there are any seasonal patterns.
- `Comment Body`: Extracting the comments instead of the posts can provide us with much more datas to identify trending keywords much easier. 
- `Upvotes`: Extracting the score of the comment can help us to differentiate positive comments vs negative comments according to the context. 

In [4]:
# Replace these with your API credentials
reddit_client_id = ''
reddit_client_secret = ''
reddit_user_agent = ''

# Authenticate with the Reddit API
reddit = praw.Reddit(client_id=reddit_client_id,
                     client_secret=reddit_client_secret,
                     user_agent=reddit_user_agent)

# Subreddit to scrape comments from
subreddit_name = 'koreatravel'

# Choose the number of top posts to fetch
limit = 450

# Fetch top posts from the subreddit
subreddit = reddit.subreddit(subreddit_name)
top_posts = subreddit.top(limit=limit)

# Lists to store the attributes of comments
comment_ids = []
comment_timestamps = []
comment_bodies = []
comment_upvotes = []

# Process the fetched top posts and extract comments from them
for post in top_posts:
    post.comments.replace_more(limit=0)  # Replace "MoreComments" objects with actual comments
    for comment in post.comments.list():
        comment_ids.append(comment.id)
        comment_timestamps.append(comment.created_utc)
        comment_bodies.append(comment.body)
        comment_upvotes.append(comment.score)
        
# Create a DataFrame from the lists
kr_top = pd.DataFrame({
    'Comment ID': comment_ids,
    'Timestamp': comment_timestamps,
    'Comment Body': comment_bodies,
    'Upvotes': comment_upvotes
})

# Converting to datetime format for best practice
kr_top['Timestamp'] = pd.to_datetime(kr_top['Timestamp'], unit='s')

print(kr_top.shape)
kr_top.head()

(13124, 4)


Unnamed: 0,Comment ID,Timestamp,Comment Body,Upvotes
0,ir411me,2022-10-05 04:20:31,Pretty solid tips.\n\nA comment/question\n\n- ...,18
1,ir4ns6m,2022-10-05 09:18:32,"Really useful tips, thank you! I was debating ...",8
2,ir3savu,2022-10-05 02:59:37,Amazing advice! I’ll be in Busan then Seoul in...,6
3,ir3okpj,2022-10-05 02:28:30,Straight to the point. Love it. Thanks for pos...,5
4,ir7v4hg,2022-10-05 23:44:59,[deleted],3


#### Export Data

In [5]:
# Save the updated DataFrame to a new CSV file
kr_top.to_pickle('data/kr_top.pkl')