## Collect and Update Data on Reddit


<img src="https://styles.redditmedia.com/t5_30hhs/styles/communityIcon_buhifseg1hm81.png" width=300></img>

There are 4 steps for this process:


- Run the collection
- Load the current data
- Merge old (existent) data with currently collected
- Save new version

We schedule the collection to be run daily.

In order to make this work, we also need to set the environment variables for Reddit application using the Kaggle feature that allows us to set secrets.


# Load packages

## Install praw

In [1]:
!pip install praw

Collecting praw
  Downloading praw-7.6.0-py3-none-any.whl (188 kB)
     |████████████████████████████████| 188 kB 903 kB/s            
Collecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: prawcore, praw
Successfully installed praw-7.6.0 prawcore-2.3.0


## Packages used

In [2]:
import os
import praw
import pandas as pd
import datetime as dt
from tqdm import tqdm
import time

## Environment settings for Reddit secrets

Here is a simple tutorial about using secrets with Kaggle: [Feature Launch: User Secrets](https://www.kaggle.com/product-feedback/114053)

In [3]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

# Utility functions

In [4]:
def get_date(created):
    return dt.datetime.fromtimestamp(created)


def reddit_connection(environment="Kaggle"):
    
    if environment == "Kaggle":
        personal_use_script = user_secrets.get_secret("REDDIT_PERSONAL_USE_SCRIPT_14_CHARS")
        client_secret = user_secrets.get_secret("REDDIT_SECRET_KEY_27_CHARS")
        user_agent = user_secrets.get_secret("REDDIT_APP_NAME")
        username = user_secrets.get_secret("REDDIT_USER_NAME")
        password = user_secrets.get_secret("REDDIT_LOGIN_PASSWORD")
         
    else: #local (Linux/Windows) environment
        personal_use_script = os.environ["REDDIT_PERSONAL_USE_SCRIPT_14_CHARS"]
        client_secret = os.environ["REDDIT_SECRET_KEY_27_CHARS"]
        user_agent = os.environ["REDDIT_APP_NAME"]
        username = os.environ["REDDIT_USER_NAME"]
        password = os.environ["REDDIT_LOGIN_PASSWORD"]

    reddit = praw.Reddit(client_id=personal_use_script, \
                         client_secret=client_secret, \
                         user_agent=user_agent, \
                         username=username, \
                         password='')
    return reddit


# Build the dataset (daily update)

In [5]:
def build_dataset(reddit, search_words='UkrainianConflict', items_limit=4000):
    
    # Collect reddit posts
    subreddit = reddit.subreddit(search_words)
    new_subreddit = subreddit.new(limit=items_limit)
    topics_dict = { "title":[],
                "score":[],
                "id":[], "url":[],
                "comms_num": [],
                "created": [],
                "body":[]}
    
    print(f"retreive new reddit posts ...")
    for submission in tqdm(new_subreddit):
        topics_dict["title"].append(submission.title)
        topics_dict["score"].append(submission.score)
        topics_dict["id"].append(submission.id)
        topics_dict["url"].append(submission.url)
        topics_dict["comms_num"].append(submission.num_comments)
        topics_dict["created"].append(submission.created)
        topics_dict["body"].append(submission.selftext)

    for comment in tqdm(subreddit.comments(limit=4000)):
        topics_dict["title"].append("Comment")
        topics_dict["score"].append(comment.score)
        topics_dict["id"].append(comment.id)
        topics_dict["url"].append("")
        topics_dict["comms_num"].append(0)
        topics_dict["created"].append(comment.created)
        topics_dict["body"].append(comment.body)

    topics_df = pd.DataFrame(topics_dict)
    print(f"new reddit posts retrieved: {len(topics_df)}")
    topics_df['timestamp'] = topics_df['created'].apply(lambda x: get_date(x))

    return topics_df
   

# Update and save dataset

We perform the following actions:  
* Load old dataset  
* Merge the two datasets  
* Save the merged data

In [6]:
def update_and_save_dataset(topics_df):   
    file_path = "../input/russian-invasion-of-ukraine/russian_invasion_of_ukraine.csv"
    out_file_path = "russian_invasion_of_ukraine.csv"
    if os.path.exists(file_path):
        topics_old_df = pd.read_csv(file_path)
        print(f"past reddit posts: {topics_old_df.shape}")
        topics_all_df = pd.concat([topics_old_df, topics_df], axis=0)
        print(f"new reddit posts: {topics_df.shape[0]} past posts: {topics_old_df.shape[0]} all posts: {topics_all_df.shape[0]}")
        topics_new_df = topics_all_df.drop_duplicates(subset = ["id"], keep='last', inplace=False)
        print(f"all reddit posts: {topics_new_df.shape}")
        topics_new_df.to_csv(out_file_path, index=False)
    else:
        print(f"reddit posts: {topics_df.shape}")
        topics_df.to_csv(out_file_path, index=False)

# Run it all

We perform the following actions:  
* Initialize connection  
* Build the dataset  
* Update and save the dataset


In [7]:
reddit = reddit_connection()
topics_data_df = build_dataset(reddit)
update_and_save_dataset(topics_data_df)

retreive new reddit posts ...


990it [00:09, 99.51it/s] 
992it [00:06, 151.93it/s]


new reddit posts retrieved: 1982


  This is separate from the ipykernel package so we can avoid doing imports until


past reddit posts: (252356, 8)
new reddit posts: 1982 past posts: 252356 all posts: 254338
all reddit posts: (253494, 8)
