# How COVID Influenced People's lives

COVID has a great negative influence on the society. Nearly nine-in-ten U.S. adults say their life has changed at least a little as a result of the COVID-19 outbreak, including 44% who say their life has changed in a major way ([Most Americans Say Coronavirus Outbreak Has Impacted Their Lives](https://www.pewsocialtrends.org/2020/03/30/most-americans-say-coronavirus-outbreak-has-impacted-their-lives/)). This study quantifies how COVID has influenced many aspects of people's everyday life by visualizing keywords changes in subreddits concerning daily life topics during the past year (in case you are not familiar with reddit: [subreddit is a web forum of a particular topic where you can post links or create a self post and dicuss.](https://www.reddit.com/r/help/comments/37shum/what_is_a_subreddit/)).

### Structure
What's included in the analysis: trace the number of members of each subreddit and the keywords change.

### Data source
The data is scrapped from one of the biggest online community **reddit**. **Reddit** has more than [430m users](https://au.oberlo.com/blog/reddit-statistics), [168b views per year](https://mediakix.com/blog/reddit-statistics-users-demographics/)).

[Aother good statistics](https://websitebuilder.org/blog/reddit-statistics/).

### Subreddits selected
**Selection criteria**
- Focus on a daily life related topics, as listed separatly below.
- Among all subreddits with similar topic, I select the one or two with most members to idealize the size and variety of the data.

**List of subreddits (click the name to view the subreddit)**

- **Food**   
[r/cooking](https://www.reddit.com/r/Cooking/): 2.5m, [r/recipes](https://www.reddit.com/r/recipes/): 2.4m, [r/MealPrepSunday](https://www.reddit.com/r/MealPrepSunday/): 2m, [r/budgetfood](https://www.reddit.com/r/budgetfood/): 302k

- **Family life, childcare and relationships**  
[r/Parenting](https://www.reddit.com/r/Parenting/): 3.2m, [r/relationships](https://www.reddit.com/r/relationships/): 3m, [r/teenagers](https://www.reddit.com/r/teenagers/): 2.4m, [r/childfree](https://www.reddit.com/r/childfree/): 1.4m, [r/weddingplanning](https://www.reddit.com/r/weddingplanning/): 144k, [r/family](https://www.reddit.com/r/family/): 138k, [r/education](https://www.reddit.com/r/education/new/): 124k, [r/christmas](https://www.reddit.com/r/christmas/): 100k, [r/Gifts](https://www.reddit.com/r/Gifts/): 56.2k, [r/GiftIdeas](https://www.reddit.com/r/GiftIdeas/): 45.4k

- **Hobbies**  
[r/gaming](https://www.reddit.com/r/gaming/): 29.3m, [r/books](https://www.reddit.com/r/books/): 19m, [r/camping](https://www.reddit.com/r/camping/): 1.9m, [r/Fitness](https://www.reddit.com/r/Fitness/): 8m, [r/travel](https://www.reddit.com/r/travel/): 5.7m

- **Financial status**
[r/personalfinance](https://www.reddit.com/r/personalfinance/): 14.3m, [r/Economics](https://www.reddit.com/r/Economics/): 1.2m, [r/povertyfinance](https://www.reddit.com/r/povertyfinance/): 588k, [r/CreditCards](https://www.reddit.com/r/CreditCards/): 262k, [r/realestateinvesting](https://www.reddit.com/r/realestateinvesting/): 330k, [r/RealEstate](https://www.reddit.com/r/RealEstate/): 236k

- **Career**
[r/Entrepreneur](https://www.reddit.com/r/Entrepreneur/): 889k, [r/business](https://www.reddit.com/r/business/): 602k, [r/smallbusiness](https://www.reddit.com/r/smallbusiness/): 484k, [r/jobs](https://www.reddit.com/r/jobs/): 428k, [r/WorkOnline](https://www.reddit.com/r/WorkOnline/): 333k, [r/careerguidance](https://www.reddit.com/r/careerguidance/): 250k, [r/GetEmployed](https://www.reddit.com/r/GetEmployed/): 102k

- **Mental Health**
[r/GetMotivated](https://www.reddit.com/r/GetMotivated/): 17.0m, [r/CasualConversation](https://www.reddit.com/r/CasualConversation/): 1.4m, [r/psychology](https://www.reddit.com/r/psychology/): 809k, [r/getdisciplined](https://www.reddit.com/r/getdisciplined/): 729k, [r/depression](https://www.reddit.com/r/depression/): 725k, [r/Anxiety](https://www.reddit.com/r/Anxiety/): 429k, [r/mentalhealth](https://www.reddit.com/r/mentalhealth/): 220k, [r/Lonely](https://www.reddit.com/r/lonely/): 194k

- **Others**
[r/Futurology](https://www.reddit.com/r/Futurology/): 15.1m, [r/Coronavirus](https://www.reddit.com/r/Coronavirus/): 2.4m, [r/legaladvice](https://www.reddit.com/r/legaladvice/): 1.4m

### Major tool
- **Pushshift** ([Manual](https://reddit-api.readthedocs.io/en/latest/), [Github](https://github.com/pushshift/api), [r/Pushshift](https://www.reddit.com/r/pushshift/))  

Pushshift is used to scrapping data from reddit. Compared to the popular tool [PRAW](https://praw.readthedocs.io/en/latest/), Pushshift provides a more powerful way to search the submissions and comments, especially for the date related search queries. In addtion, Pushshift returns query results quicker when the search invloves large amount of data.

On the other hand, the research results from Pushshift have a couple day delay - meaning that the lastest reddit data you can get using Pushshift is about a couple days ago. This analysis, though, is not real-time, therefore I used Pushshift to get the data I need for this analysis.

- **[WordCloud](https://amueller.github.io/word_cloud/index.html)**  

WordCloud is used to visualize the keywords of a specific (group of) subreddit(s).

## Initialization

In [None]:
save_path_prefix = 'covid_influence/files/'
query_keywords = ['']
# query_keywords = ['covid|corona|quarantine|pandemic'] # Use '+' or '|' to connect multiple keywords; leave as [''] without searching for specific keywords
query_subreddits = ['Gifts,GiftIdeas', 'travel', 'personalfinance', 'creditcards', 'realestate', 'smallbusiness', 'jobs,careerguidence,GetEmployed', 
                   'GetMotivated', 'CasualConversation', 'depression', 'anxiety', 'mentalhealth', 'Lonely', 'books', 'teenagers', 'parenting', 'fitness', 
                   'AskAnAmerican', 'gaming', 'relationships', 'china', 'india', 'unitedkingdom', 'australia'] # Use ',' to connect multiple subreddits, leave as [''] if search across all subreddits
query_date_ranges = []
for m in range(1, 13):
    query_date_ranges.append(['2020-'+str(m).zfill(2)+'-01', '2020-'+str(m).zfill(2)+'-05'])
    query_date_ranges.append(['2020-'+str(m).zfill(2)+'-06', '2020-'+str(m).zfill(2)+'-10'])
    query_date_ranges.append(['2020-'+str(m).zfill(2)+'-11', '2020-'+str(m).zfill(2)+'-15'])
    query_date_ranges.append(['2020-'+str(m).zfill(2)+'-16', '2020-'+str(m).zfill(2)+'-20'])
    query_date_ranges.append(['2020-'+str(m).zfill(2)+'-21', '2020-'+str(m).zfill(2)+'-25'])
    query_date_ranges.append(['2020-'+str(m).zfill(2)+'-26', '2020-'+str(m).zfill(2)+'-28'])

# Parameters to plot the wordclouds
name_months = ['Jan.', 'Feb.', 'Mar.', 'Apr.', 'May.', 'June', 'July', 'Aug.', 'Sept.', 'Oct.', 'Nov.', 'Dec.']
plt_cfg = dict()
plt_cfg['path_save'] = 'covid_influence/plots/'
plt_cfg['size'] = (50, 25)
plt_cfg['xSub'] = 3
plt_cfg['ySub'] = 4
plt_cfg['title'] = [month for month in name_months]

# Parameters to fetch submissions and comments
cfg_subm = dict()
cfg_subm['field'] = 'title,selftext,num_comments,author'
cfg_subm['rm_dupe'] = 'title'
cfg_subm['sort'] = 'num_comments'
cfg_subm['sort_type'] = 'desc'
cfg_subm['query_type'] = 'submission'

cfg_cmt = dict()
cfg_cmt['field'] = 'body,author,score'
cfg_cmt['rm_dupe'] = 'body'
cfg_cmt['sort'] = 'score'
cfg_cmt['sort_type'] = 'desc'
cfg_cmt['query_type'] = 'comment'

# Add additional stopwords
add_stopwords = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'day', 'week', 'month', 'year', 'thing', 'app', 'new', 'old', 
                 'hundred', 'thousand']

In [None]:
import funcs_pushshift
import os
import sys
import pandas as pd
import numpy as np
import pickle
import nltk
# nltk.download('wordnet')
# nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

## Getting submissions and comments from reddit using [Pushshift](https://reddit-api.readthedocs.io/en/latest/#comments-search)

The original code of the function *GetPushshiftData* is from [
dylankilkenny/PushShift.py](https://gist.github.com/dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b) and [Rare Loot](https://rareloot.medium.com/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563).

### 1. Assemble queries

In [None]:
save_path = []
fname = []
query_main = []
query_subm = []
query_cmt = []
suffix_subm = '&filter='+cfg_subm['field']+'&sort_type='+cfg_subm['sort']+'&sort=desc'+'&size=100'
suffix_cmt = '&filter='+cfg_cmt['field']+'&sort_type='+cfg_cmt['sort']+'&sort=desc'+'&size=100'
for keyword in query_keywords:
    query_temp = 'q='+keyword
    for subreddit in query_subreddits:
        query_main.append(keyword+'_'+subreddit)
        for date_range in query_date_ranges:
            query_subm.append(query_temp+'&subreddit='+subreddit+'&after='+ date_range[0]+'&before='+date_range[1]+suffix_subm)
            query_cmt.append(query_temp+'&subreddit='+subreddit+'&after='+ date_range[0]+'&before='+date_range[1]+suffix_cmt)
            save_path.append(save_path_prefix+keyword+'_'+subreddit+'/')
            fname.append(date_range[0]+'_'+date_range[1])
    del query_temp
del suffix_subm, suffix_cmt, keyword, subreddit

In [None]:
# Testing code
# subms_all = []
# cmts_all = []
# for idx in range(len(fname)):
#     try:
#         subms_all.append(pd.read_csv(save_path[idx]+cfg_subm['query_type']+'_'+fname[idx]+'.csv'))
#     except:
#         subms_all.append([])
#     try:
#         cmts_all.append(pd.read_csv(save_path[idx]+cfg_cmt['query_type']+'_'+fname[idx]+'.csv'))
#     except:
#         cmts_all.append([])

### 2. Collect the submissions and comments from Pushshift server.

In [None]:
subms_all = []
cmts_all = []
for idx in range(0, len(fname)):
    print('> Processing query: '+str(idx+1)+' / '+str(len(query_subm)) + '. Save to: '+save_path[idx]+fname[idx]+'/')
    cfg_subm['path_save'] = save_path[idx]
    cfg_subm['save_suffix'] = fname[idx]
    df_subm = funcs_pushshift.fetch_data(query_subm[idx], cfg_subm)
    subms_all.append(df_subm)

    cfg_cmt['path_save'] = save_path[idx]
    cfg_cmt['save_suffix'] = fname[idx]
    df_cmt = funcs_pushshift.fetch_data(query_cmt[idx], cfg_cmt)
    cmts_all.append(df_cmt)
    del df_subm, df_cmt
del query_subm, query_cmt

### 3. Assemble text of each month and calculate the number of active redditors

In [None]:
def mid_dividor(n):
    tmp = []
    i = 1
    while i <= n: 
        if (n % i==0) : 
            tmp.append(i) 
        i = i + 1
    if len(tmp) % 2 == 0:
        return [tmp[int(len(tmp)/2-1)], tmp[int(len(tmp)/2)]]
    else:
        return [tmp[round(len(tmp)/2)], tmp[round(len(tmp)/2)]]

In [None]:
# Parameters for plotting
if not os.path.exists(plt_cfg['path_save']):
    os.makedirs(plt_cfg['path_save'])
sub_idx = mid_dividor(len(query_main))
fig, axs = plt.subplots(sub_idx[0], sub_idx[1])

# Get column names
subm_cols = cfg_subm['field'].split(',')
cmt_cols = cfg_cmt['field'].split(',')
# Combine the data from the same month for each year and calculate the number of active redditors
num_files_month = int(len(fname) / (len(name_months) * len(query_keywords) * len(query_subreddits)))
subms = dict()
cmts = dict()
idx_data = 0
for cnt_query in range(len(query_main)):
    subms[query_main[cnt_query]] = []
    cmts[query_main[cnt_query]] = []
    num_redditor = []
    num_subm = []
    num_cmt = []
    for idx in range(0, len(name_months)):
        subm_temp = []
        cmt_temp = []
        for ii in range(0, num_files_month):
            if len(subms_all[idx_data]) != 0:
                subm_temp.append(subms_all[idx_data])
            if len(cmts_all[idx_data]) != 0:
                cmt_temp.append(cmts_all[idx_data])
            idx_data += 1
        if subm_temp == []:
            subms[query_main[cnt_query]].append(pd.DataFrame(columns=subm_cols))
        else:
            subms[query_main[cnt_query]].append(pd.concat(subm_temp).reset_index()[subm_cols])
        if cmt_temp == []:
            cmts[query_main[cnt_query]].append(pd.DataFrame(columns=subm_cols))
        else:
            cmts[query_main[cnt_query]].append(pd.concat(cmt_temp).reset_index()[cmt_cols])

        # Calculate the number of redditors
        authors = pd.concat([subms[query_main[cnt_query]][idx]['author'], cmts[query_main[cnt_query]][idx]['author']])
        temp = authors.shape[0]
        authors = authors.drop_duplicates().reset_index()
        num_redditor.append(authors.shape[0])
        if len(subms[query_main[cnt_query]][idx]) == 0:
            num_subm.append(0)
        else:
            num_subm.append(len(subms[query_main[cnt_query]][idx]))
        num_cmt.append(len(cmts[query_main[cnt_query]][idx]))
        del subm_temp, cmt_temp, temp, authors

    if axs.ndim == 1:
        ax = axs[0]
    elif axs.ndim == 2:
        sub = np.unravel_index(cnt_query, (sub_idx[0], sub_idx[1]))
        ax = axs[sub[0], sub[1]]
        del sub   
    
    num_post = pd.DataFrame({'No. submissions': num_subm, 'No. comments': num_cmt, 'No. redditors':num_redditor}, index=name_months)
    ax.plot(num_post['No. submissions'], color='#3d405b', label='No. submissions', linewidth=5)
    ax.plot(num_post['No. comments'], color='#81b29a', label='No. comments', linewidth=5)
    ax.plot(num_post['No. redditors'], color='#e07a5f', label='No. redditors', linewidth=5)
    ax.axhline(y=num_files_month*100, color='r', linestyle='--', alpha=0.3)
    ax.legend(loc='upper left', frameon=False, fontsize=15)
    ax.set_xticklabels(name_months)
    ax.set_xlabel('Months')
    ax.set_ylabel('Number')
    ax.set_title(query_main[cnt_query], fontsize=25)
    ax.spines['right'].set_visible(0)
    ax.spines['top'].set_visible(0)
    del num_redditor, num_subm, num_cmt, num_post, ax
    
fig.set_size_inches(20*sub_idx[0], 6*sub_idx[1])
if len(query_main) > 1:
    plt.savefig(plt_cfg['path_save']+'stats_'+query_main[0]+'_etc.jpg', bbox_inches='tight')
elif len(query_main) == 1:
    plt.savefig(plt_cfg['path_save']+'stats_'+query_main[0]+'.jpg', bbox_inches='tight')
del subm_cols, cmt_cols, num_files_month, idx_data, cnt_query, fig, axs, sub_idx

## Process the text for word cloud
Part of the code is adapted from the one originally produced by Zolzaya Luvsandorj ([Medium](https://towardsdatascience.com/introduction-to-nlp-part-1-preprocessing-text-in-python-8f007d44ca96)).  

The following steps are performed in order:
1. Concatenate all text from submission title, content and comments.
2. Tokenize
3. Normalize
4. Remove stopwords
5. Remove numbers, underscore, or words consist of less than two characters.
6. Reverse processed words to a big paragraph of text.

In [None]:
def process_text(text, additional_stopwords=[]):
    # Tokenise words while ignoring punctuation
    tokeniser = RegexpTokenizer(r'\w+')
    tokens = tokeniser.tokenize(text)
    
    # Lowercase and lemmatise 
    lemmatiser = WordNetLemmatizer()
    lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]
    
    # Remove stopwords
    if additional_stopwords != []:
        keywords = [lemma for lemma in lemmas if lemma not in stopwords.words('english')+additional_stopwords]
    else:
        keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
    
    # Remove words with numbers, underscore, or words consist of less than three characters.
    keywords = [word for word in keywords if not (any(char.isdigit() for char in word) or ('_' in word) or (len(word) < 3))]

    return keywords

In [None]:
word_freq = dict()
for query in query_main:
    word_freq[query] = []
    for month in range(0, len(subms[query])):
        # Concatenate text
        subm = subms[query][month]
        cmt = cmts[query][month]

        txt = ''
        for n in range(0, subm.shape[0]):
            if type(subm['title'][n]) == str:
                txt += subm['title'][n] + ' '
            if type(subm['selftext'][n]) == str:      
                txt += subm['selftext'][n] + ' '

        for n in range(0, cmt.shape[0]):
            if type(cmt['body'][n]) == str:
                txt += cmt['body'][n]
        
        # Preprocess words
        keywords = process_text(txt, add_stopwords)
        
        # Produce text frequency
        word_freq[query].append({word: keywords.count(word) for word in set(keywords)})
        del subm, cmt, keywords, txt
    pickle.dump(word_freq[query], open(save_path_prefix+query+'/word_frequency.p', 'wb'))

## Produce Word Clouds

In [None]:
wordclouds = dict()
for query in query_main:
    fig, axs = plt.subplots(plt_cfg['xSub'], plt_cfg['ySub'])
    wordclouds[query] = []
    cnt = 0
    for freq in word_freq[query]:
        if len(freq) == 0:
            wordclouds[query].append([])
        else:
            wordclouds[query].append(WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', 
                                               colormap='Set2').generate_from_frequencies(frequencies=freq))        
            sub = np.unravel_index(cnt, (plt_cfg['xSub'], plt_cfg['ySub']))
            ax = axs[sub[0], sub[1]]
            ax.imshow(wordclouds[query][cnt])
            ax.set_title(plt_cfg['title'][cnt], fontsize=20)
            ax.spines['right'].set_visible(0)
            ax.spines['top'].set_visible(0)
            ax.axis('off')
            del ax, sub
        cnt += 1
    fig.set_size_inches(plt_cfg['size'][0], plt_cfg['size'][1])
    plt.savefig(plt_cfg['path_save']+query+'_1.jpg', bbox_inches='tight')
    pickle.dump(wordclouds, open(save_path_prefix+query+'/wordclouds.p', 'wb'))
del fig, axs