# Reddit submissions and comments
This notebook shows how to read the Reddit dataset and how to compare the content of the files with the Reddit webpages. We will learn how to select the relevant information from the database and make some simple analyses.

## How to read and handle a json file
We first need to open the json file containing all the reddit submissions related to a certain stock; here we focus on FIZZ. We start by getting the data from the repository.

In [None]:
from sociophysicsDataHandler import SociophysicsDataHandler

student_config = True

file_target = 'asdz/platform2.2/20200428/ASDZ_Perron2.2_2020042815_trajectorie.parquet' 

if student_config:
    dh = SociophysicsDataHandler()
    dh.fetch_prorail_data_from_path(file_target)
else:
    webdav_basepath='/Crowdflow (Projectfolder)/ProRail_USE_LL_data'
    dh = SociophysicsDataHandler(basepath=webdav_basepath)
    
    dh.fetch_prorail_data_from_path(file_target)
                           # ,basepath=webdav_basepath)

print('The available files are the following:')
dh.list_files("econophysics/reddit/")

In [None]:
import pandas as pd
stock = 'FIZZ'
filename = 'submissions_wallstreetbets_' + stock + '_start20200901_end20210706.json' # insert here your path 
dh.fetch_econophysics_data_from_path("econophysics/reddit/" + filename)
df = dh.df
# print one of the entries (in this case, the fifth):
print(df.iloc[170])

## Comparison with the corresponding Reddit webpage
Let's consider for example the submission identified by the code 'o65l9k'. We can understand what the field contained in the json files mean by looking at the corresponding Reddit webpage.

In [None]:
subm = 'o65l9k'
df.loc[subm]
print('')
print('The web link of submission', subm, 'is: ', df['full_link'].loc[subm])

In [None]:
# the comments associated to the submission are in a separate json file:
dh.fetch_econophysics_data_from_path("econophysics/reddit/comments_" + stock + ".tar.gz")
dh.reddit_comments.get_comment_matching_id(subm)
df_comments = dh.reddit_comments.df[0]
df_comments.head()

#dh.reddit_comments.get_file_names() # to get the list files in the tar archive

In [None]:
num_comments = len(df_comments)
print('Number of comments:', num_comments)
print('For example, a comment is:')
df_comments.loc[1]

In [None]:
df_comments.loc[0]

In [None]:
# we can check the hierarchy of comments and replies by looking at the 'parent_id' field:
df_comments[['id','parent_id','body']]

## Select the relevant information
Not all the fields are equally useful for our analysis; here we select only some of them (this list is not exhaustive) in the case of the submissions file. A similar subsection can be made for the comments files.

In [None]:
list_fields = ['author_fullname','created_utc','num_comments','permalink','score','title','upvote_ratio']
df = df[list_fields]
print('The shape of the dataframe is now ', df.shape)
print(df.head())

## Convert the times into readable format
One of the crucial features of the Reddit database is the time at which the submissions and comments have been made. In the json files, times are saved as integer values. Here we transform these values into DateTime values; the times are reported in GMT values (Greenwich Mean Time).

In [None]:
df['created_utc'] = pd.to_datetime(df['created_utc'], origin='unix', unit='s') 
# created_utc is the time when the submission was created by its author
print(df[['author_fullname','created_utc']].head())

## Make a simple analysis (1)
Here we begin the analysis of the submissions, checking the distribution of the number of comments associated to each post.

In [None]:
import matplotlib.pyplot as pl
num_bins = 40 # number of bins for the histogram
pl.hist(df['num_comments'].values, bins=num_bins)
pl.xlabel('Number of comments', fontsize=14)
pl.ylabel('Frequency', fontsize=14)
pl.title('Comments distribution, FIZZ', fontsize=14)
pl.show()

print('The maximum number of comments associated to a submission is', df['num_comments'].max())
id_max = df[df['num_comments'] == df['num_comments'].max()].index[0]
print('The most commented post about FIZZ is identified by', id_max)
print('Its title is "', df['title'].loc[id_max], '"')

## Make a simple analysis (2)
We can also sort our dataframe based on some field. For example, here we rank the submissions based on their score and check what the correlation is between number of comments of submission score.

In [None]:
# sort in descending order, from the highest score to the lowest
df_sorted = df.sort_values(['score'], ascending=False) 
print(df_sorted.head())

# compute correlation between two columns of the dataframe:
correlation = df['score'].corr(df['num_comments'])
print('The correlation between score and number of comments is', correlation)