# Fields selection  

#### Notebook objectives:  
- **Review the descriptive stats for columns in 2017 dataset and select the fields relevant to analyses**
  - Note that the reddit data for the years after 2017 contains more fields. For this project, we're only interested in submissions text data and basic info about it that is available throught the datasets. So, we're interested in selecting the relevant fields available since 2017.  
- **Output a list of fields selected for analyses**  

#### Steps:  
1. [Load data](#Load-data)  
2. [Review fields and assign potential analysis roles](#Review-fields-and-assign-potential-analysis-roles)  
3. [Final fields selection](#Final-fields-selection)  

In [1]:
import pickle
import pandas as pd
import numpy as np
from IPython.display import display, HTML, Markdown, clear_output
import ipywidgets as widgets
import time 
import chime

import plotly.graph_objects as go

In [2]:
chime.theme('zelda')

In [3]:
DATA_PATH = 'data/'
OUTPUT_PATH = 'output/'

## Load data

Load the 2017 data.

In [4]:
with open(DATA_PATH + 'reddit_askdocs_submissions_2017.pkl', 'rb') as f:
    d_2017 = pickle.load(f)

In [5]:
len(d_2017)

62438

In [6]:
df_2017 = pd.DataFrame(d_2017)

In [10]:
del d_2017

In [11]:
df_2017.info()

<class 'pandas.core.frame.DataFrame'>
Index: 716072 entries, 0 to 15884
Columns: 106 entries, author to url_overridden_by_dest
dtypes: bool(6), float64(19), int64(4), object(77)
memory usage: 555.9+ MB


In [12]:
df_2017.head()

Unnamed: 0,author,author_flair_css_class,author_flair_text,brand_safe,can_mod_post,contest_mode,created_utc,domain,full_link,id,...,removed_by_category,updated_utc,steward_reports,og_description,og_title,removed_by,media_metadata,is_created_from_ads_ui,author_is_blocked,url_overridden_by_dest
0,[deleted],,,True,False,False,1514764452,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbw...,7nbwtn,...,,,,,,,,,,
1,XenonCSGO,default,This user has not yet been verified.,True,False,False,1514764122,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvsv,...,,,,,,,,,,
2,[deleted],,,True,False,False,1514764055,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvln,...,,,,,,,,,,
3,DavisTheMagicSheep,default,This user has not yet been verified.,True,False,False,1514763799,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbu...,7nburb,...,,,,,,,,,,
4,Dontgetscooped,default,This user has not yet been verified.,True,False,False,1514763188,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbs...,7nbsw2,...,,,,,,,,,,


## Review fields and assign potential analysis roles

Some of the fields in our 2017 dataset might not be useful for our analyses.  
Let's hand pick a starter set of fields that might be useful. We'll review and refine this list in the next steps.

In [None]:
# Hand-picking fields
selected_fields = []

for c in df_2017.columns:
    clear_output(wait=True)
    display(HTML(f'<h2>Field: {c}</h2>'))
    time.sleep(3)
    
    forward = False
    while not forward in ['y', 'n']:
        forward = input('Use this field? [y/n]')
        if forward == 'y':
            selected_fields.append(c)

clear_output(wait=True)

Here's the list of fields we've hand picked:

In [51]:
selected_fields

['author',
 'author_flair_text',
 'created_utc',
 'domain',
 'full_link',
 'id',
 'is_reddit_media_domain',
 'is_video',
 'locked',
 'num_comments',
 'num_crossposts',
 'over_18',
 'permalink',
 'pinned',
 'score',
 'selftext',
 'spoiler',
 'stickied',
 'title',
 'url',
 'banned_by',
 'edited',
 'crosspost_parent',
 'crosspost_parent_list',
 'distinguished',
 'author_created_utc',
 'author_fullname']

['author',
 'author_flair_text',
 'created_utc',
 'domain',
 'full_link',
 'id',
 'is_reddit_media_domain',
 'is_video',
 'locked',
 'num_comments',
 'num_crossposts',
 'over_18',
 'permalink',
 'pinned',
 'score',
 'selftext',
 'spoiler',
 'stickied',
 'title',
 'url',
 'banned_by',
 'edited',
 'crosspost_parent',
 'crosspost_parent_list',
 'distinguished',
 'author_created_utc',
 'author_fullname']

The fields dropped by hand in the step above were either uninformative (e.g. all nulls or all same value), data used for UI (e.g. css class, thumbnails, image size, etc.) or redundant (e.g. created timestamp in query local time).

In [7]:
# Save the list of hand-picked fields
fname = 'reddit_submissions_selected_fields_list.pkl'
with open(fname, 'wb') as f:
    pickle.dump(selected_fields, f)

In [8]:
with open(fname, 'rb') as f:
    selected_fields = pickle.load(f)

In [123]:
# selected_fields

Having explored individual fields and entries a bit, let's run through the starter set of selected fields again, looking at sample values, stats and value counts, to review and refine the selection, and add some helpful notes, like how the fields will likely be used in analysis, what it's for on reddit, the type of data and any additional notes.

In [84]:
field_roles_dict = {
    'reddit': {
        'a': 'author info',
        'p': 'post details',
        'aa': 'author actions on the post (other than commenting)',
        'ga': 'general subreddit users reactions to the post',
        'ma': 'mod reactions to either post content or comments activity on it'
    },
    'analysis': {
        'i': 'id',
        'r': 'reference',
        'a': 'analysis',
        'f': 'filtering',
        't': 'transform',
        'd': 'drop'
    },
    'type': {
        't': 'timestamp',
        's': 'short text',
        'l': 'long text',
        'u': 'url',
        'c': 'categorical',
        'b': 'binary flag',
        'n': 'numeric'
    }
}

In [110]:
# Display sample stats, univariate stats and value counts for each field and collect user input
# on whether to select the field as potentially useful for analysis.

selected_fields_dict = {
    'field_name': [],
    'type': [],
    'reddit_role': [],
    'analysis_role': [],
    'notes': []
}

for c in selected_fields:
    selected_fields_dict['field_name'].append(c)
    
    clear_output(wait=True)
    
    display(HTML(f'<h2>Field: {c}</h2>'))
    display(HTML(f'<h3>Sample values:</h3>'))
    display(df_2017[~df_2017[c].isna()][c].head(10))
    display(HTML(f'<hr>'))
    
    display(HTML(f'<h3>Stats:</h3>'))
    display(df_2017[c].describe())
    display(HTML(f'<hr>'))
    
    display(HTML(f'<h3>Value counts:</h3>'))
    vc1 = df_2017[c].value_counts(dropna=False)
    vc2 = df_2017[c].value_counts(dropna=False, normalize=True)
    display(pd.concat([vc1, vc2], axis=1))
    display(HTML(f'<hr>'))
    
    time.sleep(5)
    
    forward = False
    while forward == False:
        # ask reddit role
        role_type = 'reddit'
        q1_options_str = '\n'.join(
            [f"{x} = {field_roles_dict[role_type][x]}" for x in field_roles_dict[role_type]]
        )

        q1r = input(f'''
Assign {role_type} role:

{q1_options_str}

        ''')
        selected_fields_dict[f'{role_type}_role'].append(q1r)
        
        # ask analysis role
        role_type = 'analysis'
        q2_options_str = '\n'.join(
            [f"{x} = {field_roles_dict[role_type][x]}" for x in field_roles_dict[role_type]]
        )

        q2r = input(f'''
Assign {role_type} role:

{q2_options_str}

        ''')
        selected_fields_dict[f'{role_type}_role'].append(q2r)
        

        # ask field type
        role_type = 'type'
        q3_options_str = '\n'.join(
            [f"{x} = {field_roles_dict[role_type][x]}" for x in field_roles_dict[role_type]]
        )

        q3r = input(f'''
Assign field {role_type}:

{q3_options_str}

        ''')
        selected_fields_dict[f'{role_type}'].append(q3r)
        
        
        # ask for any notes
        q4r = input('Any notes on the field?:')
        selected_fields_dict['notes'].append(q4r)
        
        
        forward = input('Continue? [hit enter]')

clear_output(wait=True)

print('=== DONE! ====')

=== DONE! ====
Selected fields:
['author', 'author_flair_text', 'created_utc', 'domain', 'full_link', 'id', 'is_reddit_media_domain', 'is_video', 'locked', 'num_comments', 'num_crossposts', 'over_18', 'permalink', 'pinned', 'score', 'selftext', 'spoiler', 'stickied', 'title', 'url', 'banned_by', 'edited', 'crosspost_parent', 'crosspost_parent_list', 'distinguished', 'author_created_utc', 'author_fullname']


In [120]:
selected_fields_df = pd.DataFrame(selected_fields_dict)

In [121]:
selected_fields_df['type_long'] = selected_fields_df['type'].apply(lambda x: field_roles_dict['type'][x])
selected_fields_df['reddit_role_long'] = selected_fields_df['reddit_role']\
    .apply(lambda x: field_roles_dict['reddit'][x])
selected_fields_df['analysis_role_long'] = selected_fields_df['analysis_role']\
    .apply(lambda x: field_roles_dict['analysis'][x])
selected_fields_df

Unnamed: 0,field_name,type,reddit_role,analysis_role,notes,type_long,reddit_role_long,analysis_role_long
0,author,s,a,i,Note the [deleted] and [removed] entries.,short text,author info,id
1,author_flair_text,c,a,a,31% NaN values.,categorical,author info,analysis
2,created_utc,t,p,a,Dups present.,timestamp,post details,analysis
3,domain,s,p,a,Domain where the post originated from.,short text,post details,analysis
4,full_link,u,p,r,A link to the post on Reddit.,url,post details,reference
5,id,s,p,i,Post id,short text,post details,id
6,is_reddit_media_domain,b,p,d,All values are either empty or false.,binary flag,post details,drop
7,is_video,b,p,d,All values are either empty or false.,binary flag,post details,drop
8,locked,b,ma,a,"Only 6 values are true, everything else false.",binary flag,mod reactions to either post content or commen...,analysis
9,num_comments,n,ga,a,,numeric,general subreddit users reactions to the post,analysis


In [136]:
# Drop the fields marked for it
selected_fields_df = selected_fields_df[selected_fields_df['analysis_role'] != 'd']
selected_fields_df

Unnamed: 0,field_name,type,reddit_role,analysis_role,notes,type_long,reddit_role_long,analysis_role_long
0,author,s,a,i,Note the [deleted] and [removed] entries.,short text,author info,id
1,author_flair_text,c,a,a,31% NaN values.,categorical,author info,analysis
2,created_utc,t,p,a,Dups present.,timestamp,post details,analysis
3,domain,s,p,a,Domain where the post originated from.,short text,post details,analysis
4,full_link,u,p,r,A link to the post on Reddit.,url,post details,reference
5,id,s,p,i,Post id,short text,post details,id
8,locked,b,ma,a,"Only 6 values are true, everything else false.",binary flag,mod reactions to either post content or commen...,analysis
9,num_comments,n,ga,a,,numeric,general subreddit users reactions to the post,analysis
10,num_crossposts,n,aa,a,Both NaN and zeros present. Few values >0.,numeric,author actions on the post (other than comment...,analysis
11,over_18,b,p,a,98% false. Looks like a NSFW-type label on the...,binary flag,post details,analysis


In [None]:
# Generate an easier to read markdown version of the selected fields table

# print(
# selected_fields_df[['field_name', 'reddit_role_long', 'analysis_role_long', 'type_long', 'notes']]
# .to_markdown()
# )

|    | field_name            | reddit_role_long                                                | analysis_role_long   | type_long   | notes                                                                                                                                                                                                                   |
|---:|:----------------------|:----------------------------------------------------------------|:---------------------|:------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | author                | author info                                                     | id                   | short text  | Note the [deleted] and [removed] entries.                                                                                                                                                                               |
|  1 | author_flair_text     | author info                                                     | analysis             | categorical | 31% NaN values.                                                                                                                                                                                                         |
|  2 | created_utc           | post details                                                    | analysis             | timestamp   | Dups present.                                                                                                                                                                                                           |
|  3 | domain                | post details                                                    | analysis             | short text  | Domain where the post originated from.                                                                                                                                                                                  |
|  4 | full_link             | post details                                                    | reference            | url         | A link to the post on Reddit.                                                                                                                                                                                           |
|  5 | id                    | post details                                                    | id                   | short text  | Post id                                                                                                                                                                                                                 |
|  8 | locked                | mod reactions to either post content or comments activity on it | analysis             | binary flag | Only 6 values are true, everything else false.                                                                                                                                                                          |
|  9 | num_comments          | general subreddit users reactions to the post                   | analysis             | numeric     |                                                                                                                                                                                                                         |
| 10 | num_crossposts        | author actions on the post (other than commenting)              | analysis             | numeric     | Both NaN and zeros present. Few values >0.                                                                                                                                                                              |
| 11 | over_18               | post details                                                    | analysis             | binary flag | 98% false. Looks like a NSFW-type label on the post content.                                                                                                                                                            |
| 13 | pinned                | author actions on the post (other than commenting)              | analysis             | binary flag | Users can pin up to 4 posts to their profile.                                                                                                                                                                           |
| 14 | score                 | general subreddit users reactions to the post                   | analysis             | numeric     | The score is based on up and down votes.                                                                                                                                                                                |
| 15 | selftext              | post details                                                    | analysis             | long text   | Can have [deleted] as values.                                                                                                                                                                                           |
| 16 | spoiler               | author actions on the post (other than commenting)              | analysis             | binary flag | Spoiler tags are used to mark spoiler content, and they can blur the preview or thumbnails. Both mods and post authors can add a spoiler tag on a post. There were 30 true values in the sample, so decided to keep it. |
| 17 | stickied              | author actions on the post (other than commenting)              | filtering            | binary flag | Mods can pin up to 2 of their own posts to the top of the subreddit. This tag used to be called announcements.                                                                                                          |
| 18 | title                 | post details                                                    | analysis             | long text   | Title of the post, can be very long.                                                                                                                                                                                    |
| 19 | url                   | post details                                                    | reference            | url         | Url to the original post if crossposted or from other source.                                                                                                                                                           |
| 20 | banned_by             | mod reactions to either post content or comments activity on it | filtering            | categorical | The only values are NaN and moderators.                                                                                                                                                                                 |
| 21 | edited                | author actions on the post (other than commenting)              | analysis             | timestamp   | 86% NaNs.                                                                                                                                                                                                               |
| 22 | crosspost_parent      | post details                                                    | transform            | short text  | Cross-post parent post id.                                                                                                                                                                                              |
| 23 | crosspost_parent_list | post details                                                    | transform            | long text   | This ultimately contains the body text of a crossposted post. Just have to pull if out of the list of dicts.                                                                                                            |
| 24 | distinguished         | author actions on the post (other than commenting)              | filtering            | categorical | Mods can tag posts as distinguished, usually used for subreddit management. Use this field for filtering out these posts.                                                                                               |
| 26 | author_fullname       | author info                                                     | id                   | short text  | Unclear what this is, and lots of NaNs, but decided to keep for now.                                                                                                                                                    |

## Final fields selection

In [142]:
fname = 'reddit_submissions_selected_fields.csv'
selected_fields_df.to_csv(fname, index=False)