# Submissions data fields selection  

We'll explore /AskDocs subreddit submissions raw data structure using a small sample of the submissions posted in 2017, and select the fields relevant to this analysis.

In [131]:
import pickle
import pandas as pd
from IPython.display import display, HTML, Markdown, clear_output
import ipywidgets as widgets
import time 

In [2]:
with open("reddit_askdocs_submissions_2017.pkl", "rb") as f:
    d_2017 = pickle.load(f)

In [3]:
len(d_2017)

62438

In [4]:
df_2017 = pd.DataFrame(d_2017)

In [5]:
df_2017.head()

Unnamed: 0,author,author_flair_css_class,author_flair_text,brand_safe,can_mod_post,contest_mode,created_utc,domain,full_link,id,...,approved_at_utc,banned_at_utc,view_count,gilded,media_embed,secure_media_embed,author_created_utc,author_fullname,media,secure_media
0,[deleted],,,True,False,False,1514764452,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbw...,7nbwtn,...,,,,,,,,,,
1,XenonCSGO,default,This user has not yet been verified.,True,False,False,1514764122,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvsv,...,,,,,,,,,,
2,[deleted],,,True,False,False,1514764055,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvln,...,,,,,,,,,,
3,DavisTheMagicSheep,default,This user has not yet been verified.,True,False,False,1514763799,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbu...,7nburb,...,,,,,,,,,,
4,Dontgetscooped,default,This user has not yet been verified.,True,False,False,1514763188,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbs...,7nbsw2,...,,,,,,,,,,


In [6]:
df_2017.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62438 entries, 0 to 62437
Data columns (total 56 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   author                   62438 non-null  object 
 1   author_flair_css_class   43394 non-null  object 
 2   author_flair_text        43394 non-null  object 
 3   brand_safe               62438 non-null  bool   
 4   can_mod_post             35267 non-null  object 
 5   contest_mode             62438 non-null  bool   
 6   created_utc              62438 non-null  int64  
 7   domain                   62438 non-null  object 
 8   full_link                62438 non-null  object 
 9   id                       62438 non-null  object 
 10  is_crosspostable         20127 non-null  object 
 11  is_reddit_media_domain   14224 non-null  object 
 12  is_self                  62438 non-null  bool   
 13  is_video                 40593 non-null  object 
 14  locked                

There are 56 columns in total, but many of them are likely not useful for this analysis. Let's hand pick a starter set of fields likely to be useful for this analysis. We'll review and refine this list in the next steps.

In [None]:
# Display univariate stats and value counts for each field and collect user input
# on whether to select the field as potentially useful for analysis.

selected_fields = []

for c in df_2017.columns:
    clear_output(wait=True)
    
    display(HTML(f'<h2>Field: {c}</h2>'))
    display(df_2017[c].describe())
    display(HTML(f'<hr>'))
    
    vc1 = df_2017[c].value_counts(dropna=False)
    vc2 = df_2017[c].value_counts(dropna=False, normalize=True)
    display(pd.concat([vc1, vc2], axis=1))
    display(HTML(f'<hr>'))
    
    time.sleep(5)
    
    forward = False
    while not forward in ['y', 'n']:
        forward = input('Use this field? [y/n]')
        if forward == 'y':
            selected_fields.append(c)

clear_output(wait=True)

print('=== DONE! ====')
print('Selected fields:')
print(selected_fields)

Here's the list of fields we've hand picked:

In [51]:
selected_fields

['author',
 'author_flair_text',
 'created_utc',
 'domain',
 'full_link',
 'id',
 'is_reddit_media_domain',
 'is_video',
 'locked',
 'num_comments',
 'num_crossposts',
 'over_18',
 'permalink',
 'pinned',
 'score',
 'selftext',
 'spoiler',
 'stickied',
 'title',
 'url',
 'banned_by',
 'edited',
 'crosspost_parent',
 'crosspost_parent_list',
 'distinguished',
 'author_created_utc',
 'author_fullname']

['author',
 'author_flair_text',
 'created_utc',
 'domain',
 'full_link',
 'id',
 'is_reddit_media_domain',
 'is_video',
 'locked',
 'num_comments',
 'num_crossposts',
 'over_18',
 'permalink',
 'pinned',
 'score',
 'selftext',
 'spoiler',
 'stickied',
 'title',
 'url',
 'banned_by',
 'edited',
 'crosspost_parent',
 'crosspost_parent_list',
 'distinguished',
 'author_created_utc',
 'author_fullname']

The fields dropped were either uninformative (e.g. all nulls or all same value), data used for UI (e.g. css class, thumbnails, image size, etc.) or redundant (e.g. created timestamp in query local time).

In [7]:
fname = 'reddit_submissions_selected_fields_list.pkl'

# Save the list of hand-picked fields

# with open(fname, 'wb') as f:
#     pickle.dump(selected_fields, f)

In [8]:
with open(fname, 'rb') as f:
    selected_fields = pickle.load(f)

In [123]:
# selected_fields

In [26]:
# sample entry

df_2017[selected_fields + ['is_self']].loc[46]#['crosspost_parent_list'][0]['selftext']

# to drop: permalink, is_reddit_media_domain, is_video

author                                                   PleaseCanYouTellMe
author_flair_text                      This user has not yet been verified.
created_utc                                                      1514742763
domain                                                         self.medical
full_link                 https://www.reddit.com/r/AskDocs/comments/7n9v...
id                                                                   7n9v01
is_reddit_media_domain                                                False
is_video                                                              False
locked                                                                False
num_comments                                                              0
num_crossposts                                                          0.0
over_18                                                               False
permalink                 /r/AskDocs/comments/7n9v01/chronic_sore_throat...
pinned      

In [70]:
df_2017['spoiler'].value_counts(dropna=False)

False    62408
True        30
Name: spoiler, dtype: int64

In [108]:
c = 'author'
display(df_2017[~df_2017[c].isna()][c].head(10))

0             [deleted]
1             XenonCSGO
2             [deleted]
3    DavisTheMagicSheep
4        Dontgetscooped
5             Examiner7
6            AveryFenix
7          JohanWentMad
8          HelloImLucas
9                LeftAl
Name: author, dtype: object

Having explored individual fields and entries a bit, let's run through the starter set of selected fields again, looking at sample values, stats and value counts, to review and refine the selection, and add some helpful notes, like how the fields will likely be used in analysis, what it's for on reddit, the type of data and any additional notes.

In [84]:
field_roles_dict = {
    'reddit': {
        'a': 'author info',
        'p': 'post details',
        'aa': 'author actions on the post (other than commenting)',
        'ga': 'general subreddit users reactions to the post',
        'ma': 'mod reactions to either post content or comments activity on it'
    },
    'analysis': {
        'i': 'id',
        'r': 'reference',
        'a': 'analysis',
        'f': 'filtering',
        't': 'transform',
        'd': 'drop'
    },
    'type': {
        't': 'timestamp',
        's': 'short text',
        'l': 'long text',
        'u': 'url',
        'c': 'categorical',
        'b': 'binary flag',
        'n': 'numeric'
    }
}

In [110]:
# Display sample stats, univariate stats and value counts for each field and collect user input
# on whether to select the field as potentially useful for analysis.

selected_fields_dict = {
    'field_name': [],
    'type': [],
    'reddit_role': [],
    'analysis_role': [],
    'notes': []
}

for c in selected_fields:
    selected_fields_dict['field_name'].append(c)
    
    clear_output(wait=True)
    
    display(HTML(f'<h2>Field: {c}</h2>'))
    display(HTML(f'<h3>Sample values:</h3>'))
    display(df_2017[~df_2017[c].isna()][c].head(10))
    display(HTML(f'<hr>'))
    
    display(HTML(f'<h3>Stats:</h3>'))
    display(df_2017[c].describe())
    display(HTML(f'<hr>'))
    
    display(HTML(f'<h3>Value counts:</h3>'))
    vc1 = df_2017[c].value_counts(dropna=False)
    vc2 = df_2017[c].value_counts(dropna=False, normalize=True)
    display(pd.concat([vc1, vc2], axis=1))
    display(HTML(f'<hr>'))
    
    time.sleep(5)
    
    forward = False
    while forward == False:
        # ask reddit role
        role_type = 'reddit'
        q1_options_str = '\n'.join(
            [f"{x} = {field_roles_dict[role_type][x]}" for x in field_roles_dict[role_type]]
        )

        q1r = input(f'''
Assign {role_type} role:

{q1_options_str}

        ''')
        selected_fields_dict[f'{role_type}_role'].append(q1r)
        
        # ask analysis role
        role_type = 'analysis'
        q2_options_str = '\n'.join(
            [f"{x} = {field_roles_dict[role_type][x]}" for x in field_roles_dict[role_type]]
        )

        q2r = input(f'''
Assign {role_type} role:

{q2_options_str}

        ''')
        selected_fields_dict[f'{role_type}_role'].append(q2r)
        

        # ask field type
        role_type = 'type'
        q3_options_str = '\n'.join(
            [f"{x} = {field_roles_dict[role_type][x]}" for x in field_roles_dict[role_type]]
        )

        q3r = input(f'''
Assign field {role_type}:

{q3_options_str}

        ''')
        selected_fields_dict[f'{role_type}'].append(q3r)
        
        
        # ask for any notes
        q4r = input('Any notes on the field?:')
        selected_fields_dict['notes'].append(q4r)
        
        
        forward = input('Continue? [hit enter]')

clear_output(wait=True)

print('=== DONE! ====')

=== DONE! ====
Selected fields:
['author', 'author_flair_text', 'created_utc', 'domain', 'full_link', 'id', 'is_reddit_media_domain', 'is_video', 'locked', 'num_comments', 'num_crossposts', 'over_18', 'permalink', 'pinned', 'score', 'selftext', 'spoiler', 'stickied', 'title', 'url', 'banned_by', 'edited', 'crosspost_parent', 'crosspost_parent_list', 'distinguished', 'author_created_utc', 'author_fullname']


In [120]:
selected_fields_df = pd.DataFrame(selected_fields_dict)

In [121]:
selected_fields_df['type_long'] = selected_fields_df['type'].apply(lambda x: field_roles_dict['type'][x])
selected_fields_df['reddit_role_long'] = selected_fields_df['reddit_role']\
    .apply(lambda x: field_roles_dict['reddit'][x])
selected_fields_df['analysis_role_long'] = selected_fields_df['analysis_role']\
    .apply(lambda x: field_roles_dict['analysis'][x])
selected_fields_df

Unnamed: 0,field_name,type,reddit_role,analysis_role,notes,type_long,reddit_role_long,analysis_role_long
0,author,s,a,i,Note the [deleted] and [removed] entries.,short text,author info,id
1,author_flair_text,c,a,a,31% NaN values.,categorical,author info,analysis
2,created_utc,t,p,a,Dups present.,timestamp,post details,analysis
3,domain,s,p,a,Domain where the post originated from.,short text,post details,analysis
4,full_link,u,p,r,A link to the post on Reddit.,url,post details,reference
5,id,s,p,i,Post id,short text,post details,id
6,is_reddit_media_domain,b,p,d,All values are either empty or false.,binary flag,post details,drop
7,is_video,b,p,d,All values are either empty or false.,binary flag,post details,drop
8,locked,b,ma,a,"Only 6 values are true, everything else false.",binary flag,mod reactions to either post content or commen...,analysis
9,num_comments,n,ga,a,,numeric,general subreddit users reactions to the post,analysis


In [136]:
# Drop the fields marked for it
selected_fields_df = selected_fields_df[selected_fields_df['analysis_role'] != 'd']
selected_fields_df

Unnamed: 0,field_name,type,reddit_role,analysis_role,notes,type_long,reddit_role_long,analysis_role_long
0,author,s,a,i,Note the [deleted] and [removed] entries.,short text,author info,id
1,author_flair_text,c,a,a,31% NaN values.,categorical,author info,analysis
2,created_utc,t,p,a,Dups present.,timestamp,post details,analysis
3,domain,s,p,a,Domain where the post originated from.,short text,post details,analysis
4,full_link,u,p,r,A link to the post on Reddit.,url,post details,reference
5,id,s,p,i,Post id,short text,post details,id
8,locked,b,ma,a,"Only 6 values are true, everything else false.",binary flag,mod reactions to either post content or commen...,analysis
9,num_comments,n,ga,a,,numeric,general subreddit users reactions to the post,analysis
10,num_crossposts,n,aa,a,Both NaN and zeros present. Few values >0.,numeric,author actions on the post (other than comment...,analysis
11,over_18,b,p,a,98% false. Looks like a NSFW-type label on the...,binary flag,post details,analysis


In [141]:
# Generate a markdown version of the selected fields table

# print(
# selected_fields_df[['field_name', 'reddit_role_long', 'analysis_role_long', 'type_long', 'notes']]
# .to_markdown()
# )

|    | field_name            | reddit_role_long                                                | analysis_role_long   | type_long   | notes                                                                                                                                                                                                                   |
|---:|:----------------------|:----------------------------------------------------------------|:---------------------|:------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|  0 | author                | author info                                                     | id                   | short text  | Note the [deleted] and [removed] entries.                                                                                                                                                                               |
|  1 | author_flair_text     | author info                                                     | analysis             | categorical | 31% NaN values.                                                                                                                                                                                                         |
|  2 | created_utc           | post details                                                    | analysis             | timestamp   | Dups present.                                                                                                                                                                                                           |
|  3 | domain                | post details                                                    | analysis             | short text  | Domain where the post originated from.                                                                                                                                                                                  |
|  4 | full_link             | post details                                                    | reference            | url         | A link to the post on Reddit.                                                                                                                                                                                           |
|  5 | id                    | post details                                                    | id                   | short text  | Post id                                                                                                                                                                                                                 |
|  8 | locked                | mod reactions to either post content or comments activity on it | analysis             | binary flag | Only 6 values are true, everything else false.                                                                                                                                                                          |
|  9 | num_comments          | general subreddit users reactions to the post                   | analysis             | numeric     |                                                                                                                                                                                                                         |
| 10 | num_crossposts        | author actions on the post (other than commenting)              | analysis             | numeric     | Both NaN and zeros present. Few values >0.                                                                                                                                                                              |
| 11 | over_18               | post details                                                    | analysis             | binary flag | 98% false. Looks like a NSFW-type label on the post content.                                                                                                                                                            |
| 13 | pinned                | author actions on the post (other than commenting)              | analysis             | binary flag | Users can pin up to 4 posts to their profile.                                                                                                                                                                           |
| 14 | score                 | general subreddit users reactions to the post                   | analysis             | numeric     | The score is based on up and down votes.                                                                                                                                                                                |
| 15 | selftext              | post details                                                    | analysis             | long text   | Can have [deleted] as values.                                                                                                                                                                                           |
| 16 | spoiler               | author actions on the post (other than commenting)              | analysis             | binary flag | Spoiler tags are used to mark spoiler content, and they can blur the preview or thumbnails. Both mods and post authors can add a spoiler tag on a post. There were 30 true values in the sample, so decided to keep it. |
| 17 | stickied              | author actions on the post (other than commenting)              | filtering            | binary flag | Mods can pin up to 2 of their own posts to the top of the subreddit. This tag used to be called announcements.                                                                                                          |
| 18 | title                 | post details                                                    | analysis             | long text   | Title of the post, can be very long.                                                                                                                                                                                    |
| 19 | url                   | post details                                                    | reference            | url         | Url to the original post if crossposted or from other source.                                                                                                                                                           |
| 20 | banned_by             | mod reactions to either post content or comments activity on it | filtering            | categorical | The only values are NaN and moderators.                                                                                                                                                                                 |
| 21 | edited                | author actions on the post (other than commenting)              | analysis             | timestamp   | 86% NaNs.                                                                                                                                                                                                               |
| 22 | crosspost_parent      | post details                                                    | transform            | short text  | Cross-post parent post id.                                                                                                                                                                                              |
| 23 | crosspost_parent_list | post details                                                    | transform            | long text   | This ultimately contains the body text of a crossposted post. Just have to pull if out of the list of dicts.                                                                                                            |
| 24 | distinguished         | author actions on the post (other than commenting)              | filtering            | categorical | Mods can tag posts as distinguished, usually used for subreddit management. Use this field for filtering out these posts.                                                                                               |
| 26 | author_fullname       | author info                                                     | id                   | short text  | Unclear what this is, and lots of NaNs, but decided to keep for now.                                                                                                                                                    |

In [142]:
fname = 'reddit_submissions_selected_fields.csv'
selected_fields_df.to_csv(fname, index=False)

#### Exploring timestamp fields format
The timestamp fields are in epochs with seconds as units:

In [144]:
pd.to_datetime(df_2017['created_utc'], unit='s').head()

0   2017-12-31 23:54:12
1   2017-12-31 23:48:42
2   2017-12-31 23:47:35
3   2017-12-31 23:43:19
4   2017-12-31 23:33:08
Name: created_utc, dtype: datetime64[ns]

In [146]:
pd.to_datetime(df_2017['created_utc'], unit='s').describe(datetime_is_numeric=True)

count                            62438
mean     2017-06-26 12:58:28.898411264
min                2017-01-01 00:05:06
25%                2017-03-28 21:41:05
50%         2017-06-23 01:45:26.500000
75%         2017-09-23 12:12:30.500000
max                2017-12-31 23:54:12
Name: created_utc, dtype: object

In [78]:
pd.to_datetime(df_2017[~df_2017['edited'].isna()]['edited'], unit='s').head()

24   2017-12-31 21:08:36
25   2017-12-31 21:11:46
38   2017-12-31 19:02:42
48   2017-12-31 17:44:09
52   2018-01-01 04:59:59
Name: edited, dtype: datetime64[ns]

#### Exploring how post body text is stored for self-posts vs link-posts

In [27]:
df_2017.groupby(['is_self'])['crosspost_parent'].describe()

Unnamed: 0_level_0,count,unique,top,freq
is_self,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,20,20,t3_7n9uyb,1.0
True,0,0,,


In [28]:
df_2017['is_self'].value_counts(dropna=False)

True     62417
False       21
Name: is_self, dtype: int64

In [39]:
df_2017['crosspost_parent'].isna().value_counts(dropna=False)

True     62418
False       20
Name: crosspost_parent, dtype: int64

Upon poking around the data, it looks like if `crosspost_parent` is not NaN, then the post is a crosspost and the `selftext` field will not contain the post text body (unless it was deleted, in which case it will contain '[deleted]'), but we can get it from the `crosspost_parent_list` field.

In [148]:
df_2017[df_2017['crosspost_parent'].notna()]['selftext'].value_counts(dropna=False)

             17
[deleted]     3
Name: selftext, dtype: int64

Link posts can be crossposts or other links:

In [40]:
df_2017[df_2017['crosspost_parent'].isna()]['is_self'].value_counts(dropna=False)

True     62417
False        1
Name: is_self, dtype: int64

In [149]:
df_2017[df_2017['is_self'] == False][selected_fields].head()

Unnamed: 0,author,author_flair_text,created_utc,domain,full_link,id,is_reddit_media_domain,is_video,locked,num_comments,...,stickied,title,url,banned_by,edited,crosspost_parent,crosspost_parent_list,distinguished,author_created_utc,author_fullname
46,PleaseCanYouTellMe,This user has not yet been verified.,1514742763,self.medical,https://www.reddit.com/r/AskDocs/comments/7n9v...,7n9v01,False,False,False,0,...,False,"Chronic sore throat, pain and clogged feeling ...",https://www.reddit.com/r/medical/comments/7n9u...,,,t3_7n9uyb,"[{'approved_at_utc': None, 'approved_by': None...",,,
247,throwmcfly,This user has not yet been verified.,1514648880,self.DiagnoseMe,https://www.reddit.com/r/AskDocs/comments/7n26...,7n26eg,False,False,False,4,...,False,Severe headache at the point of orgasm (M/20),https://www.reddit.com/r/DiagnoseMe/comments/7...,,,t3_7n24f7,"[{'approved_at_utc': None, 'approved_by': None...",,,
553,confusedal0t,This user has not yet been verified.,1514500361,self.STD,https://www.reddit.com/r/AskDocs/comments/7mpz...,7mpzsa,False,False,False,2,...,False,Does anyone know if this could be herpes? Anyb...,https://www.reddit.com/r/STD/comments/7mhq3z/d...,,,t3_7mhq3z,"[{'approved_at_utc': None, 'approved_by': None...",,,
1221,PleaseCanYouTellMe,This user has not yet been verified.,1514155267,self.medical,https://www.reddit.com/r/AskDocs/comments/7lxz...,7lxzk6,False,False,False,4,...,False,Mom has one sided headache after an argument?,https://www.reddit.com/r/medical/comments/7lxy...,,,t3_7lxy1e,"[{'approved_at_utc': None, 'approved_by': None...",,,
2227,coolredditusername12,This user has not yet been verified.,1513633934,self.Dermatology,https://www.reddit.com/r/AskDocs/comments/7koh...,7kohvh,False,False,False,0,...,False,Could you help me identify this thing on my back?,https://www.reddit.com/r/Dermatology/comments/...,,,t3_7kofhb,"[{'approved_at_utc': None, 'approved_by': None...",,,


#### Exploring url fields

In [36]:
list(df_2017[df_2017['full_link'] != df_2017['url']][['full_link', 'url']].loc[7297])

['https://www.reddit.com/r/AskDocs/comments/7dj2zn/if_i_take_an_antidepressant_and_i_am_not/',
 'https://www.reddit.com/r/Drugs/comments/7dj18d/if_i_take_an_antidepressant_and_i_am_not/']

In [58]:
list(df_2017.loc[17][['full_link','url']])

['https://www.reddit.com/r/AskDocs/comments/7nb5nw/redness_but_not_jockitch/',
 'https://www.reddit.com/r/AskDocs/comments/7nb5nw/redness_but_not_jockitch/']

In [154]:
len(df_2017[df_2017['full_link'] == df_2017['url']][['full_link', 'url']])

62417

In [156]:
len(df_2017[df_2017['full_link'] != df_2017['url']][['full_link', 'url']])

21

In [155]:
df_2017[df_2017['full_link'] != df_2017['url']][['full_link', 'url']].head()

Unnamed: 0,full_link,url
46,https://www.reddit.com/r/AskDocs/comments/7n9v...,https://www.reddit.com/r/medical/comments/7n9u...
247,https://www.reddit.com/r/AskDocs/comments/7n26...,https://www.reddit.com/r/DiagnoseMe/comments/7...
553,https://www.reddit.com/r/AskDocs/comments/7mpz...,https://www.reddit.com/r/STD/comments/7mhq3z/d...
1221,https://www.reddit.com/r/AskDocs/comments/7lxz...,https://www.reddit.com/r/medical/comments/7lxy...
2227,https://www.reddit.com/r/AskDocs/comments/7koh...,https://www.reddit.com/r/Dermatology/comments/...


It looks like the `full_link` field is the url of the post itself.  
It looks like the `url` field is equal to the `full_link` for self-posts, and is either a crosspost link ot an external link for link posts.

#### Exploring the 'banned_by' field

In [42]:
df_2017['banned_by'].value_counts(dropna=False)

NaN           59411
moderators     3027
Name: banned_by, dtype: int64

#### Exploring the 'selftext' field  
This field contains the body of the post for self-posts.

In [44]:
df_2017['selftext'].isna().value_counts(dropna=False)

False    61979
True       459
Name: selftext, dtype: int64

In [51]:
df_2017['selftext'].str.len().value_counts(dropna=False)

9.0       20522
NaN         459
0.0         254
457.0        59
468.0        52
          ...  
6012.0        1
3677.0        1
3729.0        1
4862.0        1
4263.0        1
Name: selftext, Length: 3916, dtype: int64

In [151]:
df_2017['selftext'].value_counts(dropna=False).head()

[deleted]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         11833
[removed]                                                                                                                                                                                                                                                                                       

In [46]:
df_2017[df_2017['selftext'].isna()]['banned_by'].value_counts(dropna=False)

moderators    448
NaN            11
Name: banned_by, dtype: int64

Some posts only have a title and no body because that's how the user entered their question (only in the title and left the post body blank).

In [152]:
df_2017[(df_2017['selftext'].isna()) & (~df_2017['banned_by'].isna())][selected_fields].head()

Unnamed: 0,author,author_flair_text,created_utc,domain,full_link,id,is_reddit_media_domain,is_video,locked,num_comments,...,stickied,title,url,banned_by,edited,crosspost_parent,crosspost_parent_list,distinguished,author_created_utc,author_fullname
17,[deleted],,1514756005,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nb5...,7nb5nw,False,False,False,1,...,False,Redness but not JockItch?,https://www.reddit.com/r/AskDocs/comments/7nb5...,moderators,,,,,,
26,[deleted],,1514752781,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nau...,7nauj8,False,False,False,1,...,False,Epinephrine and Hydroxyzine ELI5,https://www.reddit.com/r/AskDocs/comments/7nau...,moderators,,,,,,
39,[deleted],,1514745599,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7na5...,7na56p,False,False,False,1,...,False,What is the maximum healthy range for an Lp-PL...,https://www.reddit.com/r/AskDocs/comments/7na5...,moderators,,,,,,
41,[deleted],,1514745379,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7na4...,7na4ei,False,False,False,1,...,False,(Nsfw) possible STD,https://www.reddit.com/r/AskDocs/comments/7na4...,moderators,,,,,,
45,[deleted],,1514744716,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7na2...,7na221,False,False,False,1,...,False,Clubbed nails update? Do I have it?,https://www.reddit.com/r/AskDocs/comments/7na2...,moderators,,,,,,
