# Exploratory Data Analysis  
  

#### Notebook objectives:  
- Explore the prepped data
- Brainstorm analysis questions   

#### Steps:  
1. [Load the data](#Load-the-data)  
2. [Explore data fields](#Explore-data-fields)  
   - **[author](#author)**  
   - <s>[author_flair_text](#author_flair_text)</s>
      - was used for filtering out irrelevant bot/mod posts here, but will not be used in further analyses.  
   - <s>[domain](#domain)</s>  
   - <s>[locked](#locked)</s>  
   - **[num_comments](#num_comments)**  
   - <s>[num_crossposts](#num_crossposts)</s>  
   - <s>[over_18](#over_18)</s>  
   - <s>[score](#score)</s>
   - <s>[selftext](#selftext)</s>
   - <s>[title](#title)</s>  
      - was used for filtering out some test posts here, but will not be used in further analyses.  
   - <s>[crosspost_subreddits](#crosspost_subreddits)</s>
   - **[full_post_text](#full_post_text)**
   - **[created_utc_ns_dt](#created_utc_ns_dt)**
   - <s>[edited_utc_ns_dt](#edited_utc_ns_dt)</s>
   - **[age](#age)**
   - **[sex](#sex)**  
3. [Analysis ideas](#Analysis-ideas)
   - fields selected for further analyses:
     - author  
     - num_comments  
     - full_post_text
     - created_utc_ns_dt
     - age
     - sex  
   - [Potential analysis questions to explore](#Potential-analysis-questions-to-explore:)  

In [105]:
%run notebook_setup.ipynb

DATA_PATH='data/'
OUTPUT_PATH='output/'


num_freq_plot(df, field, color='')


## Load the data

In [2]:
df = pd.read_csv(
    DATA_PATH + 'reddit_askdocs_submissions_2017_to_20220121_analysis_ds.zip',
    low_memory=False
)

## Explore data fields  

In [3]:
df.head()

Unnamed: 0,author,author_flair_text,domain,full_link,id,locked,num_comments,num_crossposts,over_18,score,selftext,title,url,crosspost_subreddits,full_post_text,created_utc_ns_dt,edited_utc_ns_dt,age,sex
0,[deleted],,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbw...,7nbwtn,False,0,0.0,False,2,,Appendicitis removed 1 month ago but feel a pa...,https://www.reddit.com/r/AskDocs/comments/7nbw...,,Appendicitis removed 1 month ago but feel a pa...,1514764452000000000,,,
1,[deleted],,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvln,False,1,0.0,False,1,,My grandma has neck/back pain and little to no...,https://www.reddit.com/r/AskDocs/comments/7nbv...,,My grandma has neck/back pain and little to no...,1514764055000000000,,,AFAB
2,DavisTheMagicSheep,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbu...,7nburb,False,2,0.0,False,1,"I've had a cold for the last couple days now, ...",My ears feel like there is pressure inside of ...,https://www.reddit.com/r/AskDocs/comments/7nbu...,,My ears feel like there is pressure inside of ...,1514763799000000000,,,
3,Dontgetscooped,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbs...,7nbsw2,False,1,0.0,False,1,(first about me : 32 white male 5 foot 5 225lb...,IBS maybe?,https://www.reddit.com/r/AskDocs/comments/7nbs...,,IBS maybe? | (first about me : 32 white male 5...,1514763188000000000,,,AMAB
4,AveryFenix,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbo...,7nbolv,False,7,0.0,False,7,I've had these marks on my stomach ever since ...,Mole or scar? Should I be worried about melanoma?,https://www.reddit.com/r/AskDocs/comments/7nbo...,,Mole or scar? Should I be worried about melano...,1514761839000000000,,21.0,


In [4]:
df.columns

Index(['author', 'author_flair_text', 'domain', 'full_link', 'id', 'locked',
       'num_comments', 'num_crossposts', 'over_18', 'score', 'selftext',
       'title', 'url', 'crosspost_subreddits', 'full_post_text',
       'created_utc_ns_dt', 'edited_utc_ns_dt', 'age', 'sex'],
      dtype='object')

In [22]:
len(df.columns)

19

### author

In [5]:
df['author'].value_counts(dropna=False)

author
[deleted]           17647
SensitiveBorder2      289
Help_Me_Reddit01      224
throwlega             190
dizson                152
                    ...  
sammydog567             1
cynyk_                  1
hippotus                1
sarcasm_itsagift        1
nominsinjun             1
Name: count, Length: 366427, dtype: int64

In [27]:
df['author'].isna().value_counts(dropna=False)

author
False    683202
Name: count, dtype: int64

In [26]:
df['author'].value_counts(dropna=False, normalize=True)

author
[deleted]           0.025830
SensitiveBorder2    0.000423
Help_Me_Reddit01    0.000328
throwlega           0.000278
dizson              0.000222
                      ...   
sammydog567         0.000001
cynyk_              0.000001
hippotus            0.000001
sarcasm_itsagift    0.000001
nominsinjun         0.000001
Name: proportion, Length: 366427, dtype: float64

In [28]:
df[df['author'] == '[deleted]']['full_post_text'].str[:60].value_counts(dropna=False)

full_post_text
Scratched a bug bite on my scalp too hard and now there's a     7
Please help |                                                   4
Back pain |                                                     4
Trouble breathing. |                                            4
What is this? |                                                 4
                                                               ..
Blood tests for both kinds of Herpes |                          1
Numbness in right ear, related to a nerve or muscle in neck     1
No one can figure out what's wrong with me. 25 and a myriad     1
Should I worry about sugar consumption right before a physic    1
Does my calf pain warrant an ER visit? |                        1
Name: count, Length: 17411, dtype: int64

In [29]:
df[df['author'] == '[removed]']['full_post_text'].str[:60].value_counts(dropna=False)

Series([], Name: count, dtype: int64)

In [33]:
author_vc = df['author'].value_counts().reset_index()

In [42]:
author_vc.head(10)

Unnamed: 0,author,count
0,[deleted],17647
1,SensitiveBorder2,289
2,Help_Me_Reddit01,224
3,throwlega,190
4,dizson,152
5,Docquest117,147
6,ukjungle,139
7,johndoejohndoes,130
8,__throwawaypt__,115
9,Ak51915,104


In [47]:
author_posts_count_vc = author_vc['count'].value_counts(normalize=True).sort_index()\
    .reset_index()
author_posts_count_vc.rename(
    columns={'count': 'author_posts_count'},
    inplace=True
)
author_posts_count_vc.head(10)

Unnamed: 0,author_posts_count,proportion
0,1,0.627768
1,2,0.225076
2,3,0.068707
3,4,0.030694
4,5,0.01543
5,6,0.009142
6,7,0.005742
7,8,0.003862
8,9,0.002885
9,10,0.001949


In [52]:
author_posts_count_vc.tail(5)

Unnamed: 0,author_posts_count,proportion
86,152,3e-06
87,190,3e-06
88,224,3e-06
89,289,3e-06
90,17647,3e-06


In [87]:
author_posts_count_vc2 = author_posts_count_vc[
    author_posts_count_vc['author_posts_count'] < 1000
].copy()

traces = []
trace = go.Scatter(
    x=author_posts_count_vc2['author_posts_count'],
    y=author_posts_count_vc2['proportion'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='proportion', rangemode='tozero')
fig.update_xaxes(title='author posts count', rangemode='tozero')
fig.update_layout(
    title=f'<b>Author posts count</b> frequency plot, excluding [deleted]',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [75]:
df['author'].value_counts(dropna=False, normalize=True).head()

author
[deleted]           0.025830
SensitiveBorder2    0.000423
Help_Me_Reddit01    0.000328
throwlega           0.000278
dizson              0.000222
Name: proportion, dtype: float64

In [61]:
len(author_vc[author_vc['count'] == 1])

230031

In [62]:
len(author_vc[author_vc['count'] == 1])/len(author_vc)

0.6277676044614616

In [70]:
len(author_vc[author_vc['count'] == 2])/len(author_vc)

0.22507620890381985

In [71]:
len(author_vc[
    (author_vc['count'] > 2)
    & (author_vc['count'] <= 10)
    ])/len(author_vc)

0.13840956043086344

In [72]:
len(author_vc[
    (author_vc['count'] > 10)
    & (author_vc['count'] <= 100)
    ])/len(author_vc)

0.008713877525400694

In [82]:
n = len(author_vc[
    (author_vc['count'] > 100)
    & (author_vc['count'] <= 500)
    ])/len(author_vc)
n
print(f'{n:.5f}')

0.00003


#### Key takeaways about the `author` field from the explorations above:  
- 2.6% of posts with the author name deleted (`[deleted]`).
  - For these posts, we do not know who the original author is. Many reddit users use throw-away accounts to post sensitive or embarrassing questions, so it is not unusual to see deleted author names.  
- 63% of authors posted once
- 23% of authors posted twice
- 14% of authors posted 3-10 times
- 0.9% of authors posted 11-100 times
- 0.003% of authors posted over 100 times

#### Analysis ideas for this field:  
- categorizing authors by post frequency and anonymity (deleted-name posters, one-time posters, few-times posters, hyper posters) could be an interesting model feature  

### author_flair_text

In [298]:
df[df['author_flair_text'] == 'Nurse'].iloc[0]['url']

'https://www.reddit.com/r/AskDocs/comments/db3in6/small_pustules_on_toddler_neck/'

In [279]:
author_flairs_vc = df['author_flair_text'].value_counts(dropna=False).reset_index()
author_flairs_vc

Unnamed: 0,author_flair_text,count
0,This user has not yet been verified.,320863
1,,235653
2,Layperson/not verified as healthcare professional,90950
3,Layperson/not verified as healthcare professio...,34933
4,Medical Student,120
...,...,...
120,Medical Pathology Scientist,1
121,Neuroscientist,1
122,MD - Clinical Pharmacologist,1
123,Certified Counselor,1


In [86]:
df['author_flair_text'].value_counts(dropna=False, normalize=True).head(20)

author_flair_text
This user has not yet been verified.                  0.469646
NaN                                                   0.344924
Layperson/not verified as healthcare professional     0.133123
Layperson/not verified as healthcare professional.    0.051131
Medical Student                                       0.000176
Registered Nurse                                      0.000176
🤖                                                     0.000081
Physician                                             0.000059
EMT                                                   0.000044
Nursing Student                                       0.000042
EMT-B                                                 0.000040
Medical Assistant                                     0.000034
Founder                                               0.000025
Social Worker/LCSW                                    0.000023
Paramedic                                             0.000019
Pharmacist                           

In [92]:
author_flairs_list = list(df['author_flair_text'].unique())
print(len(author_flairs_list))
print(author_flairs_list)

125
[nan, 'This user has not yet been verified.', 'Psychologist', 'Medical Student', 'Nursing Student', 'Nursing Graduate, RPN', 'Registered Nurse', 'Biomedical Student', 'B.S., Medical Lab Sciences', 'EMT', 'Physician', 'Moderator', 'Pharmacist', 'Pharm.D. Student', 'Physician Assistant', 'Internal Medicine Resident', 'Paramedic student', 'Epidemiologist', 'Student Radiographer', 'Surgeon | Moderator', 'EMT-B', 'Founder', 'PhD Biobehavioral Health', 'Sports Massage Therapist', 'Midwife', 'MD - Clinical Pharmacologist', 'BSN-RN', 'Speech Pathologist', 'Test Flair - Physician', 'Resident Physician, Med/Peds', 'Speech Language Pathologist', '🤖', 'Physician, IM/Peds | Moderator', 'Paramedic', 'RN', 'Medical student', 'Bachelor of Biomedicine', 'Medical Technologist - Microbiology', 'Physical Therapist ', 'Pharmacy Student', 'Speech-Language Pathologist', 'EMT, BSN Student', 'PhD, Pharmacology Researcher', 'Medical Assistant', 'Physician | Moderator', 'BSN Student', 'Radiologic Technologis

The values in this field could greatly benefit from standardization, so let's do that.

In [290]:
flairs_dict = defaultdict(list)
unverified_flairs = []

student_flairs = []
nurse_flairs = []

for f in author_flairs_list:
    if type(f) == str:
        # Bot
        if re.search(r'(\bautomod|🤖)', f, flags=re.IGNORECASE):
            flairs_dict['bot'].append(f)
        # Moderator
        if re.search(r'\bmoderator', f, flags=re.IGNORECASE):
            flairs_dict['moderator'].append(f)
            
        # Unverified/layperson
        if re.search(r'\b(layperson|not.* verified)\b', f, flags=re.IGNORECASE):
            flairs_dict['unverified'].append(f)

        # medical occupation
        if re.search(r'\bstudent\b', f, flags=re.IGNORECASE):
            flairs_dict['student'].append(f)
        if re.search(r'\bphysician\b(?! assistant)', f, flags=re.IGNORECASE):
            flairs_dict['physician'].append(f)

        # medical specialty
        if re.search(r'\b(nurs(e|ing)|RN|LPN)\b', f, flags=re.IGNORECASE):
            flairs_dict['nurse'].append(f)
        if re.search(r'(\b((bio)?Behavior|Counsell?or|Psych|Mental Health)|^Therapist$)', f, flags=re.IGNORECASE):
            flairs_dict['psychology/psychiatry'].append(f)
        if re.search(r'(\bEMT\b|\bEmergency.*(Tech|medicine)|Paramedic|\bEM\b)', f, flags=re.IGNORECASE):
            flairs_dict['EMT/paramedics/emergency medicine'].append(f)
        if re.search(r'\b(social work|LCSW|MSW)', f, flags=re.IGNORECASE):
            flairs_dict['social worker'].append(f)
        if re.search(r'\b(Speech.*Pathologist|SLP)\b', f, flags=re.IGNORECASE):
            flairs_dict['speech language pathology'].append(f)
        if re.search(r'(laboratory|\blabs?\b|medical (patholog|tech))', f, flags=re.IGNORECASE):
            flairs_dict['laboratory/pathology'].append(f)
        if re.search(r'\b(physician|medical) assistant\b', f, flags=re.IGNORECASE):
            flairs_dict['physician assistant'].append(f)
        if re.search(r'\b(radio|CA?T scan|sonograph)', f, flags=re.IGNORECASE):
            flairs_dict['medical imaging'].append(f)
        if re.search(r'\b((physical|Occupational|massage) therapist|trainer)', f, flags=re.IGNORECASE):
            flairs_dict['physical/occupational/sports therapist/trainer'].append(f)
        if re.search(r'\bpharm', f, flags=re.IGNORECASE):
            flairs_dict['pharmacology/pharmacy'].append(f)
        if re.search(r'\b(internal medicine|IM)\b', f, flags=re.IGNORECASE):
            flairs_dict['internal medicine'].append(f)
        if re.search(r'\bsurge', f, flags=re.IGNORECASE):
            flairs_dict['surgery'].append(f)
        if re.search(r'\bmidwife', f, flags=re.IGNORECASE):
            flairs_dict['midwife'].append(f)
        if re.search(r'\bneuro', f, flags=re.IGNORECASE):
            flairs_dict['neuroscience'].append(f)
        if re.search(r'(\bpediat|\bpeds\b)', f, flags=re.IGNORECASE):
            flairs_dict['pedicatrics'].append(f)
        if re.search(r'\bdent', f, flags=re.IGNORECASE):
            flairs_dict['dentistry'].append(f)
        if re.search(r'dialysis', f, flags=re.IGNORECASE):
            flairs_dict['dialysis technician'].append(f)
        if re.search(r'epidemiolog', f, flags=re.IGNORECASE):
            flairs_dict['epidemiology'].append(f)
        if re.search(r'Ophthalm', f, flags=re.IGNORECASE):
            flairs_dict['ophthalmology'].append(f)
        if re.search(r'Otolaryngologist|\bENT\b', f, flags=re.IGNORECASE):
            flairs_dict['ENT'].append(f)
        if re.search(r'cardiol', f, flags=re.IGNORECASE):
            flairs_dict['cardiology'].append(f)
        if re.search(r'\bprosthet', f, flags=re.IGNORECASE):
            flairs_dict['prosthetics'].append(f)
        if re.search(r'dieti|nutriti', f, flags=re.IGNORECASE):
            flairs_dict['nutrition'].append(f)
        if re.search(r'respiratory therap', f, flags=re.IGNORECASE):
            flairs_dict['respiratory therapy'].append(f)

    else:
        flairs_dict['unverified'].append(f)

for f in flairs_dict:
    print(f'"{f}" flairs count: {len(flairs_dict[f])}')

"unverified" flairs count: 4
"psychology/psychiatry" flairs count: 16
"student" flairs count: 12
"nurse" flairs count: 19
"laboratory/pathology" flairs count: 9
"EMT/paramedics/emergency medicine" flairs count: 10
"physician" flairs count: 17
"moderator" flairs count: 8
"pharmacology/pharmacy" flairs count: 9
"physician assistant" flairs count: 4
"internal medicine" flairs count: 4
"epidemiology" flairs count: 1
"medical imaging" flairs count: 9
"surgery" flairs count: 2
"physical/occupational/sports therapist/trainer" flairs count: 6
"midwife" flairs count: 3
"speech language pathology" flairs count: 5
"pedicatrics" flairs count: 3
"bot" flairs count: 2
"cardiology" flairs count: 1
"social worker" flairs count: 3
"respiratory therapy" flairs count: 1
"prosthetics" flairs count: 1
"dentistry" flairs count: 1
"ENT" flairs count: 1
"dialysis technician" flairs count: 1
"ophthalmology" flairs count: 1
"neuroscience" flairs count: 3
"nutrition" flairs count: 1


In [291]:
len(flairs_dict)

29

In [293]:
for f in flairs_dict:
    print(f'#################################\n"{f}" flairs\n#################################\n')
    
    nobs = author_flairs_vc[author_flairs_vc['author_flair_text'].isin(flairs_dict[f])]\
        ['count'].sum()
    print(f'Posts count: {nobs:,} ({nobs/len(df):.4f}% of dataset)\n')

    print('--- Flairs list:  ----')
    for i in flairs_dict[f]:
        print(i)
    print('-----------------------\n\n')

#################################
"unverified" flairs
#################################

Posts count: 682,399 (0.9988% of dataset)

--- Flairs list:  ----
nan
This user has not yet been verified.
Layperson/not verified as healthcare professional.
Layperson/not verified as healthcare professional
-----------------------


#################################
"psychology/psychiatry" flairs
#################################

Posts count: 38 (0.0001% of dataset)

--- Flairs list:  ----
Psychologist
PhD Biobehavioral Health
Psychiatric Nurse Practitioner
Clinical Counselor
Behavioral Technician
Therapist
Licensed Professional Counselor
Clinical Psychologist
Mental Health Professional
Physician - Psychiatrist
Counselor
Certified Counselor
Counsellor
Licensed Alcohol and Drug Counselor
Mental Health Technician
Behavioral Health Counselor
-----------------------


#################################
"student" flairs
#################################

Posts count: 188 (0.0003% of dataset)

--- Flair

In [274]:
all_flairs = {x for v in flairs_dict.values() for x in v}
#all_flairs

In [275]:
set(author_flairs_list).difference(all_flairs)

{'Bachelor of Biomedicine', 'Founder'}

In [301]:
len(df[df['author_flair_text'].isin(flairs_dict['bot'])])

60

In [303]:
df[df['author_flair_text'].isin(flairs_dict['moderator'])]

Unnamed: 0,author,author_flair_text,domain,full_link,id,locked,num_comments,num_crossposts,over_18,score,selftext,title,url,crosspost_subreddits,full_post_text,created_utc_ns_dt,edited_utc_ns_dt,age,sex
39821,darthbat,Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/650y...,650yo9,False,2,,False,2,"Age: 57\n\nSex: Female\n\nHeight: 5' 1""\n\nWei...",Friend has a (stone size)tumor (doesn't know i...,https://www.reddit.com/r/AskDocs/comments/650y...,,Friend has a (stone size)tumor (doesn't know i...,1492030659000000000,,57.0,
90038,muscleups,Surgeon | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/90jx...,90jxk1,False,66,0.0,False,14,Please comment below what changes you feel wil...,[META] Survey: What features will make /r/askd...,https://www.reddit.com/r/AskDocs/comments/90jx...,,[META] Survey: What features will make /r/askd...,1532121773000000000,,,
90337,muscleups,Surgeon | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/902c...,902cgz,False,0,0.0,True,1,26m 230 lbs question test,Hello stool black,https://www.reddit.com/r/AskDocs/comments/902c...,,Hello stool black | 26m 230 lbs question test,1531971466000000000,,26.0,
90338,muscleups,Surgeon | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/9029...,902934,False,0,0.0,True,1,26m 230 lbs question test,26m 230 lbs question test balls?,https://www.reddit.com/r/AskDocs/comments/9029...,,26m 230 lbs question test balls? | 26m 230 lbs...,1531970668000000000,,26.0,
90339,muscleups,Surgeon | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/9028...,9028zk,False,0,0.0,False,1,26m 230 lbs question test,Automod test- please ignore . Penis?,https://www.reddit.com/r/AskDocs/comments/9028...,,Automod test- please ignore . Penis? | 26m 230...,1531970645000000000,,26.0,
90340,muscleups,Surgeon | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/9028...,9028o7,False,0,0.0,False,1,"26m 230 lbs question test M F . city, yes no",automod test- ignore,https://www.reddit.com/r/AskDocs/comments/9028...,,automod test- ignore | 26m 230 lbs question te...,1531970564000000000,,26.0,
90342,muscleups,Surgeon | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/9027...,90273y,False,1,0.0,True,0,test test test,this is a test - please ignore,https://www.reddit.com/r/AskDocs/comments/9027...,,this is a test - please ignore | test test test,1531970193000000000,,,
134821,KingNebby,"Physician, IM/Peds | Moderator",reddit.com,https://www.reddit.com/r/AskDocs/comments/adsw...,adswk7,False,9,0.0,False,1,,That does for the lurkers here as well!!! :),https://www.reddit.com/r/AskDocs/?st=JQNL9Y48&...,medicine,That does for the lurkers here as well!!! :) |,1546942598000000000,,,
181114,PokeTheVeil,Physician | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/bxoe...,bxoe50,False,17,0.0,False,3,I've been quietly doing it for posts I come ac...,Flair updates for new Reddit,https://www.reddit.com/r/AskDocs/comments/bxoe...,,Flair updates for new Reddit | I've been quiet...,1559868753000000000,,,
197527,PokeTheVeil,Physician | Moderator,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/cuhv...,cuhvjv,False,93,0.0,False,771,We will shortly be rolling out an updated set ...,New rule: Only verified users may respond to i...,https://www.reddit.com/r/AskDocs/comments/cuhv...,,New rule: Only verified users may respond to i...,1566585058000000000,,,


#### Key takeaways about the `author_flair_text` field from the explorations above:  
- 34% missing values  
- 99.88% of records can be considered 'unverified' users (as opposed to verified medical professionals) based on either a missing value or unverified/layperson label in the `author_flair_text` field
  - this means that the vast majority of posts in this subreddit are made by layperson users seeking medical advice, as expected  
- the `author_flair_text` field can be used to identify posts made by bots and moderators that aren't useful for this project and hence need to be filtered out  

#### Analysis ideas for this field:  
- This field can be used to further filter the analysis dataset by removing irrelevant bot and moderator comments    

#### Let's filter out bot and moderator posts  

In [307]:
df = df[~(
    (df['author_flair_text'].isin(flairs_dict['bot']))
    | (df['author_flair_text'].isin(flairs_dict['moderator']))
)]

In [308]:
len(df)

683122

### domain

In [312]:
df['domain'].value_counts(dropna=False, normalize=True)

domain
self.AskDocs             0.999720
self.medical             0.000035
reddit.com               0.000019
self.Advice              0.000013
self.DiagnoseMe          0.000012
                           ...   
self.Psychic             0.000001
self.Medical_Students    0.000001
self.neurology           0.000001
self.Coldsores           0.000001
self.pillreminderapp     0.000001
Name: proportion, Length: 100, dtype: float64

This field doesn't contain useful info for this project.  

### locked

In [313]:
df['locked'].value_counts(dropna=False, normalize=True)

locked
False    0.999818
True     0.000182
Name: proportion, dtype: float64

This field doesn't contain useful info for this project.  

### num_comments

In [319]:
num_comments_vc = df['num_comments'].value_counts(dropna=False, normalize=True).sort_index()\
    .reset_index()
num_comments_vc

Unnamed: 0,num_comments,proportion
0,0,0.053260
1,1,0.170454
2,2,0.418689
3,3,0.105562
4,4,0.078643
...,...,...
272,494,0.000001
273,562,0.000001
274,686,0.000001
275,802,0.000001


In [326]:
cutoff = 40

In [327]:
num_comments_vc[num_comments_vc['num_comments'] >= cutoff]['proportion'].sum()

0.002778420252897725

In [322]:
num_comments_vc2 = num_comments_vc[
    num_comments_vc['num_comments'] < cutoff
].copy()

traces = []
trace = go.Scatter(
    x=num_comments_vc2['num_comments'],
    y=num_comments_vc2['proportion'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='proportion', rangemode='tozero')
fig.update_xaxes(title='num_comments', rangemode='tozero')
fig.update_layout(
    title=f'<b>num_comments</b> frequency plot, < {cutoff}',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [331]:
for i,r in df[df['num_comments'] >= cutoff].head().iterrows():
    print(r['url'])

https://www.reddit.com/r/AskDocs/comments/7hu6ys/severe_transient_shortness_of_breath_no_asthma_or/
https://www.reddit.com/r/AskDocs/comments/7gbdja/rapid_weight_loss_others_have_noticed_first_was/
https://www.reddit.com/r/AskDocs/comments/7d27ub/is_this_a_tapeworm_and_what_do_i_do/
https://www.reddit.com/r/AskDocs/comments/7cwnd3/bizarre_case_of_hiv/
https://www.reddit.com/r/AskDocs/comments/7c9nh2/do_doctor_give_up_on_people_with_uncommon_diseases/


#### Key takeaways about the `num_comments` field from the explorations above:  
- most posts (42%) have 2 comments  
- 0.3% of posts had 40 or more comments  

#### Analysis ideas for this field:  
- this field is interesting for analyses as a measure of engagement with posts  

### num_crossposts

In [332]:
num_crossposts_vc = df['num_crossposts'].value_counts(dropna=False, normalize=True).sort_index()\
    .reset_index()
num_crossposts_vc

Unnamed: 0,num_crossposts,proportion
0,0.0,0.944017
1,1.0,0.000433
2,2.0,4.8e-05
3,3.0,1e-05
4,4.0,1e-06
5,,0.055489


This field doesn't contain useful info for this project.

### over_18

In [333]:
df['over_18'].value_counts(dropna=False, normalize=True)

over_18
False    0.93902
True     0.06098
Name: proportion, dtype: float64

This field doesn't seem to contain useful info for this project.

### score

In [334]:
score_vc = df['score'].value_counts(dropna=False, normalize=True).sort_index()\
    .reset_index()
score_vc

Unnamed: 0,score,proportion
0,0,0.022135
1,1,0.900782
2,2,0.044424
3,3,0.015043
4,4,0.004245
...,...,...
333,715,0.000001
334,745,0.000001
335,833,0.000001
336,1107,0.000001


In [337]:
cutoff = 40

In [340]:
score_vc[score_vc['score'] >= cutoff]['proportion'].sum()

0.0010071407449913777

In [338]:
score_vc2 = score_vc[
    score_vc['score'] < cutoff
].copy()

traces = []
trace = go.Scatter(
    x=score_vc2['score'],
    y=score_vc2['proportion'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='proportion', rangemode='tozero')
fig.update_xaxes(title='score', rangemode='tozero')
fig.update_layout(
    title=f'<b>score</b> frequency plot, < {cutoff}',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

#### Key takeaways about the `score` field from the explorations above:  
- most posts (90%) have a score of 1  
- 0.1% of posts had a score of 40
- it looks like `score` and up/downvoting is not as actively used in this subreddit as in others, as expected  

#### Analysis ideas for this field:  
- This field doesn't seem to contain useful info for this project.  

### selftext

In [344]:
df['selftext'].str[:60].value_counts(dropna=False, normalize=True)

selftext
NaN                                                                         0.279209
Please be as detailed as possible in your submissions. Pleas                0.000130
Hi, female, 24 years old, 5’7, 115 pounds? (51kg)\n\nI’ve been              0.000088
Age\n\nSex\n\nHeight\n\nWeight\n\nRace\n\nDuration of complaint\n\nLocat    0.000076
\[removed\]                                                                 0.000072
                                                                              ...   
19\nFemale\n6'2\n170 pounds\nLeft Arm\n1 Hour\nFibromyalgia\nTetral         0.000001
22f, Canada \n\nI’ve been experiencing a few sensations the la              0.000001
My right eye is -2 and the left is -0.75 , should i always w                0.000001
29 - Female - Caucasian - 5'5" - 127lbs\n\nCurrent meds: Lexap              0.000001
weight\n\n20f, 5'2, 149lbs, white\n\nfor the past few weeks i've            0.000001
Name: proportion, Length: 477300, dtype: float64

In [348]:
df['selftext'].fillna('', inplace=True)

In [349]:
df['selftext'].str[:60].value_counts(dropna=False, normalize=True)

selftext
                                                                            0.279209
Please be as detailed as possible in your submissions. Pleas                0.000130
Hi, female, 24 years old, 5’7, 115 pounds? (51kg)\n\nI’ve been              0.000088
Age\n\nSex\n\nHeight\n\nWeight\n\nRace\n\nDuration of complaint\n\nLocat    0.000076
\[removed\]                                                                 0.000072
                                                                              ...   
19\nFemale\n6'2\n170 pounds\nLeft Arm\n1 Hour\nFibromyalgia\nTetral         0.000001
22f, Canada \n\nI’ve been experiencing a few sensations the la              0.000001
My right eye is -2 and the left is -0.75 , should i always w                0.000001
29 - Female - Caucasian - 5'5" - 127lbs\n\nCurrent meds: Lexap              0.000001
weight\n\n20f, 5'2, 149lbs, white\n\nfor the past few weeks i've            0.000001
Name: proportion, Length: 477300, dtype: float64

In [357]:
for i,r in df[df['selftext'] == ''].tail().iterrows():
    print(r['title'])
    print(r['url'])

Swollen booster injection site
https://www.reddit.com/r/AskDocs/comments/rt73fe/swollen_booster_injection_site/
Questions about blood panel results
https://www.reddit.com/r/AskDocs/comments/rt6z6p/questions_about_blood_panel_results/
Concerns with increased risk of myocarditis from covid vax while pregnant
https://www.reddit.com/r/AskDocs/comments/rt6yfi/concerns_with_increased_risk_of_myocarditis_from/
Post Foot Puncture Pain - 27/M
https://www.reddit.com/r/AskDocs/comments/rt6we3/post_foot_puncture_pain_27m/
Covid pneumonia
https://www.reddit.com/r/AskDocs/comments/rt6t6f/covid_pneumonia/


28% of posts have no text in the body. Many of these posts were deleted either by the user, bots or moderators.  

Many such observations still contain useful text data in the title that could work in NLP analyses. However, the moderators of this subreddit encourage users who made posts that were deleted because they didn't comply with community guidelines to repost them with more details that follow the community guidelines. Therefore, these observations with deleted post body are likely duplicated in the dataset with observations that have non-blank posts content. To avoid dupplication in the analysis dataset, we are filtering out the records with blank selftext values.

In [358]:
len(df)

683122

In [360]:
len(df[df['selftext'] == ''])

190734

In [361]:
df = df[df['selftext'] != '']

In [362]:
len(df)

492388

In [364]:
df['selftext'].str[:70].value_counts(dropna=False, normalize=True)

selftext
Please be as detailed as possible in your submissions. Please include:                  0.000181
Age\n\nSex\n\nHeight\n\nWeight\n\nRace\n\nDuration of complaint\n\nLocation\n\nAny e    0.000100
\[removed\]                                                                             0.000100
Male, 31, nonsmoker, very occasional drinker, 5 foot 10, 155 pounds, 2                  0.000065
Hi, female, 24 years old, 5’7, 115 pounds? (51kg)\n\nI’ve been having a                 0.000053
                                                                                          ...   
20M, Only condition I have is Vitiligo. For the last 5 years or so, I                   0.000002
Hey everyone. I'm 19 years old, 6' tall, and 150lbs. I've been a prett                  0.000002
(For clarification, all details will reflect my condition 6 months ago                  0.000002
23M/White/5'10/165lbs\n\nI've been taking buspirone at 12 am every night                0.000002
weight\n\n20f, 5'2, 1

In [365]:
len(df[df['selftext'].str.contains('Please be as detailed as possible in your submissions.')])

120

In [366]:
for i,r in df[
    df['selftext'].str.contains('Please be as detailed as possible in your submissions.')
].head().iterrows():
    print(r['url'])

https://www.reddit.com/r/AskDocs/comments/7fxixq/chronic_nausea_and_general_sick_feeling_gp_keeps/
https://www.reddit.com/r/AskDocs/comments/72kdop/i_have_had_a_constant_headache_for_fucking_3/
https://www.reddit.com/r/AskDocs/comments/71xaqy/ate_a_moldy_date/
https://www.reddit.com/r/AskDocs/comments/6vsear/will_sleeping_beside_a_wireless_router_harm_your/
https://www.reddit.com/r/AskDocs/comments/6k0nfz/toe_cramps/


### Post body text length

In [429]:
df['selftext'] = df['selftext'].str.strip()

In [430]:
selftext_length_vc = df['selftext'].str.len().value_counts().sort_index().reset_index()
selftext_length_vc

Unnamed: 0,selftext,count
0,1,8
1,2,10
2,3,52
3,4,32
4,5,23
...,...,...
7313,24235,1
7314,24599,1
7315,25075,1
7316,25833,1


In [431]:
traces = []
trace = go.Scatter(
    x=selftext_length_vc['selftext'],
    y=selftext_length_vc['count'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='count', rangemode='tozero')
fig.update_xaxes(title='selftext length', rangemode='tozero')
fig.update_layout(
    title=f'<b>selftext length</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [432]:
selftext_length_vc[selftext_length_vc['selftext'] >= 5_000]['count'].sum()

3685

In [433]:
selftext_length_vc[selftext_length_vc['selftext'] >= 5_000]['count'].sum()/len(df)

0.007483935433032486

In [474]:
selftext_length_vcn = df['selftext'].str.len().value_counts(
    normalize=True, bins=100
).sort_index().reset_index()
selftext_length_vcn['cumulative_proportion'] = selftext_length_vcn['proportion'].cumsum()
selftext_length_vcn

Unnamed: 0,selftext,proportion,cumulative_proportion
0,"(-25.938000000000002, 270.37]",0.088918,0.088918
1,"(270.37, 539.74]",0.228302,0.317219
2,"(539.74, 809.11]",0.213900,0.531120
3,"(809.11, 1078.48]",0.152522,0.683642
4,"(1078.48, 1347.85]",0.099509,0.783151
...,...,...,...
95,"(25591.15, 25860.52]",0.000002,0.999998
96,"(25860.52, 26129.89]",0.000000,0.999998
97,"(26129.89, 26399.26]",0.000000,0.999998
98,"(26399.26, 26668.63]",0.000000,0.999998


In [475]:
selftext_length_vcn.head(20)

Unnamed: 0,selftext,proportion,cumulative_proportion
0,"(-25.938000000000002, 270.37]",0.088918,0.088918
1,"(270.37, 539.74]",0.228302,0.317219
2,"(539.74, 809.11]",0.2139,0.53112
3,"(809.11, 1078.48]",0.152522,0.683642
4,"(1078.48, 1347.85]",0.099509,0.783151
5,"(1347.85, 1617.22]",0.065379,0.84853
6,"(1617.22, 1886.59]",0.043072,0.891602
7,"(1886.59, 2155.96]",0.029178,0.92078
8,"(2155.96, 2425.33]",0.01958,0.94036
9,"(2425.33, 2694.7]",0.01422,0.954581


In [437]:
selftext_length_bins_ubs = [
    math.floor(x) for x in list(pd.IntervalIndex(selftext_length_vcn['selftext']).right)
]

In [436]:
cutoff = 5_000

In [438]:
cutoff_ind = len([x for x in selftext_length_bins_ubs if x <= cutoff])

In [439]:
traces = []
trace = go.Bar(
    x=selftext_length_bins_ubs[:cutoff_ind],
    y=selftext_length_vcn['proportion'].iloc[:cutoff_ind],
)
traces.append(trace)

fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='proportion', rangemode='tozero')
fig.update_xaxes(title='selftext length')
fig.update_layout(
    title=f'<b>selftext length bins</b> frequency plot, with lengths cutoff at 5,000 chars',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [453]:
cutoff = 100

In [454]:
selftext_length_vc2 = selftext_length_vc[selftext_length_vc['selftext'] < cutoff]

traces = []
trace = go.Scatter(
    x=selftext_length_vc2['selftext'],
    y=selftext_length_vc2['count'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='count', rangemode='tozero')
fig.update_xaxes(title='selftext length', rangemode='tozero')
fig.update_layout(
    title=f'<b>selftext length</b> frequency plot, cutoff at {cutoff} chars',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [455]:
selftext_length_vc[
    selftext_length_vc['selftext'] < 10]
   

Unnamed: 0,selftext,count
0,1,8
1,2,10
2,3,52
3,4,32
4,5,23
5,6,20
6,7,36
7,8,28
8,9,26


In [471]:
for i,r in df[df['selftext'].str.len() < 10].head().iterrows():
    print('\n--------------')
    print(len(r['selftext']))
    print(r['selftext'])
    print(r['title'])
    print(r['num_comments'])
    print(r['url'])


--------------
7
27 Male
Dull pain in wrist that pulses, then hand feels cold and weak. Any ideas?
0
https://www.reddit.com/r/AskDocs/comments/7kefe7/dull_pain_in_wrist_that_pulses_then_hand_feels/

--------------
8
Height 5
Rеduсе сhаnсе оf gеttіng арреndісіtіs? Whаt рrесаutіоns саn уоu tаkе (еg dіеt)?
1
https://www.reddit.com/r/AskDocs/comments/7a2orh/rеduсе_сhаnсе_оf_gеttіng_арреndісіtіs_whаt/

--------------
6
Thanks
Is it bad to have a Z pack that's 2 months expired
12
https://www.reddit.com/r/AskDocs/comments/716vjh/is_it_bad_to_have_a_z_pack_thats_2_months_expired/

--------------
1
G
F
3
https://www.reddit.com/r/AskDocs/comments/6ly5wn/f/

--------------
1
4
Infected wound
2
https://www.reddit.com/r/AskDocs/comments/6ipyai/infected_wound/


#### Key takeaways about the `selftext` field from the explorations above:  
- 28% of posts have no text in the body. Many of these posts were deleted either by the user, bots or moderators.
  - Many such observations still contain useful text data in the title that could work in NLP analyses. However, the moderators of this subreddit encourage users who made posts that were deleted because they didn't comply with community guidelines to repost them with more details that follow the community guidelines. Therefore, these observations with deleted post body are likely duplicated in the dataset with observations that have non-blank posts content. To avoid dupplication in the analysis dataset, we are filtering out the records with blank `selftext` values.  
- about 50% of obs have post text length of less than 800 chars  
- about 95% of obs have post text length of less than 2,700 chars  

#### Analysis ideas for this field:  
- We most likely won't need this field in analyses on its own, and we will use the derived `full_post_text` field instead, because it combines the `title` and `selftext` fields together into one text that contains the whole user question. Exploring the `title` and `selftext` fields shows many examples where users will ask a part of their question in the title and a part of it in the body, so combining them together is need for the scope of this project.

## title

In [476]:
df['title'].str[:60].value_counts(dropna=False, normalize=True)

title
Help                                                            0.000335
Please help                                                     0.000242
What is this?                                                   0.000238
Chest pain                                                      0.000201
Is this normal?                                                 0.000191
                                                                  ...   
can scar tissue that has replaced the original tissue reduce    0.000002
White Piedra or folliculitis? Pic included                      0.000002
23M Eyes are blurry and tired all the time                      0.000002
Are sock marks always a sign of peripheral edema?               0.000002
what to do about constant hunger and thirst?                    0.000002
Name: proportion, Length: 469868, dtype: float64

In [478]:
df['title'].str[:60].value_counts(dropna=False, normalize=True).head(40)

title
Help                      0.000335
Please help               0.000242
What is this?             0.000238
Chest pain                0.000201
Is this normal?           0.000191
Should I be worried?      0.000158
Should I be concerned?    0.000130
What could this be?       0.000128
What is this rash?        0.000124
Question                  0.000120
Lower back pain           0.000118
Knee pain                 0.000114
Blood in stool            0.000112
What is wrong with me?    0.000108
Shoulder pain             0.000102
Shortness of breath       0.000100
Abdominal pain            0.000097
Chest Pain                0.000095
Help please               0.000089
Back pain                 0.000089
Testicular pain           0.000085
Please help me            0.000081
Heart palpitations        0.000077
Appendicitis?             0.000071
I need help               0.000071
What's wrong with me?     0.000071
Abdominal Pain            0.000067
Stomach issues            0.000067
Swollen lymph 

In [479]:
len(df[df['title'].isna()])

0

In [480]:
len(df[df['title'] == ''])

0

Let's see if we can find some test posts...

In [519]:
df[
    (df['title'].str.contains(r'\btest\b.*\bpost\b'))
    & (df['title'].str.len() < 40)
]['title']

121333    test post please ignore
121915    test post please ignore
122366                  test post
122371    test post please ignore
122376    test post please ignore
122377                test post 2
122378                  test post
136020    test post please ignore
Name: title, dtype: object

In [520]:
len(df[
    (df['title'].str.contains(r'\btest\b.*\bpost\b'))
    & (df['title'].str.len() < 40)
])

8

In [547]:
df[
    (df['title'].str.contains(r'^test$'))
]['title']

122511    test
131051    test
135192    test
619755    test
Name: title, dtype: object

We found some test posts, but not a lot (and there could be more). Let's filter these test posts out.

In [521]:
len(df)

492388

In [548]:
df = df[~(
    (
        (df['title'].str.contains(r'\btest\b.*\bpost\b'))
        & (df['title'].str.len() < 40)
    )
    |
    (df['title'].str.contains(r'^test$'))
)]

In [549]:
len(df)

492376

#### Key takeaways about the `title` field from the explorations above:  
- no blanks or missing values    
- some values are generic, some summarize the medical question asked in the post, and some are all or part of the medical question the user is asking
- using this field we identified and filtered out some test posts

#### Analysis ideas for this field:  
- We most likely won't need this field in analyses on its own, and we will use the derived `full_post_text` field instead, because it combines the `title` and `selftext` fields together into one text that contains the whole user question, same as with the `selftext` field.

## crosspost_subreddits  

This field could potentially be useful, because the crosspost subreddits are sometimes subreddits for specific health issues and could provide context to the user's medical question. However, the [num_crossposts](#num_crossposts) EDA above showed that over 94% of records don't have crossposts, so this field is too sparsely populated for this project.  

## full_post_text  
This is a derived field, derived by combining the `title` and `selftext` fields.

In [550]:
df['full_post_text'].str[:60].value_counts(dropna=False, normalize=True)

full_post_text
Mom is in the ICU after a brain hemorrhage stroke. How can I      0.000028
Resting heart rate 137 | This has happened a few times.  I t      0.000028
Severe symptoms for 6 months, large swollen lymph nodes for       0.000026
Shortness of Breath/Having to Control Breathing | 21M, 230 l      0.000024
Extreme lethargy and brain fog. Any way to combat this? (19F      0.000022
                                                                    ...   
[22f] mysterious leg numbness going on 9 months | I am 22 an      0.000002
(NSFW) (20M) I have dry, flaky and peeling skin on penis hea      0.000002
CT scan revealed “slight density” in my appendix. What could      0.000002
Significant change in stool | Welp, this is gonna be super T      0.000002
what to do about constant hunger and thirst? | weight\n\n20f,     0.000002
Name: proportion, Length: 487679, dtype: float64

### `full_post_text` text length

In [551]:
df['full_post_text'] = df['full_post_text'].str.strip()

In [552]:
full_post_text_length_vc = df['full_post_text'].str.len().value_counts().sort_index().reset_index()
full_post_text_length_vc

Unnamed: 0,full_post_text,count
0,5,1
1,7,1
2,9,1
3,11,2
4,13,2
...,...,...
7413,24340,1
7414,24690,1
7415,25148,1
7416,25875,1


In [553]:
traces = []
trace = go.Scatter(
    x=full_post_text_length_vc['full_post_text'],
    y=full_post_text_length_vc['count'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='count', rangemode='tozero')
fig.update_xaxes(title='full_post_text length', rangemode='tozero')
fig.update_layout(
    title=f'<b>full_post_text length</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [554]:
full_post_text_length_vc[full_post_text_length_vc['full_post_text'] >= 5_000]['count'].sum()

3882

In [555]:
full_post_text_length_vc[full_post_text_length_vc['full_post_text'] >= 5_000]['count'].sum()\
    /len(df)

0.007884218564674151

In [556]:
full_post_text_length_vcn = df['full_post_text'].str.len().value_counts(
    normalize=True, bins=100
).sort_index().reset_index()
full_post_text_length_vcn['cumulative_proportion'] = \
    full_post_text_length_vcn['proportion'].cumsum()
full_post_text_length_vcn

Unnamed: 0,full_post_text,proportion,cumulative_proportion
0,"(-21.999000000000002, 274.98]",0.059050,0.059050
1,"(274.98, 544.96]",0.217515,0.276565
2,"(544.96, 814.94]",0.221004,0.497569
3,"(814.94, 1084.92]",0.162628,0.660197
4,"(1084.92, 1354.9]",0.107245,0.767442
...,...,...,...
95,"(25653.1, 25923.08]",0.000002,0.999998
96,"(25923.08, 26193.06]",0.000000,0.999998
97,"(26193.06, 26463.04]",0.000000,0.999998
98,"(26463.04, 26733.02]",0.000000,0.999998


In [557]:
full_post_text_length_vcn.head(20)

Unnamed: 0,full_post_text,proportion,cumulative_proportion
0,"(-21.999000000000002, 274.98]",0.05905,0.05905
1,"(274.98, 544.96]",0.217515,0.276565
2,"(544.96, 814.94]",0.221004,0.497569
3,"(814.94, 1084.92]",0.162628,0.660197
4,"(1084.92, 1354.9]",0.107245,0.767442
5,"(1354.9, 1624.88]",0.070314,0.837756
6,"(1624.88, 1894.86]",0.04627,0.884026
7,"(1894.86, 2164.84]",0.031817,0.915843
8,"(2164.84, 2434.82]",0.020915,0.936758
9,"(2434.82, 2704.8]",0.015129,0.951886


In [558]:
full_post_text_length_bins_ubs = [
    math.floor(x) for x in list(
        pd.IntervalIndex(full_post_text_length_vcn['full_post_text']).right
    )
]

In [559]:
cutoff = 5_000

In [560]:
cutoff_ind = len([x for x in full_post_text_length_bins_ubs if x <= cutoff])

In [561]:
traces = []
trace = go.Bar(
    x=full_post_text_length_bins_ubs[:cutoff_ind],
    y=full_post_text_length_vcn['proportion'].iloc[:cutoff_ind],
)
traces.append(trace)

fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='proportion', rangemode='tozero')
fig.update_xaxes(title='full_post_text length')
fig.update_layout(
    title=f'<b>full_post_text length bins</b> frequency plot, with lengths cutoff at 5,000 chars',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [562]:
cutoff = 100

In [563]:
full_post_text_length_vc2 = full_post_text_length_vc[
    full_post_text_length_vc['full_post_text'] < cutoff
]

traces = []
trace = go.Scatter(
    x=full_post_text_length_vc2['full_post_text'],
    y=full_post_text_length_vc2['count'],
    mode='markers'
)
traces.append(trace)
fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='count', rangemode='tozero')
fig.update_xaxes(title='full_post_text length', rangemode='tozero')
fig.update_layout(
    title=f'<b>full_post_text length</b> frequency plot, cutoff at {cutoff} chars',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

In [564]:
ub = 20

In [565]:
full_post_text_length_vc[full_post_text_length_vc['full_post_text'] < ub]
   

Unnamed: 0,full_post_text,count
0,5,1
1,7,1
2,9,1
3,11,2
4,13,2
5,15,1
6,18,2
7,19,2


In [566]:
lb = 10
ub = 20

In [567]:
for i,r in df[
    (df['full_post_text'].str.len() >= lb)
    & (df['full_post_text'].str.len() < ub)
].head().iterrows():
    print('\n--------------')
    print(len(r['full_post_text']))
    print(r['full_post_text'])
    print(r['num_comments'])
    print(r['url'])


--------------
18
Infected wound | 4
2
https://www.reddit.com/r/AskDocs/comments/6ipyai/infected_wound/

--------------
11
Hand | Hajf
1
https://www.reddit.com/r/AskDocs/comments/9fkjsp/hand/

--------------
15
sdfdsf | dsfsdf
1
https://www.reddit.com/r/AskDocs/comments/9y9dzy/sdfdsf/

--------------
13
Ahaha | Shshs
1
https://www.reddit.com/r/AskDocs/comments/c3rfed/ahaha/

--------------
13
HIP PAIN | 18
1
https://www.reddit.com/r/AskDocs/comments/ciq5jz/hip_pain/


#### Key takeaways about the `full_post_text` field from the explorations above:   
- about 50% of obs have post text length of less than 815 chars  
- about 95% of obs have post text length of less than 2,700 chars  

#### Analysis ideas for this field:  
- This is the main text field that will be used for NLP analyses

## created_utc_ns_dt

In [568]:
df['created_utc_ns_dt']

2         1514763799000000000
3         1514763188000000000
4         1514761839000000000
5         1514757843000000000
7         1514757201000000000
                 ...         
683196    1640995849000000000
683197    1640995722000000000
683198    1640995609000000000
683200    1640995566000000000
683201    1640995537000000000
Name: created_utc_ns_dt, Length: 492376, dtype: int64

In [570]:
df['created_utc_ns_dt'] = pd.to_datetime(df['created_utc_ns_dt'])

In [571]:
df['created_utc_ns_dt']

2        2017-12-31 23:43:19
3        2017-12-31 23:33:08
4        2017-12-31 23:10:39
5        2017-12-31 22:04:03
7        2017-12-31 21:53:21
                 ...        
683196   2022-01-01 00:10:49
683197   2022-01-01 00:08:42
683198   2022-01-01 00:06:49
683200   2022-01-01 00:06:06
683201   2022-01-01 00:05:37
Name: created_utc_ns_dt, Length: 492376, dtype: datetime64[ns]

In [572]:
df['created_utc_ns_dt'].describe()

count                           492376
mean     2020-03-10 15:29:36.459733760
min                2017-01-01 00:05:06
25%         2019-04-06 02:04:00.500000
50%         2020-05-27 07:54:11.500000
75%      2021-04-28 13:04:20.249999872
max                2022-01-22 00:41:27
Name: created_utc_ns_dt, dtype: object

In [581]:
df['created_utc_ns_dt'].isna().value_counts()

created_utc_ns_dt
False    492376
Name: count, dtype: int64

In [615]:
created_dates = df['created_utc_ns_dt'].dt.date.astype('str')

In [620]:
created_dates.name = 'created_date'

In [621]:
created_dates.describe()

count         492376
unique          1831
top       2021-01-14
freq             563
Name: created_date, dtype: object

In [622]:
created_date_vc = created_dates.value_counts(dropna=False).sort_index().reset_index()
created_date_vc

Unnamed: 0,created_date,count
0,2017-01-01,81
1,2017-01-02,94
2,2017-01-03,123
3,2017-01-04,120
4,2017-01-05,127
...,...,...
1826,2022-01-18,480
1827,2022-01-19,439
1828,2022-01-20,461
1829,2022-01-21,440


In [623]:
traces = []
trace = go.Scatter(
    x=created_date_vc['created_date'],
    y=created_date_vc['count'],
    mode='markers',
    name='daily posts volume',
    hovertemplate='Date: %{x|%Y-%m-%d}<br>Posts daily vol: %{y} <extra></extra>'
)
traces.append(trace)

trace = go.Scatter(
    x=created_date_vc['created_date'],
    y=created_date_vc['count'].rolling(30).mean(),
    mode='lines',
    name='30-day moving average of daily posts volume',
    hovertemplate='Date: %{x|%Y-%m-%d}<br>30-day MA daily vol: %{y} <extra></extra>'
)
traces.append(trace)


fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='daily posts volume', rangemode='tozero')
fig.update_xaxes(title='created_date', rangemode='tozero')
fig.update_layout(
    title=f'<b>created_date</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=True,
    height=400
)

fig.update_layout(
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=1.02,
        xanchor='right',
        x=1
    )
)

fig.show()

#### Key takeaways about the `created_utc_ns_dt` field from the explorations above:   
- starting in 2018, there's a mostly steady upward trend in daily post volumes, until mid-2020  
- unusual daily post volumes variation in 2021  

#### Analysis ideas for this field:  
- This is the main datetime field that will be used in further analyses in this project  

## edited_utc_ns_dt

In [624]:
df['edited_utc_ns_dt']

2        NaN
3        NaN
4        NaN
5        NaN
7        NaN
          ..
683196   NaN
683197   NaN
683198   NaN
683200   NaN
683201   NaN
Name: edited_utc_ns_dt, Length: 492376, dtype: float64

In [626]:
df['edited_utc_ns_dt'] = pd.to_datetime(df['edited_utc_ns_dt'])

In [627]:
df['edited_utc_ns_dt']

2        NaT
3        NaT
4        NaT
5        NaT
7        NaT
          ..
683196   NaT
683197   NaT
683198   NaT
683200   NaT
683201   NaT
Name: edited_utc_ns_dt, Length: 492376, dtype: datetime64[ns]

In [628]:
df['edited_utc_ns_dt'].isna().value_counts()

edited_utc_ns_dt
True     479167
False     13209
Name: count, dtype: int64

In [629]:
df['edited_utc_ns_dt'].describe()

count                            13209
mean     2018-11-07 21:17:46.541827072
min                2017-01-01 04:09:38
25%                2017-08-03 19:55:51
50%                2018-02-19 17:20:57
75%                2020-06-15 19:59:40
max                2021-12-09 06:57:08
Name: edited_utc_ns_dt, dtype: object

In [638]:
edited_dates = df['edited_utc_ns_dt'].dt.date #.astype('str')

In [639]:
edited_dates.name = 'edited_date'

In [640]:
edited_dates.describe()

count          13209
unique           782
top       2021-02-27
freq              69
Name: edited_date, dtype: object

In [641]:
edited_dates.value_counts(dropna=False).sort_index()

edited_date
2017-01-01        15
2017-01-02        13
2017-01-03        17
2017-01-04        25
2017-01-05        19
               ...  
2021-11-19         9
2021-12-07        20
2021-12-08        60
2021-12-09        24
NaT           479167
Name: count, Length: 783, dtype: int64

In [642]:
edited_date_vc = edited_dates.value_counts().sort_index().reset_index()
edited_date_vc

Unnamed: 0,edited_date,count
0,2017-01-01,15
1,2017-01-02,13
2,2017-01-03,17
3,2017-01-04,25
4,2017-01-05,19
...,...,...
777,2021-11-18,3
778,2021-11-19,9
779,2021-12-07,20
780,2021-12-08,60


In [644]:
traces = []
trace = go.Scatter(
    x=edited_date_vc['edited_date'],
    y=edited_date_vc['count'],
    mode='markers',
    name='daily edits volume',
    hovertemplate='Date: %{x|%Y-%m-%d}<br>Edits daily vol: %{y} <extra></extra>'
)
traces.append(trace)

trace = go.Scatter(
    x=edited_date_vc['edited_date'],
    y=edited_date_vc['count'].rolling(30).mean(),
    mode='lines',
    name='30-day moving average of daily edits volume',
    hovertemplate='Date: %{x|%Y-%m-%d}<br>30-day MA daily edits vol: %{y} <extra></extra>'
)
traces.append(trace)


fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='daily edits volume', rangemode='tozero')
fig.update_xaxes(title='edited_date', rangemode='tozero')
fig.update_layout(
    title=f'<b>edited_date</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=True,
    height=400
)

fig.update_layout(
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=1.02,
        xanchor='right',
        x=1
    )
)

fig.show()

In [646]:
traces = []
trace = go.Scatter(
    x=created_date_vc['created_date'],
    y=created_date_vc['count'].rolling(30).mean(),
    mode='lines',
    name='30-day moving average of daily posts volume',
    hovertemplate='Date: %{x|%Y-%m-%d}<br>30-day MA daily posts vol: %{y} <extra></extra>'
)
traces.append(trace)

trace = go.Scatter(
    x=edited_date_vc['edited_date'],
    y=edited_date_vc['count'].rolling(30).mean(),
    mode='lines',
    name='30-day moving average of daily edits volume',
    hovertemplate='Date: %{x|%Y-%m-%d}<br>30-day MA daily edits vol: %{y} <extra></extra>'
)
traces.append(trace)


fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='daily volume', rangemode='tozero')
fig.update_xaxes(title='date', rangemode='tozero')
fig.update_layout(
    title=f'<b>created_date</b> and <b>edited_date</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=True,
    height=600
)

fig.update_layout(
    legend=dict(
        orientation='h',
        yanchor='bottom',
        y=1.02,
        xanchor='right',
        x=1
    )
)

fig.show()

#### Key takeaways about the `edited_utc_ns_dt` field from the explorations above:   
- overall, edit volumes are low  
- Up until the spring of 2018, daily edits volumes followed a fairly consistent trend. After that, daily edits volume stays very low until early 2020. Starting in early 2020, daily edits volume becomes widely dispersed, but on average it went back to approx the same level as it was at the end of 2018.     

#### Analysis ideas for this field:  
- This field doesn't seem very interesting for the purposes this project.  

## age

In [657]:
df['age'].describe()

count    405967.000000
mean         25.022381
std           9.470582
min           1.000000
25%          19.000000
50%          23.000000
75%          28.000000
max          99.000000
Name: age, dtype: float64

In [648]:
df['age'].value_counts(dropna=False, normalize=True)

age
NaN     0.175494
20.0    0.055693
21.0    0.055279
22.0    0.054722
23.0    0.051142
          ...   
91.0    0.000051
96.0    0.000051
94.0    0.000049
93.0    0.000047
98.0    0.000041
Name: proportion, Length: 100, dtype: float64

In [652]:
age_vc = df['age'].value_counts().sort_index().reset_index()
age_vc['age'] = age_vc['age'].astype('int')
age_vc

Unnamed: 0,age,count
0,1,465
1,2,770
2,3,786
3,4,629
4,5,841
...,...,...
94,95,45
95,96,25
96,97,27
97,98,20


In [656]:
traces = []
trace = go.Bar(
    x=age_vc['age'],
    y=age_vc['count'],
)
traces.append(trace)


fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='count', rangemode='tozero')
fig.update_xaxes(title='age', rangemode='tozero')
fig.update_layout(
    title=f'<b>age</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400
)

fig.show()

#### Key takeaways about the `age` field from the explorations above:   
- approx 18% missing values  
- roughly bell-shaped distribution, centered around a mean of 25 years, skewed right  
- Younger ages are most likely undercounted in the histogram above, because they are described in less consistent ways and hence are more challenging to parse. Hence, it's a good idea to exclude younger ages  < 13 from age-related analyses.       

#### Analysis ideas for this field:  
- This field will be used in analyses as a demographic characteristic.  

## sex

In [658]:
df['sex'].describe()

count     306244
unique         2
top         AMAB
freq      166470
Name: sex, dtype: object

In [660]:
df['sex'].value_counts(dropna=False)

sex
NaN     186132
AMAB    166470
AFAB    139774
Name: count, dtype: int64

In [661]:
df['sex'].value_counts(dropna=False, normalize=True)

sex
NaN     0.378028
AMAB    0.338095
AFAB    0.283877
Name: proportion, dtype: float64

In [664]:
traces = []
trace = go.Bar(
    x=df['sex'].value_counts(normalize=True).index,
    y=df['sex'].value_counts(normalize=True).values,
)
traces.append(trace)


fig = go.Figure(traces)

    
fig.update_yaxes(gridcolor='#eee', title='proportion', rangemode='tozero')
fig.update_xaxes(title='sex', rangemode='tozero')
fig.update_layout(
    title=f'<b>sex</b> frequency plot',
    plot_bgcolor='#fff',
    showlegend=False,
    height=400,
    width=400
)

fig.show()

#### Key takeaways about the `sex` field from the explorations above:   
- approx 38% missing values  
- of the non-missing values, the majority of the posts (54%) are about AMAB individuals
- NOTE: this field represents the sex of an individual described in the post, and not necessarily the poster  

#### Analysis ideas for this field:  
- This field will be used in analyses as a demographic characteristic.  

## Analysis ideas  

### Fields selected for further analyses:  
- author  
- num_comments  
- full_post_text  
- created_utc_ns_dt  
- age  
- sex  

### Potential analysis questions to explore:  
- Are there any interesting reposting patterns? Do reposters repost about the same subject? How often do people repost?  
- What kinds of posts get more engagement (as measured by comment counts)? How does this engagement vary with post subject and demographics?
- What are the posts about? How have post subjects changed over time?  
- Are there time-related patterns, e.g. seasonality?  
- What age/sex groups post more/less? About what subjects? Did that change over time?  
