# Parse demographics  
The rules of the AskDocs subreddit require at least some demographics (age and sex), and encourage detailed demographics, diagnoses, medical history and medications information.  

#### Notebook objectives:  
- Parse demographic data
- Parse any additional info  
- Save the resulting analysis dataset  

#### Steps:  
First, we need to work out the needed parsing steps on a smaller data subset. We'll use a small random sample of the data for that.  
1. [Load and sample data](#Load-and-sample-data)  
2. [Filter out irrelevant messages](#Filter-out-irrelevant-messages)  
3. [Consolidate post and cross-post content](#Consolidate-post-and-cross-post-content)
   - [Get cross-post subreddit names](#Get-cross-post-subreddit-names)  
4. [Consolidate the title and post body as the full user question](#Consolidate-the-title-and-post-body-as-the-full-user-question)
5. [Fix timestamp formats](#Fix-timestamp-formats)
6. [Update the selected fields list](#Update-the-selected-fields-list)

Next, we apply the above parsing prep steps to the entire dataset.  
7. [Load all data](#Load-all-data)  
8. [Apply data prep steps](#Apply-data-prep-steps)  
9. [Save the analysis dataset](#Save-the-analysis-dataset)
   

In [123]:
import pickle
import pandas as pd
import numpy as np
from IPython.display import display, HTML, Markdown, clear_output
import ipywidgets as widgets
import time 
import re

import plotly.graph_objects as go

In [2]:
DATA_PATH = 'data/'
OUTPUT_PATH = 'output/'

## Load and sample data

In [52]:
df = pd.read_csv(
    DATA_PATH + 'reddit_askdocs_submissions_2017_to_20220121_analysis_ds.zip',
    low_memory=False
)

In [53]:
df.head()

Unnamed: 0,author,author_flair_text,domain,full_link,id,locked,num_comments,num_crossposts,over_18,score,selftext,title,url,crosspost_subreddits,full_post_text,created_utc_ns_dt,edited_utc_ns_dt
0,[deleted],,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbw...,7nbwtn,False,0,0.0,False,2,,Appendicitis removed 1 month ago but feel a pa...,https://www.reddit.com/r/AskDocs/comments/7nbw...,,Appendicitis removed 1 month ago but feel a pa...,1514764452000000000,
1,[deleted],,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvln,False,1,0.0,False,1,,My grandma has neck/back pain and little to no...,https://www.reddit.com/r/AskDocs/comments/7nbv...,,My grandma has neck/back pain and little to no...,1514764055000000000,
2,DavisTheMagicSheep,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbu...,7nburb,False,2,0.0,False,1,"I've had a cold for the last couple days now, ...",My ears feel like there is pressure inside of ...,https://www.reddit.com/r/AskDocs/comments/7nbu...,,My ears feel like there is pressure inside of ...,1514763799000000000,
3,Dontgetscooped,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbs...,7nbsw2,False,1,0.0,False,1,(first about me : 32 white male 5 foot 5 225lb...,IBS maybe?,https://www.reddit.com/r/AskDocs/comments/7nbs...,,IBS maybe? | (first about me : 32 white male 5...,1514763188000000000,
4,AveryFenix,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbo...,7nbolv,False,7,0.0,False,7,I've had these marks on my stomach ever since ...,Mole or scar? Should I be worried about melanoma?,https://www.reddit.com/r/AskDocs/comments/7nbo...,,Mole or scar? Should I be worried about melano...,1514761839000000000,


In [54]:
df_sample = df.sample(n=10_000, random_state=1)

In [56]:
del df

In [261]:
def num_freq_plot(df, field, color=''):
    '''Makes a plotly bar chart plot for the selected num type field.'''
    freq_df = df[field].value_counts().sort_index().reset_index()
    freq_df.columns = ['value', 'count']
    
    traces = []
        
    trace = go.Bar(
        x=freq_df['value'],
        y=freq_df['count'],
        hovertext=freq_df['value'],
        hovertemplate="Value: %{x:,}<br>" +
            "Frequency: %{y:,}<br>" +
            "<extra></extra>",
    )
    traces.append(trace)

    fig = go.Figure(traces)
    
    if len(color) > 0:
        fig.update_traces(marker_color=color)
        
    fig.update_yaxes(gridcolor='#eee', title='frequency', rangemode='tozero')
    fig.update_xaxes(rangemode='tozero')
    fig.update_layout(
        title=f'<b>{field}</b> frequency distribution',
        plot_bgcolor='#fff',
        showlegend=False,
        height=400
    )
    
    if len(freq_df) >= 10:
        fig.update_layout(height=500)

    return fig

In [388]:
# a copy of the sample that's ok to mess up
df_sample2 = df_sample.copy()

## Explore the text data

In [389]:
df_sample2['full_post_text'].str[:60].value_counts().head(20)

full_post_text
Help |                                                           3
Do cannabis withdrawal symptoms come and go through its dura     2
Tonsillitis question |                                           2
What do I have |                                                 2
Ear infection |                                                  2
Why does my sternum “pop” if I’ve been sitting leaning over      2
Dull chest pain after infected by COVID that doesn't feel se     2
Does this need stitches? |                                       2
Bright Red Blood &amp; Clotting In Stool - NSAID/Alcohol | H     1
What are these red dots on my skin? | Hi all, on Wednesday t     1
Lapses of Unresponsiveness | Hi, I’m a white 30F, 150lbs, an     1
Very watery stool/diarrhea, not sure what to do | 20 male 17     1
Should I be concerned about smelling old urine? | I'm 23, ma     1
Strange mark under toenail | There’s been this strange mark      1
Is functioning Dysphagia a regular side effect 

## Age and gender

In [390]:
df_sample.columns

Index(['author', 'author_flair_text', 'domain', 'full_link', 'id', 'locked',
       'num_comments', 'num_crossposts', 'over_18', 'score', 'selftext',
       'title', 'url', 'crosspost_subreddits', 'full_post_text',
       'created_utc_ns_dt', 'edited_utc_ns_dt'],
      dtype='object')

### Age and gender pattern 1

In [463]:
regexp_gender_kwds = r'(F|M|AFAB|AMAB|female|male|boy|girl|man|woman|father|mother|daughter|son|brother|sister|grandma|grandpa|grandmother|grandfather)'

In [464]:
regexp_age_gender_1 = r'''(?<![.,'"])\b([1-9]\d?) ?''' + regexp_gender_kwds + r'\b'

regexp_age_gender_1_extracts = df_sample2['full_post_text'].str.extract(
        regexp_age_gender_1,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age',
        1: 'patient_gender'
    })

regexp_age_gender_1_extracts.dropna(inplace=True)

regexp_age_gender_1_extracts['patient_age'] = \
    regexp_age_gender_1_extracts['patient_age'].astype('int')

In [465]:
len(regexp_age_gender_1_extracts)/len(df_sample2)

0.2724

In [466]:
regexp_age_gender_1_extracts.describe(include='all')

Unnamed: 0,patient_age,patient_gender
count,2724.0,2724
unique,,11
top,,F
freq,,1085
mean,25.064244,
std,9.068754,
min,1.0,
25%,20.0,
50%,23.0,
75%,28.0,


In [467]:
regexp_age_gender_1_extracts['patient_age'].value_counts().sort_index()

patient_age
1     2
2     7
3     4
4     1
5     2
     ..
79    1
82    1
85    1
98    1
99    1
Name: count, Length: 73, dtype: int64

In [468]:
num_freq_plot(regexp_age_gender_1_extracts, 'patient_age')

In [469]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_1_extracts[
    (regexp_age_gender_1_extracts['patient_age'] <= upper_bound) 
    & (regexp_age_gender_1_extracts['patient_age'] >= lower_bound)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_1_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_1_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 18
Gender = M

Full text:
I scratched open a little bumb on my head and now my mom got concerned | For as long as I [18M] can remember I've had this little, red (I think) bumb on my head a few centimeters above my ear (so under my hair). Today I scratched it open and it started bleeding a bit so I informed my mom about the bumb and she got quiet concerned. Is this something to worry about?

Url:
https://www.reddit.com/r/AskDocs/comments/j4z0az/i_scratched_open_a_little_bumb_on_my_head_and_now/


----------------
Extracted values:
Age = 25
Gender = F

Full text:
Forehead nerve hurts to touch?? | 25F. I found this spot (by touch, not visible) at the right side of my hair line. Ever so slightly raised and when I lightly graze it, it shoots an umbrella of headache on that side of my head. Is this weird??? I’m guessing it’s a nerve? Is it odd for it to be so superficial?

Url:
https://www.reddit.com/r/AskDocs/comments/ft6zjk/forehead_nerve_hurts_to

In [470]:
# Filter out out-of-range values
regexp_age_gender_1_extracts = regexp_age_gender_1_extracts.loc[in_range_ind]

In [471]:
num_freq_plot(regexp_age_gender_1_extracts, 'patient_age')

This pattern matches the following proportion of the data sample:

In [472]:
len(regexp_age_gender_1_extracts)/len(df_sample2)

0.2698

In [473]:
remainder = df_sample2[
    ~df_sample2.index.isin(regexp_age_gender_1_extracts.index)
].copy()

In [474]:
len(remainder)

7302

### Age and gender pattern 2

In [478]:
regexp_age_gender_2 = r'''(?<![.,'"])\b([1-9]\d?)[ \-]?years?-? ?old,? ''' + regexp_gender_kwds + r'\b'

regexp_age_gender_2_extracts = remainder['full_post_text'].str.extract(
        regexp_age_gender_2,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age',
        1: 'patient_gender'
    })

regexp_age_gender_2_extracts.dropna(inplace=True)

regexp_age_gender_2_extracts['patient_age'] = \
    regexp_age_gender_2_extracts['patient_age'].astype('int')

In [479]:
len(regexp_age_gender_2_extracts)/len(df_sample2)

0.0608

In [481]:
regexp_age_gender_2_extracts.describe(include='all')

Unnamed: 0,patient_age,patient_gender
count,608.0,608
unique,,19
top,,male
freq,,301
mean,24.019737,
std,9.671862,
min,1.0,
25%,19.0,
50%,23.0,
75%,27.0,


In [482]:
regexp_age_gender_2_extracts['patient_age'].value_counts().sort_index()

patient_age
1      1
2      6
3      1
4      2
5      3
6      2
7      1
8      3
9      2
10     3
11     1
12     4
13     5
14     5
15    17
16    20
17    23
18    30
19    36
20    58
21    43
22    35
23    37
24    34
25    33
26    27
27    27
28    23
29    13
30    24
31     8
32    13
33     9
34     4
35     6
36     8
37     3
38     6
39     3
40     5
41     2
42     3
45     2
46     2
47     1
48     1
50     1
52     1
54     1
55     1
57     1
60     1
67     1
70     3
74     1
87     1
91     1
Name: count, dtype: int64

In [483]:
num_freq_plot(regexp_age_gender_2_extracts, 'patient_age')

In [484]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_2_extracts[
    (regexp_age_gender_2_extracts['patient_age'] < 13)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_2_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_2_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 2
Gender = son

Full text:
Vaccines | So my son is 5 weeks old today and one of my friends has a 2 year old son who is not vaccinated what so ever. Can we go over to their house to visit? Please no hateful comments, I’m a first time mom and I’m just trying my best.

Url:
https://www.reddit.com/r/AskDocs/comments/kgix3s/vaccines/


----------------
Extracted values:
Age = 2
Gender = son

Full text:
The flu and contagiousness | Male, 36, 5’8, 170lbs. Non smoker, social drinker, Caucasian. Take cymbalta, and various supplements (Vit D drops, elderberry, Vit B complex, Milk thistle).

I got the flu vaccine in February, late, but seem to have come down with it since Friday.  I have a wife and 2 year old son. My son was vaccinated, but my wife chose not to. I am currently on tamiflu (only 2 doses thus far) and am worried about my wife. We were intimate the night before I started showing symptoms.  Is it pretty much guaranteed she’s going to get it?


From testing in the above, we don't need to filter out out-of-range values here.

In [485]:
num_freq_plot(regexp_age_gender_2_extracts, 'patient_age')

This pattern matches the following proportion of the data sample:

In [486]:
len(regexp_age_gender_2_extracts)/len(df_sample2)

0.0608

In [487]:
remainder = remainder[
    ~remainder.index.isin(regexp_age_gender_2_extracts.index)
].copy()

In [488]:
len(remainder)

6694

In [489]:
len(remainder)/len(df_sample2)

0.6694

### Age and gender pattern 3

In [490]:
regexp_age_gender_3 = r'''(?<![.,'"])\b([1-9]\d?)[ \-]?y\.?/? ?o\.?,? ''' + regexp_gender_kwds + r'\b'

regexp_age_gender_3_extracts = remainder['full_post_text'].str.extract(
        regexp_age_gender_3,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age',
        1: 'patient_gender'
    })

regexp_age_gender_3_extracts.dropna(inplace=True)

regexp_age_gender_3_extracts['patient_age'] = \
    regexp_age_gender_3_extracts['patient_age'].astype('int')

In [491]:
len(regexp_age_gender_3_extracts)/len(df_sample2)

0.0191

In [492]:
regexp_age_gender_3_extracts.describe(include='all')

Unnamed: 0,patient_age,patient_gender
count,191.0,191
unique,,12
top,,male
freq,,88
mean,24.958115,
std,9.919589,
min,1.0,
25%,19.5,
50%,23.0,
75%,28.5,


In [493]:
regexp_age_gender_3_extracts['patient_age'].value_counts().sort_index()

patient_age
1      1
2      2
3      1
4      1
6      1
12     1
14     3
15     2
16    11
17     4
18     7
19    14
20    14
21     8
22     8
23    18
24    10
25    16
26    10
27     4
28     7
29     6
30     7
31     4
32     3
33     7
34     1
35     1
36     1
37     2
38     2
39     4
40     1
43     1
46     1
47     1
48     1
56     1
58     1
60     1
70     1
73     1
Name: count, dtype: int64

In [494]:
num_freq_plot(regexp_age_gender_3_extracts, 'patient_age')

In [495]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_3_extracts[
    (regexp_age_gender_3_extracts['patient_age'] < 13)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_3_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_3_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 4
Gender = son

Full text:
4yo son broke his arm, received the doctor's "narrative and impression." What exactly does this mean and is there any action I need to take? | M4 45 pounds

 XR: Elbow Left Complete

Clinical  History: Displaced simple supracondylar fracture without intercondylar  fracture of left humerus, initial encounter for close fracture.

Comparison: 4/24/2021

Impression:

Type I supracondylar fracture with unchanged alignment. Increased boy sclerosis about the fracture line which remains visible.

The anterior humeral line is respected and radiocapitellar alignment is preserved. A small effusion remains present.

Can someone ELI5 here for me? Thank you!

Url:
https://www.reddit.com/r/AskDocs/comments/nm1eai/4yo_son_broke_his_arm_received_the_doctors/


----------------
Extracted values:
Age = 2
Gender = boy

Full text:
2M son has a weird bug bite... Wide circle and red | Our 2YO boy loves to play outside. Caucasian, about 28/

From testing in the above, we don't need to filter out out-of-range values here.

In [496]:
num_freq_plot(regexp_age_gender_3_extracts, 'patient_age')

This pattern matches the following proportion of the data sample:

In [497]:
len(regexp_age_gender_3_extracts)/len(df_sample2)

0.0191

In [498]:
remainder = remainder[
    ~remainder.index.isin(regexp_age_gender_3_extracts.index)
].copy()

In [499]:
len(remainder)

6503

In [500]:
len(remainder)/len(df_sample2)

0.6503

In [502]:
remainder['selftext'].isna().value_counts()

selftext
False    3847
True     2656
Name: count, dtype: int64

In [501]:
for i, r in remainder[remainder['selftext'].notna()].head().iterrows():
    print('\n\n----------------')

    print('\nFull text:')
    print(r['full_post_text'])
    print('\nUrl:')
    print(r['url'])



----------------

Full text:
Pain in groin, itchy scrotum/anus | Hi all. I've been experiencing the following symptoms off and on for about four months:

1) Pain in groin. Sometimes extending to testicles (more the right testicle but pain is felt in both)

2) Intense itchy feeling on scrotum and anus. It feels like bugs crawling. There is no redness or rash.

3) Increased urgency when needing to urinate.

4) (potentially a red herring and is unrelated but I'm not sure) increased frequency of bowel movements. 

I've seen three general practitioners and one urologist with these issues and no one has been any help. The urologist visit was recent and he said it could potentially be a hernia and referred to me a surgeon, this doesn't seem right to me and he didn't seem convinced himself. We can rule out any STIs (including crabs/scabies) as I've been tested for all of those (twice!) since all of this started, although I did have and was treated for gonorrhea about seven months ago. 

----

## Age only patterns

### Age pattern 1

In [503]:
regexp_age_1 = r'''\baged? ?[:\-]? ([1-9]\d?)\b'''

regexp_age_1_extracts = remainder['full_post_text'].str.extract(
        regexp_age_1,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_1_extracts.dropna(inplace=True)

regexp_age_1_extracts['patient_age'] = \
    regexp_age_1_extracts['patient_age'].astype('int')

In [504]:
len(regexp_age_1_extracts)/len(df_sample2)

0.092

In [505]:
regexp_age_1_extracts.describe(include='all')

Unnamed: 0,patient_age
count,920.0
mean,25.978261
std,9.625951
min,4.0
25%,20.0
50%,24.0
75%,30.0
max,84.0


In [506]:
regexp_age_1_extracts['patient_age'].value_counts().sort_index()

patient_age
4      1
5      2
6      1
7      1
8      3
9      1
11     4
12     3
13     4
14     6
15    18
16    25
17    33
18    59
19    37
20    62
21    62
22    57
23    50
24    47
25    52
26    48
27    29
28    50
29    34
30    40
31    28
32    20
33    22
34    11
35    18
36     7
37     8
38     8
39     7
40     4
41     2
42     3
43     1
44     4
45     5
46     3
47     4
49     3
50     1
52     2
53     5
55     4
57     2
58     1
59     1
60     2
61     3
63     2
64     4
65     2
69     1
74     1
77     1
84     1
Name: count, dtype: int64

In [507]:
num_freq_plot(regexp_age_1_extracts, 'patient_age')

In [508]:
in_range_ind = regexp_age_1_extracts[
    (regexp_age_1_extracts['patient_age'] > 80)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_1_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 84

Full text:
Is it ethical (or legal) for a doctor to advise family members to NOT call 911 if a dementia patient has a medical emergency because they "have no quality of life"? | Age: 84
Sex: M
Height: 5.8
Weight: 140
Race: white
Duration: 5 years
Location: not relevant
Relevant Medical Issues: dementia, heart attack, stroke, enlarged prostate
Current Meds: Flomax, Coreg, Tranxene, aspirin

Url:
https://www.reddit.com/r/AskDocs/comments/6bhkn7/is_it_ethical_or_legal_for_a_doctor_to_advise/


From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [509]:
len(regexp_age_1_extracts)/len(df_sample2)

0.092

In [510]:
remainder_age_only = remainder[
    ~remainder.index.isin(regexp_age_1_extracts.index)
].copy()

In [511]:
len(remainder_age_only)

5583

In [512]:
len(remainder_age_only)/len(df_sample2)

0.5583

### Age pattern 2

In [513]:
regexp_age_2 = r'''\b([1-9]\d?)[ \-]?years?-? ?old\b'''

regexp_age_2_extracts = remainder_age_only['full_post_text'].str.extract(
        regexp_age_2,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_2_extracts.dropna(inplace=True)

regexp_age_2_extracts['patient_age'] = \
    regexp_age_2_extracts['patient_age'].astype('int')

In [514]:
len(regexp_age_2_extracts)/len(df_sample2)

0.0509

In [515]:
regexp_age_2_extracts.describe(include='all')

Unnamed: 0,patient_age
count,509.0
mean,24.603143
std,12.293958
min,1.0
25%,18.0
50%,22.0
75%,27.0
max,97.0


In [516]:
regexp_age_2_extracts['patient_age'].value_counts().sort_index()

patient_age
1     1
2     1
3     2
4     3
5     3
     ..
75    1
79    1
81    1
95    1
97    1
Name: count, Length: 61, dtype: int64

In [517]:
num_freq_plot(regexp_age_2_extracts, 'patient_age')

In [518]:
in_range_ind = regexp_age_2_extracts[
    (regexp_age_2_extracts['patient_age'] < 80)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_2_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 16

Full text:
Spinal cord inflammation or muscle? | 16 years old

Athletic

245lbs

Right below my shoulder blades I feel like the two muscles are consistently contracting. In other words they just feel extremely tight. I’ve never had this trouble in this particular area before. I There is a sharp pain sometimes on the spine. I went to the chiropractor about five hours ago. He popped that area and my neck. Muscles are still tight as of now. How can I tell if this is inflammation instead of a tweaked nerve?

Url:
https://www.reddit.com/r/AskDocs/comments/kejttn/spinal_cord_inflammation_or_muscle/


----------------
Extracted value:
Age = 23

Full text:
I am 23 years old and believe I may be exhibiting symptoms of ALS. | Hi All, 

About a week ago I began having a numb/weak feeling in both of my legs (can’t really describe it) even though they both seem to still be pretty strong. I can run, jump etc. Yesterday I went to the ER and they drew bloo

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [519]:
len(regexp_age_2_extracts)/len(df_sample2)

0.0509

In [520]:
remainder_age_only = remainder_age_only[
    ~remainder_age_only.index.isin(regexp_age_2_extracts.index)
].copy()

In [521]:
len(remainder_age_only)

5074

In [522]:
len(remainder_age_only)/len(df_sample2)

0.5074

In [523]:
remainder['selftext'].isna().value_counts()

selftext
False    3847
True     2656
Name: count, dtype: int64

### Age pattern 3

In [524]:
regexp_age_3 = r'''\b([1-9]\d?)[ \-]?y\.?/ ?o\.?\b'''

regexp_age_3_extracts = remainder_age_only['full_post_text'].str.extract(
        regexp_age_3,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_3_extracts.dropna(inplace=True)

regexp_age_3_extracts['patient_age'] = \
    regexp_age_3_extracts['patient_age'].astype('int')

In [525]:
len(regexp_age_3_extracts)/len(df_sample2)

0.0041

In [526]:
regexp_age_3_extracts.describe(include='all')

Unnamed: 0,patient_age
count,41.0
mean,22.682927
std,9.72224
min,2.0
25%,17.0
50%,20.0
75%,28.0
max,49.0


In [527]:
regexp_age_3_extracts['patient_age'].value_counts().sort_index()

patient_age
2     1
5     1
12    1
13    1
14    1
15    2
16    2
17    3
18    3
19    2
20    4
21    2
22    1
23    4
24    1
26    1
28    2
30    2
33    1
36    1
38    1
39    1
41    2
49    1
Name: count, dtype: int64

In [528]:
num_freq_plot(regexp_age_3_extracts, 'patient_age')

In [529]:
in_range_ind = regexp_age_3_extracts[
    (regexp_age_3_extracts['patient_age'] < 80)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_3_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 16

Full text:
Difficulty in breathing through nose | I cannot breathe through my nose easily, this in turns make me can't taste food. I do not have a runny nose, the amount of mucus is very little. I tried clearing the mucus, it only improved by a bit. Still having difficulty. I just checked my nasal passage and it seemed like it had shrinked, adding to the difficulty in breathing through my nose. I am 16 y/o. My BMI is in the acceptable range, however close to the overweight range. Help please, i want to be able to taste food without breathing so loudly through my nose.
Additional info: male, 164cm, white, weight not sure, has been i would say 3 months, i do not smoke or drink, however recently i have outbreaks of boils( so far i have not see any boils for at least 2 weeks.

Url:
https://www.reddit.com/r/AskDocs/comments/88khki/difficulty_in_breathing_through_nose/


----------------
Extracted value:
Age = 19

Full text:
Spontaneous fatigue? 

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [530]:
len(regexp_age_3_extracts)/len(df_sample2)

0.0041

In [532]:
remainder_age_only = remainder_age_only[
    ~remainder_age_only.index.isin(regexp_age_3_extracts.index)
].copy()

In [533]:
len(remainder_age_only)

5033

In [534]:
len(remainder_age_only)/len(df_sample2)

0.5033

In [535]:
remainder_age_only['selftext'].isna().value_counts()

selftext
True     2653
False    2380
Name: count, dtype: int64

In [538]:
for i, r in remainder_age_only[remainder_age_only['selftext'].notna()].tail(20).iterrows():
    print('\n\n----------------')

    print('\nFull text:')
    print(r['full_post_text'])
    print('\nUrl:')
    print(r['url'])



----------------

Full text:
Urethritis and shit | Im a male 20 yrs old, So i had sex with a girl i met on tinder, after a day or two, my pee starts to burn that’s the only symptom that i have, after 1 month that’s when I finally go to see the doctor and learned that i have urethritis for 1 month. I can feel some lumps Inside the urethra, under the shaft of the penis in the urethra. I dont know what that means but the doctor said it’s because of the inflammation of the urethra, i dont know what to do. Will it go away with the antibiotics? Is it just really inflammation of the urethra?

Url:
https://www.reddit.com/r/AskDocs/comments/e2nbau/urethritis_and_shit/


----------------

Full text:
random onset of nausea in the middle of the night lasting 10 minutes | Hi,

last night I woke up from my dream and I started feeling really nauseous and lightheaded. No other symptoms.

i went to the bathroom to presumably vomit but I ended up just curling up on my cold bathroom floor for about 5 m