# Parse demographics  
The rules of the AskDocs subreddit require at least some demographics (age and sex), and encourage detailed demographics, diagnoses, medical history and medications information.  

#### Notebook objectives:  
- Parse demographic data
- Parse any additional info  
- Save the resulting analysis dataset  

#### Steps:  
First, we need to work out the needed parsing steps on a smaller data subset. We'll use a small random sample of the data for that.  
1. [Load and sample data](#Load-and-sample-data)  
2. Parse age & sex/gender patterns  
3. Parse age only patterns
4. Parse sex/gender only patterns  

Next, we apply the above parsing prep steps to the entire dataset.  
7. [Load all data](#Load-all-data)  
8. [Apply the parsing steps](#Apply-the-parsing-steps)  
9. [Save the analysis dataset](#Save-the-analysis-dataset)
   

In [1]:
%run notebook_setup.ipynb

DATA_PATH='data/'
OUTPUT_PATH='output/'


num_freq_plot(df, field, color='')


## Load and sample data

In [2]:
df = pd.read_csv(
    DATA_PATH + 'reddit_askdocs_submissions_2017_to_20220121_analysis_ds.zip',
    low_memory=False
)

In [3]:
df.head()

Unnamed: 0,author,author_flair_text,domain,full_link,id,locked,num_comments,num_crossposts,over_18,score,selftext,title,url,crosspost_subreddits,full_post_text,created_utc_ns_dt,edited_utc_ns_dt
0,[deleted],,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbw...,7nbwtn,False,0,0.0,False,2,,Appendicitis removed 1 month ago but feel a pa...,https://www.reddit.com/r/AskDocs/comments/7nbw...,,Appendicitis removed 1 month ago but feel a pa...,1514764452000000000,
1,[deleted],,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbv...,7nbvln,False,1,0.0,False,1,,My grandma has neck/back pain and little to no...,https://www.reddit.com/r/AskDocs/comments/7nbv...,,My grandma has neck/back pain and little to no...,1514764055000000000,
2,DavisTheMagicSheep,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbu...,7nburb,False,2,0.0,False,1,"I've had a cold for the last couple days now, ...",My ears feel like there is pressure inside of ...,https://www.reddit.com/r/AskDocs/comments/7nbu...,,My ears feel like there is pressure inside of ...,1514763799000000000,
3,Dontgetscooped,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbs...,7nbsw2,False,1,0.0,False,1,(first about me : 32 white male 5 foot 5 225lb...,IBS maybe?,https://www.reddit.com/r/AskDocs/comments/7nbs...,,IBS maybe? | (first about me : 32 white male 5...,1514763188000000000,
4,AveryFenix,This user has not yet been verified.,self.AskDocs,https://www.reddit.com/r/AskDocs/comments/7nbo...,7nbolv,False,7,0.0,False,7,I've had these marks on my stomach ever since ...,Mole or scar? Should I be worried about melanoma?,https://www.reddit.com/r/AskDocs/comments/7nbo...,,Mole or scar? Should I be worried about melano...,1514761839000000000,


In [4]:
df_sample = df.sample(n=10_000, random_state=1)

In [5]:
del df

In [6]:
# a copy of the sample that's ok to mess up
df_sample2 = df_sample.copy()

## Parse age & sex/gender patterns

In [15]:
df_sample2['full_post_text'].str[:60].value_counts().head(10)

full_post_text
Help |                                                          3
Do cannabis withdrawal symptoms come and go through its dura    2
Tonsillitis question |                                          2
What do I have |                                                2
Ear infection |                                                 2
Why does my sternum “pop” if I’ve been sitting leaning over     2
Dull chest pain after infected by COVID that doesn't feel se    2
Does this need stitches? |                                      2
Bright Red Blood &amp; Clotting In Stool - NSAID/Alcohol | H    1
What are these red dots on my skin? | Hi all, on Wednesday t    1
Name: count, dtype: int64

### Age and gender pattern 1  
Pattern like:  
 - 28F  
 - 28 female  
 - 28, AMAB  
 - etc.  

In [17]:
regexp_gender_kwds = r'(F|M|AFAB|AMAB|MTF|FTM|female|male|boy|girl|man|woman|father|mother|daughter|son|brother|sister|grandma|grandpa|grandmother|grandfather)'

In [18]:
regexp_age_gender_1 = r'''(?<![.,'"])\b([1-9]\d?)[,/]? ?''' + regexp_gender_kwds + r'\b'

regexp_age_gender_1_extracts = df_sample2['full_post_text'].str.extract(
        regexp_age_gender_1,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age',
        1: 'patient_gender'
    })

regexp_age_gender_1_extracts.dropna(inplace=True)

regexp_age_gender_1_extracts['patient_age'] = \
    regexp_age_gender_1_extracts['patient_age'].astype('int')

In [19]:
len(regexp_age_gender_1_extracts)/len(df_sample2)

0.3183

In [20]:
regexp_age_gender_1_extracts.describe(include='all')

Unnamed: 0,patient_age,patient_gender
count,3183.0,3183
unique,,17
top,,F
freq,,1151
mean,25.121583,
std,8.937768,
min,1.0,
25%,20.0,
50%,23.0,
75%,28.0,


In [21]:
regexp_age_gender_1_extracts['patient_age'].value_counts().sort_index()

patient_age
1     2
2     7
3     5
4     1
5     3
     ..
82    1
85    1
91    1
98    1
99    1
Name: count, Length: 75, dtype: int64

In [22]:
num_freq_plot(regexp_age_gender_1_extracts, 'patient_age')

In [23]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_1_extracts[
    (regexp_age_gender_1_extracts['patient_age'] <= upper_bound) 
    & (regexp_age_gender_1_extracts['patient_age'] >= lower_bound)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_1_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_1_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 18
Gender = M

Full text:
I scratched open a little bumb on my head and now my mom got concerned | For as long as I [18M] can remember I've had this little, red (I think) bumb on my head a few centimeters above my ear (so under my hair). Today I scratched it open and it started bleeding a bit so I informed my mom about the bumb and she got quiet concerned. Is this something to worry about?

Url:
https://www.reddit.com/r/AskDocs/comments/j4z0az/i_scratched_open_a_little_bumb_on_my_head_and_now/


----------------
Extracted values:
Age = 25
Gender = F

Full text:
Forehead nerve hurts to touch?? | 25F. I found this spot (by touch, not visible) at the right side of my hair line. Ever so slightly raised and when I lightly graze it, it shoots an umbrella of headache on that side of my head. Is this weird??? I’m guessing it’s a nerve? Is it odd for it to be so superficial?

Url:
https://www.reddit.com/r/AskDocs/comments/ft6zjk/forehead_nerve_hurts_to

In [24]:
# Filter out out-of-range values
regexp_age_gender_1_extracts = regexp_age_gender_1_extracts.loc[in_range_ind]

In [25]:
num_freq_plot(regexp_age_gender_1_extracts, 'patient_age')

This pattern matches the following proportion of the data sample:

In [26]:
len(regexp_age_gender_1_extracts)/len(df_sample2)

0.3152

In [27]:
remainder = df_sample2[
    ~df_sample2.index.isin(regexp_age_gender_1_extracts.index)
].copy()

In [28]:
len(remainder)

6848

### Age and gender pattern 2  
Pattern like:  
 - 28 year-old F  
 - 28-years-old female  
 - 28 years old , AMAB  
 - etc.  

In [29]:
regexp_age_gender_2 = r'''(?<![.,'"])\b([1-9]\d?)[ \-]?years?-? ?old,? ''' + regexp_gender_kwds + r'\b'

regexp_age_gender_2_extracts = remainder['full_post_text'].str.extract(
        regexp_age_gender_2,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age',
        1: 'patient_gender'
    })

regexp_age_gender_2_extracts.dropna(inplace=True)

regexp_age_gender_2_extracts['patient_age'] = \
    regexp_age_gender_2_extracts['patient_age'].astype('int')

In [30]:
len(regexp_age_gender_2_extracts)/len(df_sample2)

0.0603

In [31]:
regexp_age_gender_2_extracts.describe(include='all')

Unnamed: 0,patient_age,patient_gender
count,603.0,603
unique,,20
top,,male
freq,,298
mean,24.0199,
std,9.710234,
min,1.0,
25%,19.0,
50%,23.0,
75%,27.0,


In [32]:
regexp_age_gender_2_extracts['patient_age'].value_counts().sort_index()

patient_age
1      1
2      6
3      1
4      2
5      3
6      2
7      1
8      3
9      2
10     3
11     1
12     4
13     5
14     5
15    17
16    20
17    23
18    30
19    36
20    57
21    43
22    35
23    36
24    34
25    32
26    26
27    26
28    23
29    12
30    25
31     8
32    13
33     9
34     4
35     6
36     8
37     3
38     6
39     3
40     5
41     2
42     3
45     2
46     2
47     1
48     1
50     1
52     1
54     1
55     1
57     1
60     1
67     1
70     3
74     1
87     1
91     1
Name: count, dtype: int64

In [33]:
num_freq_plot(regexp_age_gender_2_extracts, 'patient_age')

In [34]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_2_extracts[
    (regexp_age_gender_2_extracts['patient_age'] < 13)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_2_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_2_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 2
Gender = son

Full text:
Vaccines | So my son is 5 weeks old today and one of my friends has a 2 year old son who is not vaccinated what so ever. Can we go over to their house to visit? Please no hateful comments, I’m a first time mom and I’m just trying my best.

Url:
https://www.reddit.com/r/AskDocs/comments/kgix3s/vaccines/


----------------
Extracted values:
Age = 2
Gender = son

Full text:
The flu and contagiousness | Male, 36, 5’8, 170lbs. Non smoker, social drinker, Caucasian. Take cymbalta, and various supplements (Vit D drops, elderberry, Vit B complex, Milk thistle).

I got the flu vaccine in February, late, but seem to have come down with it since Friday.  I have a wife and 2 year old son. My son was vaccinated, but my wife chose not to. I am currently on tamiflu (only 2 doses thus far) and am worried about my wife. We were intimate the night before I started showing symptoms.  Is it pretty much guaranteed she’s going to get it?


From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [35]:
len(regexp_age_gender_2_extracts)/len(df_sample2)

0.0603

In [36]:
remainder = remainder[
    ~remainder.index.isin(regexp_age_gender_2_extracts.index)
].copy()

In [37]:
len(remainder)

6245

In [38]:
len(remainder)/len(df_sample2)

0.6245

### Age and gender pattern 3  
Pattern like:  
 - 28YO F  
 - 28 y.o. female  
 - 28 y/o, AMAB  
 - etc.  

In [39]:
regexp_age_gender_3 = r'''(?<![.,'"])\b([1-9]\d?)[ \-]?y\.?/? ?o\.?,? ''' + regexp_gender_kwds + r'\b'

regexp_age_gender_3_extracts = remainder['full_post_text'].str.extract(
        regexp_age_gender_3,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age',
        1: 'patient_gender'
    })

regexp_age_gender_3_extracts.dropna(inplace=True)

regexp_age_gender_3_extracts['patient_age'] = \
    regexp_age_gender_3_extracts['patient_age'].astype('int')

In [40]:
len(regexp_age_gender_3_extracts)/len(df_sample2)

0.0191

In [41]:
regexp_age_gender_3_extracts.describe(include='all')

Unnamed: 0,patient_age,patient_gender
count,191.0,191
unique,,12
top,,male
freq,,88
mean,24.958115,
std,9.919589,
min,1.0,
25%,19.5,
50%,23.0,
75%,28.5,


In [42]:
regexp_age_gender_3_extracts['patient_age'].value_counts().sort_index()

patient_age
1      1
2      2
3      1
4      1
6      1
12     1
14     3
15     2
16    11
17     4
18     7
19    14
20    14
21     8
22     8
23    18
24    10
25    16
26    10
27     4
28     7
29     6
30     7
31     4
32     3
33     7
34     1
35     1
36     1
37     2
38     2
39     4
40     1
43     1
46     1
47     1
48     1
56     1
58     1
60     1
70     1
73     1
Name: count, dtype: int64

In [43]:
num_freq_plot(regexp_age_gender_3_extracts, 'patient_age')

In [44]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_3_extracts[
    (regexp_age_gender_3_extracts['patient_age'] < 13)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_3_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_3_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 4
Gender = son

Full text:
4yo son broke his arm, received the doctor's "narrative and impression." What exactly does this mean and is there any action I need to take? | M4 45 pounds

 XR: Elbow Left Complete

Clinical  History: Displaced simple supracondylar fracture without intercondylar  fracture of left humerus, initial encounter for close fracture.

Comparison: 4/24/2021

Impression:

Type I supracondylar fracture with unchanged alignment. Increased boy sclerosis about the fracture line which remains visible.

The anterior humeral line is respected and radiocapitellar alignment is preserved. A small effusion remains present.

Can someone ELI5 here for me? Thank you!

Url:
https://www.reddit.com/r/AskDocs/comments/nm1eai/4yo_son_broke_his_arm_received_the_doctors/


----------------
Extracted values:
Age = 2
Gender = boy

Full text:
2M son has a weird bug bite... Wide circle and red | Our 2YO boy loves to play outside. Caucasian, about 28/

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [45]:
len(regexp_age_gender_3_extracts)/len(df_sample2)

0.0191

In [46]:
remainder = remainder[
    ~remainder.index.isin(regexp_age_gender_3_extracts.index)
].copy()

In [47]:
len(remainder)

6054

In [48]:
len(remainder)/len(df_sample2)

0.6054

### Age and gender pattern 4  
Pattern like:  
 - F28  
 - female28   
 - etc.

In [49]:
regexp_age_gender_4 = (
    r'\b' 
    + regexp_gender_kwds 
    + r'''([1-9]\d)\b'''
)

regexp_age_gender_4_extracts = remainder['full_post_text'].str.extract(
        regexp_age_gender_4,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_gender', 
        1: 'patient_age',
    })

regexp_age_gender_4_extracts.dropna(inplace=True)

regexp_age_gender_4_extracts['patient_age'] = \
    regexp_age_gender_4_extracts['patient_age'].astype('int')

In [50]:
len(regexp_age_gender_4_extracts)/len(df_sample2)

0.0213

In [51]:
regexp_age_gender_4_extracts.describe(include='all')

Unnamed: 0,patient_gender,patient_age
count,213,213.0
unique,6,
top,F,
freq,102,
mean,,24.169014
std,,8.672206
min,,13.0
25%,,19.0
50%,,22.0
75%,,27.0


In [52]:
regexp_age_gender_4_extracts['patient_age'].value_counts().sort_index()

patient_age
13     5
14     4
15     5
16     8
17     9
18    13
19    16
20    19
21    13
22    15
23    16
24     6
25    19
26     9
27     8
28    10
29     8
30     6
31     2
32     3
33     4
35     2
37     1
41     1
42     1
43     1
44     1
45     1
46     1
52     1
55     1
60     1
61     1
67     1
68     1
Name: count, dtype: int64

In [53]:
num_freq_plot(regexp_age_gender_4_extracts, 'patient_age')

In [54]:
upper_bound = 90
lower_bound = 12
in_range_ind = regexp_age_gender_4_extracts[
    (regexp_age_gender_4_extracts['patient_age'] < 20)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted values:')
    print(f'''Age = {regexp_age_gender_4_extracts['patient_age'].loc[i]}''')
    print(f'''Gender = {regexp_age_gender_4_extracts['patient_gender'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted values:
Age = 19
Gender = F

Full text:
Could my birth control/PCOS be messing with my(F19) cycle or could it be something else? | Female, 19, 5’3, 80kg, PCOS

I have PCOS and I started taking birth control during the first day of my last period. My period only lasted 3 days and it was really light compared to most of my periods but the blood was bright red and there were clots so I assumed it was my period and not implantation bleeding. I had protected sex on the 5th day once I was sure my period had ended. I bled again on the 6th day but it was really light, bright red with like 1 clot and had spotting/brown discharge for the 2 days following that. I’m assuming it was caused by the sex and any uterine contractions I might have experienced as well as the pills stopping my period early.

I am currently on the 11th day of my current cycle and I’m starting ovulation (on the first day of my fertile period according to my period tracking app), but for the first

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [55]:
len(regexp_age_gender_4_extracts)/len(df_sample2)

0.0213

In [56]:
remainder = remainder[
    ~remainder.index.isin(regexp_age_gender_4_extracts.index)
].copy()

In [57]:
len(remainder)

5841

In [58]:
len(remainder)/len(df_sample2)

0.5841

In [59]:
remainder['selftext'].isna().value_counts()

selftext
False    3235
True     2606
Name: count, dtype: int64

In [265]:
for i, r in remainder[remainder['selftext'].notna()].head().iterrows():
    print('\n\n----------------')

    print('\nFull text:')
    print(r['full_post_text'])
    print('\nUrl:')
    print(r['url'])



----------------

Full text:
Pain in groin, itchy scrotum/anus | Hi all. I've been experiencing the following symptoms off and on for about four months:

1) Pain in groin. Sometimes extending to testicles (more the right testicle but pain is felt in both)

2) Intense itchy feeling on scrotum and anus. It feels like bugs crawling. There is no redness or rash.

3) Increased urgency when needing to urinate.

4) (potentially a red herring and is unrelated but I'm not sure) increased frequency of bowel movements. 

I've seen three general practitioners and one urologist with these issues and no one has been any help. The urologist visit was recent and he said it could potentially be a hernia and referred to me a surgeon, this doesn't seem right to me and he didn't seem convinced himself. We can rule out any STIs (including crabs/scabies) as I've been tested for all of those (twice!) since all of this started, although I did have and was treated for gonorrhea about seven months ago. 

----

## Age only patterns

### Age pattern 1

In [266]:
regexp_age_1 = r'''\baged? ?[:\-]? ?([1-9]\d?)\b'''

regexp_age_1_extracts = remainder['full_post_text'].str.extract(
        regexp_age_1,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_1_extracts.dropna(inplace=True)

regexp_age_1_extracts['patient_age'] = \
    regexp_age_1_extracts['patient_age'].astype('int')

In [267]:
len(regexp_age_1_extracts)/len(df_sample2)

0.0942

In [268]:
regexp_age_1_extracts.describe(include='all')

Unnamed: 0,patient_age
count,942.0
mean,25.619958
std,9.490526
min,4.0
25%,20.0
50%,24.0
75%,29.0
max,84.0


In [269]:
regexp_age_1_extracts['patient_age'].value_counts().sort_index()

patient_age
4      2
5      2
6      1
7      1
8      3
9      1
11     4
12     3
13     4
14     7
15    21
16    29
17    38
18    62
19    44
20    62
21    58
22    61
23    51
24    48
25    52
26    50
27    30
28    50
29    33
30    42
31    27
32    20
33    21
34    10
35    17
36     6
37     8
38     7
39     7
40     4
41     2
42     3
43     1
44     4
45     5
46     4
47     4
49     3
50     1
52     2
53     4
55     3
57     1
58     1
59     1
60     2
61     3
63     2
64     4
65     2
69     1
74     1
77     1
84     1
Name: count, dtype: int64

In [270]:
num_freq_plot(regexp_age_1_extracts, 'patient_age')

In [271]:
in_range_ind = regexp_age_1_extracts[
    (regexp_age_1_extracts['patient_age'] > 80)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_1_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 84

Full text:
Is it ethical (or legal) for a doctor to advise family members to NOT call 911 if a dementia patient has a medical emergency because they "have no quality of life"? | Age: 84
Sex: M
Height: 5.8
Weight: 140
Race: white
Duration: 5 years
Location: not relevant
Relevant Medical Issues: dementia, heart attack, stroke, enlarged prostate
Current Meds: Flomax, Coreg, Tranxene, aspirin

Url:
https://www.reddit.com/r/AskDocs/comments/6bhkn7/is_it_ethical_or_legal_for_a_doctor_to_advise/


From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [272]:
len(regexp_age_1_extracts)/len(df_sample2)

0.0942

In [273]:
remainder_age_only = remainder[
    ~remainder.index.isin(regexp_age_1_extracts.index)
].copy()

In [274]:
len(remainder_age_only)

4899

In [275]:
len(remainder_age_only)/len(df_sample2)

0.4899

### Age pattern 2

In [276]:
regexp_age_2 = r'''\b([1-9]\d?)[ \-]?ye?a?rs?-? ?old\b'''

regexp_age_2_extracts = remainder_age_only['full_post_text'].str.extract(
        regexp_age_2,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_2_extracts.dropna(inplace=True)

regexp_age_2_extracts['patient_age'] = \
    regexp_age_2_extracts['patient_age'].astype('int')

In [277]:
len(regexp_age_2_extracts)/len(df_sample2)

0.0556

In [278]:
regexp_age_2_extracts.describe(include='all')

Unnamed: 0,patient_age
count,556.0
mean,25.023381
std,12.402172
min,1.0
25%,19.0
50%,23.0
75%,28.0
max,98.0


In [279]:
regexp_age_2_extracts['patient_age'].value_counts().sort_index()

patient_age
1     1
2     1
3     3
4     2
5     5
     ..
79    1
81    1
95    1
97    1
98    1
Name: count, Length: 64, dtype: int64

In [280]:
num_freq_plot(regexp_age_2_extracts, 'patient_age')

In [281]:
in_range_ind = regexp_age_2_extracts[
    (regexp_age_2_extracts['patient_age'] < 80)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_2_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 16

Full text:
Spinal cord inflammation or muscle? | 16 years old

Athletic

245lbs

Right below my shoulder blades I feel like the two muscles are consistently contracting. In other words they just feel extremely tight. I’ve never had this trouble in this particular area before. I There is a sharp pain sometimes on the spine. I went to the chiropractor about five hours ago. He popped that area and my neck. Muscles are still tight as of now. How can I tell if this is inflammation instead of a tweaked nerve?

Url:
https://www.reddit.com/r/AskDocs/comments/kejttn/spinal_cord_inflammation_or_muscle/


----------------
Extracted value:
Age = 23

Full text:
I am 23 years old and believe I may be exhibiting symptoms of ALS. | Hi All, 

About a week ago I began having a numb/weak feeling in both of my legs (can’t really describe it) even though they both seem to still be pretty strong. I can run, jump etc. Yesterday I went to the ER and they drew bloo

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [282]:
len(regexp_age_2_extracts)/len(df_sample2)

0.0556

In [283]:
remainder_age_only = remainder_age_only[
    ~remainder_age_only.index.isin(regexp_age_2_extracts.index)
].copy()

In [284]:
len(remainder_age_only)

4343

In [285]:
len(remainder_age_only)/len(df_sample2)

0.4343

In [286]:
remainder['selftext'].isna().value_counts()

selftext
False    3235
True     2606
Name: count, dtype: int64

### Age pattern 3

In [287]:
regexp_age_3 = r'''\b([1-9]\d?)[ \-]?y\.?/? ?o\.?\b'''

regexp_age_3_extracts = remainder_age_only['full_post_text'].str.extract(
        regexp_age_3,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_3_extracts.dropna(inplace=True)

regexp_age_3_extracts['patient_age'] = \
    regexp_age_3_extracts['patient_age'].astype('int')

In [288]:
len(regexp_age_3_extracts)/len(df_sample2)

0.0108

In [289]:
regexp_age_3_extracts.describe(include='all')

Unnamed: 0,patient_age
count,108.0
mean,26.703704
std,12.033644
min,1.0
25%,20.0
50%,24.0
75%,30.25
max,72.0


In [290]:
regexp_age_3_extracts['patient_age'].value_counts().sort_index()

patient_age
1     1
4     2
5     2
12    1
13    1
14    1
15    2
16    2
17    3
18    5
19    5
20    8
21    2
22    6
23    7
24    8
25    6
26    3
27    2
28    3
29    5
30    6
31    3
32    1
33    3
34    1
35    2
36    1
38    1
39    1
40    1
41    3
42    1
45    1
49    1
50    1
55    2
58    1
60    2
72    1
Name: count, dtype: int64

In [291]:
num_freq_plot(regexp_age_3_extracts, 'patient_age')

In [292]:
in_range_ind = regexp_age_3_extracts[
    (regexp_age_3_extracts['patient_age'] < 80)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_3_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 41

Full text:
What is a breast "mass effect?" | Trying again, thank you automod!

41yo WF, 5 ft, 160 pounds, medical Hx includes depression, allergies, asthma, and pacemaker. Taking no medications that affect the current situation which has been going on for 2 weeks.

Two weeks ago I found a lump on my left breast, small, less than pea-sized, palpable under the skin but not visible through the skin.  I went to my family doctor and was scheduled for a mammogram.

After the mammogram I received a letter stating the mammogram showed need for further evaluation and images. I called the imaging center to ask what exactly had been seen; they couldn't tell me much other than a "mass effect" was seen in my left breast, which is where the lump is.  I can't find anything to tell me exactly what a "mass effect" in the breast is, and was hoping someone here would be able to help clarify. I am scheduled for focal point compressions and possible ultrasound 

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [293]:
len(regexp_age_3_extracts)/len(df_sample2)

0.0108

In [294]:
remainder_age_only = remainder_age_only[
    ~remainder_age_only.index.isin(regexp_age_3_extracts.index)
].copy()

In [295]:
len(remainder_age_only)

4235

In [296]:
len(remainder_age_only)/len(df_sample2)

0.4235

### Age pattern 4

In [310]:
regexp_age_4 = r'''\bI'? ?a?m ([1-9]\d)\b(?!%)'''

regexp_age_4_extracts = remainder_age_only['full_post_text'].str.extract(
        regexp_age_4,
        flags=re.IGNORECASE
    ).rename(columns={
        0: 'patient_age'
    })

regexp_age_4_extracts.dropna(inplace=True)

regexp_age_4_extracts['patient_age'] = \
    regexp_age_4_extracts['patient_age'].astype('int')

In [311]:
len(regexp_age_4_extracts)/len(df_sample2)

0.01

In [312]:
regexp_age_4_extracts.describe(include='all')

Unnamed: 0,patient_age
count,100.0
mean,22.32
std,7.458816
min,13.0
25%,17.0
50%,21.0
75%,24.0
max,54.0


In [313]:
regexp_age_4_extracts['patient_age'].value_counts().sort_index()

patient_age
13     1
14     4
15     9
16     4
17     8
18     7
19     9
20     7
21     8
22     4
23    10
24     5
25     4
26     3
27     1
28     3
30     1
33     3
34     1
36     2
37     1
40     2
44     1
46     1
54     1
Name: count, dtype: int64

In [314]:
num_freq_plot(regexp_age_4_extracts, 'patient_age')

In [315]:
in_range_ind = regexp_age_4_extracts[
    (regexp_age_4_extracts['patient_age'] > 30)
].index

for i in in_range_ind[:5]:
    print('\n\n----------------')
    print('Extracted value:')
    print(f'''Age = {regexp_age_4_extracts['patient_age'].loc[i]}''')

    print('\nFull text:')
    print(df_sample2['full_post_text'].loc[i])
    print('\nUrl:')
    print(df_sample2['url'].loc[i])



----------------
Extracted value:
Age = 33

Full text:
2 ER visits and no answers, dont know how to move forward. | Lately in the evenings I'll get a sensation that I forgot how to swallow. If I drink something it's easier but just manually I struggle. When this starts to occur I get a rushing feeling, that I assume now is adrenaline, probably body thinks I'm choking. This ultimately leads into Hyperventilating and heart rate and blood pressure skyrockets. (BP 200/140). This happened to me last night and I thought maybe it was the Benadryl I took so I drank a lot of water and induced vomiting. The vomit was like molasses and didn't come out. I ended up inhaling some of it which extreme panic took over as I'm on my hands and knees leaning forward trying to breathe. Ambulance picks me up and after some fluids and nothing else it all seemed to resolve a couple hours later, although lungs still kinda hurt.

I'm 33 white male 140lbs
Take Benadryl nightly for sleep
50ml Enbrel for autoimmu

From testing in the above, we don't need to filter out out-of-range values here.

This pattern matches the following proportion of the data sample:

In [316]:
len(regexp_age_4_extracts)/len(df_sample2)

0.01

In [317]:
remainder_age_only = remainder_age_only[
    ~remainder_age_only.index.isin(regexp_age_4_extracts.index)
].copy()

In [318]:
len(remainder_age_only)

4135

In [319]:
len(remainder_age_only)/len(df_sample2)

0.4135

In [320]:
remainder_age_only['selftext'].isna().value_counts()

selftext
True     2595
False    1540
Name: count, dtype: int64

In [321]:
for i, r in remainder_age_only[remainder_age_only['selftext'].notna()].tail(20).iterrows():
    print('\n\n----------------')

    print('\nFull text:')
    print(r['full_post_text'])
    print('\nUrl:')
    print(r['url'])



----------------

Full text:
need help with abnormal ultrasound reading | hello, I was wondering if someone could be so kind and help me understand the findings of my ultrasound. 

I have been having issues with the right side of my body(only) but more so my right foot. it often goes cold, one toe will go numb and it can’t carry much of my weight. when I drive my leg sometimes shakes. I decided to finally get this checked out when one day I was going to the restroom and fell twice trying to get there because my ankle gave out (I was wearing heel, but I do OFTEN) 

I have recently been having issues with the right side of my arm. particularly my front shoulder blade and my wrist. my right had will become significantly colder than my left &amp; others can tell the extreme difference if I touch them. 

I have because foggy headed, I miss up words more often than not and my speak feels slurred. (however covid has hindered my social vocabulary working from home)

I was ordered an ultrasou