# Scrape Data and Initial Cleaning

In [1]:
import requests
from datetime import datetime
import pandas as pd

In [2]:
today_timestamp = datetime.today().timestamp()
round(today_timestamp)

1678772115

In [3]:
#create function to get posts and returns dataframe with subreddit, author_full_name and selftext and title
def get_posts(epoch_time:float, sub_or_comment: str, no_of_posts:int, subreddit:str):
    df = pd.DataFrame()
    url = f"https://api.pushshift.io/reddit/search/{sub_or_comment}"
    params = {
        'subreddit': subreddit,
        'before': round(epoch_time),
        'size':no_of_posts,
    }
    
    res = requests.get(url,params)
    print(f"Request Status Code = {res.status_code}")
    data = res.json()
    print(f"The last post's created utc is {data['data'][-1]['created_utc']} ")
    list_of_details = ['subreddit','title','selftext','author_fullname']
    df=pd.json_normalize(data['data'])[list_of_details]
    print(f"There {df.shape[0]} new posts for {subreddit}")
    return df

## Scrap data from subreddits CPAP and sleep

In [4]:
df1 = get_posts(today_timestamp, 'submission', 1000, 'CPAP')

Request Status Code = 200
The last post's created utc is 1673851854 
There 994 new posts for CPAP


In [5]:
df2 = get_posts(today_timestamp, 'submission', 1000, 'Sleep')

Request Status Code = 200
The last post's created utc is 1676621785 
There 999 new posts for Sleep


## Inital inspection, cleaning and merging of title and selftext

In [6]:
df1.head()

Unnamed: 0,subreddit,title,selftext,author_fullname
0,CPAP,MAD with a CPAP,Has anyone tried this? Would it be advisable t...,t2_3lhdm45e
1,CPAP,Anyone using a UPS or battery back up?,Power in my area has been having a rough time ...,t2_edo4p
2,CPAP,Mask with wide straps/better force distribution?,"Hi All, I need recommendations for full face m...",t2_e3dihhnz
3,CPAP,Mask Causing Permanent Face Indents,I really am at my wits end everyone. I have my...,t2_nkorz
4,CPAP,"AirMini issue - mask broke on vacation, used u...","TLDR: AirMini mask broke on a cruise, got a ne...",t2_7n6wwhqc


We can see from entry 0 that there are `\n` in selftext. We will clean them when we merge title and selftext.

In [7]:
df2.head()

Unnamed: 0,subreddit,title,selftext,author_fullname
0,sleep,Medication recommendations for sleep?,I (f19) have a really bad sleep schedule. When...,t2_isheye88
1,u_ode-2-sleep,test,r/Lynette,t2_c7sw7qpu
2,sleep,GOOD NOGHT EVERYONE,,t2_6i0bg98pf
3,sleep,Im tired of feeling desperate when I cant sleep,"I always had an issue with sleep, but now I’m ...",t2_78dr8i4k
4,sleep,Going to bed at 4am everyday,Does anyone know if it's bad if I go to sleep ...,t2_7duo0ary


Check for null values.

In [8]:
df1.isnull().sum()

subreddit          0
title              0
selftext           0
author_fullname    0
dtype: int64

In [9]:
df2.isnull().sum()

subreddit          0
title              0
selftext           0
author_fullname    4
dtype: int64

Since there are only 4 null values we can drop them.

In [10]:
df2 = df2.dropna()
df2.isnull().sum()

subreddit          0
title              0
selftext           0
author_fullname    0
dtype: int64

## Initial clean and merge

We'll check really short texts as they might not have significant information for us.

In [11]:
df1['text'] = df1['title']+' '+df1['selftext']
df1.head()

Unnamed: 0,subreddit,title,selftext,author_fullname,text
0,CPAP,MAD with a CPAP,Has anyone tried this? Would it be advisable t...,t2_3lhdm45e,MAD with a CPAP Has anyone tried this? Would i...
1,CPAP,Anyone using a UPS or battery back up?,Power in my area has been having a rough time ...,t2_edo4p,Anyone using a UPS or battery back up? Power i...
2,CPAP,Mask with wide straps/better force distribution?,"Hi All, I need recommendations for full face m...",t2_e3dihhnz,Mask with wide straps/better force distributio...
3,CPAP,Mask Causing Permanent Face Indents,I really am at my wits end everyone. I have my...,t2_nkorz,Mask Causing Permanent Face Indents I really a...
4,CPAP,"AirMini issue - mask broke on vacation, used u...","TLDR: AirMini mask broke on a cruise, got a ne...",t2_7n6wwhqc,"AirMini issue - mask broke on vacation, used u..."


In [12]:
df1['text_word_count'] = df1['text'].map(lambda x:len(x.split()))

In [13]:
df1.sort_values(by='text_word_count').head(25)

Unnamed: 0,subreddit,title,selftext,author_fullname,text,text_word_count
830,CPAP,Cpa,,t2_lboqznjr,Cpa,1
488,CPAP,camping,,t2_crjqhmy9,camping,1
457,CPAP,Dreamstation 2,,t2_bqvpqoqb,Dreamstation 2,2
551,CPAP,OSCAR Question,,t2_y45g0,OSCAR Question,2
884,CPAP,PiSA Water?,,t2_13bk7k,PiSA Water?,2
23,CPAP,OSCAR HELP,,t2_2qzkjgci,OSCAR HELP,2
588,CPAP,Error message.,,t2_5ms8niw1,Error message.,2
272,CPAP,OSCAR Troubleshoot,,t2_2qzkjgci,OSCAR Troubleshoot,2
709,CPAP,mini cpap?,,t2_5ri53mii,mini cpap?,2
508,CPAP,cpap goggles,,t2_enj81kd4,cpap goggles,2


We can see that there is a post that is removed. Possibly because it was not relevent to the subreddit.

Also, upon further investigation, (especially those with OSCAR), we found that those posts that had no `selftext` had pictures in them. For example, OSCAR is an open source programme for CPAP analytics, which is why people show screenshots of it. We will keep those posts as it does provide words that will help identify a CPAP user post.

In [14]:
df1[df1['selftext']=='[removed]']

Unnamed: 0,subreddit,title,selftext,author_fullname,text,text_word_count
381,CPAP,CPAP filters for Resmed 11 from Amazon,[removed],t2_vp2dn1kn,CPAP filters for Resmed 11 from Amazon [removed],8
515,CPAP,Now you will earn money fast.,[removed],t2_srrh73d,Now you will earn money fast. [removed],7
533,CPAP,Koala Nap,[removed],t2_67fnw7bg,Koala Nap [removed],3
579,CPAP,UPDATE SQUARED: BiPAP high pressure Airfit F20...,[removed],t2_9jqp6,UPDATE SQUARED: BiPAP high pressure Airfit F20...,14
637,CPAP,Need help with chronic congestion.,[removed],t2_6pg44w23,Need help with chronic congestion. [removed],6
691,CPAP,When did your sleep apnea start?,[removed],t2_tnmrb62b,When did your sleep apnea start? [removed],7
885,CPAP,Win 500$ apple gift-card,[removed],t2_d0ky5my9,Win 500$ apple gift-card [removed],5
913,CPAP,Instale o aplicativo Vegas Casino &amp; Slots!,[removed],t2_lboqznjr,Instale o aplicativo Vegas Casino &amp; Slots!...,8


Since only a small number of posts were removed, we will just remove those posts.

In [15]:
df1 = df1[df1['selftext']!='[removed]']

In [16]:
df1.shape

(986, 6)

We will do the same for df2.

In [17]:
df2['text'] = df2['title']+' '+df2['selftext']
df2.head()

Unnamed: 0,subreddit,title,selftext,author_fullname,text
0,sleep,Medication recommendations for sleep?,I (f19) have a really bad sleep schedule. When...,t2_isheye88,Medication recommendations for sleep? I (f19) ...
1,u_ode-2-sleep,test,r/Lynette,t2_c7sw7qpu,test r/Lynette
2,sleep,GOOD NOGHT EVERYONE,,t2_6i0bg98pf,GOOD NOGHT EVERYONE
3,sleep,Im tired of feeling desperate when I cant sleep,"I always had an issue with sleep, but now I’m ...",t2_78dr8i4k,Im tired of feeling desperate when I cant slee...
4,sleep,Going to bed at 4am everyday,Does anyone know if it's bad if I go to sleep ...,t2_7duo0ary,Going to bed at 4am everyday Does anyone know ...


In [18]:
df2['text_word_count'] = df2['text'].map(lambda x:len(x.split()))

In [19]:
df2.sort_values(by='text_word_count').head(25)

Unnamed: 0,subreddit,title,selftext,author_fullname,text,text_word_count
979,u_Immediate-Sleep-6264,dasdasd,,t2_48res2xxd,dasdasd,1
90,sleep,issues,,t2_sv1op54x,issues,1
976,u_Immediate-Sleep-6264,dasdasd,,t2_48res2xxd,dasdasd,1
977,u_Immediate-Sleep-6264,dasdasd,,t2_48res2xxd,dasdasd,1
981,u_Immediate-Sleep-6264,fsdfsdf,,t2_48res2xxd,fsdfsdf,1
978,u_Immediate-Sleep-6264,dasdasd,,t2_48res2xxd,dasdasd,1
657,u_Illustrious-Sleep-48,Yas,,t2_7kpibclj,Yas,1
980,u_Immediate-Sleep-6264,dasdasd,,t2_48res2xxd,dasdasd,1
799,sleep,sleeping,,t2_4ovk6i89c,sleeping,1
982,u_Immediate-Sleep-6264,fsdfsdf,,t2_48res2xxd,fsdfsdf,1


It seems that the scrapper included other subreddits that had sleep in their name. We will remove those.

There are also posts that were removed. So we will remove those from our data set.

In [20]:
df2 = df2[df2['selftext']!='[removed]'] #get all rows without [removed] in selftext

In [21]:
df2 = df2[df2['subreddit']=='sleep'] #get all rows with only sleep as subreddit

In [22]:
df2.sort_values(by='text_word_count').head(25)

Unnamed: 0,subreddit,title,selftext,author_fullname,text,text_word_count
90,sleep,issues,,t2_sv1op54x,issues,1
799,sleep,sleeping,,t2_4ovk6i89c,sleeping,1
2,sleep,GOOD NOGHT EVERYONE,,t2_6i0bg98pf,GOOD NOGHT EVERYONE,3
653,sleep,I’m so sleepy,,t2_cubpxp29,I’m so sleepy,3
920,sleep,Sleeping Music,[https://youtu.be/tobNZptAqfc](https://youtu.b...,t2_nvzkq1u8,Sleeping Music [https://youtu.be/tobNZptAqfc](...,3
15,sleep,I LOVE SLEEP,,t2_7izs47t3,I LOVE SLEEP,3
187,sleep,Sound of jungle rain,[https://www.youtube.com/watch?v=1z\_99D9Z1qI]...,t2_54r7vwfr,Sound of jungle rain [https://www.youtube.com/...,5
41,sleep,Can snoring always be cured?,,t2_16l00ec5,Can snoring always be cured?,5
927,sleep,sleeping 15+ hours a day,,t2_fawpoe1n,sleeping 15+ hours a day,5
810,sleep,I wonder what sleep tastes like,,t2_datn4san,I wonder what sleep tastes like,6


We cans see that the shorter posts have a tendency to be music related. Probably to help people go to sleep. So we'd want to keep those words associated with the music.

We will keep these short posts as they contain indicators like snoring, sleep coach and sleep expert.

We will replace the youtube links with the empty string since it does not add much information.

Inspect first 5 texts of each df.

In [23]:
df1['text'][0]

'MAD with a CPAP Has anyone tried this? Would it be advisable to use both?'

In [24]:
df1['text'][1]

'Anyone using a UPS or battery back up? Power in my area has been having a rough time lately and I have been wondering about getting one just in case. Anyone been using one or have and recommendations?'

In [25]:
df1['text'][2]

"Mask with wide straps/better force distribution? Hi All, I need recommendations for full face masks with wide straps, especially the part that sits at the base of the skull, or a headgear system that distributes the force better and doesn't make it all converge on a single point.\n\nI'm currently using a large F20, which is a nice mask, but the straps give me headaches, even when very loose - the part that sits at the base of the skull is just too small and puts too much pressure on one single point. The Mirage Quattro's headgear is a big improvement. I wonder if there's a mask (or 3rd party headgear) that's even better?"

In [26]:
df1['text'][3]

"Mask Causing Permanent Face Indents I really am at my wits end everyone. I have my mask loosened as loose as possible, to the point it occasionally will leak, and still I have dents in my face. I am frustrated and depressed and am unsure what to do. I tried mask covers but those made it so there was 0 seal for my mask. I'm 23 but I feel as though my mask is aging me. On my cheeks besides my nose there's now this puff of fat on each side as a result of the pressure from my mask. Is there anything I can do?"

In [27]:
df1['text'][4]

'AirMini issue - mask broke on vacation, used universal adaptor TLDR: AirMini mask broke on a cruise, got a new mask when I returned to the US, universal adaptor worked ok, not sure if I like AirMini anymore. \n\nI’ve had an AirMini for around 5 years and take it frequently for business and vacation travel. It’s adequate and I’ve had no real complaints. \n\nThis month I took it on a cruise out of Florida. I have the aftermarket Resway universal circuit connector and always pack it with the AM. Trouble is there’s no home respiratory supply stores in the Bermuda Triangle and the stupid mask (“Airfit Nxx something-something”) separated from the frame on day 2 of the cruise. I’ve never liked that mask much anyway and now I hate it. I was able to get it to stay on somewhat effectively with pressure and a chin strap but the slightest move caused it to fall apart and enraged me at all hours of the night. \n\nWe got back to Florida on a Saturday and were scheduled to stay 2 more days before co

From df1, we can see that `\n` has to be cleaned from the data.

In [28]:
df2['text'][0]

'Medication recommendations for sleep? I (f19) have a really bad sleep schedule. Whenever I’m at home, don’t have work or somewhere to be my sleep schedule starts getting messed up and as I’m getting older I can feel my brain slowly giving up on me. I usually sleep at 8-9 am and wake up at 7-9 pm and it just ruins my whole day and the things I’ve had planned for the day. Yesterday, I tried fixing my sleep schedule by staying up the whole day and I couldn’t do it and knocked out at 3:30 pm, now I’m awake at 1 am eating because I didn’t eat anything throughout the whole day, and now I know I won’t sleep until 9-10 am again. Are there any medications to knock me out? I’m so tired of my sleeping habit this has been going on since I was 12 and it just ruins my whole day and makes me more depressed. I’ve googled if I should take NyQuil and was told it’s bad to take it constantly, what should I do.'

In [32]:
df2['text'][2]

'GOOD NOGHT EVERYONE '

In [33]:
df2['text'][3]

'Im tired of feeling desperate when I cant sleep I always had an issue with sleep, but now I’m exhausted can’t stand anymore. Yesterday I only slept 2 hours, woke up ate 5:30AM took the subway, watched 6 hours of class, tried to stay awake as much as possible but it didnt work I was just fueling my anxiety. Tried to study after class and got a big headache so I decided that I would sleep as soon as I could to rest and do my tasks tomorrow. Got home after a big storm, cooked my lunch and breakfast for tomorrow, lied down very very sleepy, my brother started to scream bc of his team losing the game, 40 minutes after that I had a anxiety attack cause I couldn’t sleep. Still crying and feeling desperate bc i cant sleep and if i cant sleep i wont be able to pay attention to my favorites classes, and i wont be able to write my essay, and i will probably holding my tears all day long. Im exhausted (a similar situation happened last month and last month, is like i need a anxiety attack bc of s

In [34]:
df2['text'][4]

"Going to bed at 4am everyday Does anyone know if it's bad if I go to sleep at 4 a.m. every day? I sleep 8 hours a day but only ever go to sleep at 4 a.m. Someone told me that the brain does It's repairing from 22 p.m. to 1 a.m. or something like that. Is that true? Will I get alzheimers like that?"

In [35]:
df2['text'][5]

'Advice Hello everyone, I’m new here, it’s late at night and i cant sleep so I figured I’d ask for advice. It’s the 5th night in a row that im having trouble sleeping. I rarely ever get decent sleep, it has been like this for a year now. Last night I only had 2 hours of sleep and tonight I’ve been awake for 2 hours now. I don’t have trouble falling asleep, I struggle with staying asleep, once i wake up it takes up to 6 hours to fall back asleep, i feel so hopeless, I don’t know what to do. I wake up really tired and i get this brain fog all the time. I take melatonin but it’s not helping. I don’t even know what doctor i should go to regarding this. Can someone recommend something?'

Similar to df1, we need to get rid of `\n`.

With our experience with other datasets, we will also replace `&amp;` if there is any.

In [36]:
df1['text'] = df1['text'].str.replace('\n','')
df1['text'] = df1['text'].str.replace('&amp;','&')
df1['text'][4]

'AirMini issue - mask broke on vacation, used universal adaptor TLDR: AirMini mask broke on a cruise, got a new mask when I returned to the US, universal adaptor worked ok, not sure if I like AirMini anymore. I’ve had an AirMini for around 5 years and take it frequently for business and vacation travel. It’s adequate and I’ve had no real complaints. This month I took it on a cruise out of Florida. I have the aftermarket Resway universal circuit connector and always pack it with the AM. Trouble is there’s no home respiratory supply stores in the Bermuda Triangle and the stupid mask (“Airfit Nxx something-something”) separated from the frame on day 2 of the cruise. I’ve never liked that mask much anyway and now I hate it. I was able to get it to stay on somewhat effectively with pressure and a chin strap but the slightest move caused it to fall apart and enraged me at all hours of the night. We got back to Florida on a Saturday and were scheduled to stay 2 more days before coming home. I

In [37]:
df2['text'] = df2['text'].str.replace('\n','')
df2['text'] = df2['text'].str.replace('&amp;','&')

Since our problem requires identifying people who use CPAP, it will help our model if we have users with multiple posts.

We will merge our data, check datatypes and export the data for EDA in the next notebook.

In [38]:
final = pd.concat([df1,df2],ignore_index=True,axis=0)[['subreddit','text']]

In [39]:
final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1918 entries, 0 to 1917
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1918 non-null   object
 1   text       1918 non-null   object
dtypes: object(2)
memory usage: 30.1+ KB


In [40]:
final['subreddit'] = final['subreddit'].astype('category')

In [41]:
final.to_pickle('./datasets/df.pkl')