# Data Cleaning
---
The aim of this notebook is to clean the data collected from the subreddits.

## Libraries Used
---

In [1]:
import numpy as np
import pandas as pd

## Import Data
---

In [2]:
anx = pd.read_csv('../../../data/original/anxiety_submissions.csv')
writ = pd.read_csv('../../../data/original/writing_submissions.csv')

## Clean Anxiety Data
---

In [3]:
anx.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,removed_by_category,author_flair_template_id,author_flair_text_color,author_flair_background_color,edited,author_cakeday,is_created_from_ads_ui,author_is_blocked,distinguished,banned_by
0,[],False,JackW357,,[],,text,t2_86tvd1p6,False,False,...,,,,,,,,,,
1,[],False,belladoll1021,,[],,text,t2_3imzzz6p,False,False,...,,,,,,,,,,
2,[],False,ashwinderegg,,[],,text,t2_3o3tfuf3,False,False,...,,,,,,,,,,
3,[],False,ashwinderegg,,[],,text,t2_3o3tfuf3,False,False,...,,,,,,,,,,
4,[],False,lachapoxxx,,[],,text,t2_93gbsj7i,False,False,...,,,,,,,,,,


In [4]:
anx = anx[['author', 'link_flair_text', 'num_comments', 'subreddit', 'selftext', 'title', 'created_utc']]

In [5]:
anx.isnull().sum()

author               0
link_flair_text    513
num_comments         0
subreddit            0
selftext            96
title                0
created_utc          0
dtype: int64

In [6]:
anx.fillna(' ', inplace = True)
anx.isnull().sum()

author             0
link_flair_text    0
num_comments       0
subreddit          0
selftext           0
title              0
created_utc        0
dtype: int64

In [7]:
anx[anx['selftext'] == '[removed]'].count()

author             92
link_flair_text    92
num_comments       92
subreddit          92
selftext           92
title              92
created_utc        92
dtype: int64

In [8]:
anx[anx['selftext'] == '[deleted]'].count()

author             40
link_flair_text    40
num_comments       40
subreddit          40
selftext           40
title              40
created_utc        40
dtype: int64

In [9]:
anx.replace({'[removed]':' ', '[deleted]':' '}, inplace = True)

In [10]:
anx['text'] = anx['title'] + ' ' + anx['selftext']

In [11]:
anx.head(3)

Unnamed: 0,author,link_flair_text,num_comments,subreddit,selftext,title,created_utc,text
0,JackW357,DAE Questions,9,Anxiety,,Anyone else scared of dying and scared of when...,1606687976,Anyone else scared of dying and scared of when...
1,belladoll1021,Health,1,Anxiety,Can a tight throat and gagging feeling be anxi...,Tight throat,1606687615,Tight throat Can a tight throat and gagging fe...
2,ashwinderegg,Advice Needed,3,Anxiety,Does anyone else feel like they can no longer ...,Anxiety overriding my intuition.,1606687588,Anxiety overriding my intuition. Does anyone e...


In [12]:
anx.drop(columns = ['selftext', 'title'], inplace = True)

In [13]:
anx.to_csv('../../../data/clean/anx_clean.csv')

## Clean Writing (Control) Data
---

In [14]:
writ.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,distinguished,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,edited,banned_by,is_created_from_ads_ui,author_is_blocked
0,[],False,Yersenie,,[],,text,t2_9jnd5n8o,False,False,...,,,,,,,,,,
1,[],False,hoe4hob1,,[],,text,t2_8wg687lh,False,False,...,,,,,,,,,,
2,[],False,Pagliacci_Baby,,[],,text,t2_7d3owwgj,False,False,...,,,,,,,,,,
3,[],False,Jp_web_agency,,[],,text,t2_7tw9syp3,False,False,...,,,,,,,,,,
4,[],False,Jp_web_agency,,[],,text,t2_7tw9syp3,False,False,...,,,,,,,,,,


In [15]:
writ = writ[['author', 'link_flair_text', 'num_comments', 'subreddit', 'selftext', 'title', 'created_utc']]
writ.head()

Unnamed: 0,author,link_flair_text,num_comments,subreddit,selftext,title,created_utc
0,Yersenie,,2,writing,[removed],Jj,1609841667
1,hoe4hob1,Discussion,0,writing,[removed],My character dresses in gender neutral clothin...,1609840909
2,Pagliacci_Baby,,2,writing,[removed],Between The Lakes: Critique requested,1609840879
3,Jp_web_agency,,2,writing,,Skills needed to make your resume stand out,1609840727
4,Jp_web_agency,,2,writing,,Skills needed to make your resume stand out,1609840696


In [16]:
writ.isnull().sum()

author                0
link_flair_text    1440
num_comments          0
subreddit             0
selftext            226
title                 0
created_utc           0
dtype: int64

In [17]:
writ.fillna(' ', inplace = True)
writ.isnull().sum()

author             0
link_flair_text    0
num_comments       0
subreddit          0
selftext           0
title              0
created_utc        0
dtype: int64

In [18]:
writ[writ['selftext'] == '[removed]'].count()

author             857
link_flair_text    857
num_comments       857
subreddit          857
selftext           857
title              857
created_utc        857
dtype: int64

In [19]:
writ[writ['selftext'] == '[deleted]'].count()

author             16
link_flair_text    16
num_comments       16
subreddit          16
selftext           16
title              16
created_utc        16
dtype: int64

In [20]:
writ.replace({'[removed]':' ', '[deleted]':' '}, inplace = True)

In [21]:
writ['text'] = writ['title'] + ' ' + writ['selftext']

In [22]:
writ.head(3)

Unnamed: 0,author,link_flair_text,num_comments,subreddit,selftext,title,created_utc,text
0,Yersenie,,2,writing,,Jj,1609841667,Jj
1,hoe4hob1,Discussion,0,writing,,My character dresses in gender neutral clothin...,1609840909,My character dresses in gender neutral clothin...
2,Pagliacci_Baby,,2,writing,,Between The Lakes: Critique requested,1609840879,Between The Lakes: Critique requested


In [23]:
writ.drop(columns = ['selftext', 'title'], inplace = True)

In [24]:
writ.to_csv('../../../data/clean/writing_cleaning.csv')

## Merge Datasets
---

In [25]:
df = anx.append(writ)

In [26]:
df.head()

Unnamed: 0,author,link_flair_text,num_comments,subreddit,created_utc,text
0,JackW357,DAE Questions,9,Anxiety,1606687976,Anyone else scared of dying and scared of when...
1,belladoll1021,Health,1,Anxiety,1606687615,Tight throat Can a tight throat and gagging fe...
2,ashwinderegg,Advice Needed,3,Anxiety,1606687588,Anxiety overriding my intuition. Does anyone e...
3,ashwinderegg,Advice Needed,7,Anxiety,1606687588,Anxiety overriding my intuition. Does anyone e...
4,lachapoxxx,Advice Needed,1,Anxiety,1606687488,hey friends! i need some advice my anxiety has...


In [27]:
df.isnull().sum()

author             0
link_flair_text    0
num_comments       0
subreddit          0
created_utc        0
text               0
dtype: int64

In [28]:
df.to_csv('../../../data/clean/anx_writing.csv')

## Recap
---
Cleaned data collected from subreddit. This included dealing with null values, dropping unecessary columns, and merging the selftext and title columns. Both data collected were merged into one dataset. All cleaned datasets were saved. These will be used for exploratory data analysis (EDA) and modeling. 