# Cleaning and EDA
---

## Contents
---
- [Library Imports](#Library-Imports)
- [Cleaning](#Cleaning)

### Library Imports
---

In [2]:
#Imports
import pandas as pd

### Cleaning
---

In [28]:
# Pull in the corpus.csv and transform into a dataframe - drop the index included in the csv with index_col = [0]
corpus = pd.read_csv('./data/corpus.csv', index_col = [0])

In [32]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2800 entries, 0 to 2799
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      2800 non-null   object
 1   selftext   1907 non-null   object
 2   subreddit  2800 non-null   object
dtypes: object(3)
memory usage: 87.5+ KB


**The following steps will combine the title and selftext columns into one column, separated with a white space.  The column will be called 'text'.**

In [30]:
# While all posts have text in their titles, about 32% do not have text in the body of the post.  These are likely images.  Before I can concantenate the columns together, I will replace NaNs with a white space.
corpus.isnull().sum()

title          0
selftext     893
subreddit      0
dtype: int64

In [37]:
corpus.fillna(' ', inplace = True)

In [39]:
# No NaNs! \o/
corpus.isnull().sum()

title        0
selftext     0
subreddit    0
dtype: int64

In [40]:
#Concatenate the title and selftext columns together into single column of text.
corpus['text'] = corpus['title'] + ' ' + corpus['selftext']

In [42]:
corpus.head()

Unnamed: 0,title,selftext,subreddit,text
0,The Traveling Journal of r/fountainpens (Appro...,"Greetings, fountain pen family!\n\nSome of you...",fp,The Traveling Journal of r/fountainpens (Appro...
1,"Sometimes, we just need this simple reminder 🤗",,fp,"Sometimes, we just need this simple reminder 🤗"
2,I must confess I was wrong about Kaweco,I've been in the hobby on and off for over a d...,fp,I must confess I was wrong about Kaweco I've b...
3,Why are cheap fountain pens so much better?,This is the cheapest pen I have. I had zero pr...,fp,Why are cheap fountain pens so much better? Th...
4,Literally classic design,Beautiful stripes and engraving on the nib is ...,fp,Literally classic design Beautiful stripes and...


In [45]:
# Drop the title and selftext columns as they are now combined into one column.
columns_to_drop = ['title', 'selftext']

corpus.drop(columns = columns_to_drop, inplace = True)

In [56]:
corpus.head()

Unnamed: 0,subreddit,text
0,1,The Traveling Journal of r/fountainpens (Appro...
1,1,"Sometimes, we just need this simple reminder 🤗"
2,1,I must confess I was wrong about Kaweco I've b...
3,1,Why are cheap fountain pens so much better? Th...
4,1,Literally classic design Beautiful stripes and...


**For modeling, I will change the 'fp' subreddit classifier to 1 and the 'pens' subreddit classifier to 0.**

In [52]:
corpus.replace({'fp':1, 'pens':0}, inplace = True)

In [53]:
corpus

Unnamed: 0,subreddit,text
0,1,The Traveling Journal of r/fountainpens (Appro...
1,1,"Sometimes, we just need this simple reminder 🤗"
2,1,I must confess I was wrong about Kaweco I've b...
3,1,Why are cheap fountain pens so much better? Th...
4,1,Literally classic design Beautiful stripes and...
...,...,...
2795,0,Which pen are you picking?
2796,0,I want this pen! Any idea the name?!
2797,0,Looking for ID on the left pen. This is the on...
2798,0,"Current rotation, no complaints"


**Read the cleaned datafram back into a csv file**

In [57]:
corpus.to_csv('./data/cleaned_corpus.csv', index = False)

**Note that 'fountainpens' = 1 and 'pens' = 0**

In [58]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2800 entries, 0 to 2799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  2800 non-null   int64 
 1   text       2800 non-null   object
dtypes: int64(1), object(1)
memory usage: 65.6+ KB


In [60]:
# The fountainpens subreddit make up 55.6% of the data and the pens subreddit make up 44%
corpus['subreddit'].value_counts(normalize = True) * 100

subreddit
1    55.607143
0    44.392857
Name: proportion, dtype: float64