## Notebook 2. Cleaning

I already deleted duplicates and have no missing values for the selftext in previous notebook, but additional cleaning needs to be done as well. Later in this notebook, I wanted to compare RogexpTokenizer, Porter Stemmer and WordNetLemmatizer to clean text and see what works best. Some coding decisions were made with Sophie's help.

### 1. Importing data

In [13]:
# Import libraries

import pandas as pd
import numpy as np


from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from  nltk.stem import PorterStemmer
import re


import warnings
warnings.filterwarnings('ignore')

In [14]:
# import data

df1 = pd.read_csv('../data/book.csv')
df2 = pd.read_csv('../data/parenting.csv')

df1.head(3)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,crosspost_parent_list,author_cakeday,author_flair_background_color,author_flair_template_id,author_flair_text_color,distinguished,suggested_sort,banned_by,collections,edited
0,[],False,MesbaWhamed67,,[],,text,t2_dzy4gk0i,False,False,...,,,,,,,,,,
1,[],False,Yourmemoriesonsale,,[],,text,t2_1xkkexun,False,False,...,,,,,,,,,,
2,[],False,020Wombat,,[],,text,t2_b9p6ytvm,False,False,...,,,,,,,,,,


### 2. Fixing the size of columns

In [15]:
# Check the size of datasets: a - books, b - parenting

df1.shape

(4256, 83)

In [16]:
df2.shape

(4179, 70)

In [18]:
# make each columns similar
# this code I've used from Sophie's walkthrough lesson

df2.drop(set(df2.columns).difference(df1.columns), axis = 1, inplace = True)

df1.drop(set(df1.columns).difference(df2.columns), axis = 1, inplace = True)

In [19]:
# check results

df1.shape

(4256, 70)

In [20]:
 df2.shape

(4179, 70)

### 3. Exploring columns

In [21]:
# Books and parenting have similar columns now

df1.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts',
       'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'tit

In [28]:
# Create a list of potential columns to use later

columns_to_keep=['id','title','selftext']

### 4. Data dictionary

- I decided to create a list of potential columns that are interested to me and could be used later. Idea of building a 'dictionary' came from Sophie's lesson walkthrough

Data Element   |	Description
---------------|-----------------
created_uts    |    Date the post was created
id             |	Unique identier of post
num_comments   |    Number of comments
retriev_on     |    time stemp(time after time posting)
title          |	Post Title
selftext       |	The text of the post. Not all posts have text. Some- images or videos.
subreddit      |	Name of the subreddit




### 5. Concatinatig both subreddits to create a new dataframe with columns I want to use later

In [29]:
# concat subreddits and set new columns

df = pd.concat([df1,df2], copy=False)[columns_to_keep]
df.shape

(8435, 3)

### 6. Check one more time for Duplicates and missing posts

In [30]:
# check duplicated id's and selftext

df[df.duplicated(subset='id')]
df[df.duplicated(subset='selftext')]

Unnamed: 0,id,title,selftext
6,p7e6e5,10 Quotes Of Wisdom By Iconic Author Dr Seuss,
6,p7dh6q,Free consultation,


In [31]:
# remove duplicates

df.drop_duplicates(subset=['selftext'], keep = False, inplace = True)

In [32]:
# cleaning data

df.dtypes

id          object
title       object
selftext    object
dtype: object

In [33]:
# Check nulls

df.isnull().sum()

id          0
title       0
selftext    0
dtype: int64

### 7. Combine title and selftext in one columns text

In [34]:
# For later research and convenience, I've decided to combine text and selftext in one column

df['text'] = df['title']+' '+df['selftext']
df['text'].head(2)

1    Why do you like James Joyce?  James Joyce from...
2    We - Yevgeny Zamyatin Spoilers\n\nJust finishe...
Name: text, dtype: object

### 8. Deleting posts with links

In [35]:
# check the shape of my dataframe
df.shape

(8432, 4)

In [36]:
# determine posts with links and decide if I want to delete them

df[df['text'].str.contains('http')].shape


(328, 4)

In [18]:
# https://www.kite.com/python/answers/how-to-use-re.sub()-in-python#:~:text=to%20use%20re.-,sub()%20in%20Python,to%20replace%20substrings%20in%20strings.
# https://docs.python.org/3/library/re.html#module-re

df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+', '', x))


### 9. Working on text column and cleaning all text

##### - PorterStemmer

In [19]:
# I wanted to test different ways of cleaning data. 1st is using porter stemmer
# Here I've used codes from Sophie's lesson

ps = PorterStemmer()

df['stemmed_text'] = df.text.apply(lambda x: ' '.join([ps.stem(w) for w in x.split()]))
df['stemmed_text'][0]

'whi doe my 13mo wake up so angry? my h goe and get him when he wake up in the middl of the night while i go pee befor i nurs him back to sleep. and lo is constantli sign “more” and “milk” and “mommy” all while tri to launch himself out of h’ arm and scream hi head off. what can h do?? he is worri that he is go to drop lo.'

##### - RegexpTokenizer

In [20]:
# 2nd is is create a function 'clean_text' using RegexpTokenizer
# For this, I've used some ideas from https://coderoad.ru/42056872/%D0%9A%D0%B0%D0%BA-%D1%83%D0%B4%D0%B0%D0%BB%D0%B8%D1%82%D1%8C-in-strings-%D1%81-%D0%BF%D0%BE%D0%BC%D0%BE%D1%89%D1%8C%D1%8E-RegexpTokenizer
# https://www.programcreek.com/python/?CodeExample=clean+text


def clean_text(sen):
    post = sen.lower()
    tokenizer = RegexpTokenizer('[a-z]+')
    clean_post = " ".join(tokenizer.tokenize(post))
    
    return clean_post

df['text'] = df['text'].map(lambda x: clean_text(x))
df['text'][0]

'why does my mo wake up so angry my h goes and gets him when he wakes up in the middle of the night while i go pee before i nurse him back to sleep and lo is constantly signing more and milk and mommy all while trying to launch himself out of h s arms and screaming his head off what can h do he is worried that he is going to drop lo'

###### - WordNetLemmatizer

In [21]:
# Some words have similar roots and I wanted to use WordNetLemmatizer to connect them together
# Idea for this code came from Sophie's lesson walkthrough

lem = WordNetLemmatizer()

df['lem_text'] = df.text.apply(lambda post: ' '.join([lem.lemmatize(word) for word in post.split()]))
df['lem_text'][0]

'why doe my mo wake up so angry my h go and get him when he wake up in the middle of the night while i go pee before i nurse him back to sleep and lo is constantly signing more and milk and mommy all while trying to launch himself out of h s arm and screaming his head off what can h do he is worried that he is going to drop lo'

### 10. Comparing both cleaning methods and lemmatizer

In [22]:
#Compare both cleaning methods and lemmatizer

df[['text','stemmed_text', 'lem_text']].head(3)

Unnamed: 0,text,stemmed_text,lem_text
1,why do you like james joyce james joyce from m...,whi do you like jame joyce? jame joyc from my ...,why do you like james joyce james joyce from m...
2,we yevgeny zamyatin spoilers just finished thi...,we - yevgeni zamyatin spoiler just finish thi ...,we yevgeny zamyatin spoiler just finished this...
3,books about non gaussian distributions i just ...,book about non-gaussian distribut i just finis...,book about non gaussian distribution i just fi...


 - After my exploration, both methods did a great job. But, I like the refexptokenizer more because it doesn's change a root of my words and would delete '/' ';' ':' and other sighns attached to words. The lemmatizer is good too, but I didn't like it would cut smaller words too much that it is hard to understand which word is which. Here is my placements:
 1. RogexpTokenizer
 2. WordNetLemmatizer
 3. Porter Stemmer
 
 ### RogexpTokenizer won!!!

### 11. Reseting index 

In [23]:
# Index reset

df.reset_index(drop=True, inplace=True)
df.head(1)

Unnamed: 0,created_utc,id,retrieved_on,title,selftext,subreddit,text,stemmed_text,lem_text
0,1629379545,p7esvi,1629379556,Why do you like James Joyce?,James Joyce from my very brief reading of him...,books,why do you like james joyce james joyce from m...,whi do you like jame joyce? jame joyc from my ...,why do you like james joyce james joyce from m...


### 12. Final Columns to keep

In [24]:
# After some thoughts, I decided to keep only 2 columns 'subreddit' and 'text'

cols = ['subreddit','text']
new_df = df[cols]

### 13. Convert column subreddit to 0 and 1

In [25]:
# My goal is to preddict if words come from subreddit 'books'. 
#That's why I'm converting books to nummerical.  books would be - 1 and parenting would be 0

new_df = pd.get_dummies(new_df, columns = ['subreddit'])


### 14. Created a new dataframe 'cleaned.csv' what is ready for EDA

In [26]:
# Save to a new dataframe

new_df.to_csv('../data/cleaned_df.csv', index=False)


######  Proceed to next notebook EDA!