# Problem Statement:
    Based on people's books suggestions on Reddit (r/booksuggestions), what are similar books that other people have read/suggested? Using NLP and Deep Learning methods, let's analyze those posts and come up with a way to find out what those books are.

## Project Overview
1. Pull data from Reddit posts (r/booksuggestions) between July 25, 2010 and March 30, 2021
2. Use adv. NLP methods to analyze data:
    - clean the posts, remove special characters
    - use cont. skip-grams for most similar books
    - sensitivity analysis to detect similarities
    - cluster similar profils/books
3. TBD

### Goals of this notebook
In this notebook I pull the reddit posts, put them into a dataframe and clean them for my analysis

    Datasets (and sources)
     - Reddit r/booksuggestions

In [27]:
import pandas as pd
import numpy as np
import os
import requests
import re
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk, conlltags2tree, tree2conlltags
from nltk.stem import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

plt.style.use('ggplot')

In [None]:
pip install --upgrade pandas

In [3]:
# print working directory
# print(f'pwd: {pwd}')

# list of files in data folder
print(os.listdir(path='..//data/booksuggestions'))


['booksuggestions_1526605530.csv', 'booksuggestions_1419920389.csv', 'booksuggestions_1528985291.csv', 'booksuggestions_1319578598.csv', 'booksuggestions_1583849823.csv', 'booksuggestions_1421639638.csv', 'booksuggestions_1546512148.csv', 'booksuggestions_1605839965.csv', 'booksuggestions_1595454506.csv', 'booksuggestions_1599861587.csv', 'booksuggestions_1480859511.csv', 'booksuggestions_1599298082.csv', 'booksuggestions_1594457865.csv', 'booksuggestions_1588307222.csv', 'booksuggestions_1557265382.csv', 'booksuggestions_1548811809.csv', 'booksuggestions_1386303777.csv', 'booksuggestions_1401460556.csv', 'booksuggestions_1582390065.csv', 'booksuggestions_1399506108.csv', 'booksuggestions_1358208212.csv', 'booksuggestions_1579731983.csv', 'booksuggestions_1541265510.csv', 'booksuggestions_1575511094.csv', 'booksuggestions_1612185218.csv', 'booksuggestions_1592346507.csv', 'booksuggestions_1594605235.csv', 'booksuggestions_1450729375.csv', 'booksuggestions_1530282774.csv', 'booksuggesti

### Setting up Subreddit's API and Extracting Posts

In [None]:
#creating url and params variables
url = 'https://api.pushshift.io/reddit/search/submission'

# creating params for subreddits posts
param_booksuggestions = {
    'subreddit': 'booksuggestions', #importing booksuggestions subreddit
    'size': 100 #max posts that we can retrieve at once
}

In [None]:
# define function that takes in url and params based on timestamp (utc) 
# checks the website link and processes it

def pull_reddit_posts(url, params):

    res = requests.get(url, params)
    if res.status_code == 200:
        print('Status Code is Okay!')
        df = pd.DataFrame(res.json()['data'])
        created_utc = df['created_utc'].min()
        params['before'] = created_utc  
        print(f"exporting {params['subreddit']}_{created_utc}")
        df.to_csv(f"../data/booksuggestions/{params['subreddit']}_{created_utc}.csv")
    else:
        print("No data to load. Please try again :'(") 

In [None]:
#list comp to pull multiple booksuggestions posts 
# from: Tuesday, March 30,2021 1:22:32PM (epoch 1280093579)
# to: Sunday, July 25,2010 2:32:59PM (epoch 1617135752)
[f'{pull_reddit_posts(url, param_booksuggestions)} {i}' for i in range(200)]

In [9]:
# creating a file variables where all the data are located
files = os.listdir(path = '../data/booksuggestions')

# checking the number of files
len(files)

1092

In [None]:
# checking files in 
[file for file in files if i.startswith('booksuggestions_')]

In [5]:
#reimporting the booksuggestions files to create a dataframe
for file in files:
    booksuggestions_list = [pd.read_csv('../data/booksuggestions/' + 
                                        file) for file in files 
                            if file.startswith('booksuggestions_')]

In [10]:
#dataframe of booksuggestions
booksuggestions_data = pd.concat(booksuggestions_list, axis=0)

In [13]:
booksuggestions_data.head()

Unnamed: 0.1,Unnamed: 0,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,can_mod_post,contest_mode,created_utc,domain,...,thumbnail_width,view_count,media,link_flair_template_id,author_id,secure_media,removed_by,og_description,og_title,media_metadata
0,0,Spoggy,,[],,text,False,False,1526861857,self.booksuggestions,...,,,,,,,,,,
1,1,type2adultdiabeetus,,[],,text,False,False,1526857596,self.booksuggestions,...,,,,,,,,,,
2,2,The69thDuncan,,[],,text,False,False,1526856465,self.booksuggestions,...,,,,,,,,,,
3,3,mrjamiemcc,,[],,text,False,False,1526855461,self.booksuggestions,...,,,,,,,,,,
4,4,FrankenHeart,,[],,text,False,False,1526854114,self.booksuggestions,...,,,,,,,,,,


In [17]:
booksuggestions_data.shape

(109122, 97)

In [14]:
# exporting the data
booksuggestions_data.to_csv('../data/booksuggestions/booksuggestions_data.csv')

### Data Cleaning

In [28]:
# reimporting the data and dropping cols
booksuggestions_data = pd.read_csv('/Users/ronald_asseko_messa/Google Drive/dsir-125-large-files/booksuggestions_data.csv')
booksuggestions_data.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1, inplace=True)

booksuggestions_data.head()

Unnamed: 0,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,can_mod_post,contest_mode,created_utc,domain,full_link,...,thumbnail_width,view_count,media,link_flair_template_id,author_id,secure_media,removed_by,og_description,og_title,media_metadata
0,Spoggy,,[],,text,False,False,1526861857,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
1,type2adultdiabeetus,,[],,text,False,False,1526857596,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
2,The69thDuncan,,[],,text,False,False,1526856465,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
3,mrjamiemcc,,[],,text,False,False,1526855461,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
4,FrankenHeart,,[],,text,False,False,1526854114,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,


In [12]:
booksuggestions_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109122 entries, 0 to 109121
Data columns (total 96 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   author                         109114 non-null  object 
 1   author_flair_css_class         666 non-null     object 
 2   author_flair_richtext          69926 non-null   object 
 3   author_flair_text              394 non-null     object 
 4   author_flair_type              69926 non-null   object 
 5   can_mod_post                   74587 non-null   object 
 6   contest_mode                   78879 non-null   object 
 7   created_utc                    109114 non-null  object 
 8   domain                         109106 non-null  object 
 9   full_link                      109112 non-null  object 
 10  gilded                         38571 non-null   float64
 11  id                             109112 non-null  object 
 12  is_crosspostable              

In [29]:
# select only 3 columns 
df = booksuggestions_data[['author','title','num_comments', 'selftext']]

# filling missing values
df.fillna('[...]', inplace=True)

In [30]:
#check for missing values
df.isna().sum().sort_values(ascending=False)

author          0
title           0
num_comments    0
selftext        0
dtype: int64

In [31]:
# combine title and selftext columns
df['text'] = df['title'] + df['selftext']
df.head()

Unnamed: 0,author,title,num_comments,selftext,text
0,Spoggy,Looking for Horror fiction that explores the u...,5.0,I love horror films that delve into the outer ...,Looking for Horror fiction that explores the u...
1,type2adultdiabeetus,Books that are about or talk about US Army PSYOPS,0.0,"Psyops, an abbreviation of Psychological Opera...",Books that are about or talk about US Army PSY...
2,The69thDuncan,Looking for new sci-fi,10.0,So I read a ton of sci-fi and struggle to find...,Looking for new sci-fiSo I read a ton of sci-f...
3,mrjamiemcc,Recommend me my very first book to read,4.0,Being honest. I have never read a book out of ...,Recommend me my very first book to readBeing h...
4,FrankenHeart,Started a book club. Suggestions?,19.0,Somehow I became the age of a person that star...,Started a book club. Suggestions?Somehow I bec...


In [18]:
# define a function to remove special chars and numbers
def clean_text_simple(df, text, clean_text):
    df[clean_text] = df[text].astype(str)
    df[clean_text] = df[clean_text].apply(lambda elem: re.sub(r"\n", "; ", elem))  
    
    return df

In [32]:
# applying the clean_text_simple to my text
df = clean_text_simple(df, 'text', 'clean_text')
df.head()

Unnamed: 0,author,title,num_comments,selftext,text,clean_text
0,Spoggy,Looking for Horror fiction that explores the u...,5.0,I love horror films that delve into the outer ...,Looking for Horror fiction that explores the u...,Looking for Horror fiction that explores the u...
1,type2adultdiabeetus,Books that are about or talk about US Army PSYOPS,0.0,"Psyops, an abbreviation of Psychological Opera...",Books that are about or talk about US Army PSY...,Books that are about or talk about US Army PSY...
2,The69thDuncan,Looking for new sci-fi,10.0,So I read a ton of sci-fi and struggle to find...,Looking for new sci-fiSo I read a ton of sci-f...,Looking for new sci-fiSo I read a ton of sci-f...
3,mrjamiemcc,Recommend me my very first book to read,4.0,Being honest. I have never read a book out of ...,Recommend me my very first book to readBeing h...,Recommend me my very first book to readBeing h...
4,FrankenHeart,Started a book club. Suggestions?,19.0,Somehow I became the age of a person that star...,Started a book club. Suggestions?Somehow I bec...,Started a book club. Suggestions?Somehow I bec...


In [26]:
os.listdir(path='/Users/ronald_asseko_messa/Google Drive/dsir-125-large-files/')

['df_clean_tagged.csv',
 '.DS_Store',
 'books_clean_df.csv',
 'booksuggestions_clean_df.pkl',
 'booksuggestions_data.csv']

In [24]:
# exporting df as a pickle file 
df.to_pickle('/Users/ronald_asseko_messa/Google Drive/dsir-125-large-files/booksuggestions_clean_df.pkl')
