# Problem Statement:
    Based on people's books suggestions on Reddit (r/booksuggestions), what are similar books that other people have read/suggested? Using NLP and Deep Learning methods, let's analyze those posts and buil a recommender system.

## Project Overview
1. Pull data from Reddit posts (r/booksuggestions) between July 25, 2010 and March 30, 2021
2. Use adv. NLP methods to analyze data:
    - clean the posts, remove special characters
    - detect entities on each row
    - use cont. skip-grams from Word2Vec for similar words
    - create a function that input a post and returns 3 books
3. Conclusion and recommendations.

### Goals of this notebook
In this notebook I pull the reddit posts, put them into a dataframe and clean them for my analysis

    Data source:
     - Reddit r/booksuggestions

Let's first start by importing the libraries we'll need 

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import re
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk, conlltags2tree, tree2conlltags
from nltk.stem import WordNetLemmatizer

import warnings
warnings.filterwarnings('ignore')

plt.style.use('ggplot')

Now, lets' check our working directory and make sure that everything runs smoothly since this is where we'll be downloading our data to. 

In [2]:
# print working directory
# print(f'pwd: {pwd}')

# list of files in data folder
# print(os.listdir(path='..//data/booksuggestions'))


### Setting up Subreddit's API and Extracting Posts

Our data comes from Reddit, and here we use an API called Pushshift to scrape the posts that we want and set up the parameters for that. Those parameters are the subreddit's name and the max number of posts that we can pull at once. Note that as of March 2021, that max is capped at 100. Then we define a function that will take in the API's URL and the parameters we just created. That function, once called, will first access the website we gave it to, check whether the site works, and print 'Successfully accessed the website (API) provided!' if everything works, otherwise it will print "Failed... No data has been loaded." Our function will then download the posts as a JSON file and put them into a data frame. From those 100 posts, we'll create and use a timestamp for each one and take the earliest to create the next batch of 100 posts going backward until there is nothing left to download. We can also move forward depending on which timeframe we wish to capture. Then it will export the dataset to a local directory and print it each time it does so.

In [None]:
#creating url and params variables
url = 'https://api.pushshift.io/reddit/search/submission'

# creating params for subreddits posts
param_booksuggestions = {
    'subreddit': 'booksuggestions', #importing booksuggestions subreddit
    'size': 100 #max posts that we can retrieve at once
}

# define function that takes in url and params based on timestamp (utc) 
# checks the website link and processes it
def pull_reddit_posts(url, params):

    res = requests.get(url, params)
    if res.status_code == 200:
        print('Successfully accessed the website (API) provided!')
        df = pd.DataFrame(res.json()['data'])
        created_utc = df['created_utc'].min()
        params['before'] = created_utc  
        print(f"exporting {params['subreddit']}_{created_utc}")
        df.to_csv(f"../data/booksuggestions/{params['subreddit']}_{created_utc}.csv")
    else:
        print("Failed... No data has been loaded.") 

In [None]:
#creating url and params variables
url = 'https://api.pushshift.io/reddit/search/submission'

# creating params for subreddits posts
param_booksuggestions = {
    'subreddit': 'booksuggestions', #importing booksuggestions subreddit
    'size': 100 #max posts that we can retrieve at once
}
# define function that takes in url and params based on timestamp (utc) 
# checks the website link and processes it
def pull_reddit_posts(url, params):

    res = requests.get(url, params)
    if res.status_code == 200:
        print('Successfully accessed the website (API) provided!')
        df = pd.DataFrame(res.json()['data'])
        created_utc = df['created_utc'].min()
        params['before'] = created_utc  
        print(f"exporting {params['subreddit']}_{created_utc}")
        df.to_csv(f"../data/booksuggestions/{params['subreddit']}_{created_utc}.csv")
    else:
        print("Failed... No data has been loaded.") 
        
#list comp to pull multiple booksuggestions posts 
# from: Tuesday, March 30,2021 1:22:32PM (epoch 1280093579)
# to: Sunday, July 25,2010 2:32:59PM (epoch 1617135752)
[f'{pull_reddit_posts(url, param_booksuggestions)} {i}' for i in range(200)]

#reimporting the booksuggestions files to create a dataframe
for file in files:
    booksuggestions_list = [pd.read_csv('../data/booksuggestions/' + 
                                        file) for file in files 
                            if file.startswith('booksuggestions_')]

#dataframe of booksuggestions
booksuggestions_data = pd.concat(booksuggestions_list, axis=0)

Finally, we use a list comprehension to use our newly created function to pull 200 batches. We can run this cell multiple times, depending on how much data we want. Please note that if you try to pull too many batches, more than 200 at the time, which will change based on the subreddit you're pulling from, it can give you an error.

In [None]:
#list comp to pull multiple booksuggestions posts 
# from: Tuesday, March 30,2021 1:22:32PM (epoch 1280093579)
# to: Sunday, July 25,2010 2:32:59PM (epoch 1617135752)
[f'{pull_reddit_posts(url, param_booksuggestions)} {i}' for i in range(200)]

# creating a file variables where all the data are located
files = os.listdir(path = '../data/booksuggestions')

# checking the list of files in files variable 
[file for file in files if i.startswith('booksuggestions_')]

# checking the number of files
print('How many files do we have?', len(files))

Now that we have all the data that we want, lets put them into one single data frame and check out how it looks. We have 109,122 rows and 97 columns. Then we export the final dataset to our data directory before we start cleaning the data. 

In [5]:
#reimporting the booksuggestions files to create a dataframe
for file in files:
    booksuggestions_list = [pd.read_csv('../data/booksuggestions/' + 
                                        file) for file in files 
                            if file.startswith('booksuggestions_')]

#dataframe of booksuggestions
booksuggestions_data = pd.concat(booksuggestions_list, axis=0)

# exporting the data
booksuggestions_data.to_csv('../data/booksuggestions/booksuggestions_data.csv')

How many rows and columns do we have?

In [17]:
# how many rows and columns do we have?
booksuggestions_data.shape

(109122, 97)

### Data Cleaning

Our data cleaning part starts by reimporting the data, check the columns' names and type and only keep author, title, comment, and number of comments (num_comments). Here is what our data looks like.

In [4]:
# reimporting the data and dropping cols
booksuggestions_data = pd.read_csv('/Users/ronald_asseko_messa/Google Drive/dsir-125-large-files/booksuggestions_data.csv')
booksuggestions_data.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1, inplace=True)

booksuggestions_data.head()

Unnamed: 0,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,can_mod_post,contest_mode,created_utc,domain,full_link,...,thumbnail_width,view_count,media,link_flair_template_id,author_id,secure_media,removed_by,og_description,og_title,media_metadata
0,Spoggy,,[],,text,False,False,1526861857,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
1,type2adultdiabeetus,,[],,text,False,False,1526857596,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
2,The69thDuncan,,[],,text,False,False,1526856465,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
3,mrjamiemcc,,[],,text,False,False,1526855461,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,
4,FrankenHeart,,[],,text,False,False,1526854114,self.booksuggestions,https://www.reddit.com/r/booksuggestions/comme...,...,,,,,,,,,,


In [8]:
# checking what the columns look like and their types
booksuggestions_data.info()

Let's filter the columns we want  replace all the missing values with "[...]" since our model doesn't handle missing values well. Those missing values typically come from posts with no comment or titles. Once it's done, we check if that transformation worked by counting the number of missing values. 

In [5]:
# select only 3 columns 
df = booksuggestions_data[['author','title', 'selftext']]

# filling missing values
df.fillna('[...]', inplace=True)

#check for missing values
df.isna().sum().sort_values(ascending=False)

author      0
title       0
selftext    0
dtype: int64

Now we combine our titles and comments into a single cell for each row.

In [7]:
# combine title and selftext columns
df['text'] = df['title'] + df['selftext']
df.head(3)

Unnamed: 0,author,title,selftext,text
0,Spoggy,Looking for Horror fiction that explores the u...,I love horror films that delve into the outer ...,Looking for Horror fiction that explores the u...
1,type2adultdiabeetus,Books that are about or talk about US Army PSYOPS,"Psyops, an abbreviation of Psychological Opera...",Books that are about or talk about US Army PSY...
2,The69thDuncan,Looking for new sci-fi,So I read a ton of sci-fi and struggle to find...,Looking for new sci-fiSo I read a ton of sci-f...


Once we have a single cell on each row, we clean them by removing the newline escape sequences (“\n”) and output the result.

In [32]:
# define a function to remove special chars and numbers
def clean_text_simple(df, text, clean_text):
    df[clean_text] = df[text].astype(str)
    df[clean_text] = df[clean_text].apply(lambda elem: re.sub(r"\n", "; ", elem))  
    
    return df

# applying the clean_text_simple to my text
df = clean_text_simple(df, 'text', 'clean_text')
df.head()

Unnamed: 0,author,title,num_comments,selftext,text,clean_text
0,Spoggy,Looking for Horror fiction that explores the u...,5.0,I love horror films that delve into the outer ...,Looking for Horror fiction that explores the u...,Looking for Horror fiction that explores the u...
1,type2adultdiabeetus,Books that are about or talk about US Army PSYOPS,0.0,"Psyops, an abbreviation of Psychological Opera...",Books that are about or talk about US Army PSY...,Books that are about or talk about US Army PSY...
2,The69thDuncan,Looking for new sci-fi,10.0,So I read a ton of sci-fi and struggle to find...,Looking for new sci-fiSo I read a ton of sci-f...,Looking for new sci-fiSo I read a ton of sci-f...
3,mrjamiemcc,Recommend me my very first book to read,4.0,Being honest. I have never read a book out of ...,Recommend me my very first book to readBeing h...,Recommend me my very first book to readBeing h...
4,FrankenHeart,Started a book club. Suggestions?,19.0,Somehow I became the age of a person that star...,Started a book club. Suggestions?Somehow I bec...,Started a book club. Suggestions?Somehow I bec...


Let's finally check our file's path and export our latest dataframe. Note that I'm using the "to_pickle" method to export the dataframe instead of the "to_csv". This allows us to maintain the dataframe in its original form, espacially for words vectors to stay the same and not a list, which is what "to_csv" does with tokens.

In [9]:
# check file path
os.listdir(path='/Users/ronald_asseko_messa/Google Drive/dsir-125-large-files/')

# exporting df as a pickle file 
df.to_pickle('/Users/ronald_asseko_messa/Google Drive/dsir-125-large-files/booksuggestions_clean_df.pkl')
