## Problem statement

I recently came across one youtube video related to the book review of 'The Midnight Library' by Matt Haig which provoked me to read that book. So I went to Google and asked about this book, it landed in goodreads page. I could see many people reviewed that book. In addition to that, it was the 'The Best Fiction book of goodreads choice awards 2020' with 72k votes.

In goodreads website, there are 20 fiction books nominated for 'The Best Fiction book of goodreads choice awards 2020'. 
One interesting thing which I noted was the second book that got most votes 'Anxious people' by Fredrik Backman is differed by just 5 votes lesser than 'The Midnight Library'. I want to know whether any similarities between these two books.

I am curious to find out why people are loving the book 'The Midnight Library' by Matt Haig and what is special about this book that interests me to read? Is there any similarities between the books 'The Midnight Library' and 'Anxious people'?So, I am going to scrap these 2 book reviews from goodreads and find out any

## Getting the data

I am going to scrap these 2 book reviews from goodreads and clean the data.

In [1]:
# necessary imports
import requests
import bs4
from random import randint
from time import sleep
import pickle

In [91]:
# function to get reviews data from the goodreads website
def get_reviews_data(base_url):
    '''Returns the book reviews from goodreads.com'''
    n=1
    full_reviews=[]
    while n <= 10:
        scrap_url=base_url.format(n)
        page = requests.get(scrap_url)
        sleep(randint(5,15))
        soup=bs4.BeautifulSoup(page.text,'lxml')
        reviews_in_page =' '.join([review.text.strip().replace('\n',' ') for review in soup.select('.reviewText.stacked')])
        full_reviews.append(reviews_in_page)
        n+=1
    return ' '.join(full_reviews)    

#urls
urls= ['https://www.goodreads.com/book/show/52578297-the-midnight-library?from_choice=true&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page={}',
       'https://www.goodreads.com/book/show/49127718-anxious-people?authenticity_token=qiE093a2IvgTjmPpf8PSrWcbGo3%2F9Du24Jmj%2Bnp9Hbv1jyMhKK%2BsBEezAhlr6Ch9ItXw5Fnp0B0gI8rKeFygPg%3D%3D&amp;amp;from_choice=true&amp;amp;hide_last_page=true&amp;amp;language_code=en&amp;amp;page={}']

In [92]:
# Get reviews of 2 books selected from goodreads website
#reviews_data = [get_reviews_data(base_url) for base_url in urls]

In [3]:
#books
books = ['The Midnight Library', 'Anxious People']

In [99]:
# save the reviews_data to pickle file
!mkdir reviews_data

for i,b in enumerate(books):
    with open("reviews_data/" + b + ".txt", "wb") as file:
        pickle.dump(reviews_data[i], file)   

In [4]:
# load pickle file to dict
book_reviews={}

for i,b in enumerate(books):
    with open("reviews_data/" + b + ".txt", "rb") as file:
        book_reviews[b]=[pickle.load(file)]

In [5]:
book_reviews['Anxious People']



## Data cleaning

In [6]:
next(iter(book_reviews.keys()))

'The Midnight Library'

In [7]:
next(iter(book_reviews.values()))



In [8]:
# converting the dict items to dataframe
import pandas as pd
reviews_df=pd.DataFrame(book_reviews).transpose()
reviews_df.columns=['Reviews']
reviews_df

Unnamed: 0,Reviews
The Midnight Library,Okay! No more words! This is one of the best s...
Anxious People,This is my goodreads 2020 choice as best ficti...


In [9]:
# Data cleaning round1 
# Make all words lower case
# Remove Punctuations, numbers

import re
import string

def data_clean_round1(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]',' ',text)
    return text

In [10]:
# data_celan_round1 function call
reviews_df.Reviews=reviews_df.Reviews.apply(lambda x: data_clean_round1(x))

In [11]:
reviews_df.Reviews[0]



In [12]:
#cleaned data
reviews_df

Unnamed: 0,Reviews
The Midnight Library,okay no more words this is one of the best s...
Anxious People,this is my goodreads choice as best ficti...


In [13]:
import nltk
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

In [14]:
# data cleaning round2
# split/tokenize the sentences
# Lemmatize the tokens to convert the inflectional words to its base form

def data_clean_round2(text):
    token_words=nltk.word_tokenize(text)
    lem=WordNetLemmatizer()
    lem_output = ' '.join([lem.lemmatize(w) for w in token_words])
    return lem_output

In [15]:
# calling data_clean_round2
reviews_df.Reviews=reviews_df.Reviews.apply(lambda x: data_clean_round2(x))

In [16]:
# cleaned data which is Corpus (Collection of texts)
reviews_df

Unnamed: 0,Reviews
The Midnight Library,okay no more word this is one of the best sci ...
Anxious People,this is my goodreads choice a best fiction nov...


In [17]:
# save cleaned data to pickle file for future use
reviews_df.to_pickle('Corpus.pkl')

### Document term matrix

In [18]:
# Conerting the Corpus to document term matrix using CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(stop_words='english')
reviews_cv=cv.fit_transform(reviews_df.Reviews)
reviews_dtm = pd.DataFrame(reviews_cv.toarray(), columns=cv.get_feature_names())
reviews_dtm.index=reviews_df.index
reviews_dtm

Unnamed: 0,abandon,abandoned,abbott,abdominal,abiding,ability,abit,able,ably,abo,...,yrralh,yummy,zany,zara,zero,zipped,zipping,zone,zoom,zr
The Midnight Library,1,2,0,0,1,4,1,48,1,1,...,1,2,0,0,4,2,1,1,0,0
Anxious People,0,0,4,2,0,21,2,32,0,1,...,1,0,3,31,2,0,0,0,1,1


In [19]:
#saving the document term matrix to pickle file
reviews_dtm.to_pickle('reviews_dtm.pkl')