# Data Wrangling in Books

This notebook contains 2 sections, where we clean both book datasets: firstly, using the dataset with general information about books, useful to discover about languages, number of pages, publication years and average rating of a wide variety of books and then, applying some NLP pre-processing techniques to reader reviews, to do sentiment analysis and compare the patterns calculated with readers scores later. Additionally, we'll apply two different sentiment patterns criteria and we'll combine and compare the results to understand a little more about the people preferences. The resultant datasets are saved on processed Data folder to apply EDA and look for initial findings.

## I. Books: general information

In this section, we check the information on `book.csv` dataset and apply some preprocessing and wrangling steps.

### Importing Data

Importing relevant pachages:

In [1]:
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import random
import json
import re

Importing `book1.csv` and `book2.csv` from the `interim` folder:

In [2]:
df_book1 = pd.read_csv('../Data/interim/books/book1.csv', index_col=False)

In [3]:
df_book1.dropna(subset=['description'], inplace=True)

In [4]:
df_book2 = pd.read_csv('../Data/interim/books/book2.csv', index_col=False)

In [5]:
df_book2.dropna(subset=['description'], inplace=True)

In [6]:
df_book = pd.concat([df_book1, df_book2])

Inspecting the resultant dataFrame:

In [7]:
df_book.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 825175 entries, 1 to 499996
Data columns (total 29 columns):
asin                    141201 non-null object
authors                 825175 non-null object
average_rating          825175 non-null float64
book_id                 825175 non-null int64
country_code            825175 non-null object
description             825175 non-null object
edition_information     84392 non-null object
format                  661449 non-null object
image_url               825175 non-null object
is_ebook                825175 non-null object
isbn                    502210 non-null object
isbn13                  583432 non-null object
kindle_asin             361286 non-null object
language_code           492893 non-null object
link                    825175 non-null object
num_pages               612922 non-null float64
popular_shelves         825175 non-null object
publication_day         529733 non-null float64
publication_month       582375 non-null fl

We note that some columns are not complete (`publiser`, `isbn`), but we focus just in relevant columns, as:
- authors
- average_rating
- book_id
- description
- is_ebook
- language_code
- num_pages
- publication_year
- ratings_count
- similar_books
- title

In [8]:
df_book_ = df_book.copy()
df_book_ = df_book_.loc[:, ['authors', 'average_rating', 'book_id', 
                            'description', 'is_ebook', 'language_code',
                            'num_pages', 'publication_year', 'ratings_count',
                            'similar_books', 'title']]

Some questions that we must answer are, for instances: 

1. Are we including dictionaries in this analysis? We must consider that the number of pages of dictionaries are larger than common books.
2. Are data about publication year consistent and correct? 
3. How many people evaluated the books on the ranking? A book with one evaluation can't be consider in the same category of books with hundred or thousand of evaluations.
4. Must be delete the books with incomplete data? Probably not, because using the language code as criteria we should delete almost 40% of books, but in opposite, we see that descriptions, titles and authors are complete.

#### 1. Filtering dictionaries

In [9]:
dictionary = []
for title in df_book.title:
    if re.search(r'dictionary', str(title).lower()) is None:
        dictionary.append(0)
    else:
        dictionary.append(1)
        
df_book['is_dictionary'] = dictionary

In [10]:
df_book.groupby('is_dictionary').count()

Unnamed: 0_level_0,asin,authors,average_rating,book_id,country_code,description,edition_information,format,image_url,is_ebook,...,publication_year,publisher,ratings_count,series,similar_books,text_reviews_count,title,title_without_series,url,work_id
is_dictionary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,141187,824679,824679,824679,824679,824679,84337,661011,824679,824679,...,675273,653174,824679,824679,824679,824679,824678,824678,824679,824679
1,14,496,496,496,496,496,55,438,496,496,...,445,442,496,496,496,496,496,496,496,496


There are almost 500 documents belongs to the category of Dictionary. We filter them using `is_dictionary`:

In [11]:
df_book_noDict = df_book[df_book['is_dictionary'] == 0]

#### 2. Outliers in rating counts

If we are curious to work with the `average_rating` of books, we need to consider how many `rating_count` has every one and delete **outliers**. For instance, the `average_rating` of books with just one `rating_count` can't be compared with other with one hundred of counts. To deal with these differences, we calculate the **z score** of every rating count and filter books with z > 3 (standard threshold).

In [12]:
z = np.abs(stats.zscore(df_book_noDict.ratings_count))

In [13]:
df_book_noDict = df_book_noDict[(z < 3)]

#### 3. Average Rating vs Absolute Ratings

We add an absolute rating column that round ratings to get scores between 0 and 5:

In [14]:
df_book_noDict['abs_rating'] = round(df_book_noDict.average_rating)

#### 4. Outliers in number of pages

In [15]:
df_book_noDict.dropna(subset=['num_pages'], inplace=True)

In [16]:
z_pages = np.abs(stats.zscore(df_book_noDict.num_pages))
df_book_noDict = df_book_noDict[(z_pages < 3)]

#### 5. Publication year: sanity check

Then, we build a `df_book_noDict_year` that delete missing years and outliers.

In [17]:
df_book_noDict.dropna(subset=['publication_year'], inplace=True)
z_year = np.abs(stats.zscore(df_book_noDict.publication_year))
df_book_noDict_year = df_book_noDict[(z_year < 3)]

Finally, we save the dataFrames `df_book_noDict` and `df_book_noDict_year`:

In [18]:
#df_book_noDict.to_csv('../Data/processed/books/books_noDict.csv')
#df_book_noDict_year.to_csv('../Data/processed/books/books_noDict_year.csv')

## II. Book Reviews: preprocessing

It's time to check the books_reviews.csv dataset and apply some nlp preprocessing steps.

### Importing Data

In [19]:
df_reviews = pd.read_csv('../Data/interim/books/books_reviews.csv', index_col=False)

In [20]:
df_reviews.head(2)

Unnamed: 0,book_id,has_spoiler,rating,review_id,review_sentences,timestamp,user_id
0,18245960,True,5,dfdbb7b0eb5a7e4c26d59a937e2e5feb,"[[0, 'This is a special book.'], [0, 'It start...",2017-08-30,8842281e1d1347389f2ab93d60773d4d
1,16981,False,3,a5d2c3628987712d0e05c4f90798eb67,"[[0, 'Recommended by Don Katz.'], [0, 'Avail f...",2017-03-22,8842281e1d1347389f2ab93d60773d4d


In [21]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1378033 entries, 0 to 1378032
Data columns (total 7 columns):
book_id             1378033 non-null int64
has_spoiler         1378033 non-null bool
rating              1378033 non-null int64
review_id           1378033 non-null object
review_sentences    1378033 non-null object
timestamp           1378033 non-null object
user_id             1378033 non-null object
dtypes: bool(1), int64(2), object(4)
memory usage: 64.4+ MB


Now, we merge 3 columns of `df_books_`: `book_id`, `title`, `description` with this dataFrame to posterior analysis.

In [22]:
df_book_esential = df_book_.loc[:, ['book_id', 'title', 'description']]

In [23]:
df_books_reviews = df_reviews.merge(df_book_esential)

In [24]:
df_books_reviews.head(2)

Unnamed: 0,book_id,has_spoiler,rating,review_id,review_sentences,timestamp,user_id,title,description
0,16981,False,3,a5d2c3628987712d0e05c4f90798eb67,"[[0, 'Recommended by Don Katz.'], [0, 'Avail f...",2017-03-22,8842281e1d1347389f2ab93d60773d4d,Invisible Man,First published in 1952 and immediately hailed...
1,16981,False,4,706a8032efbde550167bf0d96c2ab501,"[[0, 'This book was actually good, so long tho...",2015-02-25,2159f55d397e8fbe68d5e03668e7d9d2,Invisible Man,First published in 1952 and immediately hailed...


### NLP preprocessing

In this section we apply NLP pre-processing techniques such as subtraction of special characters, tokenization and lemmatization of the words. We define a pre-processing function as a pipeline of the methods mentioned and a sentiment_parameters_Pattern that calculate the polarity and subjectivity pattern of every review. In this case, we measure polarity using TextBlob and Afinn. Finally, we include the review normalized and the sentiment patterns using TextBlob and Afinn on the books_reviews dataFrame and save as csv file.

Importing relevant packages

In [25]:
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from textblob.sentiments import PatternAnalyzer
from textblob import TextBlob
import nltk

Installing the AFINN package

In [26]:
import sys
!{sys.executable} -m pip install afinn

In [27]:
from afinn import Afinn

Building `expandContractions` function

In [28]:
"""
from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
all credits go to alko and arturomp @ stack overflow.
"""

with open('../Data_preprocessing/wordLists/contractionList.txt', 'r') as f:
    cList = json.loads(f.read())
    c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

Pipeline preprocessing

1. Normalize and expand contractions
2. Delete spetial characters
3. Tokenize words
4. Lemmatization
5. Join sentences again

Building `pre_processing` and `sentiment_parameters_Pattern` functions

In [29]:
wpt = nltk.WordPunctTokenizer()
lemmatizer = WordNetLemmatizer() 

def pre_processing(text):
    text = re.sub(r'’',"'", text)
    text = expandContractions(text.lower())
    text = text.lower()
    # Filtering special characters
    text = re.sub(r'[^a-zA-Z\s]','', text)
    # Tokenization and filtering stop-words
    tokens = wpt.tokenize(text)
    # Lemmatization
    words_lem = [lemmatizer.lemmatize(word) for word in tokens]
    text_norm = ' '.join(words_lem)
    
    return text_norm

def sentiment_parameters_Pattern(sentence):
    blob = TextBlob(sentence, analyzer=PatternAnalyzer())
    return blob.sentiment.polarity, blob.sentiment.subjectivity

Normalizing `reviews_sentences`:

In [30]:
text_normalized = [pre_processing(text) for text in df_books_reviews['review_sentences']]

Saving the normalized reviews as a new column of `df_books_reviews`

In [31]:
df_books_reviews['text_normalized'] = text_normalized

Measure the polarity and subjectivity patterns of every review normalized:

In [32]:
sentiment_pattern = [sentiment_parameters_Pattern(text) for text in text_normalized]

Saving new columns with the patterns extracted in `df_books_reviews`:

In [33]:
polarity_textBlob = [sentiment_pattern[i][0] for i in range(len(sentiment_pattern))]
subjectivity_textBlob = [sentiment_pattern[j][1] for j in range(len(sentiment_pattern))]
df_books_reviews['polarity_textBlob'] = polarity_textBlob
df_books_reviews['subjectivity_textBlob'] = subjectivity_textBlob

Now, we'll use AFINN lexicon to measure the polarity of the sentences.

In [34]:
af = Afinn()

We measure the polarity according AFINN lexicon and then, we classify the sentences in positive, negative or neutral, depending on the polarity score. Then, we include both as new columns of `df_books_reviews`.

In [35]:
sentiment_scores = [af.score(text) for text in text_normalized]

In [36]:
sentiment_category = ['positive' if score > 0 
                          else 'negative' if score < 0 
                              else 'neutral' 
                                  for score in sentiment_scores]

In [37]:
df_books_reviews['sentiment_scores'] = sentiment_scores
df_books_reviews['sentiment_category'] = sentiment_category

And we save the file in the `processed` folder

In [38]:
#df_books_reviews.to_csv('../Data/processed/books/sentiment_patterns_books_reviews.csv')