# 2. DATA

In total, the dataset contains 10,000 books, but we've decided to pick only 3000, to create network of more appropriate size. 

The reason is, there are 6 milion ratings in total - which results in too dense graph. We still had to remove edges which had weight lower then a certain threshold to get to a managable size of the graph.

**After cleaning our dataset is in total around 100MB.**

## Preparing data
- In the rest of this notebook we will explain how we prepared our dataset for analysis

### 2.1 Tags
Goodbooks-10k tags are in two files
1. book_tags.csv: 
    - book_id 
    - tag_id 
    - number of users who gave this tag
2. tags.csv: 
    - tag_id 
    - tag_name
    
We want to keep only **top 5 tags** per book and we need to transform the data in resulting form:
- book_id
- tag_name
- number of users who gave this tag



In [4]:
import pandas as pd
#load book tags from csv
booktags = pd.read_csv(".\\data\\book_tags.csv")
#load tag names
tagnames = pd.read_csv(".\\data\\tags.csv")
#load list of 3000 book ids
bookids = pd.read_pickle(".\\data\\book_id3000.pkl")

In [7]:
#translate tag id to tag name
tagdict = tagnames.set_index('tag_id')['tag_name'].to_dict()
booktags['tag_name'] = booktags['tag_id'].map(tagdict)
booktags.head()

#### Blacklisting tags
- we also want to remove frequent tags, which don't tell us information about the book genre, but are used to categorize the users' libraries

In [8]:
blacklist_strict = ["to-read","currently-reading", "favorites", "owned-books", "books-i-own", "owned", "re-read", "library",
            "kindle", "default", "ebook", "my-books","wish-list","my-library", "audiobooks","i-own", "audio", "favourites", "own-it",
                "e-book", "e-books", "to-buy", "audiobook", "ebooks","books", "audible","audio-books", "audio-book", "have"]
#remove blacklisted tags
for bl in blacklist_strict:
    booktags.drop(booktags[booktags.tag_name == bl].index, inplace=True)

**Find Top 5 tags for each book**

In [None]:
top5 = pd.DataFrame()
res = list()

#sort booktags
booktags.sort_values(by=['count'],ascending=False)

#find top 5 tags for each book_id
for i in bookids: 
    bookid = i
    BI = booktags["goodreads_book_id"] == bookid
    tags = booktags[BI][:5]
    res.append(tags)
    
#concatenate all dataframes into 1
for r in res:
    top5 = pd.concat([r,top5])

### 2.Reviews & Descriptions

#### Downloading from GoodReads
- our selected dataset doesn't contain user reviews and book description, so we used following script to download them

In [None]:
from lxml.html.clean import clean_html as lxml_clean_html #cleaning javascript from html
from bs4 import BeautifulSoup #html tag removal

In [None]:
#load book ids
with open('.\\data\\book_id3000.pkl', 'rb') as f:
    bookIDs = pickle.load(f)

baseUrl = "https://www.goodreads.com/book/show/"

err_list = list()
for index, row in df.iterrows():
    #take a book id
    bookID = df.at[index,'bookID']
    #create a query for the book
    query = "%s%s"%(baseUrl,bookID)
    try:
        #download html page
        res = urlopen(query)
    except:
        #save bookids with error
        err_list.append(bookID)
        continue
     
    #use utf-8 encoding and ignore other characters
    source = res.read().decode('utf-8', 'ignore')
    
    #removes java sctript
    source = lxml_clean_html(source)
    
    #save page
    df.at[index,'page'] = source

**Cleaning HTML pages**
- since pages contain lot of information we don't need, we have to find tags, which contain reviews and descriptions, deduplicate and clean them from html tags

In [None]:
#regular expression to find tag which includes text for descriptions and reviews
regexp = r'<span id=\"freeText.*'
reviews = list()
descriptions = list()

for index, row in data.iterrows():
    p = data.at[index,'page']
    
    #find tags including descriptions and reviews
    texts = re.findall(regexp, p) 
    res = list()
    
    #remove html tags
    for t in texts:
        soup = BeautifulSoup(t, 'lxml')
        res.append(soup.text)
    
    #remove duplicates (each longer text has short version and full version)
    fin = list()
    l = len(res) -1
    last = 0
    for i, x in enumerate(res):
        if i < l:
            if res[i][:50] == res[i+1][:50] and last != i:
                fin.append(res[i+1])
                last = i+1
            elif res[i][:50] != res[i+1][:50] and last != i:
                fin.append(res[i])
                last = i    
    
    if len(fin) == 0:
        continue
    d= fin[0]
    r = [x for b, x in enumerate(fin) if b>0]
    
    #save results
    data.at[index,'description'] = d
    data.at[index,'reviews'] = r