# Goodreads Database

## Summary
<b>Goals</b>:
- create a database with 3 tables:
    - books
    - genres
    - book-genre pairs  
    
<b>Source data</b>: 
- csv file for GoodReads created in Phase 1 of the project (link)  

<b>Implementation:</b>  
For this purposes we used Amazon RDS instance with MySQL rdbms. We need to designe the schema, implement this schema on the existing instance and then populate the created tables with our data.  
SQL scripts were run with DBeaver. It's possible to do with Python libs, but with DBeaver it turned out to be faster and cleaner. 

## DB Schema

In [15]:
from IPython.display import Image
from IPython.core.display import HTML 
HTML("<h3>ER-diagram</h3>")

In [16]:
Image(url= "gr_erd.png")

### Books table creation script

In [33]:
books_table_sql ="""
CREATE TABLE bidb.gr_books (  
	book_id BIGINT NOT NULL,  
	title varchar(255) NOT NULL,  
	reviews_cnt INT NULL,  
	ratings_cnt INT NULL,  
	pub_year INT NULL,  
	avg_rating FLOAT NULL,  
	alt_avg_rating FLOAT NULL,  
	author_1_name varchar(100) NULL,  
	author_1_avg_rating FLOAT NULL,  
	CONSTRAINT gr_books_PK PRIMARY KEY (book_id)  
)  
ENGINE=InnoDB  
DEFAULT CHARSET=utf8  
COLLATE=utf8_general_ci;
"""

### Genres table creation script

In [34]:
genres_table_sql = """
CREATE TABLE bidb.gr_genres (
	genre_id varchar(100) NOT NULL,
	name varchar(100) NULL,
	CONSTRAINT gr_genres_PK PRIMARY KEY (genre_id)
)
ENGINE=InnoDB
DEFAULT CHARSET=utf8
COLLATE=utf8_general_ci;
"""

### Genre-book table (many-to-many relationship)

In [35]:
genre_book_sql = """
CREATE TABLE bidb.gr_genre_book (
	book_id BIGINT NOT NULL,
	genre_id VARCHAR(100) NOT NULL,
	id BIGINT NOT NULL AUTO_INCREMENT,
	CONSTRAINT gr_genre_book_PK PRIMARY KEY (id),
	CONSTRAINT gr_genre_book_gr_books_FK FOREIGN KEY (book_id) REFERENCES bidb.gr_books(book_id),
	CONSTRAINT gr_genre_book_gr_genres_FK FOREIGN KEY (genre_id) REFERENCES bidb.gr_genres(genre_id)
)
ENGINE=InnoDB
DEFAULT CHARSET=utf8
COLLATE=utf8_general_ci;
"""

## Step 1: Create books table
### Preparing the data
For the books table we alredy have books ids, so we can re-use them. Since we're going to create a separate table for genres, we need to exclude these columns. This can be done is various ways. 

In [1]:
# Import the necessray libraries
import pandas as pd
#import numpy as np

In [21]:
# read in the csv file prepared in Phase 1 of the project
df = pd.read_csv("gr_books_v4.csv")

In [47]:
# show the columns that we have now
print(df.columns)

Index(['id', 'title', 'reviews_cnt', 'ratings_cnt', 'pub_year', 'avg_rating',
       'alt_avg_rating', 'author1_name', 'author1_role', 'author1_avg_rating',
       'author2_name', 'author2_role', 'author2_avg_rating', 'author3_name',
       'author3_role', 'author3_avg_rating', 'genres', 'Genre_1', 'Genre_2',
       'Genre_3', 'Genre_4', 'Genre_5', 'Genre_6', 'Genre_7', 'Genre_8',
       'Genre_9', 'Genre_10', 'Genre_11'],
      dtype='object')


For our books table we only need columns 1-8 and 10. We exclude author1_role becuse it's irrelevant for our purposes. 

In [46]:
columns = list(df.columns)
books_columns = columns[:8] + [columns[9]]
print(books_columns)

['id', 'title', 'reviews_cnt', 'ratings_cnt', 'pub_year', 'avg_rating', 'alt_avg_rating', 'author1_name', 'author1_avg_rating']


Now we only need to save it to a file again: 

In [29]:
books_only = df[books_columns]
books_only.to_csv('books_only.csv', index = False)
books_only.head()

Unnamed: 0,id,title,reviews_cnt,ratings_cnt,pub_year,avg_rating,alt_avg_rating,author1_name,author1_avg_rating
0,37424706,The Art of Gathering: How We Meet and Why It M...,1367,23,,4.3,4.304348,Priya Parker,4.3
1,117833,The Master and Margarita,331567,178007,1967.0,4.32,4.316718,Mikhail Bulgakov,4.25
2,18632929,Kaip atpažinti psichopatą,295551,96957,2012.0,3.92,3.922079,Jon Ronson,3.9
3,1953,A Tale of Two Cities,1240390,710415,1859.0,3.82,3.820596,Charles Dickens,3.87
4,5130,Island,50444,18565,1962.0,3.87,3.866792,Aldous Huxley,3.98


### Populating the table

In [36]:
books_table_populate_sql = """
LOAD DATA LOCAL INFILE '/Users/anamakarevich/gr_books_only.csv'
INTO TABLE gr_books
FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(book_id, title, reviews_cnt, ratings_cnt, pub_year, avg_rating, alt_avg_rating, author_1_name, author_1_avg_rating)
"""

## Step 2: Create genres table

### Preparing the data 

For genres we need to extract all the genres present in the table and assign them some ids so that we can reference them in the book_genre table. 
First, we will create a helper function to generate the column names for genres. 

In [37]:
def get_columns(n_genres):
    result = []
    for i in range(1,n_genres+1):
        result.append('Genre_'+str(i))
    return result

In [45]:
# get the columns with genres
genre_columns = get_columns(11)
print(genre_columns)

['Genre_1', 'Genre_2', 'Genre_3', 'Genre_4', 'Genre_5', 'Genre_6', 'Genre_7', 'Genre_8', 'Genre_9', 'Genre_10', 'Genre_11']


Next we actually extract the values of the genres for all the books and create a set of genres.

In [44]:
# extract just the genres (replacinng None with empty string so that it's easy to remove it later)
set_of_genres = set(df[genre_columns].fillna('').values.flatten())
# remove empty strings
set_of_genres.remove('')
print(set_of_genres)

{'romance', 'history', 'science', 'science-fiction', 'personal-development', 'non-fiction', 'politics', 'literature', 'fiction', 'adult', 'children', 'american', 'biography', 'mystery', 'adventure', 'dystopia', 'contemporary', 'philosophy', 'fantasy', 'young-adult', 'british', 'classics', 'psychology', 'economics', 'novel'}


We have successfully extracted our genres and ready to create a source file for genres. The last thing we need to do is to assign ids to genres which can be done very easily when creating a data frame.  
Note: we could have created ids automatically in the db, but I decided to try both ways. The last table with use automatic id creation with autoincrement. 

In [50]:
genres_df = pd.DataFrame({'genre_id':range(len(set_of_genres)), 'name': list(set_of_genres)})
# save to .csv
genres_df.to_csv('genres_only.csv', index = False)
genres_df.head()

Unnamed: 0,genre_id,name
0,0,romance
1,1,history
2,2,science
3,3,science-fiction
4,4,personal-development


### Populating the table 

In [51]:
genres_table_populate_sql = """
LOAD DATA LOCAL INFILE '/Users/anamakarevich/genres_only.csv'
INTO TABLE gr_genres
FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(genre_id, name)
"""

## Step 3: Create genre-book table 

The last table is the most complicates since we have many to many relationship here which should be implemented with the additional table that matches books ids to genres ids.  
The first thing to do is to create a lookup dictionary for genres and ids so that we can build the required table. 

In [54]:
# create dictionary for genres for fast lookup
genres_dict = dict(zip(genres_df.name, genres_df.genre_id))

In [55]:
print(genres_dict)

{'romance': 0, 'history': 1, 'science': 2, 'science-fiction': 3, 'personal-development': 4, 'non-fiction': 5, 'politics': 6, 'literature': 7, 'fiction': 8, 'adult': 9, 'children': 10, 'american': 11, 'biography': 12, 'mystery': 13, 'adventure': 14, 'dystopia': 15, 'contemporary': 16, 'philosophy': 17, 'fantasy': 18, 'young-adult': 19, 'british': 20, 'classics': 21, 'psychology': 22, 'economics': 23, 'novel': 24}


Next we will write a function that will actually match the book ids and genre ids. To do that, we need to iterate though through the rows of the original data frame - we only need two columns: id (book id) and genres which contains a list of genres. We user .values instead of .iterrows() because .values actually provides us with the numpy array which is more efficient for this purpose. And that's why we address the columns by numbers instead of name - each row is just an array in this case. 

In [60]:
def generate_many_to_many_df():
    books_ids = []
    genres_ids = []
    for row in df[['id','genres']].values:
        book_id = row[0]
        genres = row[1].split(',')
        for genre in genres:
            gen_id = genres_dict.get(genre)
            # we use is non None here since we have 0 as one of the values in the 
            # dictionary which will be treated as None if we don't specify that explicitly
            if gen_id is not None:
                books_ids.append(book_id)
                genres_ids.append(gen_id)
    return pd.DataFrame({'book_id': books_ids, 'genre_id': genres_ids})
res_df = generate_many_to_many_df()

Now we can finally save our last data frame to csv

In [62]:
res_df.to_csv('genre_book.csv', index=False)
res_df.head()

Unnamed: 0,book_id,genre_id
0,37424706,5
1,37424706,23
2,37424706,4
3,37424706,22
4,117833,8


### Populating the table 

In [63]:
genre_book_populate_sql = """
LOAD DATA LOCAL INFILE '/Users/anamakarevich/genre_book.csv'
INTO TABLE gr_genre_book 
FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(book_id, genre_id)
"""

## Results

We can check that it all works by joining all tables. For example, let's extract all the genres for the book named "Breakfast at Tiffany's".

In [65]:
Image(url= "gr_results.png")