# SQL Project 


The coronavirus took the entire world by surprise, changing everyone's daily routine. City dwellers no longer spent their free time outside, going to cafes and malls; more people were home, reading books. That attracted the attention of startups that rushed to develop new apps for book lovers.
You've been given a database of one of the services competing in this market. It contains data on books, publishers, authors, and customer ratings and reviews of books. This information will be used to generate a value proposition for a new product.


# Description of the data
books:
Contains data on books:
book_id
author_id


title
num_pages — number of pages
publication_date
publisher_id


authors:
Contains data on authors:
author_id
author


publishers:
Contains data on publishers:
publisher_id
publisher


ratings:
Contains data on user ratings:

rating_id
book_id
username — the name of the user who rated the book
rating


reviews:
Contains data on customer reviews:


review_id
book_id
username — the name of the user who reviewed the book
text — the text of the review



# The goals of the study

Find the number of books released after January 1, 2000.
Find the number of user reviews and the average rating for each book.
Identify the publisher that has released the greatest number of books with more than 50 pages (this will help you exclude brochures and similar publications from your analysis).
Identify the author with the highest average book rating: look only at books with at least 50 ratings.
Find the average number of text reviews among users who rated more than 50 books

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine

In [2]:


db_config = {'user': 'practicum_student',         # username
             'pwd': 's65BlTKV3faNIGhmvJVzOqhs', # password
             'host': 'rc1b-wcoijxj3yxfsf3fs.mdb.yandexcloud.net',
             'port': 6432,              # connection port
             'db': 'data-analyst-final-project-db'}          # the name of the database

connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],
                                                                     db_config['pwd'],
                                                                       db_config['host'],
                                                                       db_config['port'],
                                                                       db_config['db'])

engine = create_engine(connection_string, connect_args={'sslmode':'require'})

# Study the tables

In [3]:


tables = ['books', 'authors', 'publishers', 'ratings', 'reviews']

for table in tables:
    query = 'SELECT * FROM ' + table + ' LIMIT 5'

    df = pd.io.sql.read_sql(query, con=engine)

    # Displaying the tables
    display(df)

  # Checking for duplicates
    duplicates = df.duplicated()
    if duplicates.any():
        print(f"Duplicates found in the '{table}' table")
        print(df[duplicates])
    else:
        print(f"No duplicate values found in the '{table}' table")

    # Checking for missing values
    missing_values = df.isnull().sum()
    if missing_values.any():
        print(f"Missing values found in the '{table}' table")
        print(missing_values)
    else:
        print(f"No missing values found in the '{table}' table")


Unnamed: 0,book_id,author_id,title,num_pages,publication_date,publisher_id
0,1,546,'Salem's Lot,594,2005-11-01,93
1,2,465,1 000 Places to See Before You Die,992,2003-05-22,336
2,3,407,13 Little Blue Envelopes (Little Blue Envelope...,322,2010-12-21,135
3,4,82,1491: New Revelations of the Americas Before C...,541,2006-10-10,309
4,5,125,1776,386,2006-07-04,268


No duplicate values found in the 'books' table
No missing values found in the 'books' table


Unnamed: 0,author_id,author
0,1,A.S. Byatt
1,2,Aesop/Laura Harris/Laura Gibbs
2,3,Agatha Christie
3,4,Alan Brennert
4,5,Alan Moore/David Lloyd


No duplicate values found in the 'authors' table
No missing values found in the 'authors' table


Unnamed: 0,publisher_id,publisher
0,1,Ace
1,2,Ace Book
2,3,Ace Books
3,4,Ace Hardcover
4,5,Addison Wesley Publishing Company


No duplicate values found in the 'publishers' table
No missing values found in the 'publishers' table


Unnamed: 0,rating_id,book_id,username,rating
0,1,1,ryanfranco,4
1,2,1,grantpatricia,2
2,3,1,brandtandrea,5
3,4,2,lorichen,3
4,5,2,mariokeller,2


No duplicate values found in the 'ratings' table
No missing values found in the 'ratings' table


Unnamed: 0,review_id,book_id,username,text
0,1,1,brandtandrea,Mention society tell send professor analysis. ...
1,2,1,ryanfranco,Foot glass pretty audience hit themselves. Amo...
2,3,2,lorichen,Listen treat keep worry. Miss husband tax but ...
3,4,3,johnsonamanda,Finally month interesting blue could nature cu...
4,5,3,scotttamara,Nation purpose heavy give wait song will. List...


No duplicate values found in the 'reviews' table
No missing values found in the 'reviews' table


In [4]:
#run sql queries
def execute(query):
    return pd.io.sql.read_sql(query, con = engine)

In [5]:
# the number of books released after January 1, 2000
query1= '''
SELECT
    (SELECT COUNT(*) FROM books) AS total_book_count,
    COUNT(*) AS book_count_after_date
FROM
    books
WHERE
    publication_date > '2000-01-01';
    '''

execute(query1)


Unnamed: 0,total_book_count,book_count_after_date
0,1000,819


As we have 81.9% books released after 2000-01-01, this dataset contains only contemporary literature in general.

In [6]:
#the number of user reviews and the average rating for each book:
query2 = ''' SELECT b.title AS book_title,
                    count(DISTINCT re.review_id) AS n_reviews,
                    ROUND(AVG(ra.rating), 1) AS avg_rating
            FROM books AS b
            LEFT JOIN reviews AS re ON re.book_id = b.book_id
            LEFT JOIN ratings AS ra ON ra.book_id = b.book_id
            GROUP BY b.book_id
            ORDER BY avg_rating DESC, n_reviews DESC
         '''

execute(query2)

Unnamed: 0,book_title,n_reviews,avg_rating
0,A Dirty Job (Grim Reaper #1),4,5.0
1,School's Out—Forever (Maximum Ride #2),3,5.0
2,Moneyball: The Art of Winning an Unfair Game,3,5.0
3,Crucial Conversations: Tools for Talking When ...,2,5.0
4,Misty of Chincoteague (Misty #1),2,5.0
...,...,...,...
995,The World Is Flat: A Brief History of the Twen...,3,2.3
996,Drowning Ruth,3,2.0
997,His Excellency: George Washington,2,2.0
998,Junky,2,2.0


As we can see in the output, the books with the biggest number of reviews have the highest average_rating. 
'Memoirs of a Geisha' have 8 reviews and an average rating is in top 5, but Harry Potter's books lead the rating. 

In [7]:
#the publisher that has released the greatest number of books with more than 50 pages:

query3 = '''
    SELECT p.publisher, COUNT(b.book_id) AS books_count
    FROM books b
    JOIN publishers p ON b.publisher_id = p.publisher_id
    WHERE b.num_pages>50
    GROUP BY p.publisher
    ORDER BY books_count DESC
    LIMIT 1;
'''

execute(query3)

Unnamed: 0,publisher,books_count
0,Penguin Books,42


'Penguin Books' has released the greatest number of books whith more than 50 pages.

In [8]:
#the author with the highest average book rating (only with at least 50 ratings):
query4 = '''
    SELECT a.author, AVG(ra.rating) AS average_rating
    FROM books b
    JOIN authors a ON b.author_id = a.author_id
    JOIN ratings ra ON b.book_id=ra.book_id
    
    GROUP BY a.author
    HAVING COUNT (ra.rating)>50
    ORDER BY average_rating DESC
    LIMIT 1;
'''

execute(query4)

Unnamed: 0,author,average_rating
0,J.K. Rowling/Mary GrandPré,4.288462


As we expected in analysing ratings for this sample of readers the best author is Joanne Rowling.

In [9]:
#the average number of text reviews among users who rated more than 50 books:
query5 = '''
SELECT AVG(review_count) AS average_text_reviews
FROM (
    SELECT r.username, COUNT(r.review_id) AS review_count
    FROM ratings ra
    JOIN reviews r ON ra.book_id = r.book_id AND ra.username = r.username
    WHERE ra.username IN (
        SELECT username
        FROM ratings
        GROUP BY username
        HAVING COUNT(book_id) > 50
    )
    GROUP BY r.username
) AS subquery;
'''
execute(query5)

Unnamed: 0,average_text_reviews
0,24.333333


The average number of text reviews among users who rated more than 50 books is approximately 24. This is a core of our active client base, they read a lot of books.

# Conclusion

In this analysis, we show customers behave on books market.  

Customers prefer contemporary authors.

They are passionate at Harry Potter saga.Those books have a the highest ratings. 

'Penguin Books' is our best publisher. 

We've got a community of core-users, who read a lot and share their reviews. We should focus our marketing on them.
