# Kaggle Goodreads - Book Ratings

## Topics Covered
* Reading in a badly-formatted dataset
* Working with `bytes` objects and the `io` module
* Cleaning *before* loading into pandas

In [None]:
import requests
from io import BytesIO, StringIO

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats 

This Goodreads dataset was originally posted on Kaggle:

https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks

In [None]:
#Incorrectly-formatted csv; fails to open. How do we fix it?
#df = pd.read_csv('books.csv')

books_csv_url = 'https://raw.githubusercontent.com/ClaremontCollegesLibrary/PersnicketyPython/refs/heads/main/books.csv'

df = pd.read_csv(books_csv_url)

#### What does this ParserError mean? (Click to expand)

`ParserError: Error tokenizing data. C error: Expected 12 fields in line 3350, saw 13` means that there are irregular numbers of commas on different lines the csv file, so the pandas `.read_csv()` method isn't able to parse the document correctly.

This is frustrating, but we won't know how big a problem it is without investigating further.

### Looking at the Data

Pandas's `.read_csv()` can pull directly from a csv file at a specific URL, but since it can't read the csv correctly, we have to load the file into memory another way. We can use the `requests` module to make an HTML "get" request and look at the content of the response.

In [None]:
books_csv = requests.get(books_csv_url).content

Here are the first thousand characters of the csv file, returned as a Python [Bytes object](https://docs.python.org/3/library/stdtypes.html#bytes-objects)

Bytes objects display similarly to Python strings (they are formatted like a string, with a "b" at the start before the quotes) but they are fundamentally different.

In [None]:
books_csv[0:1000]

At first glance, it looks like it's formatted correctly, but this file is thousands of lines long. Let's see if we can identify which rows are formatted incorrectly.

First, we need to look for outliers in line length.

To read in the Bytes object as a csv file, we need to use a mechanism called a context manager. This is essentially a way of opening and closing a file all in one sequence, so that system resources aren't left occupied and may be freed up for other processes. In Python, context managers typically take the form of a "with... as" statement.

All the code within the "with" block 

In [None]:
commas = []

with BytesIO(books_csv) as f:
    lines = f.readlines()
    for line in lines:
        line = line.decode('utf-8')
        commas.append(line.count(','))

In [None]:
#Running this code outside the "with" block will produce an error.

#f.readlines()

In [None]:
set(commas)

If we look back at the first line of the csv, we see the following columns:

bookID,title,authors,average_rating,isbn,isbn13,language_code,  num_pages,ratings_count,text_reviews_count,publication_date,publisher

There are twelve in total, so there should be eleven commas per line separating entries. We need to look for lines that contain more than 11 commas.

In [None]:
with BytesIO(books_csv) as f:
    lines = f.readlines()
    for line in lines:
        line = line.decode('utf-8')
        
        if line.count(',') > 11:
            print(line)

Only four lines! That's manageable!

We can use the `.replace()` string method to fix this pretty quickly. We just need to provide enough context around the comma so we don't accidentally replace text in other parts of the document.

In [None]:
with BytesIO(books_csv) as f:
    content = f.read()
    content = content.decode('utf-8')
    content = content.replace(', Jr', ' Jr')
    content = content.replace(', one of the', ' one of the')
    content = content.replace('Wesley, Rawles', 'Wesley Rawles')
    content = content.replace(', Son & Ferguson', ' Son & Ferguson')
    df = pd.read_csv(StringIO(content), sep=',')

*Note: if you are running this locally and pointing to a csv file on your hard drive instead of the result of an HTML "get" request, you should use the following code instead:*

In [None]:
#with open("books.csv", 'r', encoding='utf-8') as f:
#    content = f.read()
#    content = content.replace(', Jr', ' Jr')
#    content = content.replace(', one of the', ' one of the')
#    content = content.replace('Wesley, Rawles', 'Wesley Rawles')
#    content = content.replace(', Son & Ferguson', ' Son & Ferguson')
#    df = pd.read_csv(StringIO(content), sep=',')

In [None]:
df

In [None]:
df.info()

### "  num_pages"

In [None]:
df.columns

In [None]:
df.columns = [column.strip() for column in df.columns]

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df['language_code'].value_counts()

In [None]:
plt.hist(df[df['num_pages'] < 2000]['num_pages'], bins=40)
plt.title('Distribution of Page Count')
plt.show()

In [None]:
plt.hist(df['average_rating'], bins=40)
plt.title('Distribution of User Ratings')
plt.show()

# Books with Mean Rating over 4.75 and 5+ Ratings

In [None]:
df[(df['average_rating'] > 4.75) & (df['ratings_count'] > 5)]


# End of Module 3