# Data Cleaning and preparation to build pipeline.

In [0]:
book_data_df = spark.read.csv("/Volumes/workspace/default/my_volume/books_data.csv", header=True, inferSchema=True)


In [0]:
book_data_df.write.mode("overwrite").saveAsTable("books")

In [0]:
book_data_df.show()

In [0]:
ratings_df = spark.read.csv("/Volumes/workspace/default/my_volume/Books_rating.csv", header=True, inferSchema=True)
ratings_df.write.mode("overwrite").saveAsTable("ratings")

In [0]:
display(ratings_df)

In [0]:
%sql
select count(*) as review_count from ratings

In [0]:
%sql
select count(*) as book_count from books

In [0]:
%sql
select * 
from books
where Title is null

In [0]:
%sql
select count(distinct(Title)) from ratings

There are 212399 distinct book Titles, and 3,000,000 reviews. I wonder how distributed they are.

In [0]:
%sql
create or replace temp view book_reviews as
select Id, (count(*)) as reviews
from ratings
group by Id;

select * 
from book_reviews
sort by reviews desc;

It looks like the highest number reviews is 4426 and the total reviews by Id drops quickly. I wonder how the # of reviews per book is distributed

In [0]:
%sql

select reviews, count(*) as book_count
from book_reviews
group by reviews
order by reviews

Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

It looks like the majority of books have 3 reviews or less. Not that big of a deal, but good to know. This tells me that we carry many many books, but most are not popular; assuming the number of reviews is correlated with book popularity. I'll check next to see if each book has a distinct price, otherwise I'll aggregate when appending it to the books table.

In [0]:
%sql
select Title, Id, count(distinct(Price)) as price_count
from ratings
group by Title, Id
having price_count > 1

In [0]:
%sql
Select *
from ratings
Where Id in (
  select Id
  from ratings
  group by Id
  having count(distinct(Price)) > 1
)

It looks like the quotes in titles messed up the file. There's only a few so I'll just drop em.

In [0]:
%sql
CREATE OR REPLACE TABLE ratings AS
SELECT *
FROM ratings
WHERE Id NOT IN (
    SELECT Id
    FROM ratings
    GROUP BY Id
    HAVING COUNT(DISTINCT Price) > 1
)

I wonder how many titles have quotes like this that could mess up their rows?

In [0]:
%sql
Select * from books Where Title like '%"%'

Of course. This data set is gross too. I wonder how many reviews this books have, and if it would make a difference if i dropped?

In [0]:
%sql
select b.Title, count(r.Id) as num_review
from books as b
join ratings as r on b.Title = r.Title
where b.Title like '%"%'
group by b.Title
order by num_review desc

In [0]:
%sql
select sum(num_review) as affected_rows from (
select b.Title, count(r.Id) as num_review
from books as b
join ratings as r on b.Title = r.Title
where b.Title like '%"%'
group by b.Title)

Only 8910 reviews would need to be dropped. That's small enough.

# HERE

In [0]:
%sql
select Title
from ratings
where Title in (
  Select Title 
  from books 
  Where Title like '%"%'
)

In [0]:
from pyspark.sql.functions import col

display(book_data_df.filter(col("title").like('%"%')))


In [0]:
%sql
select Id, Title, Price
from ratings
where Title is null

In [0]:
%sql
select distinct Id as id, Title
from ratings
where Title is null

Only 200 rows with 9 distinct ids where the title is null in ratings and one instance where the Title is null in books. We're good to just drop these rows.

In [0]:
%sql
CREATE OR REPLACE TABLE ratings AS
SELECT *
FROM ratings
WHERE title IS NOT NULL;

CREATE OR REPLACE TABLE books AS
SELECT *
FROM books
WHERE title IS NOT NULL;

We'll check to see if the remaining book titles correspond to a unique id and then to make sure that the Titles in ratings are present in books.

In [0]:
%sql
Select Id, COUNT(DISTINCT Title) as title_count
from ratings
group by Id
having count(distinct title) > 1; 


In [0]:
%sql
SELECT DISTINCT r.Title
FROM ratings r
LEFT ANTI JOIN books b
ON r.Title = b.title;


Cool. All Ids correspond to distinct titles and each title is present in both tables. This means we're good to add the Id and Price columns to books.