In [38]:
import pandas as pd
import numpy as np
import pickle
import sys
sys.path.append("../")
import src.Resources as src

import warnings
warnings.filterwarnings('ignore')


In this notebook i will clean all the data from imdb. We start with this base csv called Movies.

In [39]:
main_df = pd.read_csv('../Data/Movies.csv', index_col= 0)  

We change the name of each of the collumns

In [40]:
main_df.columns = ["Movie_name","Movie_img","Movie_genre","Movie_description","Movie_rating","Casting","Movie_votes","Movie_directors","Movie_stars"]

We run two cleaning functions from Resources.py

In [41]:
main_df["Movie_rating"] = main_df["Movie_rating"].apply(lambda x: src.movie_rating_cleaner(x))
main_df["Movie_votes"] = main_df["Movie_votes"].apply(lambda x: src.movie_vote_cleaner(x))

And now it's time to run the main cleaning function for imdb, it also grabs all the .pkl we have in data and concat them together to make one DataFrame of them all.

In [42]:
clean_df = src.movie_cleaner(main_df)

Here we need to reset the index, drop duplicates and reset index again.

In [43]:
clean_df.reset_index(inplace=True)
clean_df.drop("index",axis=1,inplace=True)
clean_df.drop_duplicates(inplace=True)
clean_df.reset_index(inplace=True)
clean_df.drop("index",axis=1,inplace=True)

Here we divide our Directors and Stars by the character "|", this will be useful later on for my recommendation.

In [44]:
clean_df["Movie_stars"] = clean_df["Movie_stars"].apply(lambda x: src.split_casting(x))
clean_df["Movie_directors"] = clean_df["Movie_directors"].apply(lambda x: src.split_casting(x))

I don't need casting anymore so we will drop it.

In [45]:
clean_df.drop("Casting",axis=1,inplace=True)

We make sure that there aren't any null values, and then we save it as Movies_clean

In [46]:
clean_df["Movie_directors"]

0          John Francis Daley| Jonathan Goldstein 
1                                  Chad Stahelski 
2                                 Jeremy Garelick 
3                                   James Cameron 
4                                  Kyle Newacheck 
                            ...                   
130080             Jessica Kitrick| Lewis Lovhaug 
130081                              Tristan Price 
130082                              Aman Sachdeva 
130083                            Riccardo Ghione 
130084                               Edward Conna 
Name: Movie_directors, Length: 130085, dtype: object

In [47]:
clean_df.isnull().sum()

Movie_name           0
Movie_img            0
Movie_genre          0
Movie_description    0
Movie_rating         0
Movie_votes          0
Movie_directors      0
Movie_stars          0
dtype: int64

In [48]:
clean_df.to_csv("../Data/Movies_clean.csv")

Now we will clean the Goodreads scraped data

In this notebook I will clean the data scraped from Goodreads and as well put it all in one DataFrame.
First, we start with the DataFrame Books as a baseline.

In [49]:
book_df = pd.read_csv('../Data/Books.csv', index_col= 0)  

In [50]:
book_df.head(1)

Unnamed: 0,Book_author,Book_img,Book_description,Book_rating,Book_votes,Book_title,Book_genre
0,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"In Romeo and Juliet, Shakespeare creates a vio...",3.74,2462752,Romeo and Juliet,Classics|Plays|Fiction|Romance


This is a list of all the .pkl that we want to load to our DataFrame, i chose them all, if in a future you would scrap new books you need to put the name of the pkl in here aswell, keep in mind to remove Books from the name here.

Exemple: Books_Action.pkl for this list would be just Action.

In [51]:
genres= ["Action","Adventure","Comedy","Crime","Drama","Fantasy","History","Horror","Mystery","Romance","Sciencefiction","Superhero","Thriller","List"]

Here we run the cleaning function, this function aswell puts all the pickles together into one DataFrame.

In [52]:
book_df = src.book_cleaner(genres,book_df)

We reset the index, drop duplicates, reset index again and then we multiply the rating by 2 so its in the same scale as imdb.

In [53]:
book_df.reset_index(inplace=True)
book_df.drop("index",axis=1,inplace=True)
book_df.drop_duplicates(inplace=True)
book_df.reset_index(inplace=True)
book_df.drop("index",axis=1,inplace=True)
book_df["Book_rating"] = book_df["Book_rating"].apply(lambda x: src.book_rating_multiplier(x))

We make sure that there aren't any Nan values, in this case there is one so ill just replace it for: "No description".

In [54]:
book_df.isnull().sum()

Book_author         0
Book_img            0
Book_description    1
Book_rating         0
Book_votes          0
Book_title          0
Book_genre          0
dtype: int64

In [55]:
book_df=book_df.fillna("No description")

This is the look of the DataFrame clean and with all of the new data we got from the pickles that were scraped with our goodreads selenium function.

And finally we save it as Books_clean

In [56]:
book_df.to_csv("../Data/Books_clean.csv")

Here we load Books_clean to simplify the genres.

This function uses a dictionary in my library.py to simplify all the genres of the books, then this is applied with a map.

In [57]:
book_df['Book_genre'] = book_df['Book_genre'].apply(src.map_genre)

These are now the new genres for the books.

In [58]:
book_df.head()

Unnamed: 0,Book_author,Book_img,Book_description,Book_rating,Book_votes,Book_title,Book_genre
0,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"In Romeo and Juliet, Shakespeare creates a vio...",7.48,2462752,Romeo and Juliet,Romance
1,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"Among Shakespeare's plays, ""Hamlet"" is conside...",8.04,888492,Hamlet,Drama
2,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"One night on the heath, the brave and respecte...",7.8,836710,Macbeth,Family
3,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"In Othello, Shakespeare creates a powerful dra...",7.78,368863,Othello,Drama
4,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,Shakespeare's intertwined love polygons begin ...,7.9,513517,A Midsummer Night's Dream,Fantasy


We drop any Nan because that means that it doesn't coincide with any genre that i have for my movies, so I won't be able to do a good recommendation based in the genre.

In [59]:
book_df.isna().sum()

Book_author          0
Book_img             0
Book_description     0
Book_rating          0
Book_votes           0
Book_title           0
Book_genre          49
dtype: int64

In [60]:
book_df.dropna(inplace=True)

In [61]:
book_df.head()

Unnamed: 0,Book_author,Book_img,Book_description,Book_rating,Book_votes,Book_title,Book_genre
0,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"In Romeo and Juliet, Shakespeare creates a vio...",7.48,2462752,Romeo and Juliet,Romance
1,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"Among Shakespeare's plays, ""Hamlet"" is conside...",8.04,888492,Hamlet,Drama
2,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"One night on the heath, the brave and respecte...",7.8,836710,Macbeth,Family
3,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,"In Othello, Shakespeare creates a powerful dra...",7.78,368863,Othello,Drama
4,William Shakespeare,https://images-na.ssl-images-amazon.com/images...,Shakespeare's intertwined love polygons begin ...,7.9,513517,A Midsummer Night's Dream,Fantasy


In [62]:
book_df.to_csv("../Data/Books_clean_filtered.csv")

Now we do the cleaning to prepare the recomendation

In [63]:
movies = pd.read_csv("../Data/Movies_clean.csv",index_col=0)
books = pd.read_csv("../Data/Books_clean_filtered.csv",index_col=0)

We rename our columns

In [64]:
movies.columns = ["Title","Image","Genre","Description","Rating","Votes","Directors","Stars"]
books.columns = ["Author","Image","Description","Rating","Votes","Title","Genre"]

Drop unnecesary columns

In [65]:
movies_rec = movies.drop(["Directors","Stars"],axis=1)
books_rec = books.drop(["Author",],axis=1)
books_rec = books_rec.reindex(columns=["Title","Image","Genre","Description","Rating","Votes"])

We create a column with a identifier

In [66]:
books_rec["Type"] = "Book"
movies_rec["Type"] = "Movie"

We concat both movies and books

In [67]:
recomendation = pd.concat([books_rec,movies_rec])

We create a title identifier, there may be duplicates if a book has a movie adaptation

In [68]:
recomendation["Title_id"] = recomendation["Title"]+"_"+recomendation["Type"]

We filter to have only titles with over 5000 votes

In [69]:
recomendation = recomendation[recomendation["Votes"]>5000]

In [70]:
recomendation.drop_duplicates(inplace=True)
recomendation = recomendation.drop_duplicates(subset="Title_id")
recomendation = recomendation.reset_index()
recomendation = recomendation.drop("index",axis=1)

In [71]:
recomendation.head(1)

Unnamed: 0,Title,Image,Genre,Description,Rating,Votes,Type,Title_id
0,Romeo and Juliet,https://images-na.ssl-images-amazon.com/images...,Romance,"In Romeo and Juliet, Shakespeare creates a vio...",7.48,2462752,Book,Romeo and Juliet_Book


Now we reorder our columns and export to csv

In [72]:
recomendation = recomendation.reindex(columns=["Image","Title","Rating","Votes","Genre","Description","Type","Title_id"])

In [73]:
movies.to_csv("../Data/Movies_clean.csv")
books.to_csv("../Data/Books_clean.csv")
recomendation.to_csv("../Data/recomendation.csv")