This project will create an automated pipeline that takes in new data, from Wikipedia data, Kaggle metadata and the MovieLens rating data.
allenge: Using ETL to clean data files, parse the data that we extracted to make it look how we want. Merge parsed data sets and load the merged sets into pgadmin for further use.
Create an automated pipeline that takes in new data from Wikipedia, Kaggle metadata, and MovieLens ratings. The data is transformed into appropriate format and then loaded into a PostgreSQL database for further analysis.
- Deliverable 1: Write an ETL function to read three data files
- Deliverable 2: Extract and Transform the Wikipedia Data
- Deliverable 3: Extract and Transform the Kaggle Data
- Deliverable 4: Create the Movie Database
-
Data Sources:
- movies_metadata.csv
- ratings.csv
- wikipedia-movies.json
-
Software:
- Anaconda Version 3.7.3
- MacOS Catalina Version 10.15.7
- PgAdmin4
- Jupyter Notebook
- PostgreSQL 11.9
- Python 3.7.7
A function was created to take in Kaggle metadata, Wikipedia JSON, and MovieLens rating.csv. Each one was then set into their own seperate dataframe.
A merge was created with Wikipedia JSON data and Kaggle metadata. The merged dataframe was filtered out for tv shows while multiple columns were cleaned up do to lack of undesirable data.
Transformed Kaggle metadata and MovieLens rating data, and then converted into seperate dataframes. Kaggle, Wikipedia, and MoviesLens rating dataframes were then merged to create one dataframe with ratings for analysis.
Added merged dataframes containig Kaggle and Wikipedia data with MoviesLens rating data to a SQL database where custom queries can be performed for analysis.
Amazon Prime Video Development Team can now perform the "Predict the Popular Pictures" Hackathon as requested. With the new "Movie Data" database, the team is now able to uncover low budget releases that they can use to predict are at a bargain price.
David Supple