Movies data ETL

Large data Extract, Transform, and Load (ETL) using pandas, regex, and export to SQL database

Overview

The framework for this project was that a large movie streaming platform was preparing for a hackathon. In preparation several large datasets needed to be cleaned and prepared for use by the developers. The data used came from three sources:

movie data from Wikipedia, in a JSON file (6.2 MB);
movie metadata downloaded from Kaggle, as a CSV file (34 MB);
movie ratings data from MovieLens, as a CSV file (709.6 MB).

The focus of this project was to build a program that would take in large sources of data, clean them, merge them, and then export them to a SQL server, ready for use. This was achieved by:

importing data from large CSV files into pandas dataframes;
removing columns or rows with unnecessary or corrupted data;
renaming columns for easier use;
extracting and cleaning useful data from existing columns using regular expressions;
merging datasets to maximize utility;
exporting data to a postgresql database.

Results

After processing and merging the Wikipedia and Kaggle files we were left with 6,074 clean rows of usable data. The expected outcome was originally 6,052, but I was able to squeeze out an additional 22 entries by tweaking regex filters.

The large CSV of ratings data, after being exported to a PostgreSQL database, provided us with 26,024,289. This is exactly how many entries were expected. The export of the data to the database took 5,973.17 seconds, or 99.55 minutes.

Summary

Overall this was a very interesting exercise. Regular expressions are a very powerful tool, if somewhat scary at first glance. And the ability to manipulate large datasets programmatically will doubtlessly prove to be a very useful skill going forward. I would have been interested in playing with the additional included CSVs (keywords, links, credits) to see how they fit into the larger picture. Maybe I can import those into databases and create some relationships. All in all, this is a very interesting dataset... likely one I'll return to at a later date.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data		data
resources		resources
.gitattributes		.gitattributes
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
LICENSE		LICENSE
README.md		README.md
exploratory.ipynb		exploratory.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movies data ETL

Overview

Results

Summary

About

Releases

Packages

Languages

License

ArtTucker/movies_ETL

Folders and files

Latest commit

History

Repository files navigation

Movies data ETL

Overview

Results

Summary

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages