Data Pipeline: Extract, Transform, and Load

Extracting data from a variety of sources in JSON and CSV formats to clean it and transforme it using Python and finally loading it into a PostgreSQL database using SQLAlchemy and to_SQL methods.

Business case:

A company in the live streaming business. Their data science team would like to develop an algorithm to predict which low budget movies being released will become popular. In order to do that it's necessary to go through the entire Data Pipeline (ETL):

Extract: data from multiple sources and formats. In our case, we have three data sources one file in JSON format scraped from wikipedia, the second ratings.csv, and the third movies_metadata.csv both taken from Kaggle.com
Transform: Transform the data in order to store it on a database. We read the data into a Pandas dataframe, used methods to remove null values and entire columns that were unnecessary for our analysis, and obtained one merged dataset with all the information we needed.

The following is the python code used stored in Jupyter Notebooks:

Load: after the data is ready we transfered it into it's final destination, a PostgreSQL database. This was done with Python directly by using SQLAlchemy and to_sql.

    ##Create database and load merged, clean movies dataset
    db_string = f"postgres://postgres:{db_password}@127.0.0.1:5432/movie_data"
    engine = create_engine(db_string)
    movies_df.to_sql(name='movies', con=engine, if_exists='replace')
    
    rows_imported = 0
    start_time = time.time()

    #Load ratings.csv file to database
    start_time = time.time()
    for data in pd.read_csv(f'{file_dir}/ratings.csv', chunksize=1000000):
        print(f'importing rows {rows_imported} to {rows_imported + len(data)}...', end='')
        data.to_sql(name='ratings', con=engine, if_exists='append')
        rows_imported += len(data)

        # add elapsed time to final print out
        print(f'Done. {time.time() - start_time} total seconds elapsed')

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Resources		Resources
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb.ipynb		ETL_function_test.ipynb.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline: Extract, Transform, and Load

Business case:

About

Releases

Packages

Languages

NataliaVelasquez18/Movies_ETL

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline: Extract, Transform, and Load

Business case:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages