Skip to content

Know the grape you want in your glass, but unsure of what region you’re in the mood for, or what vintage is best...

Notifications You must be signed in to change notification settings

Abla-Beatrice/ETL-Wines-Reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wines_Reviews

This project was completed by Abla.

I am delivering a database with data about wine reviews. It has been aggregated at the 'wine-level' meaning individual review-user data has been averaged (rating and points). By adding average annual temperature information to both the provinces, or countries, a link can be made between the quality of the wine and the temperature. For potential plotting purposes, the longitude and latitude per country were also added.

The program takes 5 datasources and eventually creates a normalized database with 6 tables in PostgreSQL.

These are the steps that were taken:

  1. The database schema was defined quickdatabasediagrams.com as follows:

Wine_DB_Diagram.png

  1. The export function was used to predefine the database in PostgresSQL (see Tables_for_wine_db.sql)

  2. Once the database is ready, the ETL process can run.

EXTRACT: First the following 5 data sources are loaded. Csv files can be found in the Resources folder

  1. Wine.csv - source - Wine Enthusiast
  2. winequalities.csv - source kaggle.com
  3. countries.csv - source: wikipedia
  4. read_html using pandas for annual temperature averages per county - source: wikipedia
  5. state_temps.cvs - source: www.currentresults.com

TRANSFORM:

  1. Wines:
  • remove all unnecessary columns (from datasource 1)
  • small data cleanup within the region_name column
  • drop any rows that have data missing (NaN values) >> df_wine_base
  • group remaining fields by unique wine and average price and rating >> df_wine_data
  1. Countries:
  • add average temperature per country when found to df_countries dataframe (combine datasource 3 with 2)
  • create unique country_id's
  1. Wineries:
  • create unique df_wineries dataframe from df_wine_base with winery_id for lookup
  1. Provinces:
  • create unique df_provinces dataframe from df_wine_base with province_id for lookup
  • add average temperature in Fahrenheit and Celcius for every state (=provence) in the united states to df_provinces
  1. Regions:
  • create unique df_regions dataframe from df_wine_base with region_id for lookup
  1. Wine_types:
  • create unique df_wine_types dataframe from df_wine_base with wine_type_id for lookup

Finally merge df_wine_base with 5 other dataframes, adding country_id, winery_id, province_id, region_id, wine_type_id as new columns. Remove country_name, winery_name province_name, region_name and wine_type from df_wine_data, leaving only foreign keys in place!

LOAD: Since the tables have been precreated in postgreSQL, loading of the data can happen only once (or primary keys will be violated). A test query combining all 6 tables shows it worked correctly.

About

Know the grape you want in your glass, but unsure of what region you’re in the mood for, or what vintage is best...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published