Wines_Reviews

This project was completed by Abla.

I am delivering a database with data about wine reviews. It has been aggregated at the 'wine-level' meaning individual review-user data has been averaged (rating and points). By adding average annual temperature information to both the provinces, or countries, a link can be made between the quality of the wine and the temperature. For potential plotting purposes, the longitude and latitude per country were also added.

The program takes 5 datasources and eventually creates a normalized database with 6 tables in PostgreSQL.

These are the steps that were taken:

The database schema was defined quickdatabasediagrams.com as follows:

The export function was used to predefine the database in PostgresSQL (see Tables_for_wine_db.sql)
Once the database is ready, the ETL process can run.

EXTRACT: First the following 5 data sources are loaded. Csv files can be found in the Resources folder

Wine.csv - source - Wine Enthusiast
winequalities.csv - source kaggle.com
countries.csv - source: wikipedia
read_html using pandas for annual temperature averages per county - source: wikipedia
state_temps.cvs - source: www.currentresults.com

TRANSFORM:

Wines:

remove all unnecessary columns (from datasource 1)
small data cleanup within the region_name column
drop any rows that have data missing (NaN values) >> df_wine_base
group remaining fields by unique wine and average price and rating >> df_wine_data

Countries:

add average temperature per country when found to df_countries dataframe (combine datasource 3 with 2)
create unique country_id's

Wineries:

create unique df_wineries dataframe from df_wine_base with winery_id for lookup

Provinces:

create unique df_provinces dataframe from df_wine_base with province_id for lookup
add average temperature in Fahrenheit and Celcius for every state (=provence) in the united states to df_provinces

Regions:

create unique df_regions dataframe from df_wine_base with region_id for lookup

Wine_types:

create unique df_wine_types dataframe from df_wine_base with wine_type_id for lookup

Finally merge df_wine_base with 5 other dataframes, adding country_id, winery_id, province_id, region_id, wine_type_id as new columns. Remove country_name, winery_name province_name, region_name and wine_type from df_wine_data, leaving only foreign keys in place!

LOAD: Since the tables have been precreated in postgreSQL, loading of the data can happen only once (or primary keys will be violated). A test query combining all 6 tables shows it worked correctly.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Codes		Codes
Resources		Resources
.DS_Store		.DS_Store
README.md		README.md
Wine_DB_Diagram.png		Wine_DB_Diagram.png
Wine_db.ipynb		Wine_db.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wines_Reviews

About

Releases

Packages

Languages

Abla-Beatrice/ETL-Wines-Reviews

Folders and files

Latest commit

History

Repository files navigation

Wines_Reviews

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages