Skip to content

StevenWDiamond/wine_pricing_study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Wine Price Study

Author: Steve Diamond (GitHub)

Problem Statement

We, the members of the Data Science committee of the Princeton Wine Club, often sit in our local wine shops and wonder what we should try next. The questions that we ask ourselves include the following:

  • Is it worth the price?
  • Was this a good year for this wine variety, in this country, in this region?
  • How important are the wine reviews they display with the wines?
  • What are we really paying for?
    • The country prestige?
    • The region?
    • The variety?

Each time we go through this process, we start by thanking god that we only like red wine, greatly narrowing the scope of our search.

To gain a deeper understanding of the relationship between wine prices and these various factors, we are studying a dataset with over 70,000 red wines which was scraped from WineEnthusiast Magazine's website. The data includes many of the factors we are looking to explore along with pricing for each wine. Our team will use a series of regression techniques in order to:

  • Better understand what factors are most important to determining price.
  • Attempt to build a predictive model that can estimate the cost for a given bottle of wine

We will use Root Mean Square Error (RMSE) as our metric as we compare our models.

Executive Summary

The pricing of wine is difficult for the buyer and we're never quite sure of the answers to the following questions:

  • Is it worth the price?
  • Was this a good year for this wine variety, in this country, in this region?
  • How important are the wine reviews they display with the wines?
  • What are we really paying for?
    • The country prestige?
    • The region?
    • The variety?

To gain a deeper understanding of the relationship between wine prices and these various factors, we did the following:

  1. Data Gathering
    • We acquired a dataset of WineEnthusiast Magazine reviews that included over 70,000 red wines. This list was scraped by a fellow data scientist and posted on Kaggle.com.
  2. Data Processing
    • Our team imported the entire dataset into a Pandas DataFrame and did the following steps of cleaning:
      • Narrowed the list to only red wines.
      • Removed unneccesary columns.
      • Imputed missing data for a limited number of datapoints by visiting the WineEnthusiast site and gathering data that was available.
      • Dropped rows which didn't have pricing.
      • Combined wine varietal categories where appropriate.
  3. Exploratory Data Analysis
    • We used a variety of methods to discover broad trends in our data, inlcuding examinations of:
      • Data distribution.
      • Data correlations.
      • Review content analysis.
    • We did some unsupervised learning by doing a KPrototypes cluster analysis. Examine these clusters, we were able to better understand these wine groupings.
  4. Modeling and Evaluation
    • We added the results of our KPrototypes clustering to the DataFrame for modeling.
    • We then used regression modeling, attempting to predict wine price, using RMSE as our metric.

Table of Contents

  • Loading Data
    • Library Imports
    • Data Imports
    • Data Dictionary
  • Data Cleaning
    • Overview Analysis
    • Cleaning/EDA Needs by Column
    • Removing Non-Red Wines
    • Removing Columns
    • Imputing Data
    • Removing Additional Wines
    • Drop Non-Prices Rows
    • Variety Combinations
    • Winery Data
  • Exploratory Data Analysis
    • Feature Engineering
      • Province & Region Data
      • Special Designation Column
      • Vintage Column from Title Column
    • Data Correlations
    • Data Distributions
    • Data Interations
      • Price vs Review Score
      • Cost-Per-Point Analysis
    • Review Word Frequency Analysis
    • Post-EDA Data Preparation
  • Return To EDA
  • Specific Links in Notebook
  • Modeling
    • Model Preparation
    • Models
      • Baseline Model
      • Linear Regression - Original Data
      • Linear Regression - KModes Data
      • Ridge Regression
      • LASSO Regression
      • Decision Tree Regression
      • Random Forest Regresssion
      • Extra Trees Regression
      • Feed Forward Neural Network
  • Model Selection
  • Model Evaluation
    • Residual Analysis
    • Coefficient Analysis/Interpretation
  • Conclusion/Next Steps
  • References

Data Dictionary

Feature Name Data Type Description
country string Wine's country of origin
description string Wine review copy
designation string Part of wine name that separates this particular wine (i.e. Reserve)
points integer Wine Enthusiast review score
price float Cost of wine (on the Wine Enthusiast site, this includes a link to buy)
province string Wine's province or state of origin (i.e. Provence, Califorinia)
region_1 string Wine's specific region (i.e. Calistoga)
region_2 string Wine's general region (i.e. Napa)
taster_name string Name of reviewer
taster_twitter_handle string Twitter information for reviewer
title string Full name of wine
variety string Grapes used to make the wine, sometimes called varietals
Unnamed: 0 integer Remnant column from saving without removing index
winery string Winemaker name

NOTE: All of this data is from Wine Enthusiast Magazine and was obtained as part of the Kaggle study referenced below.

References

  1. https://www.kaggle.com/zynicide/wine-reviews
  2. https://www.winemag.com/
  3. https://www.wine-searcher.com/grape-varieties
  4. https://www.thewinecellarinsider.com/wine-topics/wine-educational-questions/abc-of-wine-glossary-of-wine-terms/
  5. https://en.wikipedia.org/wiki/Pinot_noir#Synonyms (example of wikipedia information used)
  6. https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published