Wine Price Study

Author: Steve Diamond (GitHub)

Problem Statement

We, the members of the Data Science committee of the Princeton Wine Club, often sit in our local wine shops and wonder what we should try next. The questions that we ask ourselves include the following:

Is it worth the price?
Was this a good year for this wine variety, in this country, in this region?
How important are the wine reviews they display with the wines?
What are we really paying for?
- The country prestige?
- The region?
- The variety?

Each time we go through this process, we start by thanking god that we only like red wine, greatly narrowing the scope of our search.

To gain a deeper understanding of the relationship between wine prices and these various factors, we are studying a dataset with over 70,000 red wines which was scraped from WineEnthusiast Magazine's website. The data includes many of the factors we are looking to explore along with pricing for each wine. Our team will use a series of regression techniques in order to:

Better understand what factors are most important to determining price.
Attempt to build a predictive model that can estimate the cost for a given bottle of wine

We will use Root Mean Square Error (RMSE) as our metric as we compare our models.

Executive Summary

The pricing of wine is difficult for the buyer and we're never quite sure of the answers to the following questions:

Is it worth the price?
Was this a good year for this wine variety, in this country, in this region?
How important are the wine reviews they display with the wines?
What are we really paying for?
- The country prestige?
- The region?
- The variety?

To gain a deeper understanding of the relationship between wine prices and these various factors, we did the following:

Data Gathering
- We acquired a dataset of WineEnthusiast Magazine reviews that included over 70,000 red wines. This list was scraped by a fellow data scientist and posted on Kaggle.com.
Data Processing
- Our team imported the entire dataset into a Pandas DataFrame and did the following steps of cleaning:
  - Narrowed the list to only red wines.
  - Removed unneccesary columns.
  - Imputed missing data for a limited number of datapoints by visiting the WineEnthusiast site and gathering data that was available.
  - Dropped rows which didn't have pricing.
  - Combined wine varietal categories where appropriate.
Exploratory Data Analysis
- We used a variety of methods to discover broad trends in our data, inlcuding examinations of:
  - Data distribution.
  - Data correlations.
  - Review content analysis.
- We did some unsupervised learning by doing a KPrototypes cluster analysis. Examine these clusters, we were able to better understand these wine groupings.
Modeling and Evaluation
- We added the results of our KPrototypes clustering to the DataFrame for modeling.
- We then used regression modeling, attempting to predict wine price, using RMSE as our metric.

Loading Data
- Library Imports
- Data Imports
- Data Dictionary
Data Cleaning
- Overview Analysis
- Cleaning/EDA Needs by Column
- Removing Non-Red Wines
- Removing Columns
- Imputing Data
- Removing Additional Wines
- Drop Non-Prices Rows
- Variety Combinations
- Winery Data
Exploratory Data Analysis
- Feature Engineering
  - Province & Region Data
  - Special Designation Column
  - Vintage Column from Title Column
- Data Correlations
- Data Distributions
- Data Interations
  - Price vs Review Score
  - Cost-Per-Point Analysis
- Review Word Frequency Analysis
- Post-EDA Data Preparation
Return To EDA

KModes Notebook with Links to the Following:

Specific Links in Notebook

Modeling-Conclusions Notebook

Modeling
- Model Preparation
- Models
  - Baseline Model
  - Linear Regression - Original Data
  - Linear Regression - KModes Data
  - Ridge Regression
  - LASSO Regression
  - Decision Tree Regression
  - Random Forest Regresssion
  - Extra Trees Regression
  - Feed Forward Neural Network
Model Selection
Model Evaluation
- Residual Analysis
- Coefficient Analysis/Interpretation
Conclusion/Next Steps
References

Data Dictionary

Feature Name	Data Type	Description
country	string	Wine's country of origin
description	string	Wine review copy
designation	string	Part of wine name that separates this particular wine (i.e. Reserve)
points	integer	Wine Enthusiast review score
price	float	Cost of wine (on the Wine Enthusiast site, this includes a link to buy)
province	string	Wine's province or state of origin (i.e. Provence, Califorinia)
region_1	string	Wine's specific region (i.e. Calistoga)
region_2	string	Wine's general region (i.e. Napa)
taster_name	string	Name of reviewer
taster_twitter_handle	string	Twitter information for reviewer
title	string	Full name of wine
variety	string	Grapes used to make the wine, sometimes called varietals
Unnamed: 0	integer	Remnant column from saving without removing index
winery	string	Winemaker name

NOTE: All of this data is from Wine Enthusiast Magazine and was obtained as part of the Kaggle study referenced below.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Code		Code
.gitignore		.gitignore
Capstone Presentation.pdf		Capstone Presentation.pdf
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code

Code

.gitignore

.gitignore

Capstone Presentation.pdf

Capstone Presentation.pdf

readme.md

readme.md

Repository files navigation

Wine Price Study

Author: Steve Diamond (GitHub)

Problem Statement

Executive Summary

Table of Contents

Data Cleaning/EDA Notebook with Links to the Following:

KModes Notebook with Links to the Following:

Modeling-Conclusions Notebook

Data Dictionary

References

About

Releases

Packages

Languages

StevenWDiamond/wine_pricing_study

Folders and files

Latest commit

History

Repository files navigation

Wine Price Study

Author: Steve Diamond (GitHub)

Problem Statement

Executive Summary

Table of Contents

Data Dictionary

References

About

Resources

Stars

Watchers

Forks

Languages