**Perry Fox  
Capstone Notebook 1  
pyrus277@gmail.com  
December 12, 2022** 

---

# VineClime - Climate and Qualitative Trends in the California Wine Industry
## Part 1: Data Wrangling & Simple EDA

---

### Contents:

1. [Data source overview](#overview)
1. [Preparing the core dataset](#core)
1. [Gathering and Preparing the Climate Data](#climate)
1. [Preparing the Wine Reviews Data](#reviews)
1. [Next Steps](#next)

---

<a id="overview"></a>
### 1. Data Source Overview:
#### [California Wine Production 1980-2020](https://www.kaggle.com/datasets/jarredpriester/california-wine-production-19802020)
- This is the core dataset and it's from Kaggle. It describes total grapewine production volume, along with production value in USD in California over 40 years, broken out by 42 different counties.   

#### [NASA POWER Project - Climate Data](https://power.larc.nasa.gov/data-access-viewer/)
- This resource was used to append climate features to the core dataset.

#### [Wine Reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews) and [Expansion Data](https://www.kaggle.com/datasets/icaram/wine-reviews)
- These datasets were used to append the all-important subjective quality features to the dataset in the form of ratings and professional tasting notes.

---

<a id="core"></a>
### 2. Preparing the Core Dataset

In [2]:
# import packages
import numpy as np
import pandas as pd
import plotly.express as px

In [3]:
# Read csv data
cwp = pd.read_csv('data/Californa_Wine_Production_1980_2020.csv')

In [4]:
# confirm no duplicate rows
cwp.loc[cwp.duplicated()]

Unnamed: 0,Year,CommodityCode,CropName,CountyCode,County,HarvestedAcres,Yield(Unit/Acre),Production,Price(Dollars/Unit),Unit,Value(Dollars)


In [5]:
# Add the latitude and longitude into the dataframe
ll_dict = {'Alameda': [37.6017, -121.7195],
 'Merced': [37.201, -120.712],
 'Yolo': [38.7646,-121.9018],
 'Tulare': [36.1342,-118.8597],
 'Sonoma': [38.578,-122.9888],
 'Solano': [38.3105,-121.9018],
 'SanJoaquin': [37.9176,-121.171],
 'SanDiego': [32.7157,-117.1611],
 'SanBenito': [36.5761,-120.9876],
 'Sacramento': [38.4747,-121.3542],
 'Riverside': [33.9533,-117.3961],
 'Napa': [38.5025, -122.2654],
 'SanBernardino': [34.9592,-116.4194],
 'Fresno': [36.9859,-119.2321],
 'Madera': [37.2519,-119.6963],
 'Lake': [39.084,-122.8084],
 'Kings': [35.4937,-118.8597],
 'Kern': [35.4937,-118.8597],
 'Calaveras': [38.196, -120.6805],
 'Mendocino': [39.55, -123.4384],
 'Amador': [38.3489, -120.7741],
 'Nevada': [39.1347, -121.171],
 'SanLuisObispo': [35.3102, -120.4358],
 'SantaClara': [37.3337, -121.8907],
 'SantaBarbara': [34.4208, -119.6982],
 'Monterey': [36.3136, -121.3542],
 'SantaCruz': [37.0454, -121.958],
 'ElDorado': [38.7426, -120.4358],
 'Mariposa': [37.4894, -119.9679],
 'Placer': [39.0916, -120.8039],
 'SanMateo': [37.4337, -122.4014],
 'Marin': [38.0834, -122.7633],
 'Stanislaus': [37.5091, -120.9876],
 'Trinity': [40.6329, -123.0623],
 'Yuba': [39.2547, -121.3999],
 'Shasta': [40.7909, -121.8474],
 'ContraCosta': [37.8534, -121.9018],
 'Colusa': [39.1041, -122.2654],
 'Mono': [37.9219, -118.9529],
 'Tehama': [40.0982, -122.1746],
 'Glenn': [39.6438, -122.4467]}

cwp.insert(loc = 11,
               column = 'lat',
               value = cwp['County'].apply(lambda x: ll_dict[x][0]))
cwp.insert(loc = 12,
               column = 'lon',
               value = cwp['County'].apply(lambda x: ll_dict[x][1]))


In [6]:
cwp.head(10)

Unnamed: 0,Year,CommodityCode,CropName,CountyCode,County,HarvestedAcres,Yield(Unit/Acre),Production,Price(Dollars/Unit),Unit,Value(Dollars),lat,lon
0,2020,216299,GRAPESWINE,1,Alameda,2530.0,5.14,13000.0,1497.69,Tons,19470000,37.6017,-121.7195
1,2020,216299,GRAPESWINE,5,Amador,5360.0,2.31,12400.0,1318.31,Tons,16347000,38.3489,-120.7741
2,2020,216299,GRAPESWINE,9,Calaveras,579.0,3.06,1770.0,1325.99,Tons,2347000,38.196,-120.6805
3,2020,216299,GRAPESWINE,11,Colusa,747.0,6.02,4500.0,684.67,Tons,3081000,39.1041,-122.2654
4,2020,216299,GRAPESWINE,13,ContraCosta,1940.0,4.69,9090.0,751.27,Tons,6829000,37.8534,-121.9018
5,2020,216299,GRAPESWINE,17,ElDorado,2620.0,2.38,6240.0,1548.56,Tons,9663000,38.7426,-120.4358
6,2020,216299,GRAPESWINE,19,Fresno,56900.0,12.13,690000.0,362.14,Tons,249877000,36.9859,-119.2321
7,2020,216299,GRAPESWINE,29,Kern,25200.0,7.54,190000.0,314.04,Tons,59668000,35.4937,-118.8597
8,2020,216299,GRAPESWINE,31,Kings,3590.0,16.46,59100.0,286.87,Tons,16954000,35.4937,-118.8597
9,2020,216299,GRAPESWINE,33,Lake,9580.0,4.12,39500.0,1329.37,Tons,52510000,39.084,-122.8084


In [7]:
cwp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1315 entries, 0 to 1314
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year                 1315 non-null   int64  
 1   CommodityCode        1315 non-null   int64  
 2   CropName             1315 non-null   object 
 3   CountyCode           1315 non-null   int64  
 4   County               1315 non-null   object 
 5   HarvestedAcres       1302 non-null   float64
 6   Yield(Unit/Acre)     1266 non-null   float64
 7   Production           1278 non-null   float64
 8   Price(Dollars/Unit)  1278 non-null   float64
 9   Unit                 1279 non-null   object 
 10  Value(Dollars)       1315 non-null   int64  
 11  lat                  1315 non-null   float64
 12  lon                  1315 non-null   float64
dtypes: float64(6), int64(4), object(3)
memory usage: 133.7+ KB


In [8]:
print(f'Counties Represented: {len(cwp.County.unique())}')

Counties Represented: 41


In [9]:
print(f'Earliest year: {cwp.Year.min()}')
print(f'Latest year: {cwp.Year.max()}')

Earliest year: 1980
Latest year: 2020


The California Wine Production dataset gives us the following:
- 1315 rows with 11 features
- Production volume across from 42 different counties
- A timeframe of 41 years from 1980 to 2020
- For each county that produced in a given year, we see
    - Harvested acres
    - Unit per acre
    - Production volume in tons
    - Production value in USD

Next, we'll take a look at how complete the data is, and spin up some exploratory visualizaions.

In [10]:
print(cwp.CropName.value_counts())
print(cwp.CommodityCode.value_counts())
print(cwp.Unit.value_counts())

GRAPESWINE    1315
Name: CropName, dtype: int64
216299    1315
Name: CommodityCode, dtype: int64
TONS    651
Tons    380
TON     211
tons     36
ACRE      1
Name: Unit, dtype: int64


`CropName`, `CommodityCode`, and `Unit` are the same all the way down. `CountyCode` is also redundant with `Count`, so we'll drop these

In [11]:
cwp.drop(['CommodityCode', 'CropName', 'Unit', 'CountyCode'], axis=1, inplace=True)

In [12]:
# find null counts by feature:
cwp.isnull().sum()

Year                    0
County                  0
HarvestedAcres         13
Yield(Unit/Acre)       49
Production             37
Price(Dollars/Unit)    37
Value(Dollars)          0
lat                     0
lon                     0
dtype: int64

In [13]:
# create a datafrome of only the rows where the Yield values is null. Run the counts again to see if this captures all the other nulls:
null_yield_df = cwp[cwp['Yield(Unit/Acre)'].isnull()]
null_yield_df.isnull().sum()

Year                    0
County                  0
HarvestedAcres         13
Yield(Unit/Acre)       49
Production             37
Price(Dollars/Unit)    37
Value(Dollars)          0
lat                     0
lon                     0
dtype: int64

We see that all the null values overlap the largest null category, `Yield(Unit/Acre)`.   
Since there is still useful data in the rows containing null values, I'll hold on to all the rows in the `cwp_df` dataframe, and drop them as needed in the future depending on the model or analysis.   

Here we take a look at which counties are active over time:

In [14]:
cwp.County.value_counts()

Alameda          41
Merced           41
Yolo             41
Tulare           41
Sonoma           41
Solano           41
SanJoaquin       41
SanDiego         41
SanBernardino    41
Sacramento       41
Riverside        41
Napa             41
SanBenito        41
Fresno           41
Madera           41
Lake             41
Kings            41
Kern             41
Calaveras        40
Mendocino        39
Amador           39
Nevada           38
SanLuisObispo    38
SantaClara       37
SantaBarbara     36
Monterey         36
SantaCruz        36
ElDorado         33
Mariposa         30
Placer           28
SanMateo         28
Marin            22
Stanislaus       21
Trinity          20
Yuba             14
Shasta           12
ContraCosta      11
Colusa           10
Mono              4
Tehama            3
Glenn             2
Name: County, dtype: int64

About half of all counties were active for the full time frame

---

<a id="climate"></a>
### 3. Gathering and Preparing the Climate Data

Above we saw that 41 counties were included in the wine production dataset. In this section, I will import climate data from 1981 to 2020 for each of these counties. This data comes from The NASA POWER Project website. To produce a climate data csv, this site requires single point details for each location, and the selection from a wide range of climate data parameters:  

- The Single Point details specified are:
    - User Community: Agroclimatology
    - Temporal Average: Monthly & Annual
    - Latitude and Longitude (Supplied from Google Maps)
    - Output File Format: CSV  
  
- The climate data parameters selected are:
    - T2M - Temperature
    - QV2M - Humidity 
    - WS2M - Wind Speed 
    - GWETTOP - Surface Soil Wetness 
    - GWETPROF - Profile Soil Wetness 
    - GWETROOT - Root Zone Soil Wetness
    - PRECTOTCORR - Precipitation
    - UV index data was also requested, but this returned as very incomplete, so we'll drop those columns.


Performing these queries resulted in a directory with 41 csv files.  
The procedure to wrangle this data and get it combined into one useful dataframe are as follows:

In [31]:
# Make a list of the counties to iterate thru
county_list = list(cwp.County.unique())

In [32]:
# Each of the CSV files has a substantial header section, so in this cell 
# I confirm all 41 files are read in and that they all share the same shape
count = 0
shape = set()
# confirm all 41 county items are read in and the shape for each is consistent.
for county in county_list:
    df = pd.read_csv(f'data/county_climate/{county}.csv', skiprows=17, index_col='YEAR')
    #print(f'{x}, {df.shape}')
    count += 1
    shape.add(df.shape[0])
    shape.add(df.shape[1])
print(count, shape)

41 {360, 14}


Now I'll perform some manipulations and pivots to get them all together

In [33]:
# Instantiate a blank dataframe
all_county_climates = pd.DataFrame()

# Iterate thru all the county names, create a df for each, and append to the main df
for county in county_list:
    # read in the corresponding CSV:
    df = pd.read_csv(f'data/county_climate/{county}.csv', skiprows=17, index_col='YEAR')
    
    # Move the `County` and `Year` columns to the front
    df.insert(loc = 1, 
               column = 'County', 
               value = f'{county}')
    df.insert(loc = 2,
               column = 'Year',
               value = df.index)
    
    # Insert summer and winter average columns at the end
    df['winter_avg'] = df[['JAN', 'FEB', 'MAR', 'OCT', 'NOV', 'DEC']].mean(axis=1).round(2)
    df['summer_avg'] = df[['APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP']].mean(axis=1).round(2)
    
    # Simplify by dropping the individual months: index 4 to 15
    df.drop(df.columns[3:15], axis=1, inplace=True)
  
    # Reset the index
    df.reset_index(inplace=True)

    # Pivot the table so we have the climate parameters as columns
    df = df.pivot(index='Year', columns='PARAMETER', values = ['ANN', 'summer_avg', 'winter_avg'])
    
    # Add back in the county name column
    df.insert(loc = 0, 
               column = 'county', 
               value = f'{county}')
    
    all_county_climates = pd.concat([all_county_climates, df])

#### Note - I feature engineered summer and winter average columns based on the monthly data because the comparative intensities between the hot and cold months of the year can have significant effects on wine grape production and quality

In [34]:
all_county_climates

Unnamed: 0_level_0,county,ANN,ANN,ANN,ANN,ANN,ANN,ANN,ANN,ANN,...,summer_avg,winter_avg,winter_avg,winter_avg,winter_avg,winter_avg,winter_avg,winter_avg,winter_avg,winter_avg
PARAMETER,Unnamed: 1_level_1,ALLSKY_SFC_PAR_TOT,ALLSKY_SFC_UV_INDEX,GWETPROF,GWETROOT,GWETTOP,PRECTOTCORR,QV2M,T2M,WS2M,...,WS2M,ALLSKY_SFC_PAR_TOT,ALLSKY_SFC_UV_INDEX,GWETPROF,GWETROOT,GWETTOP,PRECTOTCORR,QV2M,T2M,WS2M
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1981,Alameda,-999.00,-999.0,0.55,0.56,0.47,1.74,6.53,14.22,2.16,...,2.51,-999.00,-999.0,0.61,0.63,0.63,3.35,6.26,10.68,1.82
1982,Alameda,-999.00,-999.0,0.67,0.69,0.63,2.50,6.53,12.56,2.08,...,2.32,-999.00,-999.0,0.74,0.78,0.76,4.31,5.83,9.08,1.82
1983,Alameda,-999.00,-999.0,0.69,0.70,0.66,3.10,7.20,13.63,2.16,...,2.30,-999.00,-999.0,0.76,0.78,0.76,5.46,6.74,10.79,2.02
1984,Alameda,-999.00,-999.0,0.60,0.61,0.52,1.09,6.41,13.83,2.18,...,2.50,-999.00,-999.0,0.70,0.72,0.71,2.02,6.16,9.52,1.86
1985,Alameda,-999.00,-999.0,0.56,0.57,0.48,1.08,6.10,13.32,2.02,...,2.34,-999.00,-999.0,0.62,0.64,0.62,2.00,5.40,9.04,1.69
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016,Glenn,92.89,-999.0,0.64,0.66,0.55,2.57,5.86,16.14,1.70,...,1.69,53.19,-999.0,0.75,0.78,0.75,4.73,6.00,9.62,1.71
2017,Glenn,93.31,-999.0,0.64,0.66,0.53,2.49,5.68,16.44,1.82,...,1.74,55.67,-999.0,0.74,0.75,0.69,4.50,5.29,9.76,1.90
2018,Glenn,94.71,-999.0,0.55,0.55,0.43,1.53,5.19,16.65,1.77,...,1.67,57.49,-999.0,0.59,0.60,0.56,2.47,5.00,10.29,1.87
2019,Glenn,93.52,-999.0,0.64,0.66,0.54,2.89,5.80,15.90,1.79,...,1.76,54.86,-999.0,0.71,0.73,0.64,5.14,4.93,9.53,1.84


In [35]:
all_county_climates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1640 entries, 1981 to 2020
Data columns (total 28 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   (county, )                         1640 non-null   object 
 1   (ANN, ALLSKY_SFC_PAR_TOT)          1640 non-null   float64
 2   (ANN, ALLSKY_SFC_UV_INDEX)         1640 non-null   float64
 3   (ANN, GWETPROF)                    1640 non-null   float64
 4   (ANN, GWETROOT)                    1640 non-null   float64
 5   (ANN, GWETTOP)                     1640 non-null   float64
 6   (ANN, PRECTOTCORR)                 1640 non-null   float64
 7   (ANN, QV2M)                        1640 non-null   float64
 8   (ANN, T2M)                         1640 non-null   float64
 9   (ANN, WS2M)                        1640 non-null   float64
 10  (summer_avg, ALLSKY_SFC_PAR_TOT)   1640 non-null   float64
 11  (summer_avg, ALLSKY_SFC_UV_INDEX)  1640 non-null   fl

In [36]:
# collapse the unweildy multi level column names
all_county_climates.columns = [' '.join(col).strip() for col in all_county_climates.columns.values]

In [37]:
# Dropping the UV related columns since they came back with many effective Null values
all_county_climates.drop(['ANN ALLSKY_SFC_PAR_TOT', 'ANN ALLSKY_SFC_UV_INDEX', 'summer_avg ALLSKY_SFC_PAR_TOT', 
              'summer_avg ALLSKY_SFC_UV_INDEX', 'winter_avg ALLSKY_SFC_PAR_TOT', 'winter_avg ALLSKY_SFC_UV_INDEX'], axis=1, inplace=True)

In [38]:
# Confirm changes
all_county_climates.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1640 entries, 1981 to 2020
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   county                  1640 non-null   object 
 1   ANN GWETPROF            1640 non-null   float64
 2   ANN GWETROOT            1640 non-null   float64
 3   ANN GWETTOP             1640 non-null   float64
 4   ANN PRECTOTCORR         1640 non-null   float64
 5   ANN QV2M                1640 non-null   float64
 6   ANN T2M                 1640 non-null   float64
 7   ANN WS2M                1640 non-null   float64
 8   summer_avg GWETPROF     1640 non-null   float64
 9   summer_avg GWETROOT     1640 non-null   float64
 10  summer_avg GWETTOP      1640 non-null   float64
 11  summer_avg PRECTOTCORR  1640 non-null   float64
 12  summer_avg QV2M         1640 non-null   float64
 13  summer_avg T2M          1640 non-null   float64
 14  summer_avg WS2M         1640 non-null

Finally, rename the columns to remove spaces and be more descriptive at a glance:

In [39]:
all_county_climates = all_county_climates.rename({'ANN GWETPROF': 'annual_profile_moisture', 
                                               'ANN GWETROOT': 'annual_root_moisture', 
                                               'ANN GWETTOP': 'annual_surface_moisture',
                                               'ANN PRECTOTCORR': 'annual_precipitation', 
                                               'ANN QV2M': 'annual_humidity', 
                                               'ANN T2M': 'annual_temperature', 
                                               'ANN WS2M': 'annual_wind_speed',
                                               'summer_avg GWETPROF': 'summer_profile_moisture', 
                                               'summer_avg GWETROOT': 'summer_root_moisture', 
                                               'summer_avg GWETTOP': 'summer_surface_moisture',
                                               'summer_avg PRECTOTCORR': 'summer_precipitation', 
                                               'summer_avg QV2M': 'summer_humidity', 
                                               'summer_avg T2M': 'summer_temperature',
                                               'summer_avg WS2M': 'summer_wind_speed', 
                                               'winter_avg GWETPROF': 'winter_profile_moisture', 
                                               'winter_avg GWETROOT': 'winter_root_moisture',
                                               'winter_avg GWETTOP': 'winter_surface_moisture', 
                                               'winter_avg PRECTOTCORR': 'winter_precipitation', 
                                               'winter_avg QV2M': 'winter_humidity',
                                               'winter_avg T2M': 'winter_temperature', 
                                               'winter_avg WS2M': 'winter_wind_speed'},
                                               axis = 'columns')
    
    
    

In [40]:
# Confirm change and check for null:s
all_county_climates.isnull().sum()

county                     0
annual_profile_moisture    0
annual_root_moisture       0
annual_surface_moisture    0
annual_precipitation       0
annual_humidity            0
annual_temperature         0
annual_wind_speed          0
summer_profile_moisture    0
summer_root_moisture       0
summer_surface_moisture    0
summer_precipitation       0
summer_humidity            0
summer_temperature         0
summer_wind_speed          0
winter_profile_moisture    0
winter_root_moisture       0
winter_surface_moisture    0
winter_precipitation       0
winter_humidity            0
winter_temperature         0
winter_wind_speed          0
dtype: int64

In [41]:
# confirm no duplicate rows
all_county_climates.loc[all_county_climates.duplicated()]

Unnamed: 0_level_0,county,annual_profile_moisture,annual_root_moisture,annual_surface_moisture,annual_precipitation,annual_humidity,annual_temperature,annual_wind_speed,summer_profile_moisture,summer_root_moisture,...,summer_humidity,summer_temperature,summer_wind_speed,winter_profile_moisture,winter_root_moisture,winter_surface_moisture,winter_precipitation,winter_humidity,winter_temperature,winter_wind_speed
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [42]:
all_county_climates.county.unique()

array(['Alameda', 'Amador', 'Calaveras', 'Colusa', 'ContraCosta',
       'ElDorado', 'Fresno', 'Kern', 'Kings', 'Lake', 'Madera', 'Marin',
       'Mendocino', 'Merced', 'Monterey', 'Napa', 'Nevada', 'Placer',
       'Riverside', 'Sacramento', 'SanBenito', 'SanBernardino',
       'SanDiego', 'SanJoaquin', 'SanLuisObispo', 'SanMateo',
       'SantaBarbara', 'SantaClara', 'SantaCruz', 'Shasta', 'Solano',
       'Sonoma', 'Stanislaus', 'Tehama', 'Tulare', 'Yolo', 'Mariposa',
       'Trinity', 'Mono', 'Yuba', 'Glenn'], dtype=object)

In [44]:
# Extra formatting so this dataset meshes well with the others on merge operations:
all_county_climates.index.names = ['year'] # lowercase year column name
all_county_climates.reset_index(inplace=True) # make year into a column

---

<a id="reviews"></a>
### 4. Preparing the Wine Reviews Data

In [52]:
import re
import time
import shelve

In [53]:
# Pulling the reviews dataset and the expansion to it. 
reviews = pd.read_csv('data/winemag-data-130k-v2.csv', index_col=0)
reviews_expansion = pd.read_csv('data/winemag-data-2017-2020.csv')

In [54]:
reviews.head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


In [55]:
reviews_expansion.head(3)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_photo,taster_twitter_handle,title,variety,vintage,winery
0,Portugal,This is a deliciously creamy wine with light w...,Assobio Branco,87,14.0,Douro,,,Roger Voss,https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd...,@vossroger,Quinta dos Murças 2016 Assobio Branco White (D...,Portuguese White,2016,Quinta dos Murças
1,US,"Black plum juice, black pepper, caramel and sm...",,87,25.0,California,Paso Robles,Central Coast,Matt Kettmann,https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd...,@mattkettmann,Western Slope 2014 Cabernet Sauvignon (Paso Ro...,Cabernet Sauvignon,2014,Western Slope
2,Georgia,Aromas of green apple and white flowers prepar...,,87,14.0,Lechkhumi,,,Mike DeSimone,https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd...,@worldwineguys,Teliani Valley 2015 Tsolikouri (Lechkhumi),Tsolikouri,2015,Teliani Valley


These two datasets don't quite align, but we need to combine them:
- `reviews_expansion` data comes with a taster photo column, while `reviews` does not, so this will be dropped  
- `reviews_expansion` also comes with a vintage column, which the `reviews` does not have. Luckly I can derive that column from the title column before concatenating the dataframes

In [56]:
# Drop the tastser_photo column
reviews_expansion.drop('taster_photo',axis=1, inplace=True)

# Add a vintage column to the reviews dataset, and add it so it lines up with the reviews_expansion df
def pull_year(x): 
    try: 
        return int(re.search(r'\d{4}', x).group(0))
    except AttributeError:
        return 9999

# Extract the year from the title column and put that in a new col:
# reviews['vintage'] = reviews['title'].apply(pull_year)

reviews.insert(loc = 12,
               column = 'vintage',
               value = reviews['title'].apply(pull_year))
               

In [57]:
# Confirm the shapes align
reviews.shape, reviews_expansion.shape

((129971, 14), (81115, 14))

#### Combine the two dataframes into one called `reviews`:

In [60]:
frames = [reviews, reviews_expansion]
reviews = pd.concat(frames)

In [61]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 292201 entries, 0 to 81114
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                292128 non-null  object 
 1   description            292201 non-null  object 
 2   designation            212098 non-null  object 
 3   points                 292201 non-null  int64  
 4   price                  273911 non-null  float64
 5   province               292128 non-null  object 
 6   region_1               245128 non-null  object 
 7   region_2               112953 non-null  object 
 8   taster_name            265657 non-null  object 
 9   taster_twitter_handle  258836 non-null  object 
 10  title                  292201 non-null  object 
 11  variety                292200 non-null  object 
 12  vintage                292201 non-null  object 
 13  winery                 292201 non-null  object 
dtypes: float64(1), int64(1), object(12)
m

Contained within this dataset are 130k+ reviews from wineries all over the world. This analysis is concerned with California wineries limited to the counties and time frame I have climate data for-- 1981-2020. To get the the proper time frame, the year column first needs to be converted to an integer data type so comparison operators can be used.

In [62]:
reviews['vintage'] = pd.to_numeric(reviews['vintage'], errors='coerce').fillna(9999).astype('Int64')

#### Grab all the California listings from the earliest year (1981) to present:

In [63]:
reviews = reviews[(reviews['country'] == 'US') & 
            (reviews['province'] == 'California') &
            (reviews['vintage'] > 1980) &
            (reviews['vintage'] < 2021)] 

In [64]:
# Review Count by Year
revct_by_yr = reviews['vintage'].value_counts().sort_index()
revct_by_yr

1985        1
1986        1
1987        1
1989        1
1990        3
1991        3
1992        5
1993        1
1994       11
1995       11
1996       14
1997      202
1998      211
1999      223
2000      235
2001      176
2002       32
2003       70
2004      441
2005     1130
2006     1780
2007     2072
2008     2104
2009     2944
2010     3590
2011     3208
2012     5750
2013     8550
2014    12184
2015    11380
2016    11571
2017     7304
2018     2760
2019       80
Name: vintage, dtype: Int64

Unfortunately, the reviews become a little sparse going back past the late 90s, so the scope of modeling might have to pick up in the mid 90's instead of 1981.

#### Web Scraping

In order to tie this review data in with the `Climate` and `California Wine Production` dataframes, I need to aggregate rows and join by county and year. It looks like I might be able to use the region columns for this, but unfortunately, wine regions here do not tie 1:1 with the state defined counties. The only way to bridge this gap is to investigate the `winery` column and find the physical address for each. This required some web scraping.  
  
In order to pull the county information I wrote two webscrapers.
- `scraper_google.py` takes a list of the unique winery names and perfroms Google searches on each, first finding the physical address of the winery, and then performing a search on the city name to get the county. 
- `scraper_winery_sage.py` scrapes the website www.winery-sage.com which has tables listing wineries and county  information, among other things for vineyards across California, Oregon, and Washington.
- After running both of these I ran the Google scraper again, adjusting the search term format to try to fill in a few more of the missing gaps. 

This scraping effort resulted in general success. As we'll see below, I was able to fill in a majority of the county names (58%). This is actually in the neighborhood of what would be expected, because not every California winery grows their own grapes--many are blenders that actually buy wines from growers to combine in a way to create a desired profile. If we are concerned with how climate affects the wines from a known location, we actually would not want to include blenders who are not required to disclose the sources of their wines. Therefore, dropping the rows with unknown counties would be desirable.

In [31]:
# First, get a list of unique wineries to pass along to the scraper. I'll use the `shelve` library to pass things back and forth.
wineries = list(reviews.winery.unique())
len(wineries)

4646

In [None]:
## Don't run this cell again ##
## There are 4,646 wineries to scrape-- I'll Shelve this list for scraper_google.py:

# with shelve.open('wineries') as shelfFile:
#     shelfFile['wineries'] = wineries

## I'll also instantiate an empty variable for the scraper to populate with tuples-- (winery name, county)

# with shelve.open('wineries') as shelfFile:
#     ##shelfFile['with_counties'] = [] #especially don't run this again

Okay, I ran a scraper and filled in some counties. Next, I'll bring in the results list, and isolate the wineries we couldn't find counties for, and adjust the scraping strategy for those, i.e. altering the Google search text and using the second scraper. 

In [53]:
with shelve.open('wineries') as shelfFile:
    with_counties = shelfFile['with_counties']

In [57]:
len(with_counties) # with_counties has all the wineries, and county names from the FIRST SCRAPE

4646

Now I'll convert `with_counties` into a dictionary, `with_counties_dict2`, and update this dictionary as I uncover more counties. A dictionary will be easier to use to populate the county column in the dataframe above.   

In [59]:
# Start rebuilding with_counties_dict as with_counties_dict2:
# turn with_counties2 into a dict
with_counties_dict2 = {}
for i in with_counties:
    with_counties_dict2[i[0]] = i[1] 


Also, let's see what percentage of unique wineries we have counties for:

In [71]:
na_lst = []
for winery, county in with_counties_dict2.items():
    if county == 'NA':
        na_lst.append(winery)
print(f'{round((len(na_lst)/len(with_counties_dict2)),4)*100}% of the total wineries do not have a county listed') 

62.79% of the total wineries do not have a county listed


#### Let's see if this can be improved. The website [winery sage](#https://www.winery-sage.com/) looks promising.
I'll feed `scraper_winery_sage.py` the `nalst` just created, and see if it can pick up counties that the Google scraper couldn't.  
This scraper returns a new list of tuples, `with_counties_update`, and this can be used to update `with_counties_dict2`:

In [None]:
with shelve.open('wineries') as shelfFile:
    shelfFile['na_lst'] = na_lst

In [72]:
with shelve.open('wineries') as shelfFile:
    with_counties_update = shelfFile['with_counties_update']

In [73]:
# Turn with_counties_update into a dictionary
with_counties_update_dict = {}
for tup in with_counties_update:
    with_counties_update_dict[tup[0]] = tup[1] 

# Replace the ' ' values in with_counties_update_dict with 'NA'
for k,v in with_counties_update_dict.items():
    if v == ' ':
        with_counties_update_dict[k] = 'NA'

In [74]:
# Finally,update with_counties_dict2.
# As opposed to the Google scraper, where the winery names inputted are just returned back out, 
# the winerey names returned from winery-sage might have a different format than what was used in the nalst input
# so I have to do some text manipulation to align the scraper result winery names with what we have in our 
# running dictionary, with_counties_dict2:

for winery, county in with_counties_dict2.items():
    # transform the winery-- remove hyphens, and change '&' to 'and'
    winery_mod = winery.replace('&', 'and')
    winery_mod = winery_mod.replace('-', ' ')
    if county == 'NA':
        try:
            for key, value in with_counties_update_dict.items():
                if winery_mod in key:
                    with_counties_dict2[winery] = value       
        except KeyError:
            continue

In [78]:
# okay, moment of truth.. how many NAs remain....
na_lst = []
for winery, county in with_counties_dict2.items():
    if county == 'NA':
        na_lst.append(winery)
print(f'{round((len(na_lst)/len(with_counties_dict2)),4)*100}% of the total wineries listed are still null')    

52.73% of the total wineries listed are still null


Great, got about 10% more of the county names filled in.    
Now I'll give the Google scraper another run with the remaining wineries lacking a county, but change up the search strategy a little by updating the search string.

In [None]:
# Update and reshelve na_lst
with shelve.open('wineries') as shelfFile:
    shelfFile['na_lst'] = na_lst

In [79]:
# Bring in the tuple list from the most recent scrape-- with_counites_3
with shelve.open('wineries') as shelfFile:
    with_counties3 = shelfFile['with_counties3']

In [80]:
# Turn with_counties3 into a dictionary
with_counties_update_dict3 = {}
for i in with_counties3:
    with_counties_update_dict3[i[0]] = i[1] 

In [81]:
# Update with_counties_dict2 yet again
for winery, county in with_counties_dict2.items():
    if county == 'NA':
        try:
            for key, value in with_counties_update_dict3.items():
                if winery == key:
                    with_counties_dict2[winery] = value       
        except KeyError:
            continue

In [83]:
# Second moment of truth-- how many more counties did we fill in?
na_lst = []
for winery, county in with_counties_dict2.items():
    if county == 'NA':
        na_lst.append(winery)
print(f'{round((len(na_lst)/len(with_counties_dict2)),4)*100}% of the total wineries listed are still null') 

41.71% of the total wineries listed are still null


41.71%. I think this is the best we're gonna do for now. It could be that the majority of the remainders are wine blenders, as discussed above. 
Further efforts can be made to fill in the county info, but for now we'll roll with this.   
Next I'll perform a little more cleanup, and then shelve the final `with_counties_dict2` and use this to engineer the county column in the reviews dataframe.

In [88]:
# Trim ' county' off the county names
for winery, county in with_counties_dict2.items():
    county = county.split(' ')
    if county[-1] == 'County':
        with_counties_dict2[winery] = (' ').join(county[:-1])        

In [91]:
# Some of the county names we scraped are definitely not counties, so idendify these errors and replace with 'NA'
counties_set = set()
for winery, county in with_counties_dict2.items():
    counties_set.add(county)
print(counties_set)


{'Santa Cruz', 'Jackson', 'Yountville', 'Stanislaus', 'San Francisco', 'Riverside', 'Maricopa', 'Orange', 'Buncombe', 'Mariposa', 'Greenfield', 'Mendocino', 'Humboldt', 'Sonoma', 'Butte', 'Kern', 'Chelan', '93926', 'Sutter', 'Contra Costa', 'San Diego', 'Oakland', 'Mississauga', 'Fredericksburg', 'See results about', 'San Mateo', 'Tulare', 'Angwin', 'Lake', 'Santa Barbara', 'Ventura', 'Polk', 'Merriweather Coffee + Kitchen', 'Bassett Park', 'Spokane', 'San Benito', 'Modesto', 'San Luis Obispo', 'North County Joint Union School', 'Temecula', 'Coos', '95037', 'Sierra', 'Hollister', 'Lewis', 'West 4th Ave At North 12th St', 'Toms River', 'Marin', 'Murphys', 'Santa Clara', 'Point Richmond', 'Clearlake', 'King', 'Amador', 'El Dorado', 'Snohomish', 'Oceanside', 'Trinity', '95637', 'Jack London Village', 'Plumas', 'San Bernardino', 'Florence', 'Los Angeles', 'Newhall', 'Shasta', 'Napa', 'Lodi', 'Monterey', 'San Joaquin', 'Merced', 'Douglas', 'Chico', 'Siskiyou', 'Calaveras', '94573', 'Loudoun

In [98]:
not_counties = ['93926', 'See results about', 'Merriweather Coffee + Kitchen', 'North County Joint Union School', 
                'Coos', '95037', 'West 4th Ave At North 12th St', '95637', 'Meadowcroft wines', '94573', 'Walla Walla',
                'Robert Mondavi Winery', 'Cavaletti Vineyards', 'Westfield World Trade Center', 'United States','95441']
                
for winery, county in with_counties_dict2.items():
    if county in not_counties:
        with_counties_dict2[winery] = 'NA'

In [100]:
# Shelve this dictionary to lock in the modifications. 
with shelve.open('wineries') as shelfFile:
    shelfFile['with_counties_dict2'] = with_counties_dict2

Finally, we can now add in the county column to the `reviews` dataframe!

In [103]:
def add_county(x): 
    return with_counties_dict2[x]

reviews.insert(loc = 14,
               column = 'county',
               value = reviews['winery'].apply(add_county))

In [104]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,vintage,winery,county
10,US,"Soft, supple plum envelopes an oaky structure ...",Mountain Cuvée,87,19.0,California,Napa Valley,Napa,Virginie Boone,@vboone,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,2011,Kirkland Signature,
12,US,"Slightly reduced, this wine offers a chalky, t...",,87,34.0,California,Alexander Valley,Sonoma,Virginie Boone,@vboone,Louis M. Martini 2012 Cabernet Sauvignon (Alex...,Cabernet Sauvignon,2012,Louis M. Martini,Napa
14,US,Building on 150 years and six generations of w...,,87,12.0,California,Central Coast,Central Coast,Matt Kettmann,@mattkettmann,Mirassou 2012 Chardonnay (Central Coast),Chardonnay,2012,Mirassou,Santa Clara
23,US,This wine from the Geneseo district offers aro...,Signature Selection,87,22.0,California,Paso Robles,Central Coast,Matt Kettmann,@mattkettmann,Bianchi 2011 Signature Selection Merlot (Paso ...,Merlot,2011,Bianchi,San Luis Obispo
25,US,Oak and earth intermingle around robust aromas...,King Ridge Vineyard,87,69.0,California,Sonoma Coast,Sonoma,Virginie Boone,@vboone,Castello di Amorosa 2011 King Ridge Vineyard P...,Pinot Noir,2011,Castello di Amorosa,Napa


In [114]:
round(len(reviews[reviews['county'] == 'NA'])/len(reviews.county),2)*100

19.0

While 41% of the unique wineries lacked a county name, only 19% of the rows in total lack a county name!  
Let's investigate how complete the rest of the data is for the other features:

In [115]:
reviews.isnull().sum()

country                      0
description                  0
designation              18180
points                       0
price                      334
province                     0
region_1                     2
region_2                  2118
taster_name              15756
taster_twitter_handle    15756
title                        0
variety                      0
vintage                      0
winery                       0
county                       0
dtype: int64

`designation`, `region_1`, `region_2`, and the twitter information are not features we are concerned with. As for the price nulls, 334 rows account for a very small percentage of the total, so I'm okay with dropping these rows.  

In [116]:
reviews.drop(['designation', 'region_1', 'region_2', 'taster_name', 'taster_twitter_handle'], axis=1, inplace=True)

In [121]:
reviews = reviews.dropna()

In [122]:
reviews.isnull().sum()

country        0
description    0
points         0
price          0
province       0
title          0
variety        0
vintage        0
winery         0
county         0
dtype: int64

Nice. 

Now with this dataframe cleaned up, we're going to do two things with it. We'll keep reviews as is, with all the California information currently there to do a NLP analysis on flavor words. 

We'll also make a version that can be combined with the climate and production datasets, so the data here can be incorporated into a climate oriented time series analysis.  This will mean dropping the rows with counties that cannot be tied to counties featured in the climate data, and aggregating the remaining rows by county and year. I'll set these up in the following section.

In [107]:
# remove the spaces in the county column so it matches up with the other dataframes
reviews['county'] = reviews['county'].apply(lambda x : x.replace(' ', ''))

In [70]:
# About 2500 rows out of 56,000 here turned out to be duplicates. So we'll drop these:
reviews = reviews.drop_duplicates()

---

<a id="next"></a>
### 5. Next Steps

This notebook has yielded 3 dataframes 
- `cwp` - Californa Wine Production 
- `all_county_climates` - Climate data
- `reviews` - Wine reviews

I'll shelve these and carry them over into the second notebook where I'll combine these dataframes for further exploratory analysis and modeling. The goals there will be:
- To investigate trends between climate, wine production, and wine quality in California over the past 40 years.
- To reveal correlations between the features collected in this notebook.
- To see if the data has any predictive value when models are applied to it-- including predictions for production volume and also what factors make for a highly rated wine.

In [101]:
# Don't run this cell again unless you're sure of any updates made!

# with shelve.open('capstone_dataframes') as shelfFile:
#     shelfFile['cwp'] = cwp
#     shelfFile['all_county_climates'] = all_county_climates
#     shelfFile['reviews'] = reviews

---