In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, sigmoid_kernel 

# Building a Wine Recommendation System

Creating a content-based recommendation system through using NLP modelinng on sommellier reviews.

In [2]:
df = pd.read_csv('../../Data/wine_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150930 entries, 0 to 150929
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   150930 non-null  int64  
 1   country      150925 non-null  object 
 2   description  150930 non-null  object 
 3   designation  105195 non-null  object 
 4   points       150930 non-null  int64  
 5   price        137235 non-null  float64
 6   province     150925 non-null  object 
 7   region_1     125870 non-null  object 
 8   region_2     60953 non-null   object 
 9   variety      150930 non-null  object 
 10  winery       150930 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 12.7+ MB


In [4]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
126310,126310,Chile,"Aromas of tree bark, vanilla and herbs mix wit...",Reserva Maitén Single Vineyard,87,13.0,Aconcagua Valley,,,Cabernet Sauvignon,Agustinos
121303,121303,Spain,"Quite sweaty, oily and hard, almost like an un...",,80,17.0,Galicia,Ribeiro,,White Blend,Coto de Gomariz
136970,136970,France,"Toasty, floral and full-bodied, with a soft af...",Blanc de Blancs Brut,89,65.0,Champagne,Champagne,,Chardonnay,Ruinart
69436,69436,France,Made from purchased grapes (Jaboulet finally a...,Les Jumelles,89,80.0,Rhône Valley,Côte Rôtie,,Syrah,Paul Jaboulet Aîné
104207,104207,US,Quilceda Creek's Red Mountain Cabernet Sauvign...,Galitzine Vineyard,96,110.0,Washington,Red Mountain,Columbia Valley,Cabernet Sauvignon,Quilceda Creek
10916,10916,US,This ripe Chardonnay is a successful example o...,,88,22.0,California,Central Coast,Central Coast,Chardonnay,Harmony Cellars
119488,119488,US,"Big and powerful, but nuanced despite its size...",Gap's Crown Vineyard,95,52.0,California,Sonoma Coast,Sonoma,Pinot Noir,Fulcrum
145092,145092,US,"Pours dark, and opens with aromas of cassis an...",,91,46.0,California,Rutherford,Napa,Cabernet Sauvignon,Sawyer Cellars
67329,67329,Argentina,"Big and bulky, with a mixed bag of aromas that...",Reserve,85,23.0,Mendoza Province,Tupungato,,Cabernet Sauvignon,Andeluna
63315,63315,US,"A very nice, likeable and super-drinkable Sauv...",,86,12.0,California,Paso Robles,Central Coast,Sauvignon Blanc,Villa San Juliette


In [5]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [6]:
df.drop('region_2', axis=1, inplace=True)

In [7]:
df.sample(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,variety,winery
18763,Italy,Vicario is a light and luminous Soave Classico...,Vicario,85,,Veneto,Soave Classico,Garganega,Cantina di Monteforte
146724,US,"Some sweetness in the form of kirsch and Port,...",Holbrook Mitchell Trio,90,30.0,California,Napa Valley,Cabernet Sauvignon,Rosenblum
34054,France,"While Pernand-Vergelesses, which is located in...",,90,,Burgundy,Pernand-Vergelesses,Chardonnay,Olivier Leflaive
50629,US,"Dry and tannic, a wine that could soften and m...",Sy-Cab,87,35.0,California,Paso Robles,Cabernet Sauvignon-Syrah,Pipestone Vineyards
113186,US,Hat Trick is the best of the best of Morgan's ...,Double L Vineyard Hat Trick,95,65.0,California,Santa Lucia Highlands,Chardonnay,Morgan
96768,Spain,"Eight grapes comprise this weird, peanutty, al...",Las Ocho,80,25.0,Levante,Utiel-Requena,Red Blend,Chozas Carrascal
132828,Portugal,Coming from the Alvarinho region of Monção in ...,Portal do Fidalgo,90,16.0,Vinho Verde,,Alvarinho,Provam
86418,France,"Bold brush strokes characterize this wine, the...",Clos de la Mousse Premier Cru,91,58.0,Burgundy,Beaune,Pinot Noir,Bouchard Père & Fils
140088,Italy,This wine hits all the right buttons: Its pric...,Dipinti,87,11.0,Northeastern Italy,Trentino,Riesling,La Vis
145165,US,"This is a masculine Pinot, deep in color and m...",La Colline,92,60.0,California,Arroyo Grande Valley,Pinot Noir,Laetitia


In [8]:
predictors = df[['country', 'description', 'designation', 'province', 'region_1', 'variety', 'winery']]

In [9]:
predictors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150930 entries, 0 to 150929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   country      150925 non-null  object
 1   description  150930 non-null  object
 2   designation  105195 non-null  object
 3   province     150925 non-null  object
 4   region_1     125870 non-null  object
 5   variety      150930 non-null  object
 6   winery       150930 non-null  object
dtypes: object(7)
memory usage: 8.1+ MB


## Missing Data

For the first iteration of this recommender system, I will drop observations with missing values across the board instead of being more selective. This cuts the available data in half. Next iterations could try modelling using fewer features, but more observations.

In [10]:
predictors.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


In [11]:
predictors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 85614 entries, 0 to 150928
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   country      85614 non-null  object
 1   description  85614 non-null  object
 2   designation  85614 non-null  object
 3   province     85614 non-null  object
 4   region_1     85614 non-null  object
 5   variety      85614 non-null  object
 6   winery       85614 non-null  object
dtypes: object(7)
memory usage: 5.2+ MB


In [12]:
predictors = predictors.reset_index()

In [13]:
predictors.description.duplicated().value_counts()

False    55449
True     30165
Name: description, dtype: int64

In [14]:
predictors.duplicated().value_counts()

False    85614
dtype: int64

In [15]:
predictors[(predictors.description.duplicated() == True)]

Unnamed: 0,index,country,description,designation,province,region_1,variety,winery
203,300,US,This standout Rocks District wine brings earth...,The Funk Estate,Washington,Walla Walla Valley (WA),Syrah,Saviah
289,423,US,"The aromas on this wine are quite light, conve...",Weinbau,Washington,Wahluke Slope,Grenache,Sol Stone
290,424,Spain,"A mix of smoke and toast blends with fresh, cr...",Yá Cuvée 23 Brut Rosé,Catalonia,Cava,Sparkling Blend,Sumarroca
316,480,US,Made from what Californians call the Pommard c...,Charles Vineyard Clone O5,California,Anderson Valley,Pinot Noir,Foursight
520,810,Italy,Here's a lively Moscato made in a dry style th...,Bianco Dry,Sicily & Sardinia,Noto,Moscato,Planeta
...,...,...,...,...,...,...,...,...
85609,150923,France,"Rich and toasty, with tiny bubbles. The bouque...",Demi-Sec,Champagne,Champagne,Champagne Blend,Jacquart
85610,150924,France,"Really fine for a low-acid vintage, there's an...",Diamant Bleu,Champagne,Champagne,Champagne Blend,Heidsieck & Co Monopole
85611,150926,France,"Offers an intriguing nose with ginger, lime an...",Cuvée Prestige,Champagne,Champagne,Champagne Blend,H.Germain
85612,150927,Italy,This classic example comes from a cru vineyard...,Terre di Dora,Southern Italy,Fiano di Avellino,White Blend,Terredora


In [16]:
predictors.sample(10)

Unnamed: 0,index,country,description,designation,province,region_1,variety,winery
23369,40110,US,Petite Sirah has a long history in Livermore V...,Del Arroyo Vineyard,California,Livermore Valley,Petite Sirah,Occasio
44691,77730,Italy,This creamy and well-textured blend of Insolia...,Angimbé,Sicily & Sardinia,Sicilia,White Blend,Cusumano
33776,57804,Italy,What sets this beautiful Riserva apart from th...,Riserva,Tuscany,Chianti Classico,Sangiovese,Felsina
36089,61913,Australia,Scents of black tea and cassis mark the bouque...,Forever Red,Australia Other,Central Ranges,Shiraz-Cabernet Sauvignon,Five Friends
78779,137496,Italy,Ruggeri & C. is one of Prosecco's most importa...,Giall'Oro Extra Dry,Veneto,Conegliano Valdobbiadene Prosecco Superiore,Glera,Ruggeri & C.
69281,120940,Spain,"Neutral smelling, with a hint of pool water to...",Sauvignon Blanc - Macabeo,Central Spain,Vino de la Tierra de Castilla,White Blend,Spanish Vines
15313,26213,US,"Sweet and soft, with ripe berry flavors that v...",Chateaux du Lovall,California,Napa County,Red Blend,Robert James Lynch
56733,98944,US,Ed Sbragia is making some of the most interest...,Cimarossa Vineyard,California,Howell Mountain,Cabernet Sauvignon,Sbragia
78549,137080,US,"A tough, gritty, drily astringent wine with fl...",Scoprire,California,California,Red Blend,Millésimé
18623,31608,US,Dried cherry and blackberry aromas abound on t...,Eclipse,New York,New York,Bordeaux-style Red Blend,Heron Hill


## Feature Engineering & Unique Names for Wines

In [17]:
## Creating a more detailed name for each wine by combining Winery and Designation

predictors['name'] = predictors['winery'] + ', ' + predictors['designation']

In [18]:
predictors.drop('index', axis=1, inplace=True)

In [19]:
predictors.sample(10)

Unnamed: 0,country,description,designation,province,region_1,variety,winery,name
42204,Australia,"This wine represents an excellent value, and o...",George Wyndham Founder's Reserve,South Australia,Langhorne Creek,Shiraz,Wyndham Estate,"Wyndham Estate, George Wyndham Founder's Reserve"
17337,Spain,"Narrow aromas of red currant, raspberry and re...",Reserva,Northern Spain,Navarra,Red Blend,Príncipe de Viana,"Príncipe de Viana, Reserva"
13119,US,"Bee pollen, honey mead and lemon polish mingle...",Boushey Vineyard,Washington,Yakima Valley,Marsanne,Darby,"Darby, Boushey Vineyard"
39439,US,Concannon's Merlots have been variable. This l...,Reserve,California,Livermore Valley,Merlot,Concannon,"Concannon, Reserve"
55473,US,"Delicious, but kind of heavy for a Pinot Noir....",Lone Tree Vineyard,California,Carneros,Pinot Noir,Acacia,"Acacia, Lone Tree Vineyard"
81281,Spain,With this wine the Egurens of Rioja fame have ...,Termanthia,Northern Spain,Toro,Tinta de Toro,Numanthia,"Numanthia, Termanthia"
65247,US,"With only 72% Zin, they can't call it by the v...",The Imposter,California,California,Red Blend,JC Cellars,"JC Cellars, The Imposter"
52984,Italy,This opens with a meaty aroma of cured beef an...,San Giovanni Novantasette Riserva,Tuscany,Chianti Colli Fiorentini,Sangiovese,Fattoria San Michele a Torri,"Fattoria San Michele a Torri, San Giovanni Nov..."
54174,US,"Way too strong in feline emissions, that disag...",Estate Grown,California,Napa Valley,Sauvignon Blanc,St. Supéry,"St. Supéry, Estate Grown"
66863,France,"With 24 acres in this premier cru vineyard, La...",Vaudevey Premier Cru,Burgundy,Chablis,Chardonnay,Domaine Laroche,"Domaine Laroche, Vaudevey Premier Cru"


In [20]:
## Leaving this in for future ideas

## Creating a UID for each wine by combining all data into one variable

#predictors['uid'] = predictors['winery'] + ', ' + predictors['designation'] + ', ' + predictors['country'] + ', ' + predictors['description'] + ', ' + predictors['province'] + ', ' + predictors['region_1'] + ', ' + predictors['variety']

### Removing Duplicate Values

In [21]:
predictors.drop_duplicates(inplace=True)

In [22]:
predictors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55461 entries, 0 to 85141
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   country      55461 non-null  object
 1   description  55461 non-null  object
 2   designation  55461 non-null  object
 3   province     55461 non-null  object
 4   region_1     55461 non-null  object
 5   variety      55461 non-null  object
 6   winery       55461 non-null  object
 7   name         55461 non-null  object
dtypes: object(8)
memory usage: 3.8+ MB


In [23]:
predictors.country.value_counts()

US           26157
Italy        10537
France        9761
Spain         4317
Argentina     2428
Australia     2147
Canada         114
Name: country, dtype: int64

In [24]:
predictors.name.duplicated().value_counts()

False    33549
True     21912
Name: name, dtype: int64

In order for the end user to recieve recommendations, using this model and approach, they need to enter a unique name for a wine they like. With so many duplicates this becomes tricky. For now I will drop duplicated wine values, which significantly reduces the volume of data but solved the uniqueness issue. There is almost certainly a better way around this!

In [25]:
predictors.drop_duplicates(subset='name', keep='last', inplace=True)

In [26]:
predictors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33549 entries, 1 to 85141
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   country      33549 non-null  object
 1   description  33549 non-null  object
 2   designation  33549 non-null  object
 3   province     33549 non-null  object
 4   region_1     33549 non-null  object
 5   variety      33549 non-null  object
 6   winery       33549 non-null  object
 7   name         33549 non-null  object
dtypes: object(8)
memory usage: 2.3+ MB


In [27]:
predictors.reset_index(inplace=True)

In [28]:
predictors.drop('index', axis=1, inplace=True)

## Vectorizing With Tfidf

In [29]:
vectors = TfidfVectorizer(min_df = 3,
                         max_features = None,
                         strip_accents = 'unicode',
                         analyzer = 'word',
                         token_pattern = '\w{2,}',
                         ngram_range = (1,3),
                         stop_words = 'english')

In [30]:
vectors_matrix = vectors.fit_transform(predictors['description'])

In [31]:
vectors_matrix.shape

(33549, 73445)

## Calculating Similarity

In [32]:
sig_kern = sigmoid_kernel(vectors_matrix, vectors_matrix)

In [33]:
sig_kern

array([[0.76159987, 0.76159421, 0.76159418, ..., 0.76159416, 0.76159422,
        0.76159418],
       [0.76159421, 0.76159987, 0.76159418, ..., 0.76159417, 0.76159417,
        0.76159417],
       [0.76159418, 0.76159418, 0.76159987, ..., 0.76159418, 0.76159418,
        0.76159418],
       ...,
       [0.76159416, 0.76159417, 0.76159418, ..., 0.76159987, 0.76159417,
        0.76159423],
       [0.76159422, 0.76159417, 0.76159418, ..., 0.76159417, 0.76159987,
        0.76159419],
       [0.76159418, 0.76159417, 0.76159418, ..., 0.76159423, 0.76159419,
        0.76159987]])

In [35]:
sig_kern.shape

(33549, 33549)

In [34]:
index = pd.Series(predictors.index, index=predictors['name']).drop_duplicates()

In [39]:
name = 'Concannon, Reserve'
indx = index[name]
sigmoid_score = list(enumerate(sig_kern[indx]))
sigmoid_score

[(0, 0.7615941559557649),
 (1, 0.7615941559557649),
 (2, 0.7615941559557649),
 (3, 0.7615941734445322),
 (4, 0.7615941559557649),
 (5, 0.761594175897317),
 (6, 0.7615941713140068),
 (7, 0.7615941559557649),
 (8, 0.7615941784241632),
 (9, 0.761594194725326),
 (10, 0.7615941559557649),
 (11, 0.7615941744177317),
 (12, 0.7615941559557649),
 (13, 0.7615941559557649),
 (14, 0.7615941559557649),
 (15, 0.7615941886950643),
 (16, 0.7615941559557649),
 (17, 0.7615941559557649),
 (18, 0.7615941559557649),
 (19, 0.7615941559557649),
 (20, 0.7615941559557649),
 (21, 0.7615941559557649),
 (22, 0.7615941559557649),
 (23, 0.7615941559557649),
 (24, 0.7615941769666736),
 (25, 0.7615941559557649),
 (26, 0.7615941559557649),
 (27, 0.7615941559557649),
 (28, 0.7615941765639268),
 (29, 0.7615943037101537),
 (30, 0.7615941559557649),
 (31, 0.7615941559557649),
 (32, 0.7615941559557649),
 (33, 0.7615941559557649),
 (34, 0.7615941758218441),
 (35, 0.7615941783248212),
 (36, 0.7615942166600282),
 (37, 0.76159

In [40]:
name = 'Concannon, Reserve'
indx = index[name]
sigmoid_score = list(enumerate(sig_kern[indx]))
sigmoid_score = sorted(sigmoid_score, key = lambda x:x[1], reverse = True)
sigmoid_score

[(33263, 0.7615998741120271),
 (31880, 0.7615959208591037),
 (13734, 0.7615957995284612),
 (18378, 0.761595398370176),
 (11384, 0.7615952720533775),
 (16012, 0.7615952381478818),
 (29381, 0.7615951819372025),
 (4088, 0.7615951364429353),
 (16239, 0.7615951108672043),
 (28736, 0.7615950810489946),
 (32086, 0.761595041695919),
 (5100, 0.7615950415127648),
 (9361, 0.7615950255723922),
 (24125, 0.7615949994726887),
 (16832, 0.761594993482037),
 (30416, 0.7615949502952686),
 (13735, 0.7615949343139382),
 (12580, 0.7615949153197457),
 (32318, 0.7615949065139387),
 (32314, 0.761594904634078),
 (16058, 0.7615948883469006),
 (18645, 0.7615948826313386),
 (17338, 0.7615948715322283),
 (13621, 0.7615948683256646),
 (13983, 0.7615948553305649),
 (6737, 0.7615948458336239),
 (18630, 0.7615948441704374),
 (32396, 0.7615948416490055),
 (18609, 0.7615948352196182),
 (14537, 0.7615948325672195),
 (33208, 0.7615948237266952),
 (5301, 0.7615948148262034),
 (8298, 0.761594782998013),
 (29694, 0.7615947810

Outputting the data needed to model into separate CSV's below to work with Streamlit script.

In [35]:
index.to_csv('sig_wines.csv')

In [41]:
np.savetxt("sig_score.csv", sigmoid_score, delimiter=",", fmt='%s')

In [42]:
predictors.to_csv('predictors.csv')

## The Recommender

In [92]:
def recommend_wine(name, sig_kern=sig_kern):
    indx = index[name]
    sigmoid_score = list(enumerate(sig_kern[indx]))
    sigmoid_score = sorted(sigmoid_score, key = lambda x:x[1], reverse = True)
    sigmoid_score = sigmoid_score[1:4]
    position = [i[0] for i in sigmoid_score]
    return predictors.iloc[position]

In [99]:
recommend_wine('Castelli del Grevepesa, Riserva Castelgreve')

Unnamed: 0,country,description,designation,province,region_1,variety,winery,name
19711,Italy,"Made with Sangiovese, this shows ripe berry an...",Terra Rossa Riserva,Tuscany,Chianti Colli Senesi,Sangiovese,Tenuta di Trecciano,"Tenuta di Trecciano, Terra Rossa Riserva"
30485,Argentina,This wine shows molasses and Boston baked bean...,Paris Goulart Reserva,Mendoza Province,Mendoza,Malbec-Cabernet Sauvignon,Bodega Goulart,"Bodega Goulart, Paris Goulart Reserva"
22443,Italy,"From the Classico zone of Amarone, this shows ...",Corte Vaona,Veneto,Amarone della Valpolicella Classico,"Corvina, Rondinella, Molinara",Novaia,"Novaia, Corte Vaona"
