# Introduction:
In This Notebook we will utilize sklearn's TFIDF Vectorize feature to convert text data from wine description column to numeric variables that we can use to find similarities between given wine and others in the gathered dataset.



In [2]:
## Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

pd.options.mode.chained_assignment = None 

## Import TFIDF libraries
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
# Read CSV file
wine_df = pd.read_csv('C:/Users/ADMIN/Desktop/DS Projects/My Vivino/My Vivino Project/new_wine.csv', index_col = 0) 

In [5]:
# Keep only the wine data that have wine description inplace since we are building recommendation system based on Description 
wine_df = wine_df[wine_df['wine description'].notna()]

# Reset Index after deleting rows that does not have wine description 
wine_df = wine_df.reset_index()
# Drop 'index' column that appears after resetting index 
wine_df = wine_df.drop(columns = ['index'])

wine_df

Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log
0,Cabernet Sauvignon,Carta Vieja,2019.0,3.4,4.99,Red,Loncomilla Valley,Chile,Cabernet Sauvignon,3.043747,3.781781,1.772746,3.164342,Cabernet Sauvignon is the most widely grown gr...,1.607436
1,Merlot,Carta Vieja,2019.0,3.4,4.99,Red,Loncomilla Valley,Chile,Merlot,2.020424,3.482722,1.949938,2.366192,Merlot is a staple of the wine producing regio...,1.607436
2,Cabernet Sauvignon,Three Wishes,,3.1,4.99,Red,California,United States,Cabernet Sauvignon,3.206160,4.545627,1.962006,3.569953,"Known as the king of red wine grapes, Cabernet...",1.607436
3,Cabernet Sauvignon,Crane Lake,2016.0,3.4,4.99,Red,California,United States,Cabernet Sauvignon,3.014199,4.738935,1.743180,3.540428,"Known as the king of red wine grapes, Cabernet...",1.607436
4,Pinot Noir,Crane Lake,2016.0,3.4,4.99,Red,California,United States,Pinot Noir,3.405433,2.832203,1.500866,2.147871,Pinot Noir has the well deserved reputation of...,1.607436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12791,Bin 27 Reserve Port,Fonseca,,3.8,13.00,Fortified,Porto,Portugal,Touriga Nacional,3.023872,4.767150,4.843369,,Produced exclusively in the Douro Valley of No...,2.564949
12792,Terra Bella Reserve Porto,Fonseca,,3.8,22.99,Fortified,Porto,Portugal,Touriga Nacional,2.979524,4.617777,4.679206,,Produced exclusively in the Douro Valley of No...,3.135059
12793,Malamado Malbec,Zuccardi,,3.7,25.99,Fortified,Mendoza,Argentina,Malbec,2.502118,3.910410,4.366497,2.316115,"Extremely popular, Argentinian Malbec is an in...",3.257712
12794,Malamado Malbec,Zuccardi,2012.0,3.7,19.99,Fortified,Mendoza,Argentina,Malbec,2.502118,3.910410,4.366497,2.316115,"Extremely popular, Argentinian Malbec is an in...",2.995232


I have picked wines below to test my recommendation system. It is not required to list 5 wines, you are free to input different wines based on your preferences


In [6]:
list_of_wines  = ['Barrancas Vineyards Alta Syrah',
                  'Egri Bikavér Bulls Blood',
                  'Valle Escondido Merlot',
                  'Cuvée Tradition Brut Rosé',
                  'Rot']
wines_index = []

# For loop appends the indexes of the picked wines
for wine_name in list_of_wines:
    wines_index.append(wine_df[wine_df['wine name'] == wine_name].index[0])

# Then I create a new column called "user's choice" and equate everything to zero apart from 5 wines selected earlier
wine_df ["user's choice"] = 0

for index in wines_index:
    wine_df.loc[index, "user's choice"] = 1
wine_df[wine_df["user's choice"] == 1]


wine_df["user's choice"] = wine_df["user's choice"].apply(lambda x: np.nan if x ==0 else x)
# wine_df.iloc[random_numbers_list]
wine_df["user's choice"].value_counts()
    

1.0    5
Name: user's choice, dtype: int64

In [7]:
def tfidf_recommendation(df):
    
    """
    Takes in a data frame with wine descriptions, passes the descriptions into a TFIDF function, determines
    the similarity of a given observation/description to the rest of the inputs using cosine similarity, 
    and return a data frame with the top similarity scores for each of our wines.
    """
    
    # Extract wine description column
    
    all_descriptions = df[['wine description']]
    
    # Initialize a TFIDF Vectorizer model to work with the text data
    
    tf = TfidfVectorizer(analyzer='word',
                     min_df=0,
                     stop_words='english')

    # Use the initiated TFIDF model to transform the data in descriptions
    
    tfidf_matrix = tf.fit_transform(all_descriptions['wine description'])
    
    # Compute the cosine similarities between the items in the newly transformed TFIDF matrix
    cosine_similarities = cosine_similarity(tfidf_matrix,tfidf_matrix)
    
    # Reset the index of the data frame to be able to iterate through the index with our smaller data frame
    
    user_indeces = df.reset_index()
    user_indeces = user_indeces[user_indeces["user's choice"] == 1]

    # Initialize a dictionary to store the results
    
    results = {} 

    # Iterate through the observation of Arielle's choices
    
    for idx, row in user_indeces.iterrows():

        similar_indices = cosine_similarities[idx].argsort()[:-5:-1] # Extract the top 5 wines for each observation that user likes
        similar_items = [(cosine_similarities[idx][i], df.reset_index()['index'][i]) for i in similar_indices] # Find the TFIDF score of that item
        results[row['index']] = similar_items[1:] # Append all results after the first (which will be itself) to the dictionary
        
    tfidf_recs = pd.DataFrame()

    # Iterate through the dictionary of results to add the recommended values to a data frame
    
    for k, v in results.items():
    
        for i in v:
        
            tfidf_recs = tfidf_recs.append(df[df.index.isin([i[1]])])
            
    tfidf_scores = []

    # Iterate through the data frame of recommended wines to find their TFIDF scores and add that to the data frame
    
    for i in tfidf_recs.index:
    
        for k, v in results.items():
        
            for ele in v:
            
                if i == ele[1]:
            
                    tfidf_scores.append(ele[0])
    
    tfidf_recs['tfidf_score'] = tfidf_scores[0:len(tfidf_recs)]
    
    return tfidf_recs

### Description Based Recommendation:

In [8]:
new_tfidf_recs = tfidf_recommendation(wine_df)
new_tfidf_recs

Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log,user's choice,tfidf_score
4777,Cuvée 7,Sauska,2016.0,4.1,57.99,Red,Villány,Hungary,Cabernet Sauvignon,,3.948318,1.615174,3.935267,Bordeaux blends from Hungary are jammy and ric...,4.060271,,1.0
1721,Dolomite Cabernet Franc,Raats,2016.0,3.7,27.99,Red,Stellenbosch,South Africa,Cabernet Franc,3.383157,3.543047,1.459468,3.583633,Cabernet Franc in South Africa can produce a v...,3.331847,,0.124832
2884,Dolomite Cabernet Franc,Raats,2017.0,3.8,18.040833,Red,Stellenbosch,South Africa,Cabernet Franc,3.383157,3.543047,1.459468,3.583633,Cabernet Franc in South Africa can produce a v...,2.892638,,0.124832
303,Estate Merlot,Humberto Canale,2016.0,3.4,9.99,Red,Rio Negro,Argentina,Merlot,2.399199,3.590889,1.947797,2.493179,Merlot is a relatively new grape in Argentina ...,2.301585,,1.0
2944,Merlot (Classic),Montes,2018.0,3.5,18.99,Red,Colchagua Valley,Chile,Merlot,1.80177,3.934795,2.032921,1.884926,Merlot is a staple of the wine producing regio...,2.943913,,0.245839
3207,Cuvée Alexandre Merlot (Apalta Vineyard),Lapostolle,2015.0,3.9,18.990833,Red,Colchagua Valley,Chile,Merlot,1.92677,3.967185,2.028907,2.073914,Merlot is a staple of the wine producing regio...,2.943956,,0.245839
3613,Rot,Schwarz,2013.0,4.1,64.99,Red,Burgenland,Austria,Zweigelt,3.816559,3.28241,1.787342,3.235215,"The fruit-forward and spicy Zweigelt, with fla...",4.174233,1.0,1.0
2522,Pitti,Pittnauer,2018.0,3.8,17.0,Red,Burgenland,Austria,Zweigelt,3.980691,3.023554,1.404613,3.307066,"The fruit-forward and spicy Zweigelt, with fla...",2.833213,,1.0
2788,Valpolicella Superiore,Musella,2017.0,3.5,17.99,Red,Valpolicella,Italy,Corvina,3.485892,2.967525,1.961507,2.543161,The red wines of Valpolicella have a lot to of...,2.889816,,0.109529
1261,Syrah,Smoking Loon,2017.0,3.8,13.99,Red,California,United States,Shiraz/Syrah,3.065899,4.351349,1.475967,3.377084,Californian Syrah certainly isn't a wine for t...,2.638343,,0.29608


We can see that recommended does a good job by suggesting wines of same type, from similar regions and price tags (you can see initially selected wines from few cells showed below).

Comparing to the previous wine rating based recommendation system, this is some improvement in quality of suggestions. Some wines have tfidf score of 1.0 which means that wine description is exactly same as the one that was initially selectied. 

It is because majority of the wines that have same winery locations/grape information have the same description as well. This can be resolved by gathering/scraping high-quality data from the web. I will do that after submitting the following project :)  

### Initially selected wines:

In [9]:
wine_df[wine_df["user's choice"] == 1]


Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log,user's choice
539,Egri Bikavér Bulls Blood,Egervin,2017.0,3.5,10.99,Red,Eger,Hungary,Cabernet Sauvignon,4.482264,3.895974,1.249128,4.287904,Bordeaux blends from Hungary are jammy and ric...,2.396986,1.0
1260,Valle Escondido Merlot,Gouguenheim,2017.0,3.8,13.99,Red,Mendoza,Argentina,Merlot,2.011888,3.913674,1.631071,2.21324,Merlot is a relatively new grape in Argentina ...,2.638343,1.0
3613,Rot,Schwarz,2013.0,4.1,64.99,Red,Burgenland,Austria,Zweigelt,3.816559,3.28241,1.787342,3.235215,"The fruit-forward and spicy Zweigelt, with fla...",4.174233,1.0
5051,Barrancas Vineyards Alta Syrah,Pascual Toso,2016.0,4.2,60.99,Red,Mendoza,Argentina,Shiraz/Syrah,,4.372199,1.906853,3.527991,"Syrah is a big, thick-skinned grape that origi...",4.11071,1.0
11809,Cuvée Tradition Brut Rosé,Miolo,2012.0,3.4,15.75,Sparkling,Vale dos Vinhedos,Brazil,Chardonnay,3.59065,2.973936,,,Sparkling wines from Brazil take on many diffe...,2.75684,1.0


# What happens if we choose Wines Randomly?

Let's try to generate randomly 5 wines from the dataset. Again, you can generate more or less than 5 wines.

In [10]:
# This Cell creates list of randomly generated numbers
random_numbers_list = []

for i in range (5):
    random_number = random.randrange(0, len(wine_df))
    random_numbers_list.append(random_number)
random_numbers_list


[937, 7961, 12562, 12133, 3560]

In [11]:
# DataFrame of Wines that generated from random index
wine_df.iloc[random_numbers_list]

Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log,user's choice
937,Pinot Noir,Chasing Lions,2017.0,4.0,12.99,Red,California,United States,Pinot Noir,3.207955,2.944879,1.896073,2.101255,Pinot Noir has the well deserved reputation of...,2.56418,
7961,Diamond Collection Sauvignon Blanc,Francis Ford Coppola Winery,2015.0,3.7,15.99,White,California,United States,Sauvignon Blanc,3.996501,3.060451,1.384674,,California is known primarily for its Cabernet...,2.771964,
12562,Bandol Rosé,Domaine de la Tour du Bon,2019.0,3.9,23.21,Rose,Bandol,France,Shiraz/Syrah,3.895482,2.661442,1.484822,,No summer afternoon is complete without a litt...,3.144583,
12133,Palmes d'Or Vintage Brut Champagne,Nicolas Feuillatte,1999.0,4.4,139.99,Sparkling,Champagne,France,Chardonnay,4.455622,4.023271,,,While there are many sparkling wine regions ar...,4.941571,
3560,Pinot Noir,Wild Ridge,2014.0,4.0,42.99,Red,Sonoma Coast,United States,Pinot Noir,3.511116,3.302398,1.473219,2.181562,Pinot Noir has the well deserved reputation of...,3.760968,


In [12]:
# Same process as previously described
wine_df ["user's choice"] = 0

for index in random_numbers_list:
    wine_df.loc[index, "user's choice"] = 1
wine_df[wine_df["user's choice"] == 1]

Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log,user's choice
937,Pinot Noir,Chasing Lions,2017.0,4.0,12.99,Red,California,United States,Pinot Noir,3.207955,2.944879,1.896073,2.101255,Pinot Noir has the well deserved reputation of...,2.56418,1
3560,Pinot Noir,Wild Ridge,2014.0,4.0,42.99,Red,Sonoma Coast,United States,Pinot Noir,3.511116,3.302398,1.473219,2.181562,Pinot Noir has the well deserved reputation of...,3.760968,1
7961,Diamond Collection Sauvignon Blanc,Francis Ford Coppola Winery,2015.0,3.7,15.99,White,California,United States,Sauvignon Blanc,3.996501,3.060451,1.384674,,California is known primarily for its Cabernet...,2.771964,1
12133,Palmes d'Or Vintage Brut Champagne,Nicolas Feuillatte,1999.0,4.4,139.99,Sparkling,Champagne,France,Chardonnay,4.455622,4.023271,,,While there are many sparkling wine regions ar...,4.941571,1
12562,Bandol Rosé,Domaine de la Tour du Bon,2019.0,3.9,23.21,Rose,Bandol,France,Shiraz/Syrah,3.895482,2.661442,1.484822,,No summer afternoon is complete without a litt...,3.144583,1


In [13]:
# Same process as previously described

wine_df["user's choice"] = wine_df["user's choice"].apply(lambda x: np.nan if x ==0 else x)

wine_df["user's choice"].value_counts()

wine_df.iloc[random_numbers_list]

Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log,user's choice
937,Pinot Noir,Chasing Lions,2017.0,4.0,12.99,Red,California,United States,Pinot Noir,3.207955,2.944879,1.896073,2.101255,Pinot Noir has the well deserved reputation of...,2.56418,1.0
7961,Diamond Collection Sauvignon Blanc,Francis Ford Coppola Winery,2015.0,3.7,15.99,White,California,United States,Sauvignon Blanc,3.996501,3.060451,1.384674,,California is known primarily for its Cabernet...,2.771964,1.0
12562,Bandol Rosé,Domaine de la Tour du Bon,2019.0,3.9,23.21,Rose,Bandol,France,Shiraz/Syrah,3.895482,2.661442,1.484822,,No summer afternoon is complete without a litt...,3.144583,1.0
12133,Palmes d'Or Vintage Brut Champagne,Nicolas Feuillatte,1999.0,4.4,139.99,Sparkling,Champagne,France,Chardonnay,4.455622,4.023271,,,While there are many sparkling wine regions ar...,4.941571,1.0
3560,Pinot Noir,Wild Ridge,2014.0,4.0,42.99,Red,Sonoma Coast,United States,Pinot Noir,3.511116,3.302398,1.473219,2.181562,Pinot Noir has the well deserved reputation of...,3.760968,1.0


In [14]:
alternative_tfidf_recs = tfidf_recommendation(wine_df)

alternative_tfidf_recs

Unnamed: 0,wine name,winery,wine year,wine rating,wine price,wine type,wine region,wine country,grape information,wine acidity,wine intensity,wine sweetness,wine tannin,wine description,wine price log,user's choice,tfidf_score
1589,Pinot Noir,Rickshaw,2019.0,3.6,14.99,Red,Sonoma County,United States,Pinot Noir,3.520112,2.865938,1.462285,2.05348,Pinot Noir has the well deserved reputation of...,2.707383,,1.0
6119,Russian River Valley Pinot Noir,Kosta Browne,2014.0,4.6,115.0,Red,Russian River Valley,United States,Pinot Noir,,3.143511,1.687798,2.058886,Pinot Noir has the well deserved reputation of...,4.744932,,1.0
1577,Pinot Noir Clone 4,Cambria,2016.0,3.8,14.99,Red,Santa Maria Valley,United States,Pinot Noir,3.484797,2.9307,1.590514,2.137889,Pinot Noir has the well deserved reputation of...,2.707383,,1.0
1589,Pinot Noir,Rickshaw,2019.0,3.6,14.99,Red,Sonoma County,United States,Pinot Noir,3.520112,2.865938,1.462285,2.05348,Pinot Noir has the well deserved reputation of...,2.707383,,1.0
6119,Russian River Valley Pinot Noir,Kosta Browne,2014.0,4.6,115.0,Red,Russian River Valley,United States,Pinot Noir,,3.143511,1.687798,2.058886,Pinot Noir has the well deserved reputation of...,4.744932,,1.0
1577,Pinot Noir Clone 4,Cambria,2016.0,3.8,14.99,Red,Santa Maria Valley,United States,Pinot Noir,3.484797,2.9307,1.590514,2.137889,Pinot Noir has the well deserved reputation of...,2.707383,,1.0
7895,Sauvignon Blanc,Girard,2017.0,3.6,15.979167,White,Napa Valley,United States,Sauvignon Blanc,4.077648,3.014764,1.310916,,California is known primarily for its Cabernet...,2.771286,,1.0
7597,Sauvignon Blanc,Thrive,2017.0,3.6,13.99,White,California,United States,Sauvignon Blanc,4.012355,2.93442,1.11686,,California is known primarily for its Cabernet...,2.638343,,1.0
8993,Sauvignon Blanc,Long Meadow Ranch,2018.0,4.0,26.0,White,Rutherford,United States,Sauvignon Blanc,4.0292,3.013259,1.416183,,California is known primarily for its Cabernet...,3.258097,,1.0
12027,Cuvée Alain Thienot Champagne,Thienot,2007.0,4.2,90.0,Sparkling,Champagne,France,Chardonnay,4.619952,4.102551,,,While there are many sparkling wine regions ar...,4.49981,,1.0


Here We can see that all reccommended wines have tfidf score of 1.0. This is because Vivino API provided same description for majority of the wines that have same winery location and/or grape information. However, we could see that the recommendation system is effective from the previous example.   

# Conclusion
We can see that the Description Based Recommendation System recommended 15 wines based on our input of 5 wines. I have given two 
approaches on how to use description based recommendations:

1) Inputting the wines you have previously consumed and liked. Then, the system will look for wines that have similar wine descriptions and give some recommendations based on the tfidf value

2) Randomly Generate some wines to see how the recommendation system works. Since majority of the wines have the same description and number of unique wine descriptions is limited, the tfidf value will be 1.0 which means it will search for the wines that have exactly same description. It could be solved by scraping for high-quality data. The quality of data is essential in building recommendation systems. Building of collaborative filtering model is also depends on the data quality gathered    
