## Overview of the Notebook 

The goal of this notebook is to create food and wine pairings using Word2Vec embeddings and Natural Language Processing (NLP) techniques. It processes food and wine descriptions, trains word embeddings, and calculates non-aroma and aroma vectors for wine and food to recommend pairings.

In [2]:
!pip install gensim
!pip install nltk



In [3]:
# Import Libraries
import os
import pandas as pd
import numpy as np
import string
# from operator import itemgetter
from collections import Counter, OrderedDict

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/michaelajackson/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/michaelajackson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Verify Installations 

In [5]:
from nltk.tokenize import sent_tokenize

sample_text = "This is a test. Let's see if this works."
print(sent_tokenize(sample_text))

['This is a test.', "Let's see if this works."]


## Download Datasets From Kaggle

In [7]:
!pip install kaggle



In [8]:
# Download datasets from Kaggle
!kaggle datasets download snap/amazon-fine-food-reviews

Dataset URL: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
License(s): CC0-1.0
amazon-fine-food-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [9]:
# Download datasets from Kaggle
!kaggle datasets download zynicide/wine-reviews

Dataset URL: https://www.kaggle.com/datasets/zynicide/wine-reviews
License(s): CC-BY-NC-SA-4.0
wine-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


## Load and Explore Data

In [11]:
#Wine Data Loading:
csv_pathwine = "./Data/winemag-data_first150k.csv"
# Load the CSV file
df_wine = pd.read_csv(csv_pathwine)

# Preview the data
print(df_wine.head())
print(df_wine.columns)
print(df_wine.info())
print(df_wine.isnull().sum())

   Unnamed: 0 country                                        description  \
0           0      US  This tremendous 100% varietal wine hails from ...   
1           1   Spain  Ripe aromas of fig, blackberry and cassis are ...   
2           2      US  Mac Watson honors the memory of a wine once ma...   
3           3      US  This spent 20 months in 30% new French oak, an...   
4           4  France  This is the top wine from La Bégude, named aft...   

                            designation  points  price        province  \
0                     Martha's Vineyard      96  235.0      California   
1  Carodorum Selección Especial Reserva      96  110.0  Northern Spain   
2         Special Selected Late Harvest      96   90.0      California   
3                               Reserve      96   65.0          Oregon   
4                            La Brûlade      95   66.0        Provence   

            region_1           region_2             variety  \
0        Napa Valley               

In [12]:
#Food Data Loading
csv_pathfood = "./Data/Reviews.csv"

# Load the CSV file
df_food = pd.read_csv(csv_pathfood)

# Preview the data
print(df_food.head())
print(df_food.columns)
print(df_food.info())
print(df_food.isnull().sum())

   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitality canned d...  
1 

In [13]:
reviews_list_wine = list(df_wine['description'])
reviews_list_food = list(df_food['Text'])

In [14]:
full_wine_reviews_list = [str(r) for r in reviews_list_wine]
full_wine_corpus = ' '.join(full_wine_reviews_list)
sentences_tokenized_wine = sent_tokenize(full_wine_corpus)

full_food_reviews_list = [str(r) for r in reviews_list_food]
full_food_corpus = ' '.join(full_food_reviews_list)
sentences_tokenized_food = sent_tokenize(full_food_corpus)

print(sentences_tokenized_wine[:2])
print(sentences_tokenized_food[:2])

['This tremendous 100% varietal wine hails from Oakville and was aged over three years in oak.', 'Juicy red-cherry fruit and a compelling hint of caramel greet the palate, framed by elegant, fine tannins and a subtle minty tone in the background.']
['I have bought several of the Vitality canned dog food products and have found them all to be of good quality.', 'The product looks more like a stew than a processed meat and it smells better.']


In [15]:
stop_words = set(stopwords.words('english'))

punctuation_table = str.maketrans({key: None for key in string.punctuation})
sno = SnowballStemmer('english')

def normalize_text(raw_text):
    try:
        word_list = word_tokenize(raw_text)
        normalized_sentence = []
        for w in word_list:
            try:
                lower_case_word = str.lower(w)
                stemmed_word = sno.stem(lower_case_word)
                no_punctuation = stemmed_word.translate(punctuation_table)
                if len(no_punctuation) > 1 and no_punctuation not in stop_words:
                    normalized_sentence.append(no_punctuation)
            except:
                continue
        return normalized_sentence
    except:
        return ''

normalized_sentences_wine = []
for s in sentences_tokenized_wine:
    normalized_text = normalize_text(s)
    normalized_sentences_wine.append(normalized_text)

normalized_sentences_food = []
for s in sentences_tokenized_food:
    normalized_text = normalize_text(s)
    normalized_sentences_food.append(normalized_text)

In [16]:
bigram_model_wine = Phrases(normalized_sentences_wine, min_count=100)
bigrams_wine = [bigram_model_wine[line] for line in normalized_sentences_wine]
trigram_model_wine = Phrases(bigrams_wine, min_count=50)
phrased_sentences_wine = [trigram_model_wine[line] for line in bigrams_wine]
trigram_model_wine.save('trigrams_wine.pkl')

In [17]:
bigram_model_food = Phrases(normalized_sentences_food, min_count=100)
bigrams_food = [bigram_model_food[sent] for sent in normalized_sentences_food]
trigram_model_food = Phrases(bigrams_food, min_count=50)
phrased_sentences_food = [trigram_model_food[sent] for sent in bigrams_food]
trigram_model_food.save('trigrams_food.pkl')

In [18]:
trigram_model_wine = Phraser.load('trigrams_wine.pkl')
trigram_model_food = Phraser.load('trigrams_food.pkl')

In [19]:
# Load Descriptor Mappings
descriptor_mapping = pd.read_csv('./descriptor_mapping.csv', encoding='latin1').set_index('raw descriptor')

def return_mapped_descriptor(word, mapping):
    if word in list(mapping.index):
        normalized_word = mapping.at[word, 'level_3']
        return normalized_word
    else:
        return word

normalized_sentences_wine = []
for sent in phrased_sentences_wine:
    normalized_sentence_wine = []
    for word in sent:
        normalized_word = return_mapped_descriptor(word, descriptor_mapping)
        normalized_sentence_wine.append(str(normalized_word))
    normalized_sentences_wine.append(normalized_sentence_wine)

In [20]:
aroma_descriptor_mapping = descriptor_mapping.loc[descriptor_mapping['type'] == 'aroma']
normalized_sentences_food = []
for sent in phrased_sentences_food:
    normalized_sentence_food = []
    for word in sent:
        normalized_word = return_mapped_descriptor(word, aroma_descriptor_mapping)
        normalized_sentence_food.append(str(normalized_word))
    normalized_sentences_food.append(normalized_sentence_food)

In [21]:
normalized_sentences = normalized_sentences_wine + normalized_sentences_food

In [22]:
word2vec_model_wine = Word2Vec(
    sentences=phrased_sentences_wine, 
    vector_size=300,  # Use vector_size instead of size
    min_count=8,
    epochs=15  # Use epochs instead of iter
)

print(word2vec_model_wine)

# Save the model
word2vec_model_wine.save('./food_word2vec_model.bin')

Word2Vec<vocab=9749, vector_size=300, alpha=0.025>


In [23]:
print("Sample phrased sentences:", phrased_sentences_wine[:5])

Sample phrased sentences: [['tremend', '100_variet', 'wine', 'hail', 'oakvill', 'age', 'three_year', 'oak'], ['juici', 'redcherri', 'fruit', 'compel', 'hint', 'caramel', 'greet', 'palat', 'frame', 'eleg', 'fine', 'tannin', 'subtl', 'minti', 'tone', 'background'], ['balanc', 'reward', 'start', 'finish', 'year', 'ahead', 'develop', 'nuanc'], ['enjoy', '2022–2030'], ['ripe', 'aroma', 'fig', 'blackberri', 'cassi', 'soften', 'sweeten', 'slather', 'oaki', 'chocol', 'vanilla']]


In [24]:
# Display the unique values in the 'variety' column
unique_varieties = df_wine['variety'].unique()
print(f"Unique grape varieties ({len(unique_varieties)} total):")
print(unique_varieties)

Unique grape varieties (632 total):
['Cabernet Sauvignon' 'Tinta de Toro' 'Sauvignon Blanc' 'Pinot Noir'
 'Provence red blend' 'Friulano' 'Tannat' 'Chardonnay' 'Tempranillo'
 'Malbec' 'Rosé' 'Tempranillo Blend' 'Syrah' 'Mavrud' 'Sangiovese'
 'Sparkling Blend' 'Rhône-style White Blend' 'Red Blend' 'Mencía'
 'Palomino' 'Petite Sirah' 'Riesling' 'Cabernet Sauvignon-Syrah'
 'Portuguese Red' 'Nebbiolo' 'Pinot Gris' 'Meritage' 'Baga' 'Glera'
 'Malbec-Merlot' 'Merlot-Malbec' 'Ugni Blanc-Colombard' 'Viognier'
 'Cabernet Sauvignon-Cabernet Franc' 'Moscato' 'Pinot Grigio'
 'Cabernet Franc' 'White Blend' 'Monastrell' 'Gamay' 'Zinfandel' 'Greco'
 'Barbera' 'Grenache' 'Rhône-style Red Blend' 'Albariño' 'Malvasia Bianca'
 'Assyrtiko' 'Malagouzia' 'Carmenère' 'Bordeaux-style Red Blend'
 'Touriga Nacional' 'Agiorgitiko' 'Picpoul' 'Godello' 'Gewürztraminer'
 'Merlot' 'Syrah-Grenache' 'G-S-M' 'Mourvèdre'
 'Bordeaux-style White Blend' 'Petit Verdot' 'Muscat'
 'Chenin Blanc-Chardonnay' 'Cabernet Sauvignon

In [25]:
variety_mapping = {
    # Whites
    'Pinot Gris': 'Pinot Grigio',
    'Pinot Grigio/Gris': 'Pinot Grigio',
    'Grüner Veltliner': 'Gruner Veltliner',
    'Fumé Blanc': 'Sauvignon Blanc',
    'Garganega': 'Soave',
    'Verdejo-Viura': 'Verdejo',
    'Riesling-Chardonnay': 'White Blend',
    'Sauvignon Blanc-Semillon': 'White Bordeaux Blend',
    'Semillon-Sauvignon Blanc': 'White Bordeaux Blend',
    'Trebbiano Spoletino': 'Trebbiano',
    'Trebbiano di Lugana': 'Trebbiano',
    'Malvasia Bianca': 'Malvasia',
    'Verdelho': 'Verdelho',
    'Picpoul': 'Piquepoul',
    'Alvarinho': 'Albarino',
    'Verdicchio': 'Verdicchio',
    'Marsanne-Roussanne': 'Rhone White Blend',
    'Chardonnay-Sauvignon Blanc': 'White Blend',
    'Sauvignon Blanc-Chenin Blanc': 'White Blend',
    'Chenin Blanc-Chardonnay': 'White Blend',
    'Viognier-Chardonnay': 'White Blend',
    'Grenache Blanc': 'Rhone White Blend',
    'Assyrtiko': 'Assyrtiko',
    'Müller-Thurgau': 'Muller-Thurgau',
    'Sylvaner': 'Silvaner',
    'Zibibbo': 'Muscat of Alexandria',
    'Muscat Blanc à Petits Grains': 'Muscat',
    'Prosecco': 'Glera',
    'Pinot Bianco': 'Pinot Blanc',
    'Sémillon': 'Semillon',

    # Reds
    'Shiraz': 'Syrah',
    'Syrah-Grenache': 'Rhone Red Blend',
    'Grenache-Syrah': 'Rhone Red Blend',
    'Garnacha': 'Grenache',
    'Cabernet Sauvignon-Merlot': 'Bordeaux Blend',
    'Merlot-Cabernet Sauvignon': 'Bordeaux Blend',
    'Petit Verdot': 'Petit Verdot',
    'Tempranillo-Cabernet Sauvignon': 'Tempranillo Blend',
    'Malbec-Cabernet Franc': 'Bordeaux Blend',
    'Tinta del Pais': 'Tempranillo',
    'Tinta Fina': 'Tempranillo',
    'Aragonês': 'Tempranillo',
    'Cabernet Sauvignon-Syrah': 'Cabernet-Syrah Blend',
    'Cabernet Sauvignon-Carmenère': 'Cabernet-Carmenere Blend',
    'Monastrell': 'Mourvedre',
    'Zinfandel': 'Primitivo',
    'Blaufränkisch': 'Blaufrankisch',
    'Pinot Nero': 'Pinot Noir',
    'Spätburgunder': 'Pinot Noir',
    'Ribolla Gialla': 'Ribolla Gialla',
    'Frappato': 'Frappato',
    'Nero d\'Avola': 'Nero d\'Avola',
    'Aglianico': 'Aglianico',
    'Barbera-Nebbiolo': 'Barbera',
    'Cesanese d\'Affile': 'Cesanese',
    'Lagrein': 'Lagrein',

    # Sparkling
    'Champagne Blend': 'Champagne',
    'Sparkling Blend': 'Sparkling Wine',
    'Portuguese Sparkling': 'Sparkling Wine',

    # Rosés
    'Rosé': 'Rose',
    'Rosado': 'Rose',
    'Portuguese Rosé': 'Rose',

    # Fortified and Sweet Wines
    'Sherry': 'Sherry',
    'Port': 'Port',
    'Madeira Blend': 'Madeira',
    'Pedro Ximénez': 'Pedro Ximenez',
    'Moscatel de Alejandría': 'Muscat of Alexandria',
    'Tokaji': 'Tokaji',

    # Others and Rare Varieties
    'Roussanne': 'Rhone White Blend',
    'Marsanne': 'Rhone White Blend',
    'Carmenère': 'Carmenere',
    'Albariño': 'Albarino',
    'Gewürztraminer': 'Gewurztraminer',
    'Vermentino': 'Vermentino',
    'Viognier': 'Viognier',
    'Cortese': 'Cortese (Gavi)',
    'Nerello Mascalese': 'Nerello Mascalese',
    'Dolcetto': 'Dolcetto',
    'Cinsault': 'Cinsault',
    'Carignan': 'Carignan',
    'Savagnin': 'Savagnin',
    'Tannat': 'Tannat',
    'Malbec': 'Malbec',
    'Petit Manseng': 'Petit Manseng',
    'Grenache': 'Grenache',
    'Pinotage': 'Pinotage',
    'Negroamaro': 'Negroamaro',
    'Falanghina': 'Falanghina',
    'Vernaccia': 'Vernaccia',
    'Primitivo': 'Zinfandel',
    'Cabernet Franc': 'Cabernet Franc',
    'Cabernet Sauvignon': 'Cabernet Sauvignon',
    'Chardonnay': 'Chardonnay',
    'Merlot': 'Merlot',
    'Sangiovese': 'Sangiovese',
    'Nebbiolo': 'Nebbiolo',
    'Gamay': 'Gamay',
}

def consolidate_varieties(variety_name):
    if variety_name in variety_mapping:
        return variety_mapping[variety_name]
    else:
        return variety_name


df_wine_clean = df_wine.copy()
df_wine_clean['variety'] = df_wine_clean['variety'].apply(lambda x: variety_mapping.get(x, x))
df_wine_clean.columns = df_wine_clean.columns.str.title()
df_wine_clean.rename(columns={'Region_1': 'Subregion', 'Region_2': 'Region'}, inplace=True)
df_wine_clean.head()

Unnamed: 0.1,Unnamed: 0,Country,Description,Designation,Points,Price,Province,Subregion,Region,Variety,Winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [26]:
order_of_geographies = ['Subregion', 'Region', 'Province', 'Country']

# Replace NaN and invalid values with 'none'
def replace_nan_for_zero(value):
    if str(value).lower() in ['0', 'nan', 'none']:
        return 'none'
    else:
        return value

for o in order_of_geographies:
    df_wine_clean[o] = df_wine_clean[o].apply(replace_nan_for_zero)

# Verify there are no NaN values
print(df_wine_clean[order_of_geographies].isnull().sum())

Subregion    0
Region       0
Province     0
Country      0
dtype: int64


In [27]:
# Group by 'Variety', 'Country', 'Province', 'Region', and 'Subregion', and count occurrences
variety_geo = df_wine_clean.groupby(['Variety', 'Country', 'Province', 'Region', 'Subregion']).size().reset_index(name='count')

# Filter for groups where count > 1
variety_geo_sliced = variety_geo.loc[variety_geo['count'] > 1]

# Create a new DataFrame with the relevant columns
df_vgeos = pd.DataFrame(variety_geo_sliced, columns=['Variety', 'Country', 'Province', 'Region', 'Subregion', 'count'])

# Save to CSV
df_vgeos.to_csv('varieties_all_geos.csv', index=False)

# Preview the result
df_vgeos.head(15)

Unnamed: 0,Variety,Country,Province,Region,Subregion,count
1,Agiorgitiko,Greece,Corinth,none,none,8
2,Agiorgitiko,Greece,Greece,none,none,10
3,Agiorgitiko,Greece,Nemea,none,none,84
4,Agiorgitiko,Greece,Pangeon,none,none,2
5,Agiorgitiko,Greece,Peloponnese,none,none,15
6,Aglianico,Italy,Southern Italy,none,Aglianico del Beneventano,4
7,Aglianico,Italy,Southern Italy,none,Aglianico del Taburno,12
8,Aglianico,Italy,Southern Italy,none,Aglianico del Vulture,86
9,Aglianico,Italy,Southern Italy,none,Basilicata,6
10,Aglianico,Italy,Southern Italy,none,Beneventano,2


In [28]:
# Add a new column 'geo_normalized' that combines Subregion, Region, Province, and Country excluding 'none'
df_vgeos['geo_normalized'] = (
    df_vgeos[['Subregion', 'Region', 'Province', 'Country']]
    .apply(lambda x: ', '.join(val for val in x if val and val.lower() != 'none'), axis=1)
)
df_vgeos.head(20)

Unnamed: 0,Variety,Country,Province,Region,Subregion,count,geo_normalized
1,Agiorgitiko,Greece,Corinth,none,none,8,"Corinth, Greece"
2,Agiorgitiko,Greece,Greece,none,none,10,"Greece, Greece"
3,Agiorgitiko,Greece,Nemea,none,none,84,"Nemea, Greece"
4,Agiorgitiko,Greece,Pangeon,none,none,2,"Pangeon, Greece"
5,Agiorgitiko,Greece,Peloponnese,none,none,15,"Peloponnese, Greece"
6,Aglianico,Italy,Southern Italy,none,Aglianico del Beneventano,4,"Aglianico del Beneventano, Southern Italy, Italy"
7,Aglianico,Italy,Southern Italy,none,Aglianico del Taburno,12,"Aglianico del Taburno, Southern Italy, Italy"
8,Aglianico,Italy,Southern Italy,none,Aglianico del Vulture,86,"Aglianico del Vulture, Southern Italy, Italy"
9,Aglianico,Italy,Southern Italy,none,Basilicata,6,"Basilicata, Southern Italy, Italy"
10,Aglianico,Italy,Southern Italy,none,Beneventano,2,"Beneventano, Southern Italy, Italy"


In [29]:
df_wine_merged = pd.merge(
    left=df_wine_clean,
    right=df_vgeos,
    left_on=['Variety', 'Country', 'Province', 'Region', 'Subregion'],
    right_on=['Variety', 'Country', 'Province', 'Region', 'Subregion']
)

# Drop unnecessary columns
columns_to_drop = [
    'Unnamed: 0', 'Designation', 'Price', 'Region', 'Province',
    'Subregion', 'Winery', 'count'
]
df_wine_merged.drop(columns=columns_to_drop, axis=1, inplace=True, errors='ignore')

# Verify the resulting shape
print("Merged DataFrame shape:", df_wine_merged.shape)
df_wine_merged.head()

Merged DataFrame shape: (148799, 5)


Unnamed: 0,Country,Description,Points,Variety,geo_normalized
0,US,This tremendous 100% varietal wine hails from ...,96,Cabernet Sauvignon,"Napa Valley, Napa, California, US"
1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",96,Tinta de Toro,"Toro, Northern Spain, Spain"
2,US,Mac Watson honors the memory of a wine once ma...,96,Sauvignon Blanc,"Knights Valley, Sonoma, California, US"
3,US,"This spent 20 months in 30% new French oak, an...",96,Pinot Noir,"Willamette Valley, Willamette Valley, Oregon, US"
4,France,"This is the top wine from La Bégude, named aft...",95,Provence red blend,"Bandol, Provence, France"


In [30]:
# Group by 'Variety' and 'geo_normalized', then filter groups with more than 30 occurrences
variety_geo_counts = df_wine_merged.groupby(['Variety', 'geo_normalized']).size().reset_index(name='count')

# Filter out groups with <= 30 occurrences
frequent_varieties = variety_geo_counts[variety_geo_counts['count'] > 30]

# Merge back to keep only rows with frequent 'Variety' and 'geo_normalized'
df_wine_filtered = df_wine_merged.merge(
    frequent_varieties[['Variety', 'geo_normalized']], 
    on=['Variety', 'geo_normalized']
)

# Retain only the necessary columns
df_wine_filtered = df_wine_filtered[['Variety', 'geo_normalized', 'Description']]

# Print the final shape
print("Filtered DataFrame shape:", df_wine_filtered.shape)
df_wine_filtered.head()

Filtered DataFrame shape: (112295, 3)


Unnamed: 0,Variety,geo_normalized,Description
0,Cabernet Sauvignon,"Napa Valley, Napa, California, US",This tremendous 100% varietal wine hails from ...
1,Tinta de Toro,"Toro, Northern Spain, Spain","Ripe aromas of fig, blackberry and cassis are ..."
2,Pinot Noir,"Willamette Valley, Willamette Valley, Oregon, US","This spent 20 months in 30% new French oak, an..."
3,Tinta de Toro,"Toro, Northern Spain, Spain","Deep, dense and pure from the opening bell, th..."
4,Tinta de Toro,"Toro, Northern Spain, Spain",Slightly gritty black-fruit aromas include a s...


In [31]:
wine_reviews = list(df_wine_filtered['Description'])

descriptor_mapping_tastes = pd.read_csv('./descriptor_mapping_tastes.csv', encoding='latin1').set_index('raw descriptor')

core_tastes = ['aroma', 'weight', 'sweet', 'acid', 'salt', 'piquant', 'fat', 'bitter']
descriptor_mappings = dict()
for c in core_tastes:
    if c=='aroma':
        descriptor_mapping_filtered=descriptor_mapping_tastes.loc[descriptor_mapping_tastes['type']=='aroma']
    else:
        descriptor_mapping_filtered=descriptor_mapping_tastes.loc[descriptor_mapping_tastes['primary taste']==c]
    descriptor_mappings[c] = descriptor_mapping_filtered                                                   
    

def return_descriptor_from_mapping(descriptor_mapping_tastes, word, core_taste):
    if word in list(descriptor_mapping_tastes.index):
        descriptor_to_return = descriptor_mapping_tastes['combined'][word]
        return descriptor_to_return
    else:
        return None

review_descriptors = []
for review in wine_reviews:
    taste_descriptors = []
    normalized_review = normalize_text(review)
    phrased_review = trigram_model_wine[normalized_review]

    for c in core_tastes:                                                      
        descriptors_only = [return_descriptor_from_mapping(descriptor_mappings[c], word, c) 
                            for word in phrased_review]
        no_nones = [str(d).strip() for d in descriptors_only if d is not None]
        descriptorized_review = ' '.join(no_nones)
        taste_descriptors.append(descriptorized_review)
    
    # Correct placement of appending descriptors
    review_descriptors.append(taste_descriptors)

print(f"Total Reviews Processed: {len(review_descriptors)}")

Total Reviews Processed: 112295


In [32]:
print(f"Number of reviews: {len(wine_reviews)}")
print(f"Number of review descriptors: {len(review_descriptors)}")

Number of reviews: 112295
Number of review descriptors: 112295


In [33]:
taste_descriptors = []
taste_vectors = []

for n, taste in enumerate(core_tastes):
    print(f"Processing '{taste}'...")
    
    # Extract and filter non-empty taste words
    taste_words = [r[n] for r in review_descriptors if r[n].strip() != '']
    print(f"Total taste words for '{taste}': {len(taste_words)}")
    
    # Check for empty lists early
    if not taste_words:
        print(f"No valid taste words found for '{taste}'. Skipping.")
        taste_vectors.append([np.nan] * len(review_descriptors))
        taste_descriptors.append([''] * len(review_descriptors))
        continue

    # Apply TF-IDF Vectorization
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(taste_words)
    dict_of_tfidf_weightings = dict(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))

    wine_review_descriptors = []
    wine_review_vectors = []

    # Calculate weighted vectors
    for d in taste_words:
        weighted_review_terms = []
        terms = d.split(' ')
        for term in terms:
            if term in dict_of_tfidf_weightings:
                tfidf_weighting = dict_of_tfidf_weightings[term]
                try:
                    word_vector = word2vec_model_wine.wv.get_vector(term).reshape(1, 300)
                    weighted_word_vector = tfidf_weighting * word_vector
                    weighted_review_terms.append(weighted_word_vector)
                except KeyError:
                    continue
        
        if weighted_review_terms:
            review_vector = sum(weighted_review_terms) / len(weighted_review_terms)
            wine_review_vectors.append(review_vector[0])
        else:
            wine_review_vectors.append(np.nan)

        wine_review_descriptors.append(terms)

    taste_vectors.append(wine_review_vectors)
    taste_descriptors.append(wine_review_descriptors)



Processing 'aroma'...
Total taste words for 'aroma': 111386
Processing 'weight'...
Total taste words for 'weight': 55758
Processing 'sweet'...
Total taste words for 'sweet': 39634
Processing 'acid'...
Total taste words for 'acid': 49797
Processing 'salt'...
Total taste words for 'salt': 1124
Processing 'piquant'...
Total taste words for 'piquant': 8597
Processing 'fat'...
Total taste words for 'fat': 7525
Processing 'bitter'...
Total taste words for 'bitter': 53281


In [34]:
taste_vectors_t = list(map(list, zip(*taste_vectors)))

# Correct DataFrame creation
df_review_vecs = pd.DataFrame(taste_vectors_t, columns=core_tastes)

columns_taste_descriptors = [a + '_descriptors' for a in core_tastes]
df_review_descriptors = pd.DataFrame(taste_descriptors_t, columns=columns_taste_descriptors)

# Merge with the original DataFrame
df_wine_vecs = pd.concat([df_wine_filtered, df_review_descriptors, df_review_vecs], axis=1)
print("\nProcessing complete.")
df_wine_vecs.sample(30)

NameError: name 'taste_descriptors_t' is not defined

In [253]:
import numpy as np

avg_taste_vecs = dict()

for t in core_tastes:
    # Extract valid embeddings
    review_arrays = df_wine_vecs[t].dropna().to_list()
    
    # Check if the list is not empty
    if review_arrays:
        # Compute average embedding
        average_taste_vec = np.mean(np.stack(review_arrays), axis=0)
        avg_taste_vecs[t] = average_taste_vec
    else:
        avg_taste_vecs[t] = np.nan

# Display the average vectors
for taste, vec in avg_taste_vecs.items():
    print(f"{taste.capitalize()} Average Vector: {vec if not np.isnan(vec).any() else 'No Data'}")

Aroma Average Vector: [ 1.5739756  -2.329658    1.6649437   1.7141682   0.03100785  2.9910395
 -2.0314074   3.9998605  -5.324974    2.5763528  -1.5920631   2.9321184
 -1.5420034  -0.5104966  -0.25311288  0.20328876 -5.755668   -3.1249745
  1.3041464  -0.43352252  3.9632287   0.1387603  -1.1445435   2.689269
  0.55665505 -5.3279753   0.5558952  -0.70028746 -2.66059     1.7589175
 -1.7126216   0.9711234   3.293448    4.27934    -1.2564628   2.1557739
 -0.639848    2.2322743  -1.3258679   4.6566944   0.39310262 -1.2104645
  2.9676428  -1.4946461   2.0495856   0.0375484   2.3957496   1.8847392
 -4.2689347   1.0780224  -1.2589067  -0.10864718  1.1543659  -0.9916873
  0.16716404 -1.4913247  -1.2743292  -1.5254711   1.1635944   4.0775523
 -1.3469003   3.307718    0.07497174 -0.07332997 -1.1398369   4.2903247
  2.0993228   3.0513227   0.41216844  3.8158784   1.9265314   1.8614669
 -0.14239061  0.12564936  2.5984483   0.17181961 -0.63406676 -3.476078
  1.2985238  -4.736719    0.3308491   3.0842

In [261]:
 # Debug function to inspect data
def inspect_data(data, label):
    print(f"Inspecting {label}:")
    if isinstance(data, np.ndarray):
        print(f"Shape: {data.shape}, NaNs: {np.isnan(data).sum()}")
    else:
        print(f"Type: {type(data)}, Length: {len(data)}")
    print("-" * 40)

# Adjusted PCA function
def pca_wine_variety(list_of_varieties, wine_attribute, pca=True):
    wine_var_vectors = subset_wine_vectors(list_of_varieties, wine_attribute)

    # Extract varieties and vectors
    wine_varieties = [f"{w[0][0]} - {w[0][1]}" for w in wine_var_vectors]
    wine_var_vec = [w[1] for w in wine_var_vectors]
    
    # Inspect before cleaning
    inspect_data(wine_var_vec, "Original Wine Vectors")
    
    # Clean NaNs
    wine_var_vec = np.array([vec for vec in wine_var_vec if vec is not None and not np.isnan(vec).any()])
    if wine_var_vec.size == 0:
        raise ValueError(f"No valid vectors found for {wine_attribute} after cleaning.")

    # Inspect after cleaning
    inspect_data(wine_var_vec, "Cleaned Wine Vectors")
    
    if pca:
        pca_model = PCA(n_components=1)
        wine_var_vec = pca_model.fit_transform(wine_var_vec)
        wine_var_vec = pd.DataFrame(wine_var_vec, index=wine_varieties)
    else:
        wine_var_vec = pd.Series(wine_var_vec.tolist(), index=wine_varieties)
    
    wine_var_vec.sort_index(inplace=True)

    # Create descriptor DataFrame
    wine_descriptors = pd.DataFrame([w[2] for w in wine_var_vectors], index=wine_varieties)
    wine_descriptors = pd.melt(wine_descriptors.reset_index(), id_vars='index')
    wine_descriptors.sort_index(inplace=True)
    
    return wine_var_vec, wine_descriptors

# Execute with Debugging
taste_dataframes = []
aroma_vec, aroma_descriptors = pca_wine_variety(normalized_geos, 'aroma', pca=False)
taste_dataframes.append(aroma_vec)

# Generate non-aroma scalars
for tw in core_tastes[1:]:
    try:
        pca_w_dataframe, nonaroma_descriptors = pca_wine_variety(normalized_geos, tw, pca=True)
        taste_dataframes.append(pca_w_dataframe)
    except ValueError as e:
        print(f"Skipping {tw}: {e}")

# Combine all dataframes
all_nonaromas = pd.concat(taste_dataframes, axis=1)
all_nonaromas.columns = core_tastes

Inspecting Original Wine Vectors:
Type: <class 'list'>, Length: 343
----------------------------------------
Inspecting Cleaned Wine Vectors:
Shape: (343, 300), NaNs: 0
----------------------------------------
Inspecting Original Wine Vectors:
Type: <class 'list'>, Length: 343
----------------------------------------
Inspecting Cleaned Wine Vectors:
Shape: (343, 300), NaNs: 0
----------------------------------------
Inspecting Original Wine Vectors:
Type: <class 'list'>, Length: 338
----------------------------------------
Inspecting Cleaned Wine Vectors:
Shape: (338, 300), NaNs: 0
----------------------------------------
Inspecting Original Wine Vectors:
Type: <class 'list'>, Length: 343
----------------------------------------
Inspecting Cleaned Wine Vectors:
Shape: (343, 300), NaNs: 0
----------------------------------------
Inspecting Original Wine Vectors:
Type: <class 'list'>, Length: 0
----------------------------------------
Skipping salt: No valid vectors found for salt after 

ValueError: Length mismatch: Expected axis has 6 elements, new values have 8 elements

In [None]:
aroma_descriptors_copy = aroma_descriptors.copy()
aroma_descriptors_copy.set_index('index', inplace=True)
aroma_descriptors_copy.dropna(inplace=True)

aroma_descriptors_copy = pd.DataFrame(aroma_descriptors_copy['value'].tolist(), index=aroma_descriptors_copy.index)
aroma_descriptors_copy.columns = ['descriptors', 'relative_frequency']
aroma_descriptors_copy.to_csv('wine_variety_descriptors.csv')