# Wine Recommender System

# Summary
Steps I've completed:
- data collection
- exploratory data analysis
- text preprocessing
    - removal of stop words
    - stemming with SnowballStemmer
    - document-term matrix (cv and tfidf)
    - use of unigrams and bigrams
- topic modeling
    - LSA
    - NMF
    - LDA
- content-based filtering

Still in Progress:
- topic tuning
- making recommendations

Next Steps:
- LightFM for collaborative filtering
- try using corEx from Fancy NLP lesson

# Import Data

In [1]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import string
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, NMF
from sklearn.metrics import pairwise_distances
from gensim import corpora, matutils
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

In [2]:
# read in the data
data = pd.read_csv("wine_data.csv", index_col=0)

In [3]:
# drop some rows and columns
data.drop_duplicates(inplace=True)
data.drop(columns=["region_2", "taster_twitter_handle"], inplace=True)
data.dropna(inplace=True)

# Text Preprocessing

In [4]:
# define the corpus as all descriptions
corpus = data.description.values

In [5]:
# define English stop words and add a few wine-related ones
with open("stop_words_english.txt", "r") as f:
    stopwords = [s.strip() for s in f.readlines()]
    
for i in ["wine", "wines", "drink", "drinks", "drank", "drunk", "palate", "palates", "aroma", "aromas",
          "flavor", "flavors", "note", "notes", "finish"]:
    stopwords.append(i)

In [6]:
# clean the corpus
def prep(doc, stemmer=SnowballStemmer("english"), stopwords=stopwords):
    # remove numbers, captial letters and punctuation
    doc = re.sub("\w*\d\w*", " ", doc)
    doc = re.sub("—", " ", doc)
    doc = re.sub('[“”]', " ", doc)
    doc = re.sub("[%s]" % re.escape(string.punctuation), " ", doc.lower())
    doc = doc.split()
    
    # use the stemmer on each word
    final = [stemmer.stem(word) for word in doc if word not in stopwords]
    return " ".join(final)

vfunc = np.vectorize(prep)
corpus = vfunc(corpus)

In [7]:
# twice to catch all the lingering stop words
corpus = vfunc(corpus)

In [8]:
# create the document-term matrix
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=(2/len(corpus)), max_df=0.4)
x_tfidf = tfidf.fit_transform(corpus)
doc_term_tfidf = pd.DataFrame(x_tfidf.toarray(), columns=tfidf.get_feature_names_out())

# Topic Modeling

In [9]:
# function to display the top n terms in each topic and their categories
def display_topics(model, feature_names, no_top_words, topic_names=None): 
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("Topic ", ix + 1)
        else:
            print("Topic: ", topic_names[ix])
        print(", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        print("")

In [10]:
# fit the model and display the top terms for each topic
nmf = NMF(9)
nmf.fit(x_tfidf)
display_topics(nmf, tfidf.get_feature_names_out(), 5, ["Black fruit", "Tree fruit", "Varietal wine", "Full body",
                                                       "Red fruit", "Vinification", "Other", "Woody", "Medium body"])



Topic:  Black fruit
black, black cherri, cherri, tannin, pepper

Topic:  Tree fruit
appl, white, lemon, peach, pear

Topic:  Varietal wine
cabernet, cabernet sauvignon, sauvignon, merlot, blend

Topic:  Full body
age, rich, ripe, structur, wood

Topic:  Red fruit
fruiti, red, acid, crisp, soft

Topic:  Vinification
cherri, vineyard, nose, dri, bottl

Topic:  Other
berri, plum, feel, herbal, berri fruit

Topic:  Woody
oak, vanilla, toast, french, french oak

Topic:  Medium body
bodi, medium, medium bodi, textur, sweet



In [11]:
# create the reduced dataframe
doc_topic = nmf.transform(x_tfidf)

# Recommender Systems

In [12]:
# find the wine closest to the indicated one
def close_match(new_coords, wine_index, data):
    dist = pairwise_distances(new_coords[wine_index].reshape(1,-1), new_coords)
    closest = dist.argsort()[0][1]
    print("Wine to Match")
    print("")
    print("Title:")
    print(data.title.values[wine_index])
    print("")
    print("Description:")
    print(data.description.values[wine_index])
    print("")
    print("")
    print("")
    print("Closest Match")
    print("")
    print("Title:")
    print(data.title.values[closest])
    print("")
    print("Description:")
    print(data.description.values[closest])

In [13]:
close_match(doc_topic, 3, data)

Wine to Match

Title:
Terre di Giurfo 2013 Belsito Frappato (Vittoria)

Description:
Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory herb that carry over to the palate. It's balanced with fresh acidity and soft tannins.



Closest Match

Title:
Aldegheri 2016 Tenuta Villa Cariola  (Bardolino)

Description:
Aromas of charcuterie, smoke and grilled herb lead the nose. On the simple palate, a note of white pepper accents a red-berry core. It's easy drinking, with fresh acidity and soft tannins.
