# Machine Learning Fundamentals

Today we will implement two simple content-based recommender systems. During the semester we will have a lecture on recommender systems where we look in depth about content-based and collaborative filtering recommender systems.

The main goal of this exercise is to get an understanding why similarities and normalization are very important for Machine Learning. 

In [None]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import skimage

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors

from tqdm.notebook import tqdm
import ipywidgets as widgets

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

%matplotlib inline

## Exercise 2 - Simple recommender system for wine based on similarities

We move now from cars to wine.
On Kaggle there exists a popular [dataset](https://www.kaggle.com/zynicide/wine-reviews) which consists of 130'000 wine reviews. We want to build a recommender system that recommends us a wine based on its description. 

### Prepare the dataset

In [None]:
df = pd.read_csv("winemag-data-10k.csv")
df.head()

Next we will check if the dataset contains any duplicates. And since the dataset contains duplicates, we will remove them and check again for duplicates.

In [None]:
print('dataset contains duplicates: %s' % df.duplicated().any())
print('len dataset: %s' % str(len(df)))
df.drop_duplicates(inplace=True)
print('dataset contains duplicates (after cleaning): %s' % df.duplicated().any())
print('len dataset (after cleaning): %s' % str(len(df)))

Since the text in the dataset, both for the description and the title contains upper and lower case letters, it is good to convert it to a similar format, such as both to lower case. This is specifically needed for the next step, where we will remove all of the stopwords according to a dictionary. If we would not convert the text to lowercase, we might miss removing some of the words.

In [None]:
df['description'] = df['description'].str.lower()
df['title'] = df['title'].str.lower()
df.head()

We normalize the description by tokenizing all the words and then applying lemmatization and stemming.

In [None]:
stemmer = SnowballStemmer("english")
stop_words = set(stopwords.words('english'))

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
    
def normalize(text):
    tokens = nltk.word_tokenize(text)
    result = [lemmatize_stemming(token) for token in tokens 
              if token not in stop_words and len(token) > 3]
    return result

df["normalized"] = df.description.apply(lambda x: normalize(x))

Now our dataset looks like this.

In [None]:
df.head()

We again take some of our training data as our test set.

In [None]:
train = df.iloc[:9960]
test = df.iloc[9960:]

X_train = train["normalized"].values
X_test = test["normalized"].values

To calculate our similarities, we only consider the normalized descriptions. For the first wine in our training set it looks like this:

In [None]:
X_train[0]

### Define similarity

To compare two wines, we use the jaccard similarity.
> Implement the jaccard similarity

In [None]:
def jaccard_similarity(list1, list2):
    similarity = 0
    # START YOUR CODE
    
    # END YOUR CODE
    return similarity

*Click on the dots to display the solution*

In [None]:
def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2))) / float(len(s1.union(s2)))

If your code is correct, the following cell should run without an error.
> Verify your code

In [None]:
expected_similarity = 0.05405405405405406
similarity = jaccard_similarity(X_train[0], X_train[1])

np.testing.assert_equal(similarity, expected_similarity)

### Visualize similarities
We can visualize the similarity between the wines using a heatmeap. Let us show how similar the first 40 wines are. The darker the box in the heatmap, the more similar they are.

In [None]:
dm = np.asarray([[jaccard_similarity(p1, p2) 
                  for p1 in X_train[0:40]] 
                    for p2 in X_train[0:40]])
fig, ax = plt.subplots(figsize=(18,12))    
ax = sns.heatmap(dm, linewidth=0.5, cmap="YlGnBu")
plt.show()

### Find most similar wine
> Now implement the `nearest_neighbor` function, which returns the index of the most similar wine and the similarity value. *Hint: We use now a similarity measure instead of a distance measure*.

In [None]:
def nearest_neighbor(wine, wines):
    idx = -1
    similarity = -1
    # START YOUR CODE
    
    # END YOUR CODE
    return idx, similarity

*Click on the dots to display the solution*

In [None]:
def nearest_neighbor(wine, wines):
    similarities = np.array([jaccard_similarity(wine, w) for w in wines])
    # as we are dealing with similarities now, we have to find the most similar wine
    similarity = np.max(similarities)
    idx = np.argmax(similarities)
    return idx, similarity

If your code is correct, the following cell should run without an error.

In [None]:
wine = X_test[0]

expected_similarity = 0.30303030303030304
idx, similarity = nearest_neighbor(wine, X_train)
np.testing.assert_equal(similarity, expected_similarity)

most_similar = train.iloc[[idx]]

print("Most similar wine: {}".format(most_similar.title.values[0]))
print("Similarity: ", similarity)

### Play around with our recommender system
We can now use our recommender system to get a wine recommendation based on its description.

In [None]:
@widgets.interact(wine_title=widgets.Dropdown(options=sorted(test.title), description="Wine title"))
def recommend(wine_title):
    wine = X_test[test["title"] == wine_title][0]
    idx, similarity = nearest_neighbor(wine, X_train)
    
    most_similar = train.iloc[[idx]]
    print("Most similar wine: '{}' with a similarity of {:2f}".format(most_similar.title.values[0], similarity))

## Assignment

Now answer the ILIAS quiz **Machine Learning Fundamentals**.