# Simple diversity score

### The dataset
This assignment uses the [Wikipedia Movie Plots](https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots) dataset. The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

- Release Year - Year in which the movie was released
- Title - Movie title
- Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)
- Director - Director(s)
- Plot - Main actor and actresses
- Genre - Movie Genre(s)
- Wiki Page - URL of the Wikipedia page from which the plot description was scraped
- Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)

### Assignment 1
#### Data preparation
- Load the dataset and explore its structure.
- Clean and preprocess the data as needed. Focus on the 'Plot' and 'Genre' columns for this assignment.

Note that the preprocessing steps are chosen to demonstrate some steps you COULD apply, and are not 
necessarily optimal for the recommender system.


In [None]:
import pandas as pd
import os.path as op
import re

fp = op.join('.', 'data', 'wiki_movie_plots_deduped.csv')
df = pd.read_csv(fp, sep=',')

def reduce_genre(genre):
    ''' Reduce multiple genres to a single one.
    Assumes the main genre is mentioned first.
    Various separators are used (",",  "_", "/", " "), 
    so the split is peformed on any non-word character.
    '''
    first_word_pattern = r'(\w+)\W+.*'
    match = re.search(first_word_pattern, genre)
    if match:
        return match.group(1)
    return genre

# In case of multiple genres, keep only the first
df['Genre'] = df['Genre'].apply(reduce_genre)

# Discard movies with 'unknown' genre
df = df[df['Genre'] != 'unknown']

# Remove movies with a genre occuring N times, they will be hard to recommend
genre_cutoff = 5
genre_counts = df['Genre'].value_counts()
df = df[~df['Genre'].isin(genre_counts[genre_counts < genre_cutoff].index)]

#### Feature engineering
- Combine relevant features to create a unified text representation for each movie. Suggested features to combine include 'Plot' and 'Genre'.

In [None]:
# We will consider 'Description' a combination of genre, director, and plot
df['Description'] = df['Genre'] + ' ' +  df['Director'] + ' ' +  df['Plot']
df['Description'].head()

#### Vectorization:
  - Use TF-IDF (Term Frequency-Inverse Document Frequency) to vectorize the combined text features of each movie.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    strip_accents='unicode',
    lowercase=True,
    stop_words='english',
)

movie_vectors = vectorizer.fit_transform(df['Description'])

In [None]:
print(movie_vectors.shape)
print(vectorizer.stop_words_)

#### Similarity Calculation:
  - Calculate the cosine similarity between movies based on their vectorized features.

In [None]:
# We have proven that we can implement cosine similarty ourselves
# So let's just use a package now

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Test the function, the similarity of the same row should be 1
print(cosine_similarity(movie_vectors[0], movie_vectors[0]))

# Calculate similarities between all movies
# Depending on your preprocessing, this can take a while
similarity_matrix = cosine_similarity(movie_vectors, movie_vectors)

In [None]:
# Note that the shape of the similarities equals
# N_movies x N_movies
similarity_matrix.shape

#### Diversity Scoring:

  - For each movie, calculate a diversity score as $\text{Diversity Score =}$ $1 - \text{average cosine similarity}$
 with all other movies.

In [None]:
similarities_df = pd.DataFrame(similarity_matrix, index=df.index, columns=df.index)
similarity_avg = similarities_df.mean(axis=1)
df['diversity'] = 1-similarity_avg

#### Recommendation Generation:
  - Implement a function to generate movie recommendations for a given movie. The recommendations should be based on both similarity (movies should be similar to the given movie) and diversity (recommendations should be diverse).

In [None]:
import functools

from sklearn.preprocessing import MinMaxScaler

def get_similarities(movie_id):
    '''Gets similarity scores for all other movies'''
    similarities = similarities_df.loc[movie_id]
    similarities.drop(movie_id, inplace=True)
    return similarities.rename('similarity').to_frame()

def scale_features(movie_df):
    '''Scales relevant features to domain [0, 1]'''
    scaler = MinMaxScaler((0,1))
    scaler.fit(movie_df.similarity.to_frame())
    movie_df['similarity_scaled'] = scaler.transform(movie_df.similarity.to_frame())

    scaler.fit(movie_df.diversity.to_frame())
    movie_df['diversity_scaled'] = scaler.transform(movie_df.diversity.to_frame())

    return movie_df

def weighted_score(movie, similarity_weight, diversity_weight):
    '''Calculates weighter average for relevant (scaled) features'''
    sw = movie ['similarity_scaled'] * similarity_weight
    dw = movie['diversity_scaled'] * diversity_weight
    total_weights = similarity_weight + diversity_weight
    return (sw + dw) / total_weights

def recommend_movies(movie_id, n_results=10, diversity_factor=0.5, similarity_factor=1):
    similarities = get_similarities(movie_id)
    
    # dataframe with relevant features for a single movie
    movie_df = df.drop(movie_id, axis='rows').join(similarities)
    movie_df = scale_features(movie_df)

    # calculate the weighted score
    weight_func = functools.partial(weighted_score, 
                                    similarity_weight=similarity_factor,
                                    diversity_weight=diversity_factor)
    movie_df['recommender_score'] = movie_df.apply(weight_func, axis='columns')

    return movie_df.sort_values('recommender_score', ascending=False).head(n_results)

#### Analysis:
  - Analyze the differences between the recommendations generated with and without the diversity adjustment.
  - Discuss how the diversity_factor influences the recommendation outcomes.
  - Reflect on the potential benefits and drawbacks of introducing diversity into recommendation systems.

In [None]:

def display_recommendations(movie_id, diversity_factors=[0, .25, .5, .75, 1]):
    recommendations = [
        recommend_movies(movie_id, 10, f)[['Title', 'recommender_score']].values.tolist()
        for f in diversity_factors
    ]

    print('Recommendations for movie "{}":\n'.format(random_movie.Title.values[0]))
    print(*recommendations, sep='\n')

# Select a random movie
random_movie = df.sample(1)
movie_id = random_movie.index[0]
# and show recommendations for various diversity factors
display_recommendations(movie_id)

### Assignment 02

* Similarity Score Calculation:
  - Implement a function to calculate cosine similarity scores for items in your dataset. Use scikit-learn's cosine similarity function or write your own.
  - Store the cosine similarity scores in a matrix.

* Diversity Metric Implementation:
  - Implement the diversity_metric function as provided in the earlier code below. This function should adjust the similarity scores based on the diversity_factor to introduce diversity into the recommendations.
  - Experiment with different values of diversity_factor (e.g., 0.2, 0.4, 0.6) and observe how it affects the diversity of recommendations.

* Recommendation Generation:
  - Select a few items (e.g., movies or books) as the basis for generating recommendations.
  - Apply the diversity_metric to the cosine similarity matrix to adjust the similarity scores.
  - For each selected item, generate two sets of recommendations: one using the adjusted similarity scores and another using the original cosine similarity scores.
  - Compare the two sets of recommendations to evaluate the impact of the diversity adjustment.

* Analysis:
  - Analyze the differences between the recommendations generated with and without the diversity adjustment.
  - Discuss how the diversity_factor influences the recommendation outcomes.
  - Reflect on the potential benefits and drawbacks of introducing diversity into recommendation systems.

In [None]:
import numpy as np

def diversity_metric(similarities, diversity_factor=0.4):
    """
    Adjust similarity scores based on a diversity factor.
    A higher diversity_factor promotes more diverse recommendations.
    """
    if diversity_factor == 0:
        return similarities
    # Apply diversity adjustment
    adjusted_similarities = similarities - diversity_factor * np.random.rand(*similarities.shape)
    return adjusted_similarities

In [None]:
# Some input data
interesting_movies = ['Toy Story', 'Notting Hill', 'Mulholland Drive']
original_similarities = similarity_matrix # reuse this from earlier

In [None]:
def create_similarity_frames(similarities, diversity_factors=[0, .2, .4, .6, .8, 1]):
    '''Creates a dataframe with similarities for each diverstiy factors
    diversity_factor = 0 are the unaltered similarities'''
    return {f: diversity_metric(similarities, f) for f in diversity_factors}

# Generate diversified similarty matrices
# This can take a while
diversified_similarities = create_similarity_frames(original_similarities, [0, .4])

In [None]:
def recommend(movie_title, diversified_similarities, movies_df, N_results=10):
    '''Recommend movies using the diversity metric
    Note that this implementation is written mostly in Numpy
    Converting to Pandas dataframe increased calculation time so much it made the kernel crash..
    
    If you manage to do it in Pandas, power to you!
    '''
    def get_title(idx):
        return movies_df.iloc[idx]['Title']

    movie_index = movies_df.loc[movies_df['Title'] == movie_title].index[0]
    for factor, similarities in diversified_similarities.items():
        print('Recommendations for {}, diversity factor {}:'.format(movie_title, factor))
        # get the top similarities
        recommendations = similarities[movie_index]
        # retrieve N best
        n_best = (-recommendations).argsort()[:N_results]
        # join with movies_df to get the titles
        for idx in n_best:
            print(f'{get_title(idx)} ({recommendations[idx]})')
        print('\n')

for movie_title in interesting_movies:
    recommend(movie_title, diversified_similarities, df, 5)