In [None]:
import numpy as np 
import pandas as pd 
import pickle

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Recommendation System
A recommendation system, is a subclass of information filtering system that looks for to predict or model the "rating" or "preference" a user would give to an item.

Recommender systems are used in a different areas, such as playlist generators for video and music services, product recommenders for online stores, or content recommenders for social media platforms and open web content recommenders. These systems can operate using a single input, like music, or multiple inputs within and across platforms like news, books, and search queries. Besides that, there are also popular recommender systems for specific topics like restaurants and online dating.

Recommender systems usually make use of either or both collaborative filtering and content-based filtering (user-based approach) as well as other systems such as knowledge-based systems. __Collaborative filtering__ approaches build a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. __Content-based filtering__
of discrete, pre-tagged characteristics of an item in order to recommend additional items with similar properties. Current recommender systems typically combine one or more approaches into a __hybrid system.__

![image.png](https://miro.medium.com/max/2000/1*rCK9VjrPgpHUvSNYw7qcuQ@2x.png)


Source: 

- [Recommendation System](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada)

- [Wikipedia](https://en.wikipedia.org/wiki/Recommender_system)

# Content Based Recommendations

Content based filtering uses specific items to recommend other items similar based on customer likes or behaviours

![image](https://miro.medium.com/max/998/1*O_GU8xLVlFx8WweIzKNCNw.png)

In the Google Machine Learning documentation:

To demonstrate content-based filtering, let’s give some examples for the Google Play store. The following figure shows a feature matrix where each row represents an app and each column represents a feature. Features could include categories (such as Education, Casual, Health), the publisher of the app, and many others. To simplify, assume this feature matrix is binary: a non-zero value means the app has that feature.

You also represent the user in the same feature space. Some of the user-related features could be explicitly provided by the user. For example, a user selects "Entertainment apps" in their profile. Other features can be implicit, based on the apps they have previously installed. For example, the user installed another app published by Science R Us.

The model should recommend items relevant to this user. To do so, you must first pick a similarity metric (for example, dot product). Then, you must set up the system to score each candidate item according to this similarity metric. Note that the recommendations are specific to this user, as the model did not use any information about other users.

__Using Dot Product as a Similarity Measure__

(x,y we can think like coordinate system) is the number of features that are active in both vectors simultaneously. A high dot product then indicates more common features, thus a higher similarity. (Check the figure)


__Advantages__

- The model doesn't need any data about other users, since the recommendations are specific to this user. This makes it easier to scale to a large number of users.

- The model can capture the specific interests of a user, and can recommend niche items that very few other users are interested in.

__Disadvantages__

- Since the feature representation of the items are hand-engineered to some extent, this technique requires a lot of domain knowledge. Therefore, the model can only be as good as the hand-engineered features.

- The model can only make recommendations based on existing interests of the user. In other words, the model has limited ability to expand on the users' existing interests.

Source: [Google Recommendation Systems Documentation](https://developers.google.com/machine-learning/recommendation/content-based/basics)

## Let's Code and Practice 🚀👨🏼‍💻

We will use open source Online Retail dataset and suggest products to users at the basket stage.

Let's check the dataset for first insight

- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'C', it indicates a cancellation.

- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.

- Description: Product (item) name. Nominal.

- Quantity: The quantities of each product (item) per transaction. Numeric.

- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.

- UnitPrice: Unit price. Numeric. Product price per unit in sterling (£).

- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.

- Country: Country name. Nominal. The name of the country where a customer resides.

In [None]:
import pandas as pd
import numpy as np

df_ = pd.read_csv("../input/online-retail-ii-uci/online_retail_II.csv")
# Let's copy the dataset for furher changes
df = df_.copy()
df.head()

In [None]:
# Data Pre-Processing
def outlier_thresholds(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

def retail_data_prep(dataframe):
    dataframe.dropna(inplace=True)
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
    dataframe = dataframe[dataframe["Quantity"] > 0]
    dataframe = dataframe[dataframe["Price"] > 0]
    replace_with_thresholds(dataframe, "Quantity")
    replace_with_thresholds(dataframe, "Price")
    return dataframe

In [None]:
df = retail_data_prep(df)

In [None]:
# Let's prepare data type for Germany customers

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
# output will represent in just one column
pd.set_option('display.expand_frame_repr', False)

df_ge = df[df['Country'] == "Germany"]


df_ge.groupby(['Invoice', 'Description']). \
    agg({"Quantity": "sum"}). \
    unstack(). \
    fillna(0). \
    applymap(lambda x: 1 if x > 0 else 0).iloc[0:5, 0:5]

In [None]:
def create_invoice_product_df(dataframe, id=False):
    if id:
        return dataframe.groupby(['Invoice', "StockCode"])['Quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)
    else:
        return dataframe.groupby(['Invoice', 'Description'])['Quantity'].sum().unstack().fillna(0). \
            applymap(lambda x: 1 if x > 0 else 0)

In [None]:
ge_inv_pro_df = create_invoice_product_df(df_ge)
ge_inv_pro_df.head()

In [None]:
ge_inv_pro_df = create_invoice_product_df(df_ge, id=True)
ge_inv_pro_df.head()

In [None]:
# Let's create function that ca help find stock code for further needs

df_ge['InvoiceDate']=pd.to_datetime(df['InvoiceDate'])

def check_id(dataframe, stock_code):
    product_name = dataframe.loc[dataframe["StockCode"] == stock_code, "Description"].values[0]
    print(product_name)

check_id(df_ge, "21987")
# #check_id(df_ge, 23235)
# #check_id(df_ge, 22747)

In [None]:
# Let's install mlxtend library
!pip install mlxtend

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

# Let's apply content based recommendation

# Possibilities of all product combinations
frequent_itemsets = apriori(ge_inv_pro_df, min_support=0.01, use_colnames=True)
frequent_itemsets.sort_values("support", ascending=False).head(10)

In [None]:
rules = association_rules(frequent_itemsets, metric="support", min_threshold=0.01)
rules.sort_values("support", ascending=False).head()

In [None]:
rules.sort_values("lift", ascending=False).head(20)

In [None]:
check_id(df_ge, "21987")

In [None]:
check_id(df_ge, "22747")

In [None]:
def arl_recommender(rules_df, product_id, rec_count=1):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []

    for i, product in sorted_rules["antecedents"].items():
        for j in list(product):
            if j == product_id:
                recommendation_list.append(list(sorted_rules.iloc[i]["consequents"]))

    recommendation_list = list({item for item_list in recommendation_list for item in item_list})

    return recommendation_list[:rec_count]

In [None]:
arl_recommender(rules, "21987", 3)

# Collaborative Filtering Methods

To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations (hybrid). This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on hand-engineering of features.

![Collaborative Filtering](https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-020-03266-2/MediaObjects/11227_2020_3266_Fig1_HTML.png)

Collaborative filtering methods are based on the past records between users and items in order to give new recommendations. User-item interactions matrix stores the item-user intereactions.

![User-item interactions matrix](https://buomsoo-kim.github.io/data/images/2020-08-08/0.png)


Then, the main idea that rules collaborative methods is that these past user-item interactions are sufficient to detect similar users and/or similar items and make predictions based on these estimated proximities.

The class of collaborative filtering algorithms is divided into two sub-categories that are generally called memory based and model based approaches. Memory based approaches directly works with values of recorded interactions, assuming no model, and are essentially based on nearest neighbours search (for example, find the closest users from a user of interest and suggest the most popular items among these neighbours). Model based approaches assume an underlying “generative” model that explains the user-item interactions and try to discover it in order to make new predictions.


![Memory Based](https://miro.medium.com/max/2000/1*yV3-_A1q37WheNJCvzutqg@2x.png)

The main advantage of collaborative filtering methods is that it doesn't require information about users or items and it only looks for the users and items correlation, because of that reason it can apply in different cases. 

__Moreover, if users interactions increase with the items, the new recommendations become more accurate__


__However__, collaborative filtering suffers from the “cold start problem”. It is impossible to recommend anything to new users or to recommend a new item to any users if we have less interactions.


Source: 

- [Towards Data Science](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada)

- [Google Documentation](https://developers.google.com/machine-learning/recommendation/collaborative/basics)

## Item-Based Collaborative Filtering

## Lets Code and Practice 🚀👨🏼‍💻

We will use Movie and Rating Dataset.

__Movie Dataset__:

- movieId: Unique movie ID

- title: Movie Name

__Rating Dataset__:

- userId: Unique user ID

- movieId: Unique movie ID

- rating: User ratings for movies

- timestamp: Date for rating

In [None]:
def create_user_movie_df():
    import pandas as pd
    movie = pd.read_csv('../input/movielens-20m-dataset/movie.csv')
    rating = pd.read_csv('../input/movielens-20m-dataset/rating.csv')
    df = movie.merge(rating, how="left", on="movieId")
    comment_counts = pd.DataFrame(df["title"].value_counts())
    rare_movies = comment_counts[comment_counts["title"] <= 1000].index
    common_movies = df[~df["title"].isin(rare_movies)]
    user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
    return user_movie_df

In [None]:
user_movie_df = create_user_movie_df()

In [None]:
def item_based_recommender(movie_name, user_movie_df):
    movie_name = user_movie_df[movie_name]
    return user_movie_df.corrwith(movie_name).sort_values(ascending=False).head(10)

In [None]:
item_based_recommender("Matrix, The (1999)", user_movie_df)

In [None]:
# If we dont know the name of the movie name or release year exactly
# we can create a function that can find the keywords in the movie title
def check_film(keyword, user_movie_df):
    return [col for col in user_movie_df.columns if keyword in col]

check_film("Mission", user_movie_df)

In [None]:
item_based_recommender(check_film("Sherlock", user_movie_df)[0], user_movie_df)

In [None]:
# Loading the dataset takes time so, we can upload to pkl file
# We need to import pickle library
import pickle
pickle.dump(user_movie_df, open("user_movie_df.pkl", 'wb'))
user_movie_df = pickle.load(open('user_movie_df.pkl', 'rb'))

## User-Based Collaborative Filtering

In [None]:
# Let's use pck file that we created after data preprocessing
user_movie_df = pickle.load(open('user_movie_df.pkl', 'rb'))

In [None]:
# Preparing movie list for random user in order to recommend a movie list
random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=45).values)
random_user_df = user_movie_df[user_movie_df.index == random_user]
movies_watched = random_user_df.columns[random_user_df.notna().any()].tolist()
user_movie_df.loc[user_movie_df.index == random_user, user_movie_df.columns == "Schindler's List (1993)"]
len(movies_watched)

In [None]:
# Let's access same watched movie list for different users
movies_watched_df = user_movie_df[movies_watched]
user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]
user_movie_count[user_movie_count["movie_count"] > 20].sort_values("movie_count", ascending=False)
user_movie_count[user_movie_count["movie_count"] == 33].count()
users_same_movies = user_movie_count[user_movie_count["movie_count"] > 20]["userId"]

In [None]:
users_same_movies.head()

In [None]:
# Let's count the same watched movie list
users_same_movies.count()

In [None]:
# Preparing same behavior action of users

"""
To prepare user behaviour action matrix;
1-) We need to built correlation matrix of the users
2-) After preparing correlation matrix, we need to define top users
""" 

final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies)],
                      random_user_df[movies_watched]])

In [None]:
final_df.head()

In [None]:
final_df.count()

In [None]:
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()

In [None]:
corr_df.head()

In [None]:
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.head()

In [None]:
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df.count()

In [None]:
corr_df.reset_index(inplace=True)
corr_df.head()

In [None]:
corr_df.sort_values(by="corr", ascending=False).head()

In [None]:
# Definig top users
# We are searching high correlation (higher than 65%)
top_users = corr_df[(corr_df["user_id_1"] == random_user) & (corr_df["corr"] >= 0.65)][
    ["user_id_2", "corr"]].reset_index(drop=True)

top_users = top_users.sort_values(by='corr', ascending=False)
top_users.rename(columns={"user_id_2": "userId"}, inplace=True)

In [None]:
# Let's create recommendation score using rating dataset for top users
rating = pd.read_csv('../input/movielens-20m-dataset/rating.csv')
top_users_ratings = top_users.merge(rating[["userId", "movieId", "rating"]], how='inner')

top_users_ratings = top_users_ratings[top_users_ratings["userId"] != random_user]
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']
top_users_ratings.head()

In [None]:
# Group by for weighed rating score
top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"}).head(20)

In [None]:
recommendation_df = top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"})
recommendation_df = recommendation_df.reset_index()
recommendation_df[["movieId"]].nunique()

In [None]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5]. \
    sort_values("weighted_rating", ascending=False)

movies_to_be_recommend.head(5)