# Overcoming Echo Chambers in Recommendation Systems

## Introduction

Automated recommendation systems have become a part of everyday life online. Whether its Amazon recommending products, Facebook recommending news articles, or Netflix recommending movies, we frequently interact with recommendation systems to the benefit of both consumers and businesses. A drawback of these systems is their potential, over time, to limit the diversity of items recommended as they narrow in on a user's preferences. This process results echo chambers (exposure only to recommendations from others like yourself), feedback loops (recommendations that reinforce your recorded preferences), and filter bubbles (exposure only to recommendations similar to items you have historically liked)  where user's are exposed only to recommendations reinforcing their own biases and points of view (e.g., news recommendations) or purchase patterns limiting a business' potential to increase product demand. 

In this project I build a system that provides additional recommendations based on the preferences of others who are different, but not too different, from the user. These additional recommendations are identified by clustering users based on the latent user features obtained from an alternating least squares (ALS) decomposition of the ratings matrix. Once clusters are created, the top-rated items for each cluster (based on the cluster centroid) are identified. The top-rated items in each cluster form the set of items from which the recommendations are extracted. For a new user, two sets of recommendations are provided. The first set comes from the ALS model and the second set comes from the two clusters nearest to the user's predicted cluster. 

This project presents a method for overcoming the problems of echo chambers, feedback loops, and filter bubbles that augments the recommendations provided by an common recommendation algorithm (ALS) with additional, more diverse recommendations. The augmented model provides additional recommendations based on the preferences of others who are different, but not too different, from the user. These additional recommendations are identified by clustering users based on the latent user features obtained from an alternating least squares (ALS) decomposition of the ratings matrix. Once clusters are created, the top-rated items for each cluster (based on the cluster centroid) are identified. The top-rated items in each cluster form the set of items from which the recommendations are extracted. For a new user, two sets of recommendations are provided. The first set comes from the ALS model and the second set comes from the two clusters nearest to the user's predicted cluster. 

Continue reading to learn more about how this project was developed or skip to <a href = #Get_Recommendations> Get Recommendations</a> in the last section to test the augmented recommendation engine. Readers interested in replicating this project should read through the remainder of this section and follow the links in the order described. 

## Business Understanding

Automated recommendations systems are commonplace among businesses with an online presence. They are used to recommend products, news articles, and even recipes.  These systems aim to increase consumer engagement and purchasing by recommending new items the consumer will presumably like. A common method for producing recommendations is collaborative filtering which generates recommendations based on the combined preferences of the consumer requesting recommendations and other consumers. The ability of collaborative filtering (CF) to make good recommendations (i.e., recommendations the consumer will like) improves as additional preferences are recorded. A limitation of collaborative filtering is that over time the recommendations become narrower in scope resulting in echo chambers, feedback loops, and filter bubbles. 

The negative effects of echo chambers, feedback loops, and filter bubbles are felt both socially and economically. Socially, these effects limit exposure to diverse and contrary ideas leading to a more divided and divisive society. Economically, these effects limit the variety of products to which consumers are exposed limiting a business' potential to increase product demand.

The following sections describe a method for overcomming the limitations of CF recommendation systems by enhancing an ALS algorithm to provide more diverse recommendations. The goal is to provide additional recommendations that are qualitatively different from the items recommended by the ALS model, but not too different. Four metrics are used to evaluate the performance of the augmented model to provide diverse (but not too diverse) recommendations.

The first metric evaluates the overlap between the ALS recommendations and recommendations from the augmented system intended to match the ALS recommendations. A high degree of overlap indicates that the augmented model can correctly classify users into group's with similar item preferences. 

The second metric evaluates the overlap between the ALS recommendations and actual set of diverse recommendations generated by the augmented model. A low degree of overlap indicates that the augmented model can identify recommendations that are different from the ALS recommendations. 

The third metric evaluates the extent to which the recommendations from the augmented model are qualitatively different from those of the ALS model. This metric utilizes a t-test for the difference in means. A negative and statistically significant t-statistic indicates items recommended by the augmented model are qualitatively different from those recommended by the ALS model. 

The final metric evaluates whether the items recommended by the augmented model are too qualitatively different from the ALS recommendations. This metric utilizes a t-test for the difference in differences. A positive and statistically significant t-statistic indicates the augmented recommendations are not as qualitatively different from the ALS recommendations as other potential recommendations. This metric is taken as a indicator that the augmented recommendations are not too different from the ALS recommendations. Additional information about the evaluation metrics used in this project can be found in <a href='#Modeling'>Section 1.5</a> ("Modeling") and in the <a href='../notebooks/evaluation_drm_ec.ipynb'>evaluation notebook</a>. 

## Data Understanding and Preparation

The project utilizes the <a href='https://grouplens.org/datasets/movielens/'>MovieLens</a> dataset (Harper & Konstan, 2015). The MovieLens dataset is an open source data set containing 27,753,444 movie ratings from 283,228 users for 58,098 movies. The ratings are on a five-star scale range from 0.5 stars to 5 stars in 0.5 star increments. The files include data from January 09, 1995 and September 26, 2018. The data set includes a random sample of users with at least 1 movie rating. The MovieLens dataset is available at the above link. The usage license prohibits redistribution of the data without separate permission. 

To prepare the ratings data for analysis, I removed users with fewer than ten ratings and movies with fewer than 5 ratings. The processed dataset contains 27,500,000+ ratings for 243,000+ users and 28,000+ movies. The movies data was processed to produce two new datasets. The first contain only movies with more than 50 ratings. These "most rated" movies are used as the basis for all recommendations. This is done to ensure that recommendations contain only movies that have high ratings from a large number of users preventing movies with high ratings by only a few users from skewing the recommendation set. The second contains the top 100 most-rated movies (based on average rating). The top 100 movies are used to randomly selected movies for a new user to rate. Detailed code for processing the raw data can be found in the <a href='../notebooks/clean_drm_ec.ipynb'>clean_drm_ec notebook</a> in the notebooks folder. 

After fitting the ALS model using Amazon Web Services (AWS) Elastic MapReduce (EMR), further data processing is required. The user and item factor outputs of the ALS model are saved as sets of files (a function of the MapReduce process). To work with the user and item factors outside of AWS EMR, each set of files need to be combined into a single csv file. In addition, the user factors must be scaled for use in the cluster algorithm. Detailed code for processing the ALS output from AWS can be found in the <a href='../notebooks/AWS_data.ipynb'>AWS_data notebook</a>.

The data cleaning and processing steps are contained within functions and reproducible. 

<a id='Modeling'></a>

## Modeling


Functions are contained in .py files which are then imported in Jupyter Notebooks.
Whenever possible, documentation (README.md, comments, Markdown cells) are used to explain why modeling decisions are being made. Another data scientist engaging with the project can understand the context of technical decisions that were made during the modeling process.
During model selection, do the simplest model first (i.e. guessing the majority class, Logistic Regression, NaiveBayes, Linear Regression) before trying more complex and less interpretable models (i.e. neural networks, random forest)


## Next Steps

In [None]:

Project "next steps" include potential ideas to improve the model through feature engineering, parameter tuning, etc.
Project "next steps" include ideas for future product improvements that further address the original business problem.


# Recommendations

## Local Code Imports

In [1]:
# DO NOT REMOVE THESE
%load_ext autoreload
%autoreload 2

In [2]:
# DO NOT REMOVE This
%reload_ext autoreload

In [3]:
# Uncomment to filter warnings
import warnings
warnings.filterwarnings('ignore')

In [9]:
# DO NOT REMOVE
# import local src module -
from src import recapp as ra

<a id='Get_Recommendations'></a>

## Get Recommendations

In [12]:
ra.get_recommendations()

Enter a ranking from 1 (lowest) to 5 (highest) for the following movies.
If you have not seen the movie, press enter.
Human Planet (2011): 4
Rear Window (1954): 
Last Year's Snow Was Falling (1983): 6
Ratings must be whole numbers between 1 and 5.
City of God (Cidade de Deus) (2002): 3
American Beauty (1999): 3
Grand Illusion (La grande illusion) (1937): 3
Sherlock - A Study in Pink (2010): 3
Vertigo (1958): 4
The Adventures of Sherlock Holmes and Dr. Watson: Bloody Signature (1979): 42
Ratings must be whole numbers between 1 and 5.
Maltese Falcon, The (1941): 4
Rashomon (Rashômon) (1950): 2
Godfather, The (1972): 4
Usual Suspects, The (1995): 2




Try:
    I, Claudius (1976)
    Planet Earth (2006)
    Planet Earth II (2016)
    Dylan Moran: Yeah, Yeah (2011)
    Harakiri (Seppuku) (1962)
    Human Condition III, The (Ningen no joken III) (1961)
    Civil War, The (1990)
    Smiley's People (1982)
    Life (2009)
    Blue Planet II (2017)


You may also like:
    Quiz Show (1994)
   

# References

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872