In [80]:
def sim_2_sol(sol_dict):
    a = True
    b = False
    c = "We can't be sure."


    pearson_dct = {"If when x increases, y always increases, Pearson's correlation will be always be 1.": b,
                   "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
                   "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
                   "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": b
    }

    if pearson_dct == sol_dict:
        print("That's right!  Pearson's correlation relates to a linear relationship.  The second and third cases are examples of perfect linear relationships, where we would receive values of 1 and -1.  Only having an increase or decrease that are directly related will not lead to a Pearson's correlation coefficient of 1 or -1.  You can see this by testing out your function using the examples above without using assert statements.")

    else:
        print("Oops!  That doesn't look right... Pearson's correlation relates to a linear relationship.  The second and third cases are examples of perfect linear relationships, where we would receive values of 1 and -1.  Only having an increase or decrease that are directly related will not lead to a Pearson's correlation coefficient of 1 or -1.  You can see this by testing out your function using the examples above without using assert statements.  Try looking at the correlation of different relationships to prove the values to yourself.")


def sim_4_sol(sol_dict):
    a = True
    b = False
    c = "We can't be sure."


    spearman_dct = {"If when x increases, y always increases, Spearman's correlation will be always be 1.": a,
                    "If when x increases by 1, y always increases by 3, Spearman's correlation will always be 1.": a,
                    "If when x increases by 1, y always decreases by 5, Spearman's correlation will always be -1.": a,
                    "If when x increases by 1, y increases by 3 times x, Spearman's correlation will always be 1.": a
}
    if spearman_dct == sol_dict:
        print("That's right!  Unlike Pearson's correlation, Spearman's correlation can have perfect relationships (1 or -1 values) that aren't linear relationships.  You will notice that neither Spearman or Pearson correlation values suggest a relation when there are quadratic relationships.")

    else:
        print("Oops!  That doesn't look right...these are actually all true statements! Unlike Pearson's correlation, Spearman's correlation can have perfect relationships (1 or -1 values) that aren't linear relationships.  You will notice that neither Spearman or Pearson correlation values suggest a relation when there are quadratic relationships.")
        
def sim_6_sol(sol_dict):
    a = True
    b = False
    c = "We can't be sure."


    corr_comp_dct = {"For all columns of play_data, Spearman and Kendall's measures match.": a,
                    "For all columns of play_data, Spearman and Pearson's measures match.": b,
                    "For all columns of play_data, Pearson and Kendall's measures match.": b}

    if corr_comp_dct == sol_dict:
        print("That's right!  Pearson does not match the other two measures, as it looks specifically for linear relationships.  However, Spearman and Kenall's measures are exactly the same to one another in the cases related to play_data.")

    else:
        print("Oops!  That doesn't look right...Pearson does not match the other two measures, as it looks specifically for linear relationships.  However, Spearman and Kenall's measures are exactly the same to one another in the cases related to play_data.")

def test_eucl(x, y):
    return np.linalg.norm(x - y)
    

def test_manhat(x,y):
    return sum(abs(e - s) for s,e in zip(x, y))

# Recommendation Engines

## Introduction

Recommendations are being used to recommend everything from movies to music to friends to new destinations. There are three main methods for implementing recommendations that you will become familiar with throughout this lesson:
* Knowledge Based Recommendations
* Collaborative Filtering Based Recommendations
* Content Based Recommendations

After completing this lesson, you will be ready for the upcoming lessons where you will:
* Learn about more advanced techniques.
* Deploy your recommendations in a web application.

These three lessons will aim to be extremely practical. The lessons will require that you write code to implement a number of different recommendation techniques.

**Example Recommendations:**

* LinkedIn and Facebook
> Both LinkedIn and Facebook have recommendations for connections (business of friends) similar to what is shown below.

* AirBnB Experiences and Destinations
> AirBnB uses recommendations to determine experiences and destinations for their users.

* Walmart, Amazon, and Other Retailers
> As humans on the Internet, we all get pinged with constant recommendations from retailers.

## What's Ahead

### Types of Recommendations

In this lesson, you will be working with the MovieTweetings data to apply each of the three methods of recommendations:
1. Knowledge Based Recommendations
2. Collaborative Filtering Based Recommendations
3. Content Based Recommendations

Within Collaborative Filtering, there are two main branches:
1. Model Based Collaborative Filtering
2. Neighborhood Based Collaborative Filtering

In this lesson, you will implement Neighborhood Based Collaborative Filtering. In the next lesson, you will implement Model Based Collaborative Filtering.

### Similarity Metrics

In order to implement Neighborhood Based Collaborative Filtering, you will learn about some common ways to measure the similarity between two users (or two items) including:
1. Pearson's correlation coefficient
2. Spearman's correlation coefficient
3. Kendall's Tau
4. Euclidean Distance
5. Manhattan Distance

You will learn why sometimes one metric works better than another by looking at a specific situation where one metric provides more information than another.

### Business Cases For Recommendations

Finally, you will look at the four ideas needed for businesses to implement successful recommendations to drive revenue, which include:
1. Relevance
2. Novelty
3. Serendipity
4. Increased Diversity

At the end of this lesson, you will have gained a ton of skills to build upon or to start creating your own recommendations in practice.

## Base Data - MovieTweetings

If you would like additional information about the MovieTweetings data, you can find more information at the links provided here:
* [The MovieTweetings white paper(DEADLINK)](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf)
* [A Github account set up for MovieTweetings](https://github.com/sidooms/MovieTweetings)
* [A slide deck by Simon Doom about MovieTweetings.](https://www.slideshare.net/simondooms/movie-tweetings-a-movie-rating-dataset-collected-from-twitter)
> Attached in repo as well

### Exercise - Recommendations with MovieTweetings: Getting to Know The Data

Throughout this lesson, you will be working with the [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).

**Note:** There are solutions to each of the notebooks available by hitting the orange jupyter logo in the top left of this notebook.  Additionally, you can watch me work through the solutions on the screencasts that follow each workbook. 

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import tests as t

%matplotlib inline

In [32]:
# Read in the MovieTweetings dataset originally taken from https://github.com/sidooms/MovieTweetings/tree/master/latest
movies = pd.read_csv(
    '06_recommendation_engines/movies.dat',
    delimiter='::',
    header=None,
    names=['movie_id', 'movie', 'genre'],
    dtype={'movie_id': object},
    engine='python')
reviews = pd.read_csv(
    '06_recommendation_engines/ratings.dat',
    delimiter='::',
    header=None,
    names=['user_id', 'movie_id', 'rating', 'timestamp'],
    dtype={'movie_id': object, 'user_id': object, 'timestamp': object},
    engine='python')

#### 1. Take a Look At The Data 

Take a look at the data and use your findings to fill in the dictionary below with the correct responses to show your understanding of the data.

In [33]:
print(movies.shape)
display(movies.head())

(35479, 3)


Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [34]:
print(reviews.shape)
display(reviews.head())

(863866, 4)


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,208092,5,1586466072
2,2,358273,9,1579057827
3,2,10039344,5,1578603053
4,2,6751668,9,1578955697


In [35]:
dict_sol1 = {
'The number of movies in the dataset': movies['movie'].nunique()
,'The number of ratings in the dataset': reviews['rating'].notnull().sum()
,'The number of different genres': movies['genre'].nunique()
,'The number of unique users in the dataset': reviews['user_id'].nunique()
,'The number missing ratings in the reviews dataset': reviews['rating'].isna().sum()
,'The average rating given across all ratings': reviews['rating'].mean()
,'The minimum rating given across all ratings': reviews['rating'].min()
,'The maximum rating given across all ratings': reviews['rating'].max()
}

In [36]:
dict_sol1

{'The number of movies in the dataset': 35416,
 'The number of ratings in the dataset': 863866,
 'The number of different genres': 2736,
 'The number of unique users in the dataset': 67353,
 'The number missing ratings in the reviews dataset': 0,
 'The average rating given across all ratings': 7.315877693994207,
 'The minimum rating given across all ratings': 0,
 'The maximum rating given across all ratings': 10}

#### 2. Data Cleaning

Next, we need to pull some additional relevant information out of the existing columns. 

For each of the datasets, there are a couple of cleaning steps we need to take care of:

#### Movies
* Pull the date from the title and create new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's

#### Reviews
* Create a date out of time stamp

You can check your results against the header of my solution by running the cell below with the **show_clean_dataframes** function.

In [37]:
def remove_year_in_paren(s):
    close_left = s.rfind('(')
    close_right = s.rfind(')')
    s_paren = s[close_left+1:close_right]

    return s_paren

In [38]:
movies['date'] = movies['movie'].apply(lambda x: remove_year_in_paren(x))
movies.head()

Unnamed: 0,movie_id,movie,genre,date
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895
2,12,The Arrival of a Train (1896),Documentary|Short,1896
3,25,The Oxford and Cambridge University Boat Race ...,,1895
4,91,Le manoir du diable (1896),Short|Horror,1896


In [39]:
date_ind = {'18':"1800's",'19':"1900's",'20':"2000's"}
for date in date_ind:
    movies.loc[:,date_ind[date]] = 0
    movies.loc[movies['date'].str[:2] == date, date_ind[date]] = 1
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0


In [40]:
# number of different genres
genres = []
for val in movies.genre:
    try:
        genres.extend(val.split('|'))
    except AttributeError:
        pass
genres = set(genres)

def split_genres(val):
    try:
        if val.find(gene) >-1:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# Apply function for each genre
for gene in genres:        
    movies[gene] = movies['genre'].apply(split_genres)
# print("The number of genres is {}.".format(len(genres)))

# movies = pd.concat([movies,pd.get_dummies(movies['genre'])],axis=1)
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Crime,Fantasy,Music,...,Short,Documentary,Mystery,Reality-TV,Talk-Show,History,Sci-Fi,Action,Biography,Thriller
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [41]:
reviews['date'] = pd.to_datetime(reviews['timestamp'],unit='s')
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 21:00:50
1,2,208092,5,1586466072,2020-04-09 21:01:12
2,2,358273,9,1579057827,2020-01-15 03:10:27
3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,2,6751668,9,1578955697,2020-01-13 22:48:17


## Solution

The solution to the previous notebook is available in two videos below. Remember you can access the solution notebooks from within the classroom workspaces by clicking on the orange, Jupyter Notebook icon in the upper left hand corner.

## Lesson 1 - Knowledge Based Recommendations

A knowledge based recommendation is one in which knowledge about the item or user preferences are used to make a recommendation.

Knowledge based recommendations are pretty common when purchasing luxury items. Take a look at the filters available on Zillow in the image below. This is an example of building in a knowledge based recommendation, as users can add their own preferences to the items that are provided.

<center><img src="rec_02.png" width=500></center>

* **Rank Based Recommendations:** Recommendations based on highest ratings, most purchases, most listened to, etc.
> Based on frequency independent of condition

* **Knowledge Based Recommendations:** Knowledge about the item or user preferences are used to make a recommendation

<center><img src="rec_01.png" width=500></center>

Often a rank based algorithm is provided along with knowledge based recommendations to bring the most popular items in particular categories to the user's attention.

In the next concept, you will get some practice implementing this type of recommendation for the MovieTweetings dataset.

## Exercise 2 - Recommendation with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import .rec_tests as t

%matplotlib inline

# Read in the datasets
movies = pd.read_csv('06_recommendation_engines/movies_clean.csv')
reviews = pd.read_csv('06_recommendation_engines/reviews_clean.csv')
del movies['Unnamed: 0']
del reviews['Unnamed: 0']

#### 1. How To Find The Most Popular Movies

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a `user_id` and provide back the `n_top` recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [43]:
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
0,1,68646,10,1381620027,2013-10-12 23:20:27,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1,1,113277,10,1379466669,2013-09-18 01:11:09,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,2,422720,8,1412178746,2014-10-01 15:52:26,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3,2,454876,8,1394818630,2014-03-14 17:37:10,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,2,790636,7,1389963947,2014-01-17 13:05:47,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [44]:
def create_ranked_df(movies, reviews):
        '''
        INPUT
        movies - the movies dataframe
        reviews - the reviews dataframe
        
        OUTPUT
        ranked_movies - a dataframe with movies that are sorted by highest avg rating, more reviews, 
                        then time, and must have more than 4 ratings
        '''
        
        # Pull the average ratings and number of ratings for each movie
        movie_ratings = reviews.groupby('movie_id')['rating']
        avg_ratings = movie_ratings.mean()
        num_ratings = movie_ratings.count()
        last_rating = pd.DataFrame(reviews.groupby('movie_id').max()['date'])
        last_rating.columns = ['last_rating']

        # Add Dates
        rating_count_df = pd.DataFrame({'avg_rating': avg_ratings, 'num_ratings': num_ratings})
        rating_count_df = rating_count_df.join(last_rating)

        # merge with the movies dataset
        movie_recs = movies.set_index('movie_id').join(rating_count_df)

        # sort by top avg rating and number of ratings
        ranked_movies = movie_recs.sort_values(['avg_rating', 'num_ratings', 'last_rating'], ascending=False)

        # for edge cases - subset the movie list to those with only 5 or more reviews
        ranked_movies = ranked_movies[ranked_movies['num_ratings'] > 4]
        
        return ranked_movies

    
def popular_recommendations(user_id, n_top, ranked_movies):
    '''
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time

    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''

    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies

Using the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace.  

In [45]:
# Top 20 movies recommended for id 1
ranked_movies = create_ranked_df(movies, reviews) # only run this once - it is not fast

recs_20_for_1 = popular_recommendations('1', 20, ranked_movies)

# Top 5 movies recommended for id 53968
recs_5_for_53968 = popular_recommendations('53968', 5, ranked_movies)

# Top 100 movies recommended for id 70000
recs_100_for_70000 = popular_recommendations('70000', 100, ranked_movies)

# Top 35 movies recommended for id 43
recs_35_for_43 = popular_recommendations('43', 35, ranked_movies)

**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!

### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

In [46]:
def popular_recs_filtered(user_id, n_top, ranked_movies, years=None, genres=None):
    '''
    REDO THIS DOC STRING
    
    INPUT:
    user_id - the user_id (str) of the individual you are making recommendations for
    n_top - an integer of the number recommendations you want back
    ranked_movies - a pandas dataframe of the already ranked movies based on avg rating, count, and time
    years - a list of strings with years of movies
    genres - a list of strings with genres of movies
    
    OUTPUT:
    top_movies - a list of the n_top recommended movies by movie title in order best to worst
    '''
    # Filter movies based on year and genre
    if years is not None:
        ranked_movies = ranked_movies[ranked_movies['date'].isin(years)]

    if genres is not None:
        num_genre_match = ranked_movies[genres].sum(axis=1)
        ranked_movies = ranked_movies.loc[num_genre_match > 0, :]
            
            
    # create top movies list 
    top_movies = list(ranked_movies['movie'][:n_top])

    return top_movies



In [47]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = popular_recs_filtered('1', 20, ranked_movies, years=['2015', '2016', '2017', '2018'], genres=['History'])

# Top 5 movies recommended for id 53968 with no genre filter but years=['2015', '2016', '2017', '2018']
recs_5_for_53968_filtered = popular_recs_filtered('53968', 5, ranked_movies, years=['2015', '2016', '2017', '2018'])

# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = popular_recs_filtered('70000', 100, ranked_movies, genres=['History', 'News'])

## Lesson 2 - More Personalized Recommendations

In some cases, we need to be able to send recommendations without a user telling us exactly what they want or in a more personalized way than simply the top items. Imagine you want to send an email of recommendations or place recommendations within a web page (the side of a blog or as a banner advertisement); in these cases, it is often useful to implement information that we know about users or items to make these recommendations. This leads to some additional recommendation methods!

#### Collaborative Filtering & Content Based Recommendations

* **Collaborative Filtering:** A method of making recommendations based on using the collaboration of user-item interactions

<center><img src='rec_03.png' width=500></center>

* **Content Based Recommendations:** are when we use information about the users or items to assist in our recommendations

<center><img src='rec_04.png' width=500></center>

* **Example of Data for Collaborative Filtering:**
* Item ratings for each user
* Item liked by user or not
* Item used by user or not

<center><img src='rec_05.png' width=500></center>

> * When a user is inputting her/his information (location input), this is an example of knowledge based recommending
> * When we use connections between users and items (connecting Mike and Pradeep as similar), this is an example of collaborative filtering
> * When we use information about the items or users to recommend new items (items related to robotics), this is an example of content based recommending.

#### Collaborative Filtering Methods

1. Model Based
2. Neighborhood Based
> used to identify items or users that are "neighbors" with one another

<center><img src='rec_06.png' width=500></center>

There are a number of ways we might go about finding an individual's closest neighbors - the metrics we will take a closer look at include:
1. Pearson's correlation coefficient
2. Spearman's correlation coefficient
3. Kendall's Tau
4. Euclidean Distance
5. Manhattan Distance

<center><img src='rec_07.png' width=500></center>

<center><img src='rec_08.png' width=500></center>

In the next cells, you will work through a few examples to get more familiar with how each of these metrics is computed, and why you might use one over another.

### Exercise 3 - How to Find Your Neighbor?

In neighborhood based collaborative filtering, it is incredibly important to be able to identify an individual's neighbors. Let's look at a small dataset in order to understand how we can use different metrics to identify close neighbors.

In [48]:
import numpy as np
import pandas as pd
from scipy.stats import spearmanr, kendalltau
import matplotlib.pyplot as plt
# import tests as t
# import helper as h
%matplotlib inline

play_data = pd.DataFrame({'x1': [-3, -2, -1, 0, 1, 2, 3], 
               'x2': [9, 4, 1, 0, 1, 4, 9],
               'x3': [1, 2, 3, 4, 5, 6, 7],
               'x4': [2, 5, 15, 27, 28, 30, 31]
})

#create play data dataframe
play_data = play_data[['x1', 'x2', 'x3', 'x4']]

### Measures of Similarity

The first metrics we will look at have similar characteristics:

1. Pearson's Correlation Coefficient
2. Spearman's Correlation Coefficient
3. Kendall's Tau

Let's take a look at each of these individually.

### Pearson's Correlation

First, **Pearson's correlation coefficient** is a measure related to the strength and direction of a **linear** relationship.  

If we have two vectors x and y, we can compare their individual elements in the following way to calculate Pearson's correlation coefficient:


$$CORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum\limits_{i=1}^{n}(y_i-\bar{y})^2}} $$

where 

$$\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}x_i$$

##### 1. Write a function that takes in two vectors and returns the Pearson correlation coefficient.  You can then compare your answer to the built in function in NumPy by using the assert statements in the following cell.

In [49]:
def pearson_corr(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    corr - the pearson correlation coefficient for comparing x and y
    '''
    a = x - np.mean(x)
    b = y - np.mean(y)
    numerator = np.sum(a*b)
    
    c = np.sum(a**2)
    c_1 = np.sqrt(c)
    d = np.sum(b**2)
    d_1 = np.sqrt(d)
    denominator = c_1 * d_1
    
    corr = numerator / denominator
                 
    return corr            

In [50]:
assert pearson_corr(play_data['x1'], play_data['x2']) == np.corrcoef(play_data['x1'], play_data['x2'])[0][1], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(pearson_corr(play_data['x1'], play_data['x2']))
assert round(pearson_corr(play_data['x1'], play_data['x3']), 2) == np.corrcoef(play_data['x1'], play_data['x3'])[0][1], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], pearson_corr(play_data['x1'], play_data['x3']))
assert round(pearson_corr(play_data['x3'], play_data['x4']), 2) == round(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], pearson_corr(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!")

If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!


##### 2. Now that you have computed **Pearson's correlation coefficient**, use the dictionary below to identify statements that are true about **this** measure.

In [51]:
a = True
b = False
c = "We can't be sure."


pearson_dct = {"If when x increases, y always increases, Pearson's correlation will be always be 1.": b,
               "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": b
}

sim_2_sol(pearson_dct)

That's right!  Pearson's correlation relates to a linear relationship.  The second and third cases are examples of perfect linear relationships, where we would receive values of 1 and -1.  Only having an increase or decrease that are directly related will not lead to a Pearson's correlation coefficient of 1 or -1.  You can see this by testing out your function using the examples above without using assert statements.


### Spearman's Correlation

Now, let's look at **Spearman's correlation coefficient**. Spearman's correlation is what is known as a [non-parametric](https://en.wikipedia.org/wiki/Nonparametric_statistics) statistic, which is a statistic whose distribution doesn't depend on parameters. (Statistics that follow normal distributions or binomial distributions are examples of parametric statistics.)  

Frequently non-parametric statistics are based on the ranks of data rather than the original values collected.  This happens to be the case with Spearman's correlation coefficient, which is calculated similarly to Pearson's correlation.  However, instead of using the raw data, we use the rank of each value.

You can quickly change from the raw data to the ranks using the **.rank()** method as shown here:

If we map each of our data to ranked data values as shown above:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Here, we let the **r** indicate these are ranked values (this is not raising any value to the power of r).  Then we compute Spearman's correlation coefficient as:

$$SCORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x^{r}_i - \bar{x}^{r})(y^{r}_i - \bar{y}^{r})}{\sqrt{\sum\limits_{i=1}^{n}(x^{r}_i-\bar{x}^{r})^2}\sqrt{\sum\limits_{i=1}^{n}(y^{r}_i-\bar{y}^{r})^2}} $$

where 

$$\bar{x}^r = \frac{1}{n}\sum\limits_{i=1}^{n}x^r_i$$

##### 3. Write a function that takes in two vectors and returns the Spearman correlation coefficient.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell.

In [52]:
print("The ranked values for the variable x1 are: {}".format(np.array(play_data['x1'].rank())))
print("The raw data values for the variable x1 are: {}".format(np.array(play_data['x1'])))

The ranked values for the variable x1 are: [1. 2. 3. 4. 5. 6. 7.]
The raw data values for the variable x1 are: [-3 -2 -1  0  1  2  3]


In [57]:
def corr_spearman(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    corr - the spearman correlation coefficient for comparing x and y
    '''
    x_r, y_r = x.rank(), y.rank()
    
    corr = pearson_corr(x=x_r,y=y_r)
                    
    return corr  

In [58]:
# This cell will test your function against the built in scipy function
assert corr_spearman(play_data['x1'], play_data['x2']) == spearmanr(play_data['x1'], play_data['x2'])[0], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(compute_corr(play_data['x1'], play_data['x2']))
assert round(corr_spearman(play_data['x1'], play_data['x3']), 2) == spearmanr(play_data['x1'], play_data['x3'])[0], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], compute_corr(play_data['x1'], play_data['x3']))
assert round(corr_spearman(play_data['x3'], play_data['x4']), 2) == round(spearmanr(play_data['x3'], play_data['x4'])[0], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], compute_corr(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Spearman's correlation coefficient!")

If this is all you see, it looks like you are all set!  Nice job coding up Spearman's correlation coefficient!


##### 4. Now that you have computed **Spearman's correlation coefficient**, use the dictionary below to identify statements that are true about **this** measure.

In [60]:
a = True
b = False
c = "We can't be sure."


spearman_dct = {"If when x increases, y always increases, Spearman's correlation will be always be 1.": a,
               "If when x increases by 1, y always increases by 3, Spearman's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Spearman's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Spearman's correlation will always be 1.": a
}

sim_4_sol(spearman_dct)

That's right!  Unlike Pearson's correlation, Spearman's correlation can have perfect relationships (1 or -1 values) that aren't linear relationships.  You will notice that neither Spearman or Pearson correlation values suggest a relation when there are quadratic relationships.


### Kendall's Tau

Kendall's tau is quite similar to Spearman's correlation coefficient.  Both of these measures are non-parametric measures of a relationship.  Specifically both Spearman and Kendall's coefficients are calculated based on ranking data and not the raw data.  

Similar to both of the previous measures, Kendall's Tau is always between -1 and 1, where -1 suggests a strong, negative relationship between two variables and 1 suggests a strong, positive relationship between two variables.

Though Spearman's and Kendall's measures are very similar, there are statistical advantages to choosing Kendall's measure in that Kendall's Tau has smaller variability when using larger sample sizes.  However Spearman's measure is more computationally efficient, as Kendall's Tau is O(n^2) and Spearman's correlation is O(nLog(n)). You can find more on this topic in [this thread](https://www.researchgate.net/post/Does_Spearmans_rho_have_any_advantage_over_Kendalls_tau).

Let's take a closer look at exactly how this measure is calculated.  Again, we want to map our data to ranks:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Then we calculate Kendall's Tau as:

$$TAU(\textbf{x}, \textbf{y}) = \frac{2}{n(n -1)}\sum_{i < j}sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

Where $sgn$ takes the the sign associated with the difference in the ranked values.  An alternative way to write 

$$sgn(x^r_i - x^r_j)$$ 

is in the following way:

$$
 \begin{cases} 
      -1  & x^r_i < x^r_j \\
      0 & x^r_i = x^r_j \\
      1 & x^r_i > x^r_j 
   \end{cases}
$$

Therefore the possible results of 

$$sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

are only 1, -1, or 0, which are summed to give an idea of the proportion of times the ranks of **x** and **y** are pointed in the right direction.

##### 5. Write a function that takes in two vectors and returns Kendall's Tau.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell.

In [74]:
def kendalls_tau(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    tau - the kendall's tau for comparing x and y
    '''    
    # Change each vector to ranked values
    x_r, y_r = x.rank(), y.rank()

    n = len(x)
    
    frac = 2 / (n * (n-1))
    
    sum_vals = 0
    
    for i, (x_i, y_i) in enumerate(zip(x_r, y_r)):
        for j, (x_j, y_j) in enumerate(zip(x_r, y_r)):
            if i < j:
                sum_vals += np.sign(x_i - x_j)*np.sign(y_i - y_j)
                        
    tau = sum_vals * frac
    
    return tau

In [75]:
# This cell will test your function against the built in scipy function
assert kendalls_tau(play_data['x1'], play_data['x2']) == kendalltau(play_data['x1'], play_data['x2'])[0], 'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'.format(kendalls_tau(play_data['x1'], play_data['x2']))
assert round(kendalls_tau(play_data['x1'], play_data['x3']), 2) == kendalltau(play_data['x1'], play_data['x3'])[0], 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(kendalltau(play_data['x1'], play_data['x3'])[0][1], kendalls_tau(play_data['x1'], play_data['x3']))
assert round(kendalls_tau(play_data['x3'], play_data['x4']), 2) == round(kendalltau(play_data['x3'], play_data['x4'])[0], 2), 'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'.format(kendalltau(play_data['x3'], play_data['x4'])[0][1], kendalls_tau(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Kendall's Tau!")


If this is all you see, it looks like you are all set!  Nice job coding up Kendall's Tau!


##### 6. Use your functions (and/or your knowledge of each of the above coefficients) to accurately identify each of the below statements as True or False.  **Note:** There may be some rounding differences due to the way numbers are stored, so it is recommended that you consider comparisons to 4 or fewer decimal places.

In [78]:
a = True
b = False
c = "We can't be sure."


corr_comp_dct = {"For all columns of play_data, Spearman and Kendall's measures match.": a,
                "For all columns of play_data, Spearman and Pearson's measures match.": b, 
                "For all columns of play_data, Pearson and Kendall's measures match.": b
}

sim_6_sol(corr_comp_dct)

That's right!  Pearson does not match the other two measures, as it looks specifically for linear relationships.  However, Spearman and Kenall's measures are exactly the same to one another in the cases related to play_data.


### Distance Measures

All of the above measures are considered measures of correlation.  Similarly, there are distance measures (of which there are many).  [This is a great article](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/) on some popular distance metrics.  In this notebook, we will be looking specifically at two of these measures.  

1. Euclidean Distance
2. Manhattan Distance

Different than the three measures you built functions for, these two measures take on values between 0 and potentially infinity.  Measures that are closer to 0 imply that two vectors are more similar to one another.  The larger these values become, the more dissimilar two vectors are to one another.

Choosing one of these two `distance` metrics vs. one of the three `similarity` above is often a matter of personal preference, audience, and data specificities.  You will see in a bit a case where one of these measures (euclidean or manhattan distance) is optimal to using Pearson's correlation coefficient.

### Euclidean Distance

Euclidean distance can also just be considered as straight-line distance between two vectors.

For two vectors **x** and **y**, we can compute this as:

$$ EUC(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}(x_i - y_i)^2}$$

### Manhattan Distance

Different from euclidean distance, Manhattan distance is a 'manhattan block' distance from one vector to another.  Therefore, you can imagine this distance as a way to compute the distance between two points when you are not able to go through buildings.

Specifically, this distance is computed as:

$$ MANHATTAN(\textbf{x}, \textbf{y}) = \sum\limits_{i=1}^{n}|x_i - y_i|$$

Using each of the above, write a function for each to take two vectors and compute the euclidean and manhattan distances.


<img src="06_recommendation_engines/images/distances.png">

You can see in the above image, the **blue** line gives the **Manhattan** distance, while the **green** line gives the **Euclidean** distance between two points.

##### 7. Use the below cell to complete a function for each distance metric.  Then test your functions against the built in values using the below.

In [83]:
def eucl_dist(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    euc - the euclidean distance between x and y
    '''  
    return np.linalg.norm(x - y)
    
def manhat_dist(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    manhat - the manhattan distance between x and y
    '''  
    return sum(abs(a - b) for a, b in zip(x, y))

In [84]:
# Test your functions
assert test_eucl(play_data['x1'], play_data['x2']) == eucl_dist(play_data['x1'], play_data['x2'])
assert test_eucl(play_data['x2'], play_data['x3']) == eucl_dist(play_data['x2'], play_data['x3'])
assert test_manhat(play_data['x1'], play_data['x2']) == manhat_dist(play_data['x1'], play_data['x2'])
assert test_manhat(play_data['x2'], play_data['x3']) == manhat_dist(play_data['x2'], play_data['x3'])


### Final Note

It is worth noting that two vectors could be similar by metrics like the three at the top of the notebook, while being incredibly, incredibly different by measures like these final two.  Again, understanding your specific situation will assist in understanding whether your metric is appropriate.