## My solution

In [1]:
import pandas
import numpy
from src.load_data import load_csv,select_features
from src.operations import vectorize, similarity, similarities, top_similar_items
from src.helpers import get_movie_id, get_movie_name, get_movie_year
from src.analyze import get_most_similar,get_recommendations

## Load data

The first problem to be able to generalize our problem to all types of data is to implement a loader. This loader takes the paths where the rating, user and movie data are stored (generalized as items).

In [2]:
Pathusers,Pathmovies,Pathratings = "data/users.csv","data/movies.csv","data/ratings.csv"

In [3]:
users,movies,ratings=load_csv(Pathusers,Pathmovies,Pathratings)

Then we choose the features to be used for the simulation matrix.  Note that the Title column is replaced by ID, so we can use this for data associated with restaurants or museums. The disadvantage is that 'title' (or equivalent) must be positioned first in the selected columns.

In [4]:
features = ['Animation', "Children's", 'Comedy',
       'Adventure', 'Fantasy', 'Romance', 'Drama', 'Action', 'Crime',
       'Thriller', 'Horror', 'Sci-Fi', 'Documentary', 'War', 'Musical',
       'Mystery', 'Film-Noir', 'Western']
cols= ['title']+features

In [5]:
moviesProcessed = select_features(movies,cols)

In [6]:
moviesProcessed.head()

Unnamed: 0,ID,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,Toy Story,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Jumanji,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Grumpier Old Men,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Waiting to Exhale,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Father of the Bride Part II,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Operations

The data being correctly downloaded, we will have to implement the functions used to calculate the similarity between our several samples. 

In [7]:
vectorize(moviesProcessed,features,'Toy Story')

Animation      1.0
Children's     1.0
Comedy         1.0
Adventure      0.0
Fantasy        0.0
Romance        0.0
Drama          0.0
Action         0.0
Crime          0.0
Thriller       0.0
Horror         0.0
Sci-Fi         0.0
Documentary    0.0
War            0.0
Musical        0.0
Mystery        0.0
Film-Noir      0.0
Western        0.0
Name: 0, dtype: float64

In [8]:
similarity(moviesProcessed,features,'Toy Story','Pocahontas')

2.0

In [9]:
similaritiesMatrix=similarities(moviesProcessed, features)

In [10]:
top_similar_items(similaritiesMatrix,moviesProcessed,0)

Similarity between Toy Story and Toy Story (index 0) is 3.0
Similarity between Toy Story and Jumanji (index 1) is 1.0
Similarity between Toy Story and Grumpier Old Men (index 2) is 1.0
Similarity between Toy Story and Waiting to Exhale (index 3) is 1.0
Similarity between Toy Story and Father of the Bride Part II (index 4) is 1.0
Similarity between Toy Story and Heat (index 5) is 0.0
Similarity between Toy Story and Sabrina (index 6) is 1.0
Similarity between Toy Story and Tom and Huck (index 7) is 1.0
Similarity between Toy Story and Sudden Death (index 8) is 0.0
Similarity between Toy Story and GoldenEye (index 9) is 0.0


## Analyze

Now that we have our Matrix of similarity we want to infer recommendations for a given user.

In [11]:
get_most_similar(similaritiesMatrix,movies,'Toy Story')

[(667, 'Space Jam', 3.0),
 (3685, 'Adventures of Rocky and Bullwinkle, The', 3.0),
 (3682, 'Chicken Run', 3.0),
 (2009, 'Jungle Book, The', 3.0),
 (2011, 'Lady and the Tramp', 3.0),
 (2012, 'Little Mermaid, The', 3.0),
 (2033, 'Steamboat Willie', 3.0),
 (2072, 'American Tail, An', 3.0),
 (2073, 'American Tail: Fievel Goes West, An', 3.0)]

In [12]:
get_most_similar(similaritiesMatrix,movies,'Psycho',1960)

[(3593, "Puppet Master III: Toulon's Revenge", 2.0),
 (2923, 'Rawhead Rex', 2.0),
 (1312, 'Believers, The', 2.0),
 (3407, "Jacob's Ladder", 2.0),
 (1957, 'Disturbing Behavior', 2.0),
 (1927, 'Poltergeist III', 2.0),
 (1926, 'Poltergeist II: The Other Side', 2.0),
 (1925, 'Poltergeist', 2.0),
 (732, 'Thinner', 2.0),
 (69, 'From Dusk Till Dawn', 2.0)]

In [13]:
get_recommendations(999,movies, ratings, similaritiesMatrix)

Unnamed: 0,movie_id,title,similarity
0,166,First Knight,2.0
2,1451,Smilla's Sense of Snow,2.0
3,503,"Perfect World, A",2.0
4,3197,Man Bites Dog (C'est arriv� pr�s de chez vous),2.0
5,1458,"Devil's Own, The",2.0


In [14]:
recommendations=get_recommendations(1176,movies,ratings,similaritiesMatrix)

## Evaluation

To check whether a recommendation is good or not, I chose to take into account the user's notes if he has already seen the recommended movie (by normalizing, someone might like to watch a good movie again) and to encourage the discovery of new movies (this aspect can be disabled).

In [15]:
from src.evaluation import evaluation

In [29]:
evaluation(recommendations,ratings,0,explore=False),evaluation(recommendations,ratings,0,explore=True)

(0, 5)

## Bonus

I started to implement the addition of a new feature that looks at the average ratings of similar users. The implementation is almost finished, but the addition of the column in the Dataframe is missing. The implementation is almost finished it lacks the addition of the column in the Dataframe and the replacement of the NaN values with a predefined value like 0 (or -0.5 if we assume that a movie that is not seen by the user pool will not be a good recommendation).

To find similar users we use the user's age. Note that on this DataSet users in the same range (10 years old) have the same age.

In [19]:
from src.load_data import similarUsers

In [20]:
sim=similarUsers(2,users)

In [21]:
similarUsers=users.iloc[sim]

In [22]:
from src.load_data import newFeature

In [28]:
type(movies['movie_id'])

pandas.core.series.Series

In [23]:
newFeature(ratings,users,movies,sim)

[0.8402531645569621,
 0.6291187739463602,
 0.5943181818181819,
 0.5454545454545454,
 0.588034188034188,
 0.7822115384615385,
 0.6622222222222222,
 0.5846153846153845,
 0.47317073170731705,
 0.6871232876712329,
 0.746969696969697,
 0.36923076923076925,
 0.616,
 0.6861538461538461,
 0.4344827586206897,
 0.7564954682779457,
 0.8020477815699658,
 0.674074074074074,
 0.5020689655172415,
 0.52,
 0.7114093959731543,
 0.6511363636363636,
 0.5543859649122808,
 0.6391666666666667,
 0.7310657596371882,
 0.7076923076923076,
 0.5416666666666667,
 0.8344827586206897,
 0.8303030303030303,
 0.7560975609756098,
 0.6466666666666667,
 0.7958823529411765,
 0.6666666666666667,
 0.7915625000000001,
 0.6357142857142857,
 0.7796246648793566,
 0.6666666666666667,
 0.35,
 0.7302752293577981,
 0.8181818181818181,
 0.7936170212765957,
 0.5657142857142857,
 0.6379310344827587,
 0.5568345323741007,
 0.684732824427481,
 0.6029850746268657,
 0.8148698884758364,
 0.5364341085271318,
 0.6909090909090909,
 0.91048593350

## Conclusion

Finally I think that there is still a lot to be done to improve the proposed solution 

- First of all, no class has been written, which is a real shame when you code in Python.
- Secondly, the tests proposed are very simple and don't check the borderline cases, so for lack of time I didn't code the last functions in a test driven way. This is a mistake on my part. Also, I didn't take the time to make a test folder which is not practical to launch the tests, you have to change the lines from src.load_data import * by from load_data import *.
- Also, the proposed evaluation is really naive, now that we have defined similar users we could rely on their average scores.
- Finally, I didn't have the time to comment each function which is really not ideal to make this code usable by several users.