# Requirements of Totality  Corp:
<b>Assignment 1 </b><br>(Design Recommendation System Architecture)<br>
Content discovery is a vital part for the Yovo. Users often don't know what they want to watch and need a way to discover content without searching for it. <b>Create a feed personalisation algorithm strategy which enables users to discover the right content. Underlying algorithm must strike an elegant balance between Machine Learning and giving the user control over what content they want to see.</b>
Note:
- Video has certain text attributes: tags, category, title text (context) - Use pseudocode wherever necessary
- Download ​yovo app​ to see actual feed



# Potential solutions to the problem:

### Recommender Systems generally follow one of two methods:
##### 1. Collaborative filtering 
    
This approach <b> will not apply to the Yovo app as there is no option for users to like certain videos in their feed.  
The approach would require the app to keep track of each user and their likes, shares etc. in the form of user matrix. </b>


##### 2. Content Based filtering (Suitable for YOVO app)

This approach utilizes a series of discrete characteristics of an item in order to recommend additional items with similar properties. 

<b>Based on items,which are videos in the case of Yovo including the characteristics - tags, category, title text (context) in this case. 

This method is suitable in the case where metadata is available and no matrix of users ids, preferences is available.</b>


# Practical Example of Content based filtering (Architecture) :

## The Dataset 

The dataset of 500 entries of different items like shoes, shirts etc., along with an item-id and a <b>textual description of the item.</b><br>
The system creates a profile for each item and recommends similar items.

<b>For totality, I imagine this dataset will be replaced with one containing textual discription of the video(videoId)
using the tags or title  attributes.</b>

## Process

#### 1. Extract TF * IDF [(term frequency)*(Inverse document frequency)] Score

The TF*IDF algorithm finds the importance of a word in the tag. This is done for each word in the tag and for each item.<br>
<b>This is implemented using scikit-learns inbuilt TF-IDF vectorizer. </b>

#### 2. Calculating Similarity using Cosine Similarity  
Once we have the vectors for each item, we can use cosine similarity to find items/ words that are similar.<br>
Cosine similarity judges how close the cosine angles are in the vector representation of the items.<br>

<b>This is done using the linear_kernel method of scikit-learn. It takes the tfidf matrix of the items as input and compares them to find items that are similar. </b>
    
#### 3. Store results of cosine similarity 
The results of cosine similarity are stored in result, arranged according to similarity with item i.

#### 4. Recommending Items
The function recommend takes in the item for which a recommendation is to be made and the number of recommmendations to be made and reads out the most similar items from results.
<br>
We input a threshold value to only get recommendation above a certain similarity index.<br>
<b> The items recommended can then be fed into the personalized feed of the user for relevant video recommendations.</b>

In [20]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

# reads dataset 
ds = pd.read_csv("sample-data.csv")

ds.head()

# 1. calculates tf-idf scores for items
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['description'])


# 2. Calculate cosine similarity 
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 
results = {}

# 3. Saving results in order of similarity to item i to dictionary results

for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]

    results[row['id']] = similar_items[1:]

print('Saved')

def item(id):
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0]

# 4.a Just reads the results out of the dictionary.
def recommend(item_id, num, threshold):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + " with threshold " + str(threshold) + "...")
    print("-------")
    recs = results[item_id][:num]
    for rec in recs:
        
        # A condition can be added to only to print items above a similairty threshold (50% etc.)
        if(rec[0]>threshold):
            print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

# 4. Recommend items similar to item_id and num = number of recommendations with a similarity threshold value 


recommend(item_id=9, num=5, threshold= 0.3)  # Recommend  5 items similar to item_id 9 with similarity of >30%
print('')
print('')
print('')
recommend(item_id=22, num=10, threshold= 0.5) # Recommend  10 items similar to item_id 22 with similarity of >50%
for itemid in range(1,500):
    recommend(item_id=itemid, num=10, threshold= 0.5)
    print('')
    print('')


Saved
Recommending 5 products similar to Baby micro d-luxe cardigan... with threshold 0.3
-------
Recommended: Micro d-luxe cardigan (score:0.37550840843454325)



Recommending 10 products similar to Cap 2 t-shirt... with threshold 0.5
-------
Recommended: Cap 2 crew (score:0.7049890803634637)
Recommended: Cap 2 t-shirt (score:0.7041906217524195)
Recommended: Cap 2 cap sleeve (score:0.6635362790007241)
Recommended: Cap 2 zip neck (score:0.6225162259563587)
Recommended: Cap 2 zip neck (score:0.5295225236280288)
Recommended: Cap 2 v-neck (score:0.5094581638236453)
Recommending 10 products similar to Active classic boxers... with threshold 0.5
-------


Recommending 10 products similar to Active sport boxer briefs... with threshold 0.5
-------


Recommending 10 products similar to Active sport briefs... with threshold 0.5
-------


Recommending 10 products similar to Alpine guide pants... with threshold 0.5
-------
Recommended: Alpine guide pants (score:0.8253856759948807)


Recommending 

Recommended: '73 logo t-shirt (score:0.90535306836541)
Recommended: Flying fish t-shirt (score:0.901594544659202)
Recommended: Live simply guitar t-shirt (score:0.8871062875585407)
Recommended: Girl's live simply deer t-shirt (score:0.8018828254625427)
Recommended: Girl's live simply seal t-shirt (score:0.7990848221388119)


Recommending 10 products similar to Live simply guitar t-shirt... with threshold 0.5
-------
Recommended: '73 logo t-shirt (score:0.8961878040125465)
Recommended: Flying fish t-shirt (score:0.8924673293996112)
Recommended: Gpiw classic t-shirt (score:0.8871062875585407)
Recommended: Girl's live simply deer t-shirt (score:0.8202636525774637)
Recommended: Girl's live simply seal t-shirt (score:0.817401513180821)


Recommending 10 products similar to Synch marsupial... with threshold 0.5
-------


Recommending 10 products similar to Torrentshell jkt... with threshold 0.5
-------


Recommending 10 products similar to La surfer maria t-shirt... with threshold 0.5
------

Recommending 10 products similar to Stormfront duffel 100... with threshold 0.5
-------


Recommending 10 products similar to Stormfront pack... with threshold 0.5
-------


Recommending 10 products similar to Stretch polo... with threshold 0.5
-------


Recommending 10 products similar to Stretch wading belt... with threshold 0.5
-------


Recommending 10 products similar to Sub divider... with threshold 0.5
-------
Recommended: Great divider (score:0.6735109210642198)


Recommending 10 products similar to Surf brim... with threshold 0.5
-------


Recommending 10 products similar to Surf sneaker... with threshold 0.5
-------


Recommending 10 products similar to Tech web belt... with threshold 0.5
-------


Recommending 10 products similar to Text logo t-shirt... with threshold 0.5
-------


Recommending 10 products similar to The more you know t-shirt... with threshold 0.5
-------


Recommending 10 products similar to Three trees shirt... with threshold 0.5
-------


Recommending 10 

Recommending 10 products similar to Solid betina btm... with threshold 0.5
-------
Recommended: Print banded betina btm (score:0.9132682180464679)


Recommending 10 products similar to Solid bibiana 1 piece... with threshold 0.5
-------


Recommending 10 products similar to Solid bibiana top... with threshold 0.5
-------
Recommended: Print bibiana top (score:0.9839873177805977)


Recommending 10 products similar to Solimar pants... with threshold 0.5
-------


Recommending 10 products similar to Solimar shorts... with threshold 0.5
-------


Recommending 10 products similar to S/s a/c shirt... with threshold 0.5
-------


Recommending 10 products similar to S/s rashguard... with threshold 0.5
-------
Recommended: L/s rashguard (score:0.6668707412901518)
Recommended: S/s rashguard (score:0.5124016137205425)


Recommending 10 products similar to S/s sol patrol shirt... with threshold 0.5
-------


Recommending 10 products similar to Sun shelter shirt... with threshold 0.5
-------


Recom

-------


Recommending 10 products similar to Alpine wind jkt... with threshold 0.5
-------
Recommended: Alpine wind jkt (score:0.9550036493156216)


Recommending 10 products similar to Aravis 1/4 zip... with threshold 0.5
-------


Recommending 10 products similar to Astrid top... with threshold 0.5
-------


Recommending 10 products similar to Astrid wrap... with threshold 0.5
-------


Recommending 10 products similar to Baggies shorts... with threshold 0.5
-------


Recommending 10 products similar to Kamala dress... with threshold 0.5
-------


Recommending 10 products similar to Kamala skirt... with threshold 0.5
-------
Recommended: Lithia skirt (score:0.5768478031212589)


Recommending 10 products similar to Kite town t-shirt... with threshold 0.5
-------
Recommended: Birdwalk t-shirt (score:0.5029180733443867)


Recommending 10 products similar to Barely bra (a/b)... with threshold 0.5
-------


Recommending 10 products similar to Barely everyday bra (b/c)... with threshold 0.

Recommended: Lw guide pants (score:0.8103583485161414)


Recommending 10 products similar to Lw hiking crew liner socks... with threshold 0.5
-------


Recommending 10 products similar to Storm light jkt... with threshold 0.5
-------
Recommended: Storm light jkt (score:0.9486837053276987)


Recommending 10 products similar to Stretch ascent jkt... with threshold 0.5
-------
Recommended: Stretch ascent jkt (score:0.9476346709211421)


Recommending 10 products similar to Lw sun hoody... with threshold 0.5
-------


Recommending 10 products similar to Lw travel courier... with threshold 0.5
-------


Recommending 10 products similar to Reg fit organic ctn jeans-short... with threshold 0.5
-------
Recommended: Reg fit organic ctn jeans-reg (score:0.9536045210725154)
Recommended: Reg fit organic ctn jeans-long (score:0.9522246792960233)
Recommended: Relax fit organic ctn jeans-shor (score:0.507668259865595)


Recommending 10 products similar to Relax fit organic ctn jeans-long... with thres



Recommending 10 products similar to Cap 2 crew... with threshold 0.5
-------
Recommended: Cap 2 t-shirt (score:0.7049890803634637)
Recommended: Cap 2 t-shirt (score:0.6334452370495761)
Recommended: Cap 2 cap sleeve (score:0.6081146129427677)
Recommended: Cap 2 zip neck (score:0.567262700882027)
Recommended: Cap 2 zip neck (score:0.5308649780944358)
Recommended: Cap 2 v-neck (score:0.5148329784985651)


Recommending 10 products similar to All-time shell... with threshold 0.5
-------


Recommending 10 products similar to All-wear cargo shorts... with threshold 0.5
-------




In [2]:
data = pd.read_csv('ml-latest-small/movies.csv')
data.info()
print('')
print('')


ratings = pd.read_csv('ml-latest-small/ratings.csv')
del ratings['timestamp']
ratings.info()
print('')
print('')

tags = pd.read_csv('ml-latest-small/tags.csv')
del tags['timestamp']
tags.info()
print('')
print('')

#Joining movies and ratings

movies_data = data.merge(ratings, on = 'movieId', how = 'inner')

#movies_data.info()
#movies_data.isnull().any()
#print(len(movies_data['movieId'].unique().tolist()))

# dropping duplicate  moviId as the recommender system takes input for item(movieId here) and will give recommendation 
# based on tags in single text stream, not multiple rows of duplicate item.

movies_data = movies_data.drop_duplicates(subset = ['movieId'])
print('No of unique moviId in merged datset')
print(len(movies_data['movieId'].unique().tolist()))
print('')
print('')
movies_data.info()
print('')
print('')

#Joining movies_data and tags
tags = tags.drop_duplicates(subset = ['movieId'])
tags.info()

movies_data_tags = movies_data.merge(tags, on ='movieId', how = 'inner')
movies_data_tags.head()

#1554 unique movies with tags

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId    9742 non-null int64
title      9742 non-null object
genres     9742 non-null object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
userId     100836 non-null int64
movieId    100836 non-null int64
rating     100836 non-null float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 3 columns):
userId     3683 non-null int64
movieId    3683 non-null int64
tag        3683 non-null object
dtypes: int64(2), object(1)
memory usage: 86.4+ KB


No of unique moviId in merged datset
9724


<class 'pandas.core.frame.DataFrame'>
Int64Index: 9724 entries, 0 to 100835
Data columns (total 5 columns):
movieId    9724 non-null int64
title      9724 non-null object
genres     972

Unnamed: 0,movieId,title,genres,userId_x,rating,userId_y,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,336,pixar
1,2,Jumanji (1995),Adventure|Children|Fantasy,6,4.0,62,fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance,1,4.0,289,moldy
3,5,Father of the Bride Part II (1995),Comedy,6,5.0,474,pregnancy
4,7,Sabrina (1995),Comedy|Romance,6,4.0,474,remake


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

# 1. calculates tf-idf scores for items
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies_data_tags['tag'])


# 2. Calculate cosine similarity 
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix) 
results = {}

# 3. Saving results in order of similarity to item i

for idx, row in movies_data_tags.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], movies_data_tags['tag'][i]) for i in similar_indices]

    results[row['movieId']] = similar_items[1:]
    
print('Saved')
print(results)
#def item(id):
#return movies_data_tags.loc[movies_data_tags['movieId'] == id]

# 4.a Just reads the results out of the dictionary.
def recommend(item_id, num, threshold):
    #print("Recommending " + str(num) + " products similar to " + movies_data_tags['movieId'] == item_id + "...")
    #print("-------")
    recs = results[item_id][:num]
    for rec in recs:
        
        # A condition can be added to only to print items above a similairty threshold (50% etc.)
        if(rec[0]>threshold):
            print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

# 4. Recommend items similar to item_id and num = number of recommendations with a similarity threshold value 


recommend(item_id=1, num=5, threshold= 0.3)  # Recommend  5 items similar to item_id 9 with similarity of >30%
print('')
print('')
print('')
recommend(item_id=3, num=10, threshold= 0.5) # Recommend  10 items similar to item_id 22 with similarity of >50%

