# Recommender System for [SocialComment](thesocialcomment.com)

<div>The projects main aim is to create a recommender system by 
<ul>
    <li><a href="#content">Content based Filtering</a>
    <li><a href="#collab">Collaborative Filtering</a>
</ul>
</div>

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Loading data in datasets

## users.csv

This dataset contains the details of user id, name, gender and academics

In [3]:
users_data = pd.read_csv("input/users.csv")
users_data.head()

Unnamed: 0,_id,name,gender,academics
0,5d60098a653a331687083238,Nivesh Singh Chauhan,male,undergraduate
1,5d610ae1653a331687083239,Gaurav Sharma,male,graduate
2,5d618359fc5fcf3bdd9a0910,Akshay Mishra,male,undergraduate
3,5d6d2bb87fa40e1417a49315,Saksham Mathur,male,undergraduate
4,5d7c994d5720533e15c3b1e9,Varun Chowhan,male,undergraduate


## posts.csv
This dataset contains the details of the post which includes the catogory and post type

In [4]:
posts_data = pd.read_csv("input/posts.csv")
posts_data.head()

Unnamed: 0,_id,title,category,post_type
0,5d62abaa65218653a132c956,hello there,Plant Biotechnology,blog
1,5d6d39567fa40e1417a4931c,Ml and AI,Artificial Intelligence|Machine Learning|Infor...,blog
2,5d7d23315720533e15c3b1ee,What is an Operating System ?,Operating Systems,blog
3,5d7d405e5720533e15c3b1f3,Lord Shiva,Drawings,artwork
4,5d80dfbc6c53455f896e600e,How Competition law evolved?,Competition Laws,blog


## views.csv

This dataset contains id of user and the post viewed by him along with the time which he viewed it.

In [5]:
views_data = pd.read_csv("input/views.csv")
views_data.head()

Unnamed: 0,user_id,post_id,timestamp
0,5df49b32cc709107827fb3c7,5ec821ddec493f4a2655889e,2020-06-01T10:46:45.131Z
1,5ed3748576027d35905ccaab,5ed4cbadbd514d602c1531a6,2020-06-01T09:39:20.021Z
2,5ed0defa76027d35905cc2de,5eac305f10426255a7aa9dd3,2020-06-01T08:12:42.682Z
3,5ed0defa76027d35905cc2de,5ed1ff0276027d35905cc60d,2020-06-01T08:10:23.880Z
4,5ed0defa76027d35905cc2de,5ed3820f76027d35905ccac8,2020-06-01T08:08:54.124Z


# Preprocessing
The posts_data's category column is changed from a string to list of category for each post so that it can be used for One hot encoding in the next step.

In [6]:
posts_data = posts_data.rename(columns={' post_type':'post_type'})
posts_data_col = posts_data.copy()

# Converting catagories to lists
posts_data_col['category'] = posts_data.category.str.split('|')

# Filling Nan values in category column to []
posts_data_col.loc[posts_data_col['category'].isnull(), 'category'] = posts_data_col.loc[posts_data_col['category'].isnull(), 'category'].apply(lambda x: [])

# Stripping all the end spaces in the string for each category
posts_data_col['category'] = posts_data_col.category.apply(lambda x: list(map(str.strip, x)) if x != [] else [])
posts_data_col[20:30]

Unnamed: 0,_id,title,category,post_type
20,5dbc631f99cbb90e4339c7fd,Calligraphy,"[Drawings, Calligraphy]",artwork
21,5dc065ca24b883670268772f,Colours of pushkar.,[Photography],artwork
22,5dd1751db802e41ed198b680,Marital Rape - Rape is Rape,[Empowerment],blog
23,5dde6a91369b28584ecca156,Spirituality,[Photography],artwork
24,5ddeb6e80eb5e25a8a07f065,Library Managment System: Software Requirement...,[],project
25,5de179d80eb5e25a8a07f079,Navigation system using BFS DFS algorithms,[],project
26,5de7971b8eab6401affbb137,Shadow Sketch,[Drawings],artwork
27,5de8d73249e8203ff9219a74,Promotional video.,[Video editing],skill
28,5dea816a42a8854bf6eaba89,The Periodic Table,[Inorganic Chemistry],blog
29,5dee9b5042a8854bf6eabaaf,Computer Aided Machine Drawing (CAMD),[],project


In [7]:
post_with_category = posts_data_col.copy()

# Creating seperate columns for each category and setting 1 for those which apply
for i, row in post_with_category.iterrows():
    for cat in row['category']:
        post_with_category.at[i, cat] = 1

# Filling all those other than 1 to be 0
post_with_category = post_with_category.fillna(0.0)

# Droping information which is not needed
post_encoded = post_with_category.drop('category',1).drop('title',1).drop('post_type',1)
post_encoded.head()

Unnamed: 0,_id,Plant Biotechnology,Artificial Intelligence,Machine Learning,Information Technology,Operating Systems,Drawings,Competition Laws,Eco System,Economic Policies,...,Test,Professionalism,Art,Science,Technology,Logo Design,Learning,Fictions,Typography,Media And Society
0,5d62abaa65218653a132c956,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5d6d39567fa40e1417a4931c,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5d7d23315720533e15c3b1ee,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,5d7d405e5720533e15c3b1f3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5d80dfbc6c53455f896e600e,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id="content"> </a>

# Content-based Filtering
An user's views are retrived as an input dataset for which we predict the recommended posts.

In [8]:
# Getting view data of an user with user id '5ed0defa76027d35905cc2de'
input_data = post_encoded[post_encoded['_id'].isin(views_data[views_data['user_id'] == '5d610ae1653a331687083239']['post_id'])]

In [9]:
# We'll only need the categories of posts. So let's clean up the data
input_data = input_data.reset_index(drop=True)
input_category = input_data.drop('_id',1)
input_category.head()

Unnamed: 0,Plant Biotechnology,Artificial Intelligence,Machine Learning,Information Technology,Operating Systems,Drawings,Competition Laws,Eco System,Economic Policies,Graphic,...,Test,Professionalism,Art,Science,Technology,Logo Design,Learning,Fictions,Typography,Media And Society
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
# This is the user recommendations for each category weighted by the number of views on each
user_profile = input_category.sum(axis=0)
user_profile.head()

Plant Biotechnology        1.0
Artificial Intelligence    3.0
Machine Learning           2.0
Information Technology     1.0
Operating Systems          0.0
dtype: float64

In [11]:
# Creating the category table with post_id as index column
category_table = post_with_category.set_index(post_with_category['_id'])
category_table = category_table.drop('_id',1).drop('title', 1).drop('post_type',1).drop('category',1)
category_table.head()

Unnamed: 0_level_0,Plant Biotechnology,Artificial Intelligence,Machine Learning,Information Technology,Operating Systems,Drawings,Competition Laws,Eco System,Economic Policies,Graphic,...,Test,Professionalism,Art,Science,Technology,Logo Design,Learning,Fictions,Typography,Media And Society
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5d62abaa65218653a132c956,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d6d39567fa40e1417a4931c,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d7d23315720533e15c3b1ee,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d7d405e5720533e15c3b1f3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5d80dfbc6c53455f896e600e,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


To get the recommendation on the posts, we multiply this with the category table and divide with the total views by user to find recommendation value for each posts

In [12]:
recommendation_table = ((category_table*user_profile).sum(axis=1)) / user_profile.sum()
recommendation_table.head()

_id
5d62abaa65218653a132c956    0.004115
5d6d39567fa40e1417a4931c    0.024691
5d7d23315720533e15c3b1ee    0.000000
5d7d405e5720533e15c3b1f3    0.065844
5d80dfbc6c53455f896e600e    0.008230
dtype: float64

In [13]:
# Sorting the values to get the best post recommendation on top
recommendation_table = recommendation_table.sort_values(ascending=False)
recommendation_table.head()

_id
5e7f39a3a3258347b42f2151    0.246914
5e5bb3eed701ab08af792bfa    0.246914
5e2d4d63c85ab714a7da66db    0.238683
5ecb72c0eaff6b0c3a58a48e    0.234568
5e5b59cbd701ab08af792b90    0.213992
dtype: float64

These are the recommendation generated for the user_id __5d610ae1653a331687083239__ using content based recommended system.

In [14]:
# The final recommendation table with the top 20 post recommendations
final = posts_data.loc[posts_data['_id'].isin(recommendation_table.head(20).keys())]
final.head(10)

Unnamed: 0,_id,title,category,post_type
40,5e2d4737c85ab714a7da66d9,LGBT PRIDE,Fashion Design|Visual Arts|Conceptual|Artistic...,artwork
41,5e2d4d63c85ab714a7da66db,Magazine Cover Redefined,Photography|Fashion Design|Visual Arts|Graphic...,artwork
42,5e2d516fc85ab714a7da66dd,'The Virtual ME',Fashion Design|Visual Arts|Graphic Design|Arti...,artwork
119,5e5b59cbd701ab08af792b90,The girl in the meadow,Drawings|Painting|Visual Arts|Graphic Design|P...,artwork
121,5e5bb3eed701ab08af792bfa,The Nerd who was thrown out.,Drawings|Visual Arts|Painting|Graphic Design|A...,artwork
155,5e79f3fccfc8b713f5ac7d54,The Enlightened Dreams,Drawings|Painting|Watercolours,artwork
178,5e7f39a3a3258347b42f2151,The Meenamma,Drawings|Painting|Visual Arts|Artistic design|...,artwork
247,5e94452ea3258347b42f282a,gripping,Visual Arts|Photography,artwork
269,5e94bf78a3258347b42f2925,Aesthetics,Photography|Visual Arts,artwork
288,5e96464da3258347b42f2a8e,filtered,Photography|Architecture|Painting,artwork


---

<a id="collab"> </a>

# Collaborative Filtering

Let's take a look at the datasets first

In [15]:
views_data.head()

Unnamed: 0,user_id,post_id,timestamp
0,5df49b32cc709107827fb3c7,5ec821ddec493f4a2655889e,2020-06-01T10:46:45.131Z
1,5ed3748576027d35905ccaab,5ed4cbadbd514d602c1531a6,2020-06-01T09:39:20.021Z
2,5ed0defa76027d35905cc2de,5eac305f10426255a7aa9dd3,2020-06-01T08:12:42.682Z
3,5ed0defa76027d35905cc2de,5ed1ff0276027d35905cc60d,2020-06-01T08:10:23.880Z
4,5ed0defa76027d35905cc2de,5ed3820f76027d35905ccac8,2020-06-01T08:08:54.124Z


In [16]:
posts_data.head()

Unnamed: 0,_id,title,category,post_type
0,5d62abaa65218653a132c956,hello there,Plant Biotechnology,blog
1,5d6d39567fa40e1417a4931c,Ml and AI,Artificial Intelligence|Machine Learning|Infor...,blog
2,5d7d23315720533e15c3b1ee,What is an Operating System ?,Operating Systems,blog
3,5d7d405e5720533e15c3b1f3,Lord Shiva,Drawings,artwork
4,5d80dfbc6c53455f896e600e,How Competition law evolved?,Competition Laws,blog


In [17]:
# We wont use the time_stamp, so we can drop it to save storage
views_data = views_data.drop("timestamp", 1)

# getting the user input data from the user 5d610ae1653a331687083239
input_data = views_data[views_data['user_id'] == '5d610ae1653a331687083239']
input_data.head()

Unnamed: 0,user_id,post_id
118,5d610ae1653a331687083239,5ed13d2876027d35905cc4c2
138,5d610ae1653a331687083239,5ed0e31a76027d35905cc302
176,5d610ae1653a331687083239,5d80dfbc6c53455f896e600e
207,5d610ae1653a331687083239,5ecce8a5eaff6b0c3a58a5e9
208,5d610ae1653a331687083239,5ecd6ba47023451e66223604


In [18]:
# Getting the users who have also viewed the posts viewed by the input user
user_related = views_data[views_data['post_id'].isin(input_data['post_id'].tolist())]
user_related.head()

Unnamed: 0,user_id,post_id
2,5ed0defa76027d35905cc2de,5eac305f10426255a7aa9dd3
9,5ecb979eeaff6b0c3a58a4f0,5ed13d2876027d35905cc4c2
23,5ed3748576027d35905ccaab,5eb7b10ffd92f539c465ddda
26,5ed35aa376027d35905cca67,5ed13d2876027d35905cc4c2
34,5ed350ed76027d35905cca2c,5eb4fab110426255a7aaa0ed


In [19]:
# grouping all the view data by user_id
user_grouped = user_related.groupby(['user_id'])
user_grouped = sorted(user_grouped, key= lambda x: len(x[1]), reverse= True)
user_grouped = user_grouped[1:]

# The top 2 users who are most similar to the given user are:
print(user_grouped[:2])

[('5d60098a653a331687083238',                        user_id                   post_id
115   5d60098a653a331687083238  5ed13d2876027d35905cc4c2
195   5d60098a653a331687083238  5ecd5d417023451e662235c5
196   5d60098a653a331687083238  5ecce8a5eaff6b0c3a58a5e9
289   5d60098a653a331687083238  5eca8fceeaff6b0c3a58a3c0
423   5d60098a653a331687083238  5ec5546bf2781131cc7e5140
...                        ...                       ...
1407  5d60098a653a331687083238  5e7de250a3258347b42f210a
1408  5d60098a653a331687083238  5e7f4fb3a3258347b42f2156
1421  5d60098a653a331687083238  5e7c7a44cfc8b713f5ac7dac
1422  5d60098a653a331687083238  5e7de48ca3258347b42f2110
1444  5d60098a653a331687083238  5e7a60edcfc8b713f5ac7d82

[74 rows x 2 columns]), ('5e1ef04c2a37d20505da2b8b',                       user_id                   post_id
307  5e1ef04c2a37d20505da2b8b  5ec8204cec493f4a26558893
375  5e1ef04c2a37d20505da2b8b  5eaed2f210426255a7aa9eef
377  5e1ef04c2a37d20505da2b8b  5ec5546bf2781131cc7e5140
378  5e1

> Similarity is measured by counting the total similar views between the given user and other users divided by the total posts viewed by input user

In [20]:
user_grouped = user_grouped[:100]
similarity = {}
for id, group in user_grouped:
    similarity[id] = len(group)/len(input_data)

In [21]:
# Converting similarity to pandas dataframe
similar_df = pd.DataFrame.from_dict(similarity, orient='index')
similar_df.columns = ['similarityIndex']
similar_df['user_id'] = similar_df.index
similar_df.index = range(len(similar_df))
similar_df.head(10)

Unnamed: 0,similarityIndex,user_id
0,0.540146,5d60098a653a331687083238
1,0.270073,5e1ef04c2a37d20505da2b8b
2,0.255474,5d7c994d5720533e15c3b1e9
3,0.218978,5deeef6142a8854bf6eabab9
4,0.175182,5e5af599d701ab08af792b63
5,0.131387,5defd51362624b0135ea9fd2
6,0.131387,5ec3ba5374f7660d73aa1201
7,0.116788,5df3f8f2ee4bb5252b4f5393
8,0.116788,5e7cf05bcfc8b713f5ac7db7
9,0.094891,5df20f1fee4bb5252b4f5351


In [22]:
# Let's take top 30 users for this model
topusers = similar_df.sort_values(by='similarityIndex', ascending=False)[:30]
topusers.head()

Unnamed: 0,similarityIndex,user_id
0,0.540146,5d60098a653a331687083238
1,0.270073,5e1ef04c2a37d20505da2b8b
2,0.255474,5d7c994d5720533e15c3b1e9
3,0.218978,5deeef6142a8854bf6eabab9
4,0.175182,5e5af599d701ab08af792b63


In [23]:
# Adding posts viewed by 
topuserspost = similar_df.merge(views_data, left_on='user_id', right_on='user_id', how='inner')

# Cancelling out all the posts which the user has already seen
topuserspost = topuserspost[~topuserspost['post_id'].isin(input_data['post_id'])]
topuserspost.head()

Unnamed: 0,similarityIndex,user_id,post_id
0,0.540146,5d60098a653a331687083238,5ed3820f76027d35905ccac8
1,0.540146,5d60098a653a331687083238,5ed1ff0276027d35905cc60d
2,0.540146,5d60098a653a331687083238,5ecf96e876027d35905cbf46
3,0.540146,5d60098a653a331687083238,5ecfa0ca76027d35905cbf57
4,0.540146,5d60098a653a331687083238,5ed0e20776027d35905cc2fe


In [24]:
# Grouping by post to get the cumulative similarity score for which recommendations are done
topuserspost = topuserspost.groupby('post_id').sum()['similarityIndex']
topuserspost.head(8)

post_id
5d62abaa65218653a132c956    0.540146
5d6d39567fa40e1417a4931c    0.540146
5d7d23315720533e15c3b1ee    0.109489
5d7d405e5720533e15c3b1f3    0.401460
5d80e7c16c53455f896e6014    0.131387
5d81323a6c53455f896e6044    0.014599
5d9b950768671220a1b2b153    0.218978
5dada695610ba040fbfdf585    0.145985
Name: similarityIndex, dtype: float64

In [25]:
# topuserspost = topuserspost.sort_values(by='similarityIndex', ascending=False)
# topuserspost
recommendations = pd.DataFrame()
recommendations['score'] = topuserspost[:]
recommendations['post_id'] = topuserspost.index

# Sorting recommendations based on the obtained score
recommendations = recommendations.sort_values(by='score', ascending=False)
recommendations.index = range(len(topuserspost))
recommendations.head()

Unnamed: 0,score,post_id
0,2.109489,5ec7a8bdec493f4a26558846
1,2.094891,5ec7a699ec493f4a2655883a
2,2.014599,5ec7a7a3ec493f4a26558840
3,1.963504,5ec7ad1aec493f4a26558869
4,1.919708,5ec2215374f7660d73aa1011


In [26]:
# Final recommendations by collaborative filtering
final_posts = posts_data[posts_data['_id'].isin(recommendations['post_id'][:20])]
final_posts.head(10)

Unnamed: 0,_id,title,category,post_type
187,5e81b47fa3258347b42f21d7,7 Steps To Stay Safe From Corona Virus,Mass Media|Indian Government,blog
211,5e8448aba3258347b42f2447,JUSTICE FOR NORTH EAST,Human Rights|Fundamental Rights,blog
226,5e8bfa8aa3258347b42f2611,Are We Alone In The Universe?,Archeology|Human Prehistory,blog
265,5e948fdfa3258347b42f28ca,monument,Photography,artwork
348,5ea3227010426255a7aa9ac1,Aesthetic.,Photography,artwork
349,5ea3236810426255a7aa9ac8,Solitude,Photography,artwork
353,5ea5ce9310426255a7aa9b8d,Photography,,project
414,5ebd5b46514aab59896bcd5a,Quick Sketch of Gangster Skull.,Sketch Video,skill
416,5ec2215374f7660d73aa1011,Women power,Painting,artwork
417,5ec278b574f7660d73aa10d5,Rides,Drawings,artwork


---

# Inference

The post predictions from collaborative filtering and from content-based filtering are very different. <br>
This shows that
* Content-based filtering takes all the preferences from the current user and will not show other posts. It may give better recommendations for user, but will not explore other categories.

* Collaborative filtering exploits other users' preferences to filter the best posts. This may show those which user have not explored yet.

* A limitation for both the algorithms is if there isn't enough data for each user.

### Follow-up
This project can be extended by giving the user review along with the views to give better recommendation results