## Problem Statement


Sérendipité is an article aggregation platform where articles from different domains such as technology, politics, news, and so on are shared by its users and then these articles are recommended on the basis of reading habits. 

They have a rating system for articles under which the users when they read the article rate it on a scale of 1 to 5.

As a non-personalized recommender system, i try to provide recommendations of the article to the customer base by answering the following questions:-

**Q.1: Which are the top 10 articles based on a rating provided by more than 5% of users in the dataset?**

**Q.2: Which are the most 10 read articles given that their average rating is above 1.5?**

**Q.3: Using the following formulation and identified the top 10 articles based on weighted rating as mentioned below:-**

W= (R*v + C*m)/(v + m)

W  = weighted rating

R  = Average rating for the article 

v = number of ratings for the article 

m = minimum number of ratings required for an article to be on the recommendation list, (You can consider m = 2 for this task)

C = Mean ratings for all the articles.


## Data Description


**user_id**    --Unique ID for the user

**article_id**   -- Unique ID for the article


**rating**    --Rating provided by the user (1-5)

## Table of Content

[1. Reading Dataset](#Reading-Dataset)

[2. Basic Exploration](#Basic-Exploration)

[3. Top 10 articles based on a rating provided by more than 5% of users in the dataset](#TopTenArt_ratings)

[4. Most 10 read articles given that their average rating is above 1.5](#TopTenArt_read)

[5. Identify the top 10 articles based on weighted rating](#TopTenArt_weighted)

[6. Conclusion ](#Conclusion)






## 1. Reading Dataset <a class="anchor" id="Reading-Dataset"></a>


In [1]:
import pandas as pd
import numpy as np

In [2]:
import os
os.chdir(r"D:\Datascience\Analytics vidya\LabAV\w19   Article Recommendation  Non Personalized Recommender System")

In [3]:
#Reading file:
df = pd.read_csv('train.csv')

## 2. Basic Exploration <a class="anchor" id="Basic-Exploration"></a>


### Exploring user data

In [4]:
# shape of the users data
print(df.shape)
# view the users data
df.head()

(16731, 3)


Unnamed: 0,user_id,article_id,rating
0,1,456,1
1,1,2934,1
2,1,82,1
3,1,1365,1
4,1,221,1


In [5]:
#Unique values
df.nunique()

user_id        907
article_id    2529
rating           5
dtype: int64

In [6]:
# duplicate values
df.duplicated().sum()

0

In [7]:
#Missing values
pd.isnull(df).sum() 

user_id       0
article_id    0
rating        0
dtype: int64

In [8]:
n_users = df.user_id.unique().shape[0]
n_items = df.article_id.unique().shape[0]
print("users = ", n_users, "," ,"articles = " , n_items)

users =  907 , articles =  2529


So, we have 907 users in the dataset and each user has 2 features, i.e. article_id, rating.
There is a total of 2529 articles in the dataset.

We have 16731 ratings for different user and article combinations.
We have no missing and duplicated values in the user data.

--------------------------

### First, we create a user item matrix using Pandas Pivot Function such that user_id are in the index and article_id is represented by a separate column and name it as user_article_matrix

In [9]:
user_article_matrix = df.pivot(index = 'user_id', columns = 'article_id', values = 'rating')
user_article_matrix

article_id,1,3,4,5,6,7,8,9,10,11,...,2963,2964,2965,2966,2968,2969,2970,2974,2975,2976
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1083,,,,,,,,,,,...,,,,,,,,,,
1084,,,,,,,,,,,...,,,,,,,,,,
1085,,,,,,,,,,,...,,,,,,,,,,
1086,,,,,,,,,,,...,,,,,,,,,,


## 3. Top 10 articles based on a rating provided by more than 5% of users in the dataset <a class="anchor" id="TopTenArt_ratings"></a>  

In [10]:
#5% of 907(users)
fiveperc= 0.05 * 907
fiveperc

45.35

we are only interested in popular articles.
To find rating provided by more than 5% of users, we keep articles with atleast 46 users ratings in the dataframe and drop the rest

In [11]:
art_counts = df['article_id'].value_counts()

In [12]:
fivepercRatings = df[(df['article_id'].isin(art_counts[art_counts >= 46].index))]

In [13]:
user_article_matrix_Rating = fivepercRatings.pivot(index = 'user_id', columns = 'article_id', values = 'rating')
user_article_matrix_Rating

article_id,221,456,467,580,618,911,967,1148,1249,1425,1433,1539,1562,1755,1904,2388,2660,2709,2781
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,1.0,1.0,,,,,,2.0,,,,,,,,,,,
2,,,,,,,,,,1.0,,,,,,,,,
3,,,,,,,,,,,,,1.0,,1.0,,,,
10,,,,,,,,,,,,,,1.0,,,,,
11,1.0,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1081,,,,,,,,,1.0,,,1.0,,,,,,,
1083,,,,1.0,,,,,,,,,,,1.0,,,,
1084,,,,1.0,,,,,2.0,,,,,,,,,,
1085,,,,,,,,,,1.0,,,,,,,,,


### Here we calculate the mean rating for each article, order with the highest rating listed first, and find the top 10 articles

In [14]:
Topfiveperc= user_article_matrix_Rating.mean(axis=0).sort_values(ascending=False).head(10)

In [15]:
# Top 10 article Id's based on a rating provided by more than 5% of users
TopTenArt_ratings= list(Topfiveperc.index)
TopTenArt_ratings

[580, 1249, 2781, 1433, 967, 221, 618, 1755, 456, 2388]

-----

## 4. Most 10 read articles given that their average rating is above 1.5 <a class="anchor" id="TopTenArt_read"></a>

In [16]:
# Here we count the number of ratings for each article, order with the most number of ratings first, and find the top Ten

In [17]:
user_article_matrix

article_id,1,3,4,5,6,7,8,9,10,11,...,2963,2964,2965,2966,2968,2969,2970,2974,2975,2976
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1083,,,,,,,,,,,...,,,,,,,,,,
1084,,,,,,,,,,,...,,,,,,,,,,
1085,,,,,,,,,,,...,,,,,,,,,,
1086,,,,,,,,,,,...,,,,,,,,,,


In [18]:
# Top Count of articles for each User Id
user_article_matrix.count(axis=0).sort_values(ascending=False).head(10) 

article_id
967     122
1425     91
467      90
2660     73
1562     71
456      69
221      64
1433     61
911      61
1904     61
dtype: int64

In [19]:
user_article_matrix.apply(pd.value_counts)

article_id,1,3,4,5,6,7,8,9,10,11,...,2963,2964,2965,2966,2968,2969,2970,2974,2975,2976
1.0,1.0,7.0,4.0,,8.0,4.0,2.0,7.0,2.0,1.0,...,3.0,5.0,1.0,10.0,16.0,7.0,8.0,5.0,1.0,24.0
2.0,1.0,,,1.0,2.0,,,,1.0,1.0,...,1.0,1.0,,2.0,5.0,1.0,,1.0,,10.0
3.0,,,3.0,,2.0,,,,1.0,,...,,1.0,,,,3.0,1.0,1.0,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,2.0,,,,,1.0,,,...,,,,,2.0,1.0,,,,1.0


In [20]:
# articles given average rating is above 1.5 
df_oneptfive = user_article_matrix.apply(lambda x: x[x>=1.5]).count(axis=0) / user_article_matrix.apply(lambda x: x).count(axis=0)
df_oneptfive

article_id
1       0.500000
3       0.000000
4       0.555556
5       1.000000
6       0.333333
          ...   
2969    0.416667
2970    0.111111
2974    0.285714
2975    0.000000
2976    0.314286
Length: 2529, dtype: float64

In [21]:
# Top 10 read articles given that their average rating is above 1.5 
oneptfive= df_oneptfive.sort_values(ascending = False).head(10)
oneptfive

article_id
1019    1.0
1435    1.0
769     1.0
767     1.0
1467    1.0
224     1.0
2532    1.0
1480    1.0
1493    1.0
1498    1.0
dtype: float64

In [22]:
# most 10 read articles given that their average rating is above 1.5 
TopTenArt_read= list(oneptfive.index)
TopTenArt_read

[1019, 1435, 769, 767, 1467, 224, 2532, 1480, 1493, 1498]

------

## 5. Identify the top 10 articles based on weighted rating  <a class="anchor" id="TopTenArt_weighted"></a>

W= (R*v + C*m)/(v + m) 

W = weighted rating

R = Average rating for the article

v = number of ratings for the article

m = minimum number of ratings required for an article to be on the recommendation list, (You can consider m = 2 for this task)

C = Mean ratings for all the articles.
<a class="anchor" id="weakratings"></a>

In [23]:
art_counts = df['article_id'].value_counts()

### m = minimum number of ratings required for an article to be on the recommendation list, (we can consider m = 2 for this task)

In [24]:
m = 2

In [25]:
weightedRatings = df[(df['article_id'].isin(art_counts[art_counts >= 2].index))]

In [26]:
# user item matrix
user_article_matrix_weighted = weightedRatings.pivot(index = 'user_id', columns = 'article_id', values = 'rating')
user_article_matrix_weighted

article_id,1,3,4,6,7,8,9,10,11,12,...,2961,2962,2963,2964,2966,2968,2969,2970,2974,2976
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1083,,,,,,,,,,,...,1.0,,,,,,,,,
1084,,,,,,,,,,,...,,,,,,,,,,
1085,,,,,,,,,,,...,,,,,,,,,,
1086,,,,,,,,,,,...,,,,,,,,,,


### The Average rating for the article (R)

In [27]:
# R = Average rating for the article
R= user_article_matrix.mean(axis=0)
R

article_id
1       1.500000
3       1.000000
4       2.555556
5       2.000000
6       1.500000
          ...   
2969    1.916667
2970    1.222222
2974    1.428571
2975    1.000000
2976    1.400000
Length: 2529, dtype: float64

### Count of number of ratings for each article  (V)

In [28]:
#v = number of ratings for the article
v = user_article_matrix.count(axis=0)
v

article_id
1        2
3        7
4        9
5        1
6       12
        ..
2969    12
2970     9
2974     7
2975     1
2976    35
Length: 2529, dtype: int64

### Mean ratings for all the articles (C)

In [29]:
# Mean ratings of all articles
C= df['rating'].mean()
C

1.4539477616400693

## Weighted average

In [30]:
# Calculation of Weighted average based on formula
W= (((R*v) + (C*m)))/(v + m)
W

article_id
1       1.476974
3       1.100877
4       2.355263
5       1.635965
6       1.493421
          ...   
2969    1.850564
2970    1.264354
2974    1.434211
2975    1.302632
2976    1.402916
Length: 2529, dtype: float64

In [31]:
# Top 10 articles based on their weighted average rating
weighted= W.sort_values(ascending = False).head(10)
weighted

article_id
239     3.226974
2283    3.226974
129     3.151316
2141    3.151316
2779    3.113487
931     3.090790
24      2.981579
739     2.981579
861     2.976974
2079    2.863487
dtype: float64

In [32]:
# Top 10 articles based on their weighted average rating
TopTenArt_weighted= list(weighted.index)
TopTenArt_weighted

[239, 2283, 129, 2141, 2779, 931, 24, 739, 861, 2079]

## 6. Conclusion   <a class="anchor" id="Conclusion"></a>

In [33]:
# Q.1: Which are the top 10 articles based on a rating provided by more than 5% of users in the dataset?
TopTenArt_ratings

[580, 1249, 2781, 1433, 967, 221, 618, 1755, 456, 2388]

In [34]:
# Q.2: Which are the most 10 read articles given that their average rating is above 1.5?
TopTenArt_read

[1019, 1435, 769, 767, 1467, 224, 2532, 1480, 1493, 1498]

In [35]:
# Q.3: Use the following formulation and identify the top 10 articles based on weighted rating?
TopTenArt_weighted

[239, 2283, 129, 2141, 2779, 931, 24, 739, 861, 2079]