In [1]:
# Problem Statement

iPrint, an upcoming Indian media house that offers media and information services to the people, has decided to begin providing a more personalised experience to its customers. The company’s business extends across a wide range of media, including news and information services on sports, weather, education, health, research, stocks and healthcare.

iPrint has been managing its customer base by only recommending the most popular and similar news articles to its users. However, the recommended news articles are often not relevant to the majority of the users. And the company is not able to recommend any new content to its customers, and gradually, the company sthas started to lose such users, resulting in immense revenue loss.

In order to stay healthy in the market, iPrint needs to stay updated with time an technology advancements.


# Suggestion

A probable solution for the company to compete with other competitors is to solve the issue of revenue leakage by personalising users' tastes and introducing new content to its users at the start of the day on the home page of the application.


# Solution

##### Building a NEWS RECOMMENDATION SYSTEM that does below:

--> Recommend new top 10 relevant articles to a user when he/she visits the app at the start of the day

--> Recommend top 10 similar news articles that match the ones clicked by the user.

The articles that are written in the English language must be considered.

## 1. Data Pre-Processing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 50000)
pd.set_option('display.max_columns', 50000)

consumer_ds = pd.read_csv('data/consumer_transanctions.csv', low_memory=False)

platform_ds = pd.read_csv('data/platform_content.csv')

consumer_ds.head()

platform_ds.head()

#### Increase column width to see Keywords values properly

pd.set_option('max_colwidth', None)

consumer_ds.head(2)

platform_ds.head(2)

consumer_ds.info()

consumer_ds.shape

##### Imputing 'Rating' values on the feature 'interaction_type' - numeric

consumer_ds.interaction_type.nunique()

consumer_ds['interaction_type'].unique()

consumer_ds['interaction_type'].value_counts()

# Defining the function for Imputing
def ratings(x):
    if x == 'content_followed':
        return 5
    elif x == 'content_commented_on':
        return 4
    elif x == 'content_saved':
        return 3
    elif x == 'content_liked':
        return 2
    elif x == 'content_watched':
        return 1

consumer_ds['Ratings'] = consumer_ds['interaction_type'].apply(lambda x: ratings(x))

consumer_ds.head()

consumer_ds['Ratings'].value_counts()

consumer_ds.describe()



#### Platform Dataset Analysis on Keywords Column

platform_ds.shape

platform_ds.language.nunique()

platform_ds['language'].unique()

platform_ds1 = platform_ds[platform_ds['language']=='en']

platform_ds1['language'].unique()

platform_ds1.head(2)

platform_ds1.shape

## 2. Exploratory Data Analysis

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

plt.hist(consumer_ds.interaction_type)
plt.xlabel('Interaction Type from Consumer DF', fontsize=18)
plt.ylabel('Distribution', fontsize=18)
plt.xticks(rotation=75, fontsize=14)
plt.show()

##### As visible from above histogram, the most comment interaction type for users on the app is 'content_watched', following by some 'content_liked'

fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.set_yscale('log')
plt.hist(platform_ds.item_type)
plt.xlabel('Item Types', fontsize=18)
plt.ylabel('Distribution', fontsize=18)
plt.xticks(rotation=75, fontsize=14)
plt.show()

##### As visible from above histogram, the most comment item type is 'HTML'

plt.hist(platform_ds.interaction_type)
plt.xlabel('Interaction Type from Platform DF', fontsize=18)
plt.ylabel('Distribution', fontsize=18)
plt.xticks(rotation=75, fontsize=14)
plt.show()

##### As visible from above histogram, the most comment interaction type from (platform dataset) is 'content_present'

fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.set_yscale('log')
plt.hist(platform_ds.language)
plt.xlabel('Interaction Type', fontsize=18)
plt.ylabel('Distribution', fontsize=18)
plt.xticks(rotation=75, fontsize=14)
plt.show()

##### As visible from above histogram, the most comment language of users is 'en', that is, English followed by 'pt' for Portuguese


### Some Bar Graph Plots

fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.set_yscale('log')
country=consumer_ds.country
country.value_counts().plot(kind='bar', color='red')
plt.xlabel('Consumer Countries', fontsize=14)
plt.ylabel('Number of news', fontsize=14)
plt.xticks(rotation=80, fontsize=12)
plt.show()



consumer_ds['device_name'] = consumer_ds.consumer_device_info.str.split().str.get(0)

consumer_ds.head()

fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.set_yscale('log')
device_n=consumer_ds.device_name
device_n.value_counts().plot(kind='bar', color='green')
plt.xlabel('Device Name Info', fontsize=14)
plt.ylabel('Number of news', fontsize=14)
plt.xticks(rotation=80, fontsize=14)
plt.show()

fig = plt.figure()
ax = fig.add_subplot(2, 1, 1)
ax.set_yscale('log')
prod_country=platform_ds1.producer_country
prod_country.value_counts().plot(kind='bar', color='orange')
plt.xlabel('Producer Countries', fontsize=14)
plt.ylabel('Number of news', fontsize=14)
plt.xticks(rotation=80, fontsize=14)
plt.show()

# Interaction Type --> Ratings Calculation

interact_ds1 = consumer_ds['interaction_type'].value_counts()*100/consumer_ds.shape[0]

interact_ds = pd.DataFrame(interact_ds1)

interact_ds.reset_index(inplace=True)

interact_ds.columns = ['interaction_type', 'value']
interact_ds

# Rating Value

interact_ds['rating'] = 100/interact_ds['value']
interact_ds

user_rating = pd.merge(consumer_ds, interact_ds, on='interaction_type', how='left')

user_rating = user_rating[['consumer_id', 'item_id', 'rating']]

user_rating.drop_duplicates(inplace=True)
user_rating.shape

user_rating.head(3)

user_rating['news_id'] = user_rating.groupby(['item_id']).ngroup()

user_rating['user_id'] = user_rating.groupby(['consumer_id']).ngroup()

user_rating.head(3)

ratings_ds = user_rating[['user_id','news_id','rating']]

ratings_ds.head(3)



# 3. Collaborative Filtering: User Based Recommendations

consumer_ds.head()

consumer_ds.item_id.nunique()

# Test and Train split of the dataset.
from sklearn.model_selection import train_test_split
train, test = train_test_split(consumer_ds, test_size=0.30, random_state=31)

print(train.shape)
print(test.shape)

# Pivot the train ratings' dataset into matrix format in which columns are item IDs and the rows are user IDs.
df_pivot = train.pivot_table(
    index='consumer_id',
    columns='item_id',
    values='Ratings'
).fillna(0)

df_pivot.head()

df_pivot.shape

# Copy the train dataset into dummy_train
dummy_train = train.copy()

dummy_train.head(2)

# The movies not rated by user is marked as 1 for prediction.
dummy_train['Ratings'] = dummy_train['Ratings'].apply(lambda x: 0 if x>=1 else 1)

# Convert the dummy train dataset into matrix format.
dummy_train = dummy_train.pivot_table(
    index='consumer_id',
    columns='item_id',
    values='Ratings'
).fillna(1)

dummy_train.head()



# User Similarity Matrix

## Using Cosine Similarity

df_pivot.index.nunique()

from sklearn.metrics.pairwise import pairwise_distances

# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_pivot, metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

user_correlation.shape

## Using adjusted Cosine

### Here, we are not removing the NaN values and calculating the mean only for the news rated by the user

# Create a user-movie matrix.
df_pivot = train.pivot_table(
    index='consumer_id',
    columns='item_id',
    values='Ratings'
)

df_pivot.head()

mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

df_subtracted.head()



### Finding cosine similarity

from sklearn.metrics.pairwise import pairwise_distances

# Creating the User Similarity Matrix using pairwise_distance function.
user_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
user_correlation[np.isnan(user_correlation)] = 0
print(user_correlation)

user_correlation.shape



## Prediction - User User


user_correlation[user_correlation<0]=0
user_correlation

user_predicted_ratings = np.dot(user_correlation, df_pivot.fillna(0))
user_predicted_ratings

user_predicted_ratings.shape

user_predicted_ratings

user_final_rating = np.multiply(user_predicted_ratings,dummy_train)
user_final_rating.head()

### Finding the top 10 news recommendation for the user

# Take the user ID as input

# Taking 1st userID from above list

user_input = int(input("Enter your user name"))
print(user_input)

user_final_rating.head(7)

d = user_final_rating.loc[user_input].sort_values(ascending=False)[0:10]
d

d2 = consumer_ds[['item_id', 'consumer_id']]

item_user_pair = pd.merge(d, d2, on = 'item_id', how='left')

item_user_pair

d3 = pd.merge(d,consumer_ds,left_on='item_id',right_on='item_id', how = 'left')
d3

# Evaluation - User User

# Find out the common users of test and train dataset.
common = test[test.consumer_id.isin(train.consumer_id)]
common.shape

common.head()

# convert into the user-movie matrix.
common_user_based_matrix = common.pivot_table(index='consumer_id', columns='item_id', values='Ratings')

common_user_based_matrix.head()

# Convert the user_correlation matrix into dataframe.
user_correlation_df = pd.DataFrame(user_correlation)

user_correlation_df.head()

df_subtracted.head(3)

user_correlation_df['consumer_id'] = df_subtracted.index

user_correlation_df.set_index('consumer_id',inplace=True)
user_correlation_df.head()

common.head(3)

list_name = common.consumer_id.tolist()

user_correlation_df.columns = df_subtracted.index.tolist()


user_correlation_df_1 =  user_correlation_df[user_correlation_df.index.isin(list_name)]

user_correlation_df_1.shape

user_correlation_df_2 = user_correlation_df_1.T[user_correlation_df_1.T.index.isin(list_name)]

user_correlation_df_3 = user_correlation_df_2.T

user_correlation_df_3.head()

user_correlation_df_3.shape

user_correlation_df_3[user_correlation_df_3<0]=0

common_user_predicted_ratings = np.dot(user_correlation_df_3, common_user_based_matrix.fillna(0))
common_user_predicted_ratings

dummy_test = common.copy()

dummy_test['Ratings'] = dummy_test['Ratings'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='consumer_id', columns='item_id', values='Ratings').fillna(0)

dummy_test.shape

common_user_based_matrix.head()

dummy_test.head(3)

common_user_predicted_ratings = np.multiply(common_user_predicted_ratings,dummy_test)

common_user_predicted_ratings.head()

# Model Evaluation

### RMSE

Calculating the RMSE for only the news rated by users. For RMSE, normalising the rating to (1,5) range.

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from math import sqrt

from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_user_predicted_ratings.copy()
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 5))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

common_ = common.pivot_table(index='consumer_id', columns='item_id', values='Ratings')

# Finding total non-NaN value
total_non_nan = np.count_nonzero(~np.isnan(y))

rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

### Precision@k

user_index=20

ratings_ds[ratings_ds['user_id']==20]

user_set = ratings_ds[ratings_ds['user_id']==user_index].sort_values(by='rating',ascending=False)['news_id'].tolist()

user_set

len(user_set)

predict_ds = pd.DataFrame(user_predicted_ratings)

user_predicted_set = predict_ds.iloc[user_index].sort_values(ascending=False)[:10].index.tolist()

len(user_predicted_set)

user_predicted_set

precision_at_10 = len(list(set(user_set) & set(user_predicted_set)))/10

precision_at_10

# Using Item similarity

## Item Based Similarity

df_pivot = train.pivot_table(
    index='consumer_id',
    columns='item_id',
    values='Ratings'
).T

df_pivot.head()

mean = np.nanmean(df_pivot, axis=1)
df_subtracted = (df_pivot.T-mean).T

df_subtracted.head()

from sklearn.metrics.pairwise import pairwise_distances

# Item Similarity Matrix
item_correlation = 1 - pairwise_distances(df_subtracted.fillna(0), metric='cosine')
item_correlation[np.isnan(item_correlation)] = 0
print(item_correlation)

item_correlation.shape

item_correlation[item_correlation<0]=0
item_correlation

## 4. Prediction - Item Item


item_predicted_ratings = np.dot((df_pivot.fillna(0).T),item_correlation)
item_predicted_ratings

item_predicted_ratings.shape

dummy_train.shape

### Filtering the rating only for the news not rated by the user for recommendation

item_final_rating = np.multiply(item_predicted_ratings,dummy_train)
item_final_rating.head()

### Finding the top 10 recommendation for the *user*

# Take the user ID as input

# Taking 2nd user ID from above

user_input = int(input("Enter your user name"))
print(user_input)

# Recommending the Top 10 products to the user.
d = item_final_rating.loc[user_input].sort_values(ascending=False)[0:10]
d

d1 = pd.merge(d,consumer_ds,left_on='item_id',right_on='item_id',how = 'left')
d1

train_new = pd.merge(train,consumer_ds,left_on='item_id',right_on='item_id',how='left')

train_new[train_new.consumer_id_x == -9223121837663643404]

# Evaluation - Item Item

test.columns

common =  test[test.item_id.isin(train.item_id)]
common.shape

common.head()

common_item_based_matrix = common.pivot_table(index='consumer_id', columns='item_id', values='Ratings').T

common_item_based_matrix.shape

item_correlation_df = pd.DataFrame(item_correlation)

item_correlation_df.head(3)

item_correlation_df['item_id'] = df_subtracted.index
item_correlation_df.set_index('item_id',inplace=True)
item_correlation_df.head()

list_name = common.item_id.tolist()

item_correlation_df.columns = df_subtracted.index.tolist()

item_correlation_df_1 =  item_correlation_df[item_correlation_df.index.isin(list_name)]

item_correlation_df_2 = item_correlation_df_1.T[item_correlation_df_1.T.index.isin(list_name)]

item_correlation_df_3 = item_correlation_df_2.T

item_correlation_df_3.head()

item_correlation_df_3[item_correlation_df_3<0]=0

common_item_predicted_ratings = np.dot(item_correlation_df_3, common_item_based_matrix.fillna(0))
common_item_predicted_ratings

common_item_predicted_ratings.shape

dummy_test = common.copy()

dummy_test['Ratings'] = dummy_test['Ratings'].apply(lambda x: 1 if x>=1 else 0)

dummy_test = dummy_test.pivot_table(index='consumer_id', columns='item_id', values='Ratings').T.fillna(0)

common_item_predicted_ratings = np.multiply(common_item_predicted_ratings,dummy_test)

common_ = common.pivot_table(index='consumer_id', columns='item_id', values='Ratings').T

from sklearn.preprocessing import MinMaxScaler
from numpy import *

X  = common_item_predicted_ratings.copy()
X = X[X>0]

scaler = MinMaxScaler(feature_range=(1, 10))
print(scaler.fit(X))
y = (scaler.transform(X))

print(y)

# Finding total non-nan value
total_non_nan = np.count_nonzero(~np.isnan(y))

rmse = (sum(sum((common_ - y )**2))/total_non_nan)**0.5
print(rmse)

## Filtering News from Recommendation which are already seen by the user

ds_user = ratings_ds[ratings_ds['user_id']==17]

ds_user.head(3)

ds_user.shape

ratings_ds.head(3)



prediction_ds = pd.DataFrame(user_predicted_ratings)

prediction_ds.iloc[17].sort_values(ascending=False)[:10]

user_rating_final = user_rating[['item_id', 'news_id']]

user_rating_final = user_rating_final.drop_duplicates()

user_rating_final.head(2)

recommended_news_df = pd.DataFrame(prediction_ds.iloc[17].sort_values(ascending=False))

recommended_news_df.reset_index(inplace=True)

recommended_news_df.head()

platform_ds3 = platform_ds[platform_ds['language']=='en']

news_title = platform_ds3[['item_id', 'title']]

news_title = platform_ds3[['item_id','title']]

news_title.head()

recommended_news_df.columns = ['news_id', 'score']

recommended_news_df.head(2)

recommended_merged = pd.merge(recommended_news_df, user_rating, how='left', on='news_id')

recommended_merged.head(2)

news_output = pd.merge(recommended_merged,news_title,how='left', on='item_id')

news_output = news_output[['news_id', 'score', 'item_id', 'title']]

news_output.head(2)

merged_filter = pd.merge(news_output, ds_user, on='news_id', how='left')

merged_filter.shape

merged_filter.head(3)

merged_filter = merged_filter.drop(merged_filter[merged_filter['rating']>0].index)

merged_filter['title'][:10]



# 5. Content-based Filtering

platform_ds1['text_description'].str.len()

#### Average keywords string length

np.mean(platform_ds1['text_description'].str.len())

#### Number of Text Description

number_of_text_description = []
for keywords in platform_ds1['text_description']:
    n_keywords = len(keywords.split(','))
    number_of_text_description.append(n_keywords)

number_of_text_description

sum(number_of_text_description)

# Plotting

plt.hist(number_of_text_description)
plt.xlabel('Distribution of Number of Words in News Articles Text Description column')
plt.ylabel('Number of News')
plt.show()



##### Selected DF

platform_ds1 = platform_ds1[['title', 'text_description']]

platform_ds1.head(2)

# Data Pre-Processing

import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

ps = nltk.PorterStemmer()
wn = nltk.WordNetLemmatizer()


def make_lower_case(text_description):
    return text_description.lower()

# Define function for removing stop words
def remove_stop_words(text_description):
    text_description = text_description.split()
    stops = set(stopwords.words("english"))
    text_description = [w for w in text_description if not w in stops]
    texts = [w for w in text_description if w.isalpha()]
    texts = " ".join(texts)
    return texts

# Define function for removing punctuation
def remove_punctuation(text_description):
    tokenizer = RegexpTokenizer(r'\w+')
    text_description = tokenizer.tokenize(text_description)
    text_description = " ".join(text_description)
    return text_description

# Define function for removing the html tags
def remove_html(text_description):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text_description)


# Applying the above functions in column 'text_description' and storing as a new column named 'cleaned_desc'
platform_ds1['cleaned_desc'] = platform_ds1['text_description'].apply(func = make_lower_case)
platform_ds1['cleaned_desc'] = platform_ds1.cleaned_desc.apply(func = remove_stop_words)
platform_ds1['cleaned_desc'] = platform_ds1.cleaned_desc.apply(func = remove_punctuation)
platform_ds1['cleaned_desc'] = platform_ds1.cleaned_desc.apply(func = remove_html)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

tf = TfidfVectorizer(analyzer='word',
                     stop_words='english',
                     max_df=0.8,
                     min_df=0.0,
                     use_idf=True,
                     ngram_range=(1,3))
tfidf_matrix = tf.fit_transform(platform_ds1['cleaned_desc'])

platform_ds1.head(2)

platform_ds1['cleaned_desc'][:2]

platform_ds1 = platform_ds1.drop_duplicates(subset=None, keep='first', inplace=False)

platform_ds1['cleaned_desc'][:2]

platform_ds1.shape

### Create numpy array from text_description column

keywords_array = platform_ds1['cleaned_desc'].to_numpy()

keywords_array

type(keywords_array[0])

### Use splitting to generate words from keywords_array

words_list = []
for keyword in keywords_array:
    splitted_words = keyword.lower().split(' ')
    words_list.append(splitted_words)

words_list

len(words_list), len(words_list[0]), len(words_list[1]), len(words_list[10])

### Create Dictionary

# Install Gensim

! pip install gensim

from gensim.corpora.dictionary import Dictionary

create a dictionary of words from words list

dictionary = Dictionary(words_list)

dictionary

len(dictionary)

#### Total number of words in the words_list

number_words = 0
for word in words_list:
    number_words = number_words + len(word)

number_words

dictionary.get(0), dictionary.get(1), dictionary.get(20), dictionary.get(1000)

## Generate Bag of Words

words_list[0]

#### doc2bow = Document to BOW[Bag of word]

bow = dictionary.doc2bow(words_list[0])

bow

len(words_list[0]), len(bow)

### Generate Corpus by creating BOW of each document

corpus = [dictionary.doc2bow(doc) for doc in words_list]

#print(corpus)

len(corpus), len(corpus[0]), len(corpus[1]), len(corpus[20])

len(words_list), len(words_list[0])

### TF-IDF Model

#### TF = Term frequency -> count/no.

#### IDF = Inverse doc freq. -> log(N/n)

from gensim.models.tfidfmodel import TfidfModel

# Create tfidf model of the corpus

tfidf = TfidfModel(corpus)

tfidf

print(tfidf[corpus[0]])

## Generate Similarity Matrix

from gensim.similarities import MatrixSimilarity

sims = MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))

print(sims)

sims[corpus[0]]

len(sims[corpus[0]])

## Generate Recommendation

news_title = "IEEE to Talk Blockchain at Cloud Computing Oxford-Con - CoinDesk"

# number of recommedations

n=10

platform_ds2 = platform_ds1.loc[platform_ds1.title==news_title]

platform_ds2

#### Generate words_list by splitting the keywords column

words_list = platform_ds2['cleaned_desc'].iloc[0].split(' ')

words_list

#### set the query_doc to the words_list

query_doc = words_list

query_doc

#### Get the Bag Of Words

query_doc_bow = dictionary.doc2bow(query_doc)

query_doc_bow

len(query_doc_bow), len(query_doc)

### Generate TF-IDF values for the query_doc_bow

query_doc_tfidf = tfidf[query_doc_bow]

query_doc_tfidf



### Get Similarity Score using Similarity Matrix

similarity_array = sims[query_doc_tfidf]

similarity_array

len(similarity_array)

### Create a Series to Visualize Similarity Score along with News Title

similarity_series = pd.Series(similarity_array.tolist(), index=platform_ds1.title.values)

similarity_series.head

### Sort the series to get Top Recommended News

recommended_news = similarity_series.sort_values(ascending=False)[1:n+1]

recommended_news.head

### Corpus for 15th index

corpus[15]

corpus[platform_ds2.index.values.tolist()[0]]

#### TF-IDF Values

tfidf[corpus[platform_ds2.index.values.tolist()[0]]]

#### Sorted TF-IDF values

sorted_tfidf_weights = sorted(tfidf[corpus[platform_ds1.index.values.tolist()[0]]], key=lambda w: w[1], reverse=True)

sorted_tfidf_weights

dictionary.get(3000)

dictionary.get(5000)

print('Top words associated with this news by tf-idf values are: ')
for term_id, weight in sorted_tfidf_weights[:10]:
    print(" '%s' %.5f" %(dictionary.get(term_id), weight))

platform_ds2.columns

# Creating a Modular code by creating a function for Query News Input

def news_recommendation(news_title, number_of_hits):
    platform_ds2 = platform_ds1.loc[platform_ds1.title==news_title] # get the news title row

    keywords = platform_ds2['cleaned_desc'].iloc[0].split(' ') # get the keywords as a Series (platform_ds2['cleaned_desc'])

    query_doc = keywords #set the query_doc to the list of keywords

    query_doc_bow = dictionary.doc2bow(query_doc) # get a bag of words from the query_doc

    query_doc_tfidf = tfidf[query_doc_bow] # convert the regular bag of words model to a tf-idf model

    similarity_array = sims[query_doc_tfidf] # get the array of similarity values between our news and every other news.

    similarity_series = pd.Series(similarity_array.tolist(), index=platform_ds1.title.values) # convert to a Series

    recommended_news = similarity_series.sort_values(ascending=False)[1:number_of_hits+1]

    # get the top matching results, i.e. most similar news title
    # start from index 1 because every item is most similar to itself

    return recommended_news

type(recommended_news)

news_recommendation("French Senate Will Debate on Bitcoin Regulation", 10)

news_recommendation('Fooling The Machine', 10)

# 6. ALS Model - Alternating Least Square

## Building Recommendation system using ALS on Consumer Dataset

consumer_ds.head(2)

consumer_ds.shape

consumer_ds['consumer_id'].nunique()

consumer_ds['item_id'].nunique()



## Creating Sparse User-Item Matrix

from scipy.sparse import csr_matrix

alpha = 40

consumer_ds.shape

consumer_ds.dropna(inplace=True)

consumer_ds = consumer_ds.drop_duplicates(subset=None, keep='first', inplace=False)

consumer_ds.shape

#consumer_ds['consumer_id_abs'] = consumer_ds['consumer_id'].apply(lambda x: abs(x))

#consumer_ds['item_id_abs'] = consumer_ds['item_id'].apply(lambda x: abs(x))

consumer_ds.head(2)

type(consumer_ds)

#sparse_user_item = csr_matrix( ([alpha]*consumer_ds.shape[0], (consumer_ds['consumer_id_abs'], consumer_ds['item_id_abs']) ))

users_items_pivot_matrix_df = consumer_ds.pivot_table(index='consumer_id',
                                                columns='item_id',
                                                values='Ratings').fillna(0)

users_items_pivot_matrix_df.head(10)

users_ids = list(users_items_pivot_matrix_df.index)
users_ids[:10]

users_items_pivot_sparse_matrix = csr_matrix(users_items_pivot_matrix_df)
users_items_pivot_sparse_matrix

Shape 1895x2982 sparse matrix

csr_user_array = users_items_pivot_sparse_matrix.toarray()

csr_user_array

len(csr_user_array), len(csr_user_array[0]), csr_user_array[1][1]

max(csr_user_array[1])



### csr matrix only stores where value is 40 [non-zero]. (Compressed Sparse Row)

print(users_items_pivot_sparse_matrix)

users_items_pivot_sparse_matrix = users_items_pivot_sparse_matrix.T.tocsr()

users_items_pivot_sparse_matrix

csr_item_array = users_items_pivot_sparse_matrix.toarray()

csr_item_array

len(csr_item_array), len(csr_item_array[0]), csr_item_array[1][1]

print(users_items_pivot_sparse_matrix)



#! pip install implicit

import implicit
from implicit.evaluation import train_test_split

train, test = train_test_split(users_items_pivot_sparse_matrix, train_percentage=0.8)

train

test

model = implicit.als.AlternatingLeastSquares(factors=100, regularization=0.1, iterations=20, calculate_training_loss=False)

model

# Train the Model

model.fit(train)

user_id = 100

consumer_ds['consumer_id'].loc[100]

model.recommend(user_id, users_items_pivot_sparse_matrix)

model.recommend(user_id, users_items_pivot_sparse_matrix, N=30)

### Generating recommendations for News_id / item_id

item_id = 20
n_similar = 2982

similar = model.similar_items(item_id, n_similar)

similar

output = pd.DataFrame(similar, columns=['news_id', 'als_score'])

output

## Merging recommendations

consumer_ds1 = consumer_ds[['consumer_id', 'item_id', 'interaction_type']]

# Consusmer Interaction Type
interaction = consumer_ds1['interaction_type'].value_counts()*100/consumer_ds1.shape[0]

interaction

interaction_ds = pd.DataFrame(interaction)

type(interaction_ds)

interaction_ds

interaction_ds.reset_index(inplace=True)

interaction_ds

interaction_ds.columns = ['interaction_type', 'value']

### Creating a rating column

interaction_ds['rating1'] = 100/interaction_ds['value']

interaction_ds.head(3)

consumer_ratings = pd.merge(consumer_ds1, interaction_ds, on='interaction_type', how='left')

consumer_ratings.head(3)

consumer_ratings = consumer_ratings[['consumer_id', 'item_id', 'rating1']]

consumer_ratings.head(3)

consumer_ratings.shape

consumer_ratings['news_id'] = consumer_ratings.groupby(['item_id']).ngroup()

consumer_ratings['user_id'] = consumer_ratings.groupby(['consumer_id']).ngroup()

consumer_ratings.head(3)

consumer_ratings.describe()

ratings_ds = consumer_ratings[['user_id', 'news_id', 'rating1']]

ratings_ds.head(3)

news_mapping = consumer_ratings[['item_id', 'news_id']]

news_mapping.head(3)

news_mapping.shape

news_mapping1 = news_mapping.drop_duplicates()

news_mapping1.shape

als_ds1 = pd.merge(output, news_mapping1, on='news_id', how='left')

als_ds1.head()

news_title.shape

als_title = pd.merge(als_ds1, news_title, on='item_id', how='left')

als_title.head(3)

als_title.shape

# Calculating ALS Score using Normalization

als_title['als_score_normalized'] = (als_title['als_score']-min(als_title['als_score'])) / (max(als_title['als_score']) - min(als_title['als_score']))

als_title



# Hybrid Recommendation System

## Hybrid-1: Content Based + Item Based Collaboriative Model

news_index = 30

news_prediction = pd.DataFrame(item_correlation)

news_prediction.head(3)

item_recommendation = pd.DataFrame(news_prediction.iloc[news_index].sort_values(ascending=False))

item_recommendation

item_recommendation.reset_index(inplace=True)

item_recommendation.columns = ['news_id', 'score']
item_recommendation.head(3)

### Merging

merged_ds1 = pd.merge(item_recommendation, news_mapping1, on='news_id', how='left')

merged_ds1.head(3)

hybrid1 = pd.merge(merged_ds1, platform_ds3, on='item_id', how='left')

hybrid1.shape

hybrid1 = hybrid1[['news_id','item_id','title','score']]

hybrid1.head(3)

hybrid1['collaborative_score_normalized'] = (hybrid1['score']-min(hybrid1['score']))/(max(hybrid1['score'])-min(hybrid1['score']))

hybrid1.head(3)

news_recom = news_recommendation('Machine Learning for Designers', 10)

news_ds = pd.DataFrame(news_recom)

news_ds.head(3)

news_ds.reset_index(inplace=True)

news_ds.columns = ['title', 'score']

news_ds.head(3)

news_ds['content_score_normalized'] = (news_ds['score']-min(news_ds['score'])) / (max(news_ds['score'])-min(news_ds['score']))

news_ds



hybrid1_result = pd.merge(hybrid1, news_ds, how='left', on='title')

hybrid1_result.head(3)



hybrid1_result['Main_Score'] = (hybrid1_result['collaborative_score_normalized']+hybrid1_result['content_score_normalized'])/2

hybrid1_result = hybrid1_result[['title','Main_Score']]

hybrid1_result.shape

hybrid1_result = hybrid1_result.drop_duplicates()

hybrid1_result.sort_values(by='Main_Score', ascending=False)[:10]



## Hybrid-2: ALS + Item-based Collaborative Filtering

hybrid2 = pd.merge(hybrid1, als_title, how='left', on='title')

hybrid2.head(3)

hybrid2.shape

hybrid2['Main_Score'] = (hybrid2['collaborative_score_normalized']+hybrid2['als_score_normalized'])/2

hybrid2.head(3)

hybrid2 = hybrid2[['title','Main_Score']]

hybrid2_result = hybrid2.drop_duplicates()

hybrid2_result.sort_values(by='Main_Score', ascending=False)[:10]



# Eavluation Metrics & Conclusion

Main Evaluation Metrics used here are:

#### RMSE
Root Mean Square Error is the standard deviation of the residuals (prediction errors). It serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSE is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent. It is extremely helpful to have a single number to judge a model’s performance. RMSE is a good choice as it is used mainly used for analysis & forecasting.


![image.png](attachment:image.png)

Root mean square error can be expressed as above, where N is the number of data points, y(i) is the i-th measurement, and y ̂^(i) is its corresponding prediction.


#### Precision@k
Precision@k is the proportion of recommended items in the top-k set that are relevant.
We can say that,

Precision@k = (No. of recommended items @k that are relevant) / (No. of recommended items @k)


For instance, Precision@10 calculated here, corresponds to the number of relevant results among the top 10 retrieved recommendations. This metric is chosen as it is easier to score manually since only the top k results need to be examined to determine if they are relevant or not.


![image.png](attachment:image.png)

Above formula for understanding precision@k in top-k set.

There are various other evaluation models including MAE, MAP@k, recall@k and so on that could also bring some insights to the receommendation model


SyntaxError: invalid syntax (<ipython-input-1-c7fa16b5f600>, line 3)