# Introduction

Recommender systems are the systems that are designed to recommend things to the user based on many different factors. These systems predict the most likely product that the users are most likely to purchase and are of interest to. Companies like Netflix, Amazon, etc. use recommender systems to help their users to identify the correct product or movies for them.

The recommender system deals with a large volume of information present by filtering the most important information based on the data provided by a user and other factors that take care of the user’s preference and interest. It finds out the match between user and item and imputes the similarities between users and items for recommendation. 

Both the users and the services provided have benefited from these kinds of systems. The quality and decision-making process has also improved through these kinds of systems.

There are many different things that can be recommended by the system like movies, books, news, articles, jobs, advertisements, etc. Netflix uses a recommender system to recommend movies & web-series to its users. Similarly, YouTube recommends different videos. There are many examples of recommender systems that are widely used today.

## Types of Recommendation System

### Popularity-Based Recommendation System

It is a type of recommendation system which works on the principle of popularity and or anything which is in trend. These systems check about the product or movie which are in trend or are most popular among the users and directly recommend those.

For example, if a product is often purchased by most people then the system will get to know that that product is most popular so for every new user who just signed it, the system will recommend that product to that user also and chances becomes high that the new user will also purchase that.
### Content-Based Recommendation System

It is another type of recommendation system which works on the principle of similar content. If a user is watching a movie, then the system will check about other movies of similar content or the same genre of the movie the user is watching. There are various fundamentals attributes that are used to compute the similarity while checking about similar content.

### Collaborative Filtering

It is considered to be one of the very smart recommender systems that work on the similarity between different users and also items that are widely used as an e-commerce website and also online movie websites. It checks about the taste of similar users and does recommendations. 

The similarity is not restricted to the taste of the user moreover there can be consideration of similarity between different items also. The system will give more efficient recommendations if we have a large volume of information about users and items.

# Goal

Create a recommender system to recommend articles similar to the article being read by customer.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#### Load and inspect your data.

In [None]:
data = pd.read_csv('../input/news-dataset-18920/result_final.csv')

# Inspect first and last n rows from a dataframe
def display_df(df,nrows):
    return df.head(nrows).append(df.tail(nrows))

print(data.shape)
print(data.dtypes)
display_df(data, 3)

## Data Cleaning

In [None]:
# Count how many missing values are in each column
data.isna().sum()

In [None]:
# Check is there's a pattern in missing data
sns.heatmap(data.isna().transpose(),
            cmap="YlGnBu",
            cbar_kws={'label': 'Missing Data'})

In [None]:
# Check if we can fill na's by using last date
display_df(data['date'], 8)

In [None]:
# Drop news with missing title_summary
data.dropna(how = 'any', subset = ['title_summary'], inplace = True)
print(data.isna().sum())

In [None]:
# Check duplicated news
for column in data.columns:
    print(f' {column} has {data[data.duplicated(subset = column, keep = False)].shape[0]} duplicates')

In [None]:
# Check some duplicated news with the same title
display_df(data[data.duplicated(subset = 'title', keep = False)].sort_values('title'), 4)

In [None]:
# Remove all duplicated news with same title except for the first occurence
data.drop_duplicates(subset = 'title', keep = 'first', inplace = True)
print(data.shape)

In [None]:
# Check if there are still duplicated news
for column in data.columns:
    print(f' {column} has {data[data.duplicated(subset = column, keep = False)].shape[0]} duplicates')

In [None]:
# Check duplicated news with same summary
data[data.duplicated(subset = 'summary', keep = False)].sort_values('summary')

In [None]:
# Remove all duplicated news with same summary except for the first occurence
data.drop_duplicates(subset = 'summary', keep = 'first', inplace = True)
print(data.shape)

In [None]:
# Check if there are still duplicated news
for column in data.columns:
    print(f' {column} has {data[data.duplicated(subset = column, keep = False)].shape[0]} duplicates')

In [None]:
# Last check for duplicates
data[data.duplicated(subset = 'text')]

In [None]:
# Remove all duplicated news with same summary except for the first occurence
data.drop_duplicates(subset = 'text', keep = 'first', inplace = True)

In [None]:
# Remove innecessary columns
data.drop(labels = ['Unnamed: 0', 'Unnamed: 0.1', 'title_summary'], axis = 1, inplace = True)
date = data['date']
display_df(data, 3)

## Analysis

In [None]:
dates = pd.DataFrame({'date' : date})
dates['date'] = pd.to_datetime(dates['date'], utc = True)
dates['day'] = dates.date.dt.day
dates['month'] = dates.date.dt.month
dates = dates.convert_dtypes()
dates.groupby(['month','day']).agg('count').plot(rot = 45)

### Preprocessing

In [None]:
# Join title, keywords, summary and text to make a more robust recommender system
def create_soup(x):
    return x['title'] + ' ' + x['keywords'] + ' ' + x['summary'] + ' ' + x['text']
data['soup'] = data.apply(create_soup, axis=1)

In [None]:
# Counting word frecuencies
from sklearn.feature_extraction.text import TfidfVectorizer

tfidv = TfidfVectorizer(strip_accents = 'ascii', stop_words = 'english')
tfidfv_matrix = tfidv.fit_transform(data['soup'])
features = pd.DataFrame(tfidfv_matrix.toarray(), columns=tfidv.get_feature_names(), index = data['title'])
features.head()

In [None]:
# Applying non matrix factorization to find topics within news
from sklearn.decomposition import NMF

nmf = NMF(n_components=20)
topics = nmf.fit_transform(features)

In [None]:
# Calculating similarities
from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(topics, topics)
indices = pd.Series(data.index, index = data['title']).drop_duplicates()

# Function to get recommendations
def get_recommendations(title, no_of_news_article):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:no_of_news_article+1]
    
    print("Article Read -- " + data['title'].iloc[idx] +" link --"+ data['link'].iloc[idx])
    print(" ---------------------------------------------------------- ")
    
    # Get the new indices 
    news_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    for i in range(len(news_indices)):
        print("Recomendation "+ str(i+1)+" --- " +str(news_indices[i])+"(IDX)  "+str(data['date'].iloc[news_indices[i]])+" : "+
              data['title'].iloc[news_indices[i]] +" || Link --"+ data['link'].iloc[news_indices[i]] +" score -- "+ str(sim_scores[i][1]))
        print()

In [None]:
get_recommendations('Oracle boots out Microsoft and wins bid for TikTok, reports say – TechCrunch', 5)

# Conclusion

We built a content based recommender system capable of recommend news similar to the ones read by the client. We can see from the recommendations obtained, that it's working nicely and it's ready to be tested.