# Introduction to Blog Recommendation System

In today's information age, tech blogs have become an invaluable source of knowledge, insights, and entertainment. With the ever-growing abundance of blog content, finding the most relevant and engaging articles can be a daunting task for users. To address this challenge, we need a comprehensive blog recommendation system, leveraging cutting-edge techniques and machine learning algorithms to enhance the user experience on blog platforms.

In this kernel we will build blog recommendation system based on <a href="https://www.kaggle.com/datasets/yakshshah/blog-recommendation-data?datasetId=3253367&sortBy=dateRun&tab=profile">Blog Recommendation Data</a> published by me. Which consists of blog data collected from __Medium__ and the ratings collected from over 5000 users by traking their activity.

### Their are two types of Recommendation System

1. **Collaborative Filtering**:

Collaborative filtering is a popular recommendation technique that leverages the collective wisdom of users to make recommendations. It analyzes user behavior and item similarity to identify patterns and generate personalized suggestions. There are two main approaches within collaborative filtering:

> * **User-Based Collaborative Filtering**: This approach recommends items to a user based on the preferences of similar users. It identifies users with similar tastes and recommends items that those similar users have enjoyed. For example, if User A and User B have similar preferences and User A likes a specific blog post, the system will recommend that post to User B.

> * **Item-Based Collaborative Filtering**: Instead of focusing on similar users, item-based collaborative filtering recommends items similar to those a user has previously enjoyed. It analyzes the relationships between items and suggests new items based on their similarities. For instance, if User A likes Blog Post X, and Blog Post Y is similar to Blog Post X, the system will recommend Blog Post Y to User A.

2.  __Content-Based Filtering__:

Content-based filtering recommends items based on their attributes or characteristics. It uses user preferences and item features to make recommendations. This approach builds user profiles based on their historical preferences and item attributes. It then recommends items that align with the user's preferences and exhibit similar features. For instance, if a user has shown interest in technology-related blog posts, the system will recommend more articles with similar technological topics.

3. 
__Hybrid Recommendation Systems__:
Hybrid recommendation systems combine multiple recommendation approaches to provide more accurate and diverse suggestions. They leverage the strengths of different techniques to overcome limitations and offer enhanced recommendations. For example, a hybrid system might combine collaborative filtering and content-based filtering to balance between user preferences and item attributes. By using collaborative filtering to capture user behavior and content-based filtering to capture item characteristics, a hybrid system can provide more accurate and personalized recommendations.

We will be using Content Based Filtering approach filtering based for building our recommendation system.

### Let's get started

In [1]:
# import required packages

import pandas as pd
import numpy as np
import nltk
import re
from nltk import corpus
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk import wsd
from nltk.corpus import wordnet as wn

nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('wordnet2022')

! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet # temp fix for lookup error.

[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet2022 to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/wordnet2022.zip.


Let us load the data now

In [2]:
blog_df = pd.read_csv('/kaggle/input/blog-recommendation-data/Medium Blog Data.csv')
author_df = pd.read_csv('/kaggle/input/blog-recommendation-data/Author Data.csv')
ratings_df = pd.read_csv('/kaggle/input/blog-recommendation-data/Blog Ratings.csv')

The first dataset has following features,

* blog_id : Unique ID given to the blog
* author_id : Unique ID given to the author of the blog
* blog_title : Title of the Blog
* blog _ content : Brief Summary of what the blog content is about
* blog_link : link to the specific blog
* blog_img : image related to that blog
* blog_topic : domain it belongs to for eg. AI,Data Science etc.

The Second dataset has following features,

* author_id : Unique ID given to the author
* author_name : Name of the author

The Third dataset has following features,

* blog_id : ID of the blog
* user_id : ID of the User
* ratings : ratings given by the user

# Content Based Filtering

Let us first see how many blogs we have for each domain 

In [3]:
blog_df['topic'].value_counts()

ai                      736
blockchain              644
cybersecurity           642
web-development         635
data-analysis           594
cloud-computing         589
security                527
web3                    471
machine-learning        467
nlp                     453
data-science            444
deep-learning           430
android                 426
dev-ops                 384
information-security    374
image-processing        354
flutter                 343
backend                 341
cloud-services          339
Cryptocurrency          331
app-development         322
backend-development     312
Software-Development    309
Name: topic, dtype: int64

### Remove the columns from blog data that are not needed
Let us remove __author_id__, __blog_link__, __blog_img__ and __scrape_time__ from blog_df

In [4]:
blog_df.drop(['author_id','blog_link','blog_img','scrape_time'],axis='columns',inplace=True)

We need to remove duplicate blog data

In [5]:
blog_df.drop_duplicates(['blog_title','blog_content'],inplace=True)

### Preprocessing Text Data
It is necessary to remove the stopwords from blog content and also apply lemmatization to bring all the words to theirt root word this is the basic step we need to perform before we move forward

In [6]:
lst_stopwords=corpus.stopwords.words('english')
def pre_process_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):
    text=str(text).lower()
    text=text.strip()
    text = re.sub(r'[^\w\s]', '', text)
    lst_text = text.split()
    if lst_stopwords is not None:
        lst_text=[word for word in lst_text if word not in lst_stopwords]
    if flg_lemm:
        lemmatizer = WordNetLemmatizer()
        lst_text = [lemmatizer.lemmatize(word) for word in lst_text]
    if flg_stemm:
        stemmer = PorterStemmer()
        lst_text = [stemmer.stem(word) for word in lst_text]
    text=" ".join(lst_text)
    return text

In [7]:
blog_df['clean_blog_content'] = blog_df['blog_content'].apply(lambda x: pre_process_text(x,flg_stemm=False,flg_lemm=True,lst_stopwords=lst_stopwords))

### Using TFIDF Vectorizer to vectorize the blog content

TF-IDF, short for Term Frequency-Inverse Document Frequency, is a widely used technique in natural language processing and information retrieval to quantify the importance of a term in a document within a collection of documents. TF-IDF combines two factors: term frequency (TF) and inverse document frequency (IDF).

> * **Term Frequency (TF)**:
TF measures the frequency of a term within a document. It calculates the number of times a term appears in a document and represents it as a raw count or a normalized value. The rationale behind TF is that terms that appear more frequently in a document are likely to be more important or relevant to that document.

> * **Inverse Document Frequency (IDF)**:
IDF measures the significance of a term in the entire collection of documents. It calculates the logarithm of the inverse fraction of the number of documents that contain the term. The idea behind IDF is that terms that occur in a small number of documents are more informative and valuable than those that appear in a large number of documents.

The TF-IDF calculation is performed by multiplying the TF and IDF values together. The resulting score represents the importance of a term within a document in the context of the entire collection of documents. Higher scores indicate that a term is more relevant or distinctive to a specific document.

The formula for calculating TF-IDF for a term (t) in a document (d) within a collection of documents is as follows:

<img src="https://ptime.s3.ap-northeast-1.amazonaws.com/media/natural_language_processing/text_feature_Engineering/tf-idf-formula.PNG">

In [8]:
tfidf_vecotorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vecotorizer.fit_transform(blog_df['clean_blog_content'])
print(tfidf_matrix.shape)

(10466, 25157)


Hence, thier are 25157 unique words or vectors used to describe total 10467 blogs we have in our dataset

# Using Cosine Similarity for content based filtering

Cosine similarity is a measure used to determine the similarity between two vectors in a multi-dimensional space. It calculates the cosine of the angle between the vectors, which indicates how closely related the vectors are in terms of their orientation and direction.

Here is the formula to calculate cosine similarity,

<img src="https://clay-atlas.com/wp-content/uploads/2020/03/cosine-similarity.png">

It will return how similar two vectors are based on the distance between them. This value ranges from 0 to 1 . Where 0 represents least similar and 1 the most similar content. It is very widely used and efficient method for building content based recommendation systems. Thats the only reason we are using it for building our blog recommendation system.

In [9]:
cosine_sim = cosine_similarity(tfidf_matrix)
print(cosine_sim)

[[1.         0.         0.         ... 0.02173711 0.         0.        ]
 [0.         1.         0.         ... 0.00452585 0.00905365 0.00985712]
 [0.         0.         1.         ... 0.         0.         0.        ]
 ...
 [0.02173711 0.00452585 0.         ... 1.         0.         0.        ]
 [0.         0.00905365 0.         ... 0.         1.         0.03097127]
 [0.         0.00985712 0.         ... 0.         0.03097127 1.        ]]


In [10]:
# Let us have the blogs rated by user with user id 12
user_rating = ratings_df[ratings_df['userId']==12]

# consider blogs with ratings greater than or equal to 3.5 just for simplification
blogs_to_consider = user_rating[user_rating['ratings']>=3.5]['blog_id']

# Now we need Id's of this blogs in form of a list
high_rated_blogs = blogs_to_consider.values

In [11]:
rated_blogs = blog_df[blog_df['blog_id'].isin(high_rated_blogs)]
rated_blogs

Unnamed: 0,blog_id,blog_title,blog_content,topic,clean_blog_content
198,217,Stream Builder in Flutter,How to use StreamBuilder in Flutter? Flutter i...,flutter,use streambuilder flutter flutter popular open...
1301,1328,April 1st Recommendation on Alignment,"I want to make a basic assumption here, becaus...",ai,want make basic assumption date youre truly ra...
1377,1404,How ServiceNow users can use AI and what AI se...,ServiceNow is a cloud computing company that p...,ai,servicenow cloud computing company provides so...
3375,3402,Realtime Object detection using TensorFlow in ...,Real-time object detection using TensorFlow ca...,ai,realtime object detection using tensorflow ach...
3432,3459,2 ChatGPT (Free) Chrome Extensions so Useful T...,"Save hours on writing emails, googling, learni...",ai,save hour writing email googling learning unle...
8696,8723,Testing and Debugging in Flutter,Testing in Flutter: Testing is crucial for any...,flutter,testing flutter testing crucial app developmen...
8719,8746,Exploring Flutter Stream Builder: A Beginner’s...,"Hello there, little friend! Do you want to lea...",flutter,hello little friend want learn something calle...
8720,8747,Getting Started with Augmented Reality Mobile ...,A Comprehensive Guide to Learning and Building...,flutter,comprehensive guide learning building immersiv...
8722,8749,Mastering Bloc with ‘GetCubit’ in Flutter,"When you’re working with Flutter, you might wa...",flutter,youre working flutter might want manage state ...
8730,8757,Flutter State Management: An In-Depth Explorat...,Learn how to efficiently manage state in your ...,flutter,learn efficiently manage state flutter applica...


Let us create a function to recommended blogs based on the how similar blogs are.

In [12]:
def get_similar_blog(high_rated_blogs):
    """
        Args:
            high_rated_blogs : list of blog id's of the blogs rated by the user
        Returns:
            recommended_blogs : list of blog id's of the blogs that are to be recommended
    """

    recommended_blogs = []
    
    for blog_id in high_rated_blogs:
        
        # Find out the index value of particular blog
        temp_id = blog_df[blog_df['blog_id'] == blog_id].index.values[0]
        
        # Find out the index value of all the blogs which have similarity greater than 0.95
        temp_blog_id = blog_df[cosine_sim[temp_id] > 0.95]['blog_id'].index.values
        
        # Check whether the blog is already recommended or not and also verify that it is not seen by user previously
        for b_id in temp_blog_id:
            if b_id not in recommended_blogs and b_id not in high_rated_blogs:
                recommended_blogs.append(b_id)
                
    return recommended_blogs

# Generating Recommendation

In [13]:
recommended_blogs=get_similar_blog(high_rated_blogs)

In [14]:
blog_df.iloc[recommended_blogs]

Unnamed: 0,blog_id,blog_title,blog_content,topic,clean_blog_content
3377,3404,Here are the 3 ideas that you can use for Noti...,Leverage Notion using AI. — 3 ideas for Notio...,ai,leverage notion using ai 3 idea notionaiwwwins...
3434,3461,"Reading Herculaneum Scrolls: $250,000 Challeng...",Scientists have announced a contest with a pri...,ai,scientist announced contest prize quarter mill...
1379,1406,Unbabel — The AI-powered Translation Solution ...,"In today’s global economy, language barriers c...",ai,today global economy language barrier major ob...
1303,1330,"ChatGPT tweets, and it’s painful",OpenAI’s ChaptGPT engine has taken the world b...,ai,openais chaptgpt engine taken world storm beco...
8799,8826,10 Widgets Every Flutter Developer Must Master,A Comprehensive Guide to Flutter Widgets for B...,flutter,comprehensive guide flutter widget building am...
8907,8934,Data Persistence in Flutter,Data persistence is an essential aspect of any...,flutter,data persistence essential aspect mobile appli...
8698,8725,The Ultimate Flutter Navigator 2.0 series usin...,"In the first part, you learned about how you c...",flutter,first part learned set auto_route package flut...
8871,8898,Google Pay: A success story of Flutter,Google Pay is a popular digital wallet and onl...,flutter,google pay popular digital wallet online payme...
8836,8863,Flutter vs React Native: Which One is Better?,"When it comes to mobile app development, there...",flutter,come mobile app development two major player m...
200,219,Building beautiful product item widget in Flut...,Product item widgets are a fundamental aspect ...,flutter,product item widget fundamental aspect ecommer...
