In [None]:
URK21CS1004                                                          18/10/2023

In [None]:
AIM: To design a Content-based Recommender System using the scikit-learn.

In [None]:
DESCRIPTION:
    TfidfVectorizer: 
        The TfidfVectorizer in scikit-learn is a feature extraction method that transforms a collection of raw text documents into a matrix of numerical features. It calculates the Term Frequency-Inverse Document Frequency (TF-IDF) values for each term in the documents.
            TF-IDF is a crucial concept in text analysis. It quantifies the importance of words within documents by considering both their frequency within a specific document and their uniqueness within the entire corpus. The TfidfVectorizer calculates these values, which are used to represent the content of documents in a numerical format for further analysis.
    
    linear_kernel:
        linear_kernel is a function provided by scikit-learn that computes the linear (dot) product between two arrays. It is frequently employed to determine the cosine similarity between vectors, such as those representing the content of documents or items.
            In content-based recommendation systems, cosine similarity is a common metric used to assess the similarity between items. The linear_kernel function is utilized to compute the cosine similarity, which measures the cosine of the angle between two vectors. High cosine similarity values indicate that items have similar content, making them suitable for recommendations.
    
    Content-Based Recommendation:
        Content-based recommendation is a recommendation system approach that suggests items to users based on the characteristics or attributes of the items they have previously shown interest in. It focuses on the content or features of items and aims to provide personalized recommendations by identifying items with similar attributes.
            Content-based recommendation systems analyze the content, descriptions, and attributes of items to find those that closely match the characteristics of items that a user has expressed interest in. This approach is particularly useful when user preferences are known, and recommendations are based on item similarity in terms of content.
    
    Term Frequency-Inverse Document Frequency (TF-IDF):
        Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic used in text mining and information retrieval to evaluate the importance of words or terms within a document relative to a collection of documents (corpus). It is computed for each term in a document and is employed for feature extraction in text analysis.
            TF-IDF combines two important concepts: Term Frequency (TF) measures how often a term appears in a document, and Inverse Document Frequency (IDF) quantifies how unique or rare a term is across the entire corpus. By combining these two components, TF-IDF provides a value that reflects the significance of a term within a document and helps in representing text data in a meaningful and numerical way.
    
    Cosine Similarity:
        Cosine similarity is a similarity metric used to compare the similarity between two non-zero vectors. It measures the cosine of the angle between the vectors and is often used in content-based recommendation systems to assess the similarity between items.
            Cosine similarity quantifies the alignment of vectors, indicating how similar or dissimilar they are. In content-based recommendation, it is applied to vectors representing the content or characteristics of items. High cosine similarity values suggest that items are more alike, making them suitable candidates for recommendations to users who have shown interest in similar items.

In [None]:
PROGRAM:

In [None]:
2. Develop an E-commerce item recommender system with content-based recommendation using the scikit-learn

In [10]:
print("BISWAYAN MEHRA \nURK21CS1004")
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df = pd.read_csv('shop_details.csv')

BISWAYAN MEHRA 
URK21CS1004


In [None]:
a. Use the column: 'product'. 

In [11]:
print("BISWAYAN MEHRA \nURK21CS1004")
df.dropna(subset=['product'], inplace=True)

BISWAYAN MEHRA 
URK21CS1004


In [None]:
b. Remove the leading and trailing whitespaces in that column. 

In [12]:
print("BISWAYAN MEHRA \nURK21CS1004")
df['product'] = df['product'].str.strip()

BISWAYAN MEHRA 
URK21CS1004


In [None]:
c. Perform feature extraction using Term Frequency Inverse Document Frequency (TF-IDF). 

In [13]:
print("BISWAYAN MEHRA \nURK21CS1004")
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df['product'])

BISWAYAN MEHRA 
URK21CS1004


In [None]:
d. Compute the cosine similarity. 

In [14]:
print("BISWAYAN MEHRA \nURK21CS1004")
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

BISWAYAN MEHRA 
URK21CS1004


In [None]:
e. Display the top 'n' suggestions with the similarity score for the given user input. 

In [15]:
print("BISWAYAN MEHRA \nURK21CS1004")
def get_recommendations(product_name, num_recommendations=5):
    idx = df[df['product'] == product_name].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:num_recommendations+1]
    product_indices = [i[0] for i in sim_scores]
    recommendations = df['product'].iloc[product_indices].tolist()
    similarity_scores = [i[1] for i in sim_scores]
    return recommendations, similarity_scores

BISWAYAN MEHRA 
URK21CS1004


In [None]:
Input: product_name, number of recommendations

In [17]:
print("BISWAYAN MEHRA \nURK21CS1004")
user_input = "Active sport boxer briefs"
num_recommendations = 5
recommendations, similarity_scores = get_recommendations(user_input, num_recommendations)
for i, (product, score) in enumerate(zip(recommendations, similarity_scores)):
    print(f"Recommendation {i+1}: {product} (Similarity Score: {score:.2f})")

BISWAYAN MEHRA 
URK21CS1004
Recommendation 1: Active sport briefs (Similarity Score: 0.85)
Recommendation 2: Active boxer briefs (Similarity Score: 0.85)
Recommendation 3: Active briefs (Similarity Score: 0.65)
Recommendation 4: Active briefs (Similarity Score: 0.65)
Recommendation 5: Cap 1 boxer briefs (Similarity Score: 0.65)


In [None]:
RESULT:
    Thus, the program to design the Content-based Recommender System using scikit-learn was executed and verified successfully.