**TF-IDF** (Term Frequency-Inverse Document Frequency) is a crucial concept in natural language processing (NLP) and machine learning. Let me break it down for you:

1. **Term Frequency (TF)**:
   - Represents how often a word appears in a document.
   - Calculated as the count of a specific word divided by the total number of words in that document.
   - Intuitively, it measures the relevance of a word within a single document.

2. **Document Frequency (DF)**:
   - Measures how common a word is across the entire corpus (collection of documents).
   - It's the count of documents containing a specific word.
   - Helps assess the importance of a word in the context of the entire dataset.

3. **Inverse Document Frequency (IDF)**:
   - Evaluates the significance of a word globally.
   - The IDF of a word is the logarithm of the total number of documents divided by the document frequency of that word.
   - It penalizes common words and emphasizes rare ones.

4. **TF-IDF Score**:
   - Combines TF and IDF to assign a weight to each word in a document.
   - Words with higher TF-IDF scores are considered more important.
   - Used for tasks like text classification, information retrieval, and document ranking¹².

In summary, TF-IDF helps us identify relevant terms within a document while considering their importance across the entire dataset. It's a powerful tool for feature extraction and text analysis in machine learning! 😊📚

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity
data = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/articles.csv",encoding='latin-1')
data.head()

Unnamed: 0,Article,Title
0,Data analysis is the process of inspecting and...,Best Books to Learn Data Analysis
1,The performance of a machine learning algorith...,Assumptions of Machine Learning Algorithms
2,You must have seen the news divided into categ...,News Classification with Machine Learning
3,When there are only two classes in a classific...,Multiclass Classification Algorithms in Machin...
4,The Multinomial Naive Bayes is one of the vari...,Multinomial Naive Bayes in Machine Learning


The **TfidfVectorizer** is a powerful tool in machine learning, specifically for text data. Let me break it down for you:

1. **Purpose**:
   - Converts a collection of raw text documents into a matrix of **TF-IDF** features.
   - Essentially, it combines the functionality of **CountVectorizer** (which counts word occurrences) and **TfidfTransformer** (which computes TF-IDF scores).

2. **How It Works**:
   - **Term Frequency-Inverse Document Frequency (TF-IDF)**:
     - Measures the importance of a word in a document relative to its frequency across the entire dataset.
     - Combines two components:
       - **Term Frequency (TF)**: How often a word appears in a document.
       - **Inverse Document Frequency (IDF)**: How common a word is across all documents.
     - The product of TF and IDF gives the TF-IDF score for each word.
   - **TfidfVectorizer** computes these scores for all words in your text data.

3. **Parameters** (some key ones):
   - `input`: Determines whether the input is raw content, filenames, or file-like objects.
   - `stop_words`: Optional list of stop words (e.g., "the," "and," "in") to exclude.
   - Other parameters control tokenization, character n-grams, and more.

4. **Usage**:
   - Commonly used for tasks like text classification, information retrieval, and document ranking¹³.
   - Example: If you have a set of articles and want to find similar ones based on their content, TfidfVectorizer can help!

Remember, it's a versatile tool for extracting meaningful features from text data. Feel free to explore its options and integrate it into your ML workflows! 😊📚


In [5]:
#This dataset is completely ready to use to create a recommender system, so let’s use the cosine similarity algorithm and write a Python function to recommend articles:
from sklearn.feature_extraction.text import TfidfVectorizer
articles = data["Article"].tolist()

# Create a TF-IDF vectorizer
uni_tfidf = TfidfVectorizer(stop_words="english")
uni_matrix = uni_tfidf.fit_transform(articles)

# Calculate cosine similarity
uni_sim = cosine_similarity(uni_matrix)

# Define a function to recommend articles
def recommend_articles(x):
    return ", ".join(data["Title"].loc[x.argsort()[-5:-1]])

# Add recommended articles to the DataFrame
data["Recommended Articles"] = [recommend_articles(x) for x in uni_sim]

# Display the updated DataFrame
data.head()

Unnamed: 0,Article,Title,Recommended Articles
0,Data analysis is the process of inspecting and...,Best Books to Learn Data Analysis,"Introduction to Recommendation Systems, Best B..."
1,The performance of a machine learning algorith...,Assumptions of Machine Learning Algorithms,"Applications of Deep Learning, Best Books to L..."
2,You must have seen the news divided into categ...,News Classification with Machine Learning,"Language Detection with Machine Learning, Appl..."
3,When there are only two classes in a classific...,Multiclass Classification Algorithms in Machin...,"Assumptions of Machine Learning Algorithms, Be..."
4,The Multinomial Naive Bayes is one of the vari...,Multinomial Naive Bayes in Machine Learning,"Assumptions of Machine Learning Algorithms, Me..."


Certainly! Let's break down the code step by step:

1. **Importing Libraries**:
   - We start by importing the necessary libraries:
     - `TfidfVectorizer` from `sklearn.feature_extraction.text`: This class converts a collection of text documents into a matrix of TF-IDF features.
     - `cosine_similarity` from `sklearn.metrics.pairwise`: This function calculates the cosine similarity between vectors.

2. **Creating the TF-IDF Matrix**:
   - We have a dataset with articles, and we want to create a recommender system.
   - The `TfidfVectorizer` is initialized with the following parameters:
     - `stop_words="english"`: Removes common English stop words (e.g., "the," "and," "in").
   - We convert the list of articles (`data["Article"].tolist()`) into a matrix of TF-IDF features (`uni_matrix`).

3. **Calculating Cosine Similarity**:
   - The `cosine_similarity` function computes the cosine similarity between vectors.
   - In our case, it calculates the similarity between articles based on their TF-IDF representations.
   - The resulting `uni_sim` matrix contains pairwise cosine similarity scores.

4. **Defining the Recommendation Function**:
   - We create a function called `recommend_articles(x)`:
     - It takes a similarity vector `x` (representing the similarity of an article with all other articles).
     - Sorts the similarity scores in ascending order.
     - Retrieves the titles of the top 4 most similar articles (excluding the article itself).
     - Joins these titles into a comma-separated string.
   - This function will be used to recommend articles.

5. **Adding Recommendations to the DataFrame**:
   - We apply the `recommend_articles` function to each row of the similarity matrix.
   - The resulting recommendations are stored in a new column called "Recommended Articles" in the DataFrame.

6. **Displaying the Updated DataFrame**:
   - Finally, we show the first few rows of the DataFrame with the recommended articles included.

Feel free to adapt this code to your specific dataset and use case! 😊📚

In [8]:
print(data["Recommended Articles"][20])

DBSCAN Clustering in Machine Learning, BIRCH Clustering in Machine Learning, K-Means Clustering in Machine Learning, Agglomerative Clustering in Machine Learning
