<a href="https://colab.research.google.com/github/SURESHBEEKHANI/Natural-Language-Processing/blob/main/Text_Preprocessing_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TF-IDF: Term Frequency-Inverse Document Frequency



Understanding TF-IDF: A Quick Overview
TF-IDF (Term Frequency-Inverse Document Frequency) is a crucial technique in text analysis and natural language processing. It evaluates the significance of a word in a document relative to a collection of documents (corpus).

Key Components
Term Frequency (TF): Measures how often a term appears in a document, normalized to prevent bias towards longer documents.

# Applications


Information Retrieval: Ranks documents by relevance to a query.
Text Mining: Extracts significant words from documents.
Document Clustering: Measures document similarity for tasks like topic modeling.
Feature Extraction: Converts text to numerical features for machine learning.

In [None]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopword')

In [None]:
import nltk

# Defining a paragraph of text about Artificial Intelligence (AI)
# This paragraph will be used to demonstrate text preprocessing techniques
paragraph = """Artificial Intelligence (AI) is a transformative technology that mimics human intelligence
to perform tasks such as learning, reasoning, problem-solving, and decision-making. It encompasses various subfields
including machine learning, natural language processing, computer vision, and robotics. AI systems analyze
vast amounts of data to identify patterns, make predictions, and improve their performance over time through iterative
processes. This technology has vast applications across industries, from healthcare, where it aids in diagnosing
diseases and personalizing treatment plans, to finance, where it enhances fraud detection and automates trading.
AI also powers virtual assistants like Siri and Alexa, self-driving cars, and advanced manufacturing processes.
As AI continues to evolve, it promises to revolutionize the way we live and work, offering unprecedented opportunities
for innovation and efficiency while also posing ethical and societal challenges that must be carefully managed."""

#Text Cleaning  Process

import re  # Allows using regular expressions for advanced text processing

from nltk.corpus import stopwords  # Provides a set of common words like "the", "is", "in" to filter out
from nltk.stem import WordNetLemmatizer  # Helps in reducing words to their base or dictionary form
from nltk.stem import PorterStemmer  # Assists in reducing words to their root form



In [None]:
#create Object for stemming
stemmer=PorterStemmer()
#Create objects  for lemmatization
lemmatizer=WordNetLemmatizer()
#Use Sent_tokenize for convert Crops or Paragraph into senencent
sentences=nltk.sent_tokenize(paragraph)


corpus=[]

In [None]:
import nltk

# Download NLTK stopwords data
nltk.download('stopwords')

In [None]:
# Iterating over each sentence in the 'sentences' list
for i in range(len(sentences)):
    # Removing non-alphabetical characters and replacing them with spaces
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])

    # Converting all characters to lowercase
    review = review.lower()

    # Splitting the sentence into individual words
    review = review.split()

    # Lemmatizing each word (reducing them to their base form) if it's not a stopword
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]

    # Joining the lemmatized words back into a sentence
    review = ' '.join(review)

    # Adding the processed sentence to the 'corpus' list
    corpus.append(review)



In [None]:
print(corpus)

In [None]:
# Importing necessary tools from the sklearn library
from sklearn.feature_extraction.text import TfidfVectorizer


# Creating a tool to calculate TF-IDF
vectorizer = TfidfVectorizer()

# Using the tool to calculate TF-IDF values for our sample documents
# This will analyze the importance of each word in the context of all documents
tfidf_matrix = vectorizer.fit_transform(corpus)






In [None]:
print(tfidf_matrix)