# NLP

Natural Language Processing (NLP) is a field that combines computer science, artificial intelligence and language studies. It helps computers understand, process and create human language in a way that makes sense and is useful. With the growing amount of text data from social media, websites and other sources, NLP is becoming a key tool to gain insights and automate tasks like analyzing text or translating languages.

## NLP Techniques

### 1. Text Processing and Preprocessing 
a) Tokenization: Dividing text into smaller units, such as words or sentences. \
b) Stemming and Lemmatization: Reducing words to their base or root forms. \
c) Stopword Removal: Removing common words (like "and", "the", "is") that may not carry significant meaning. \
d) Text Normalization: Standardizing text, including case normalization, removing punctuation and correcting spelling errors.

## Tokenization in NLP

Tokenization is a fundamental step in Natural Language Processing (NLP). It involves dividing a Textual input into smaller units known as tokens. These tokens can be in the form of words, characters, sub-words, or sentences. It helps in improving interpretability of text by different models. Let's understand How Tokenization Works.

![image.png](attachment:image.png)

### Types of Tokenization

#### 1. Word Tokenization
Word tokenization is the most commonly used method where text is divided into individual words. It works well for languages with clear word boundaries, like English. For example, "Machine learning is fascinating" becomes:

Input before tokenization: ["Machine Learning is fascinating"]

Output when tokenized by words: ["Machine", "learning", "is", "fascinating"]

#### 2. Character Tokenization
In Character Tokenization, the textual data is split and converted to a sequence of individual characters. This is beneficial for tasks that require a detailed analysis, such as spelling correction or for tasks with unclear boundaries. It can also be useful for modelling character-level language.

Example

Input before tokenization: ["You are helpful"]

Output when tokenized by characters: ["Y", "o", "u", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]

#### 3. Sub-word Tokenization
This strikes a balance between word and character tokenization by breaking down text into units that are larger than a single character but smaller than a full word. This is useful when dealing with morphologically rich languages or rare words.

Example

["Time", "table"] 
["Rain", "coat"] 
["Grace", "fully"] 
["Run", "way"] 

Sub-word tokenization helps to handle out-of-vocabulary words in NLP tasks and for languages that form words by combining smaller units.

#### 4. Sentence Tokenization
Sentence tokenization is also a common technique used to make a division of paragraphs or large set of sentences into separated sentences as tokens. This is useful for tasks requiring individual sentence analysis or processing.

Input before tokenization: ["Artificial Intelligence is an emerging technology. Machine learning is fascinating. Computer Vision handles images. "]

Output when tokenized by sentences ["Artificial Intelligence is an emerging technology.", "Machine learning is fascinating.", "Computer Vision handles images."]

#### 5. N-gram Tokenization
N-gram tokenization splits words into fixed-sized chunks (size = n) of data.

Input before tokenization: ["Machine learning is powerful"]

Output when tokenized by bigrams: [('Machine', 'learning'), ('learning', 'is'), ('is', 'powerful')]

### Need of Tokenization
Tokenization is an essential step in text processing and natural language processing (NLP) for several reasons. Some of these are listed below:

Effective Text Processing: Reduces the size of raw text, resulting in easy and efficient statistical and computational analysis.\
Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in ML models.\
Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.\
Text Analysis: Used in sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.\
Vocabulary Management: Generates a list of distinct tokens, Helps manage a corpus's vocabulary.
Task-Specific Adaptation: Adapts to need of particular NLP task, Good for summarization and machine translation.

## Coding

In [1]:
!pip install nltk




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
## Tokenize sentence

In [None]:
import nltk
import os
from nltk.tokenize import sent_tokenize

# Step 4: Tokenize text
text = "Hello everyone. Welcome to the workshop. You are studying NLP article."
sentences = sent_tokenize(text)

# Output
print(sentences)


['Hello everyone.', 'Welcome to the workshop.', 'You are studying NLP article.']


In [12]:
import nltk.data

# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

['Hello everyone.',
 'Welcome to the workshop.',
 'You are studying NLP article.']

In [13]:
import nltk.data

spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')

text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

['Hola amigo.', 'Estoy bien.']

In [14]:
## Word Tokenization

In [15]:
from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to the workshop."
word_tokenize(text)

['Hello', 'everyone', '.', 'Welcome', 'to', 'the', 'workshop', '.']

In [16]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

In [17]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)

['Let', 's', 'see', 'how', 'it', 's', 'working']

In [19]:
! pip install -U pip setuptools wheel

Collecting pip
  Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
Collecting setuptools
  Downloading setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---------------------------------------- 1.8/1.8 MB 12.5 MB/s eta 0:00:00
Downloading setuptools-80.9.0-py3-none-any.whl (1.2 MB)
   ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
   ---------------------------------------- 1.2/1.2 MB 20.0 MB/s eta 0:00:00
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: To modify pip, please run the following command:
c:\Users\SV937CY\OneDrive - EY\Desktop\riotinto\rio\.venv\Scripts\python.exe -m pip install -U pip setuptools wheel


In [20]:
! pip install -U spacy

Collecting spacy
  Downloading spacy-3.8.7-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp311-cp311-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp311-cp311-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp311-cp311-win_a


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [22]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---- ----------------------------------- 1.6/12.8 MB 13.9 MB/s eta 0:00:01
     ------------------------------ -------- 10.0/12.8 MB 18.8 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 17.5 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 14.1 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
import spacy

# Creating blank language object then
# tokenizing words of the sentence
nlp = spacy.blank("en")

doc = nlp("GeeksforGeeks is a one stop\
learning destination for geeks.")

for token in doc:
    print(token)

GeeksforGeeks
is
a
one
stoplearning
destination
for
geeks
.


In [24]:
import spacy

# loading modules to the pipeline.
nlp = spacy.load("en_core_web_sm")

# Initialising doc with a sentence.
doc = nlp("If you want to be an excellent programmer \
, be consistent to practice daily on GFG.")

# Using properties of token i.e. Part of Speech and Lemmatization
for token in doc:
    print(token, " | ",
          spacy.explain(token.pos_),
          " | ", token.lemma_)

If  |  subordinating conjunction  |  if
you  |  pronoun  |  you
want  |  verb  |  want
to  |  particle  |  to
be  |  auxiliary  |  be
an  |  determiner  |  an
excellent  |  adjective  |  excellent
programmer  |  noun  |  programmer
,  |  punctuation  |  ,
be  |  auxiliary  |  be
consistent  |  adjective  |  consistent
to  |  particle  |  to
practice  |  verb  |  practice
daily  |  adverb  |  daily
on  |  adposition  |  on
GFG  |  proper noun  |  GFG
.  |  punctuation  |  .


## Lemmatization vs. Stemming

### What is Lemmatization? 
Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma. This technique considers the context and the meaning of the words, ensuring that the base form belongs to the language's dictionary. For example, the words "running," "ran," and "runs" are all lemmatized to the lemma "run."

### How Lemmatization Works? 
Lemmatization involves several steps:

Part-of-Speech (POS) Tagging: Identifying the grammatical category of each word (e.g., noun, verb, adjective).\
Morphological Analysis: Analyzing the structure of the word to understand its root form.\
Dictionary Lookup: Using a predefined vocabulary to find the lemma of the word.\
For example, the word "better" would be lemmatized to "good" if it is identified as an adjective, whereas "running" would be lemmatized to "run" if identified as a verb.

### Techniques in Lemmatization
Rule-Based Lemmatization: Uses predefined grammatical rules to transform words. For instance, removing the "-ed" suffix from regular past tense verbs.\
Dictionary-Based Lemmatization: Looks up words in a dictionary to find their base forms.\
Machine Learning-Based Lemmatization: Employs machine learning models trained on annotated corpora to predict the lemma of a word.

### What is Stemming?
Stemming is a more straightforward process that cuts off prefixes and suffixes (i.e., affixes) to reduce a word to its root form. This root form, known as the stem, may not be a valid word in the language. For example, the words "running," "runner," and "runs" might all be stemmed to "run" or "runn," depending on the stemming algorithm used.

### How Stemming Works?
Stemming algorithms apply a series of rules to strip affixes from words. The most common stemming algorithms include:

Porter Stemmer: Uses a set of heuristic rules to iteratively remove suffixes.\
Snowball Stemmer: An extension of the Porter Stemmer with more robust rules.\
Lancaster Stemmer: A more aggressive stemmer that can sometimes over-stem words.\
For example, the words "running", "runner", and "runs" might all be reduced to "run" by a stemming algorithm, but sometimes it might also reduce "arguing" to "argu".

In [1]:
### Practical Implementation: Lemmatization with NLTK in

In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Function to get POS tag
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()
word = "running"
lemma = lemmatizer.lemmatize(word, get_wordnet_pos(word))
print(f"Lemmatized word: {lemma}")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Lemmatized word: run


In [4]:
## Stemming with NLTK in Python

In [3]:
import nltk
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
word = "running"
stem = stemmer.stem(word)
print(f"Stemmed word: {stem}")

Stemmed word: run


In [None]:
## Natural Language Processing with Tokenization and Lemmatization

In [5]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "The striped bats are hanging on their feet for best"

# Tokenize the text
words = nltk.word_tokenize(text)

# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming
stemmed_words = [stemmer.stem(word) for word in words]

# Function to get the part of speech tag for lemmatization
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

# Print results
print("Original Text: ", text)
print("Tokenized Words: ", words)
print("Stemmed Words: ", stemmed_words)
print("Lemmatized Words: ", lemmatized_words)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Original Text:  The striped bats are hanging on their feet for best
Tokenized Words:  ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
Stemmed Words:  ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best']
Lemmatized Words:  ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


## Removing stop words with NLTK in Python

A stop word is a commonly used word (such as "the", "a", "an", or "in") that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

### Types of Stopwords
Stopwords are frequently occurring words in a language that are frequently omitted from natural language processing (NLP) tasks due to their low significance for deciphering textual meaning. The particular list of stopwords can change based on the language being studied and the context. The following is a broad list of stopword categories:

1) Common Stopwords: These are the most frequently occurring words in a language and are often removed during text preprocessing. Examples include "the," "is," "in," "for," "where," "when," "to," "at," etc.\
2) Custom Stopwords: Depending on the specific task or domain, additional words may be considered as stopwords. These could be domain-specific terms that don't contribute much to the overall meaning. For example, in a medical context, words like "patient" or "treatment" might be considered as custom stopwords.\
3) Numerical Stopwords: Numbers and numeric characters may be treated as stopwords in certain cases, especially when the analysis is focused on the meaning of the text rather than specific numerical values.\
4) Single-Character Stopwords: Single characters, such as "a," "I," "s," or "x," may be considered stopwords, particularly in cases where they don't convey much meaning on their own.\
5) Contextual Stopwords: Words that are stopwords in one context but meaningful in another may be considered as contextual stopwords. For instance, the word "will" might be a stopword in the context of general language processing but could be important in predicting future events.

In [None]:
## Removing stop words with NLTK

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
                  showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether 
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [None]:
## Removing stop words with SpaCy

In [7]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "There is a pen on the table"

# Process the text using spaCy
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)

Original Text: There is a pen on the table
Text after Stopword Removal: pen table


In [None]:
## Remowing  of common stopwords can be done by geneim and sklearn package also

In [None]:
## Custom stopwords code

In [8]:
text = "The patient was taken for treatment after being admitted to the hospital."
tokens = word_tokenize(text.lower())

custom_stopwords = {"patient", "treatment", "admitted", "hospital"}
filtered_custom = [word for word in tokens if word not in custom_stopwords and word.isalnum()]

print("Custom stopwords removed:")
print(filtered_custom)


Custom stopwords removed:
['the', 'was', 'taken', 'for', 'after', 'being', 'to', 'the']


In [None]:
## Numerical stopwords

In [9]:
text = "In 2023, 1200 patients were treated in batch 5."
tokens = word_tokenize(text.lower())

numerical_stopwords = {token for token in tokens if token.isdigit()}
filtered_numerical = [word for word in tokens if word not in numerical_stopwords and word.isalnum()]

print("Numerical stopwords removed:")
print(filtered_numerical)


Numerical stopwords removed:
['in', 'patients', 'were', 'treated', 'in', 'batch']


In [None]:
## Single character stopwords

In [12]:
text = "x is a variable, and I am learning NLP."
tokens = word_tokenize(text.lower())

single_char_stopwords = {token for token in tokens if len(token) == 1 and token.isalpha()}
filtered_single_char = [word for word in tokens if word not in single_char_stopwords and word.isalnum()]

print("Single-character stopwords removed:")
print(filtered_single_char)


Single-character stopwords removed:
['is', 'variable', 'and', 'am', 'learning', 'nlp']


In [14]:
## Contextual Stopwords

In [13]:
text = "Will you go to the event? The will was signed yesterday."
tokens = word_tokenize(text.lower())

# In this context, we decide to filter out the future-intent word "will"
contextual_stopwords = {"will"}
filtered_contextual = [word for word in tokens if word not in contextual_stopwords and word.isalnum()]

print("Contextual stopwords removed (context-dependent):")
print(filtered_contextual)


Contextual stopwords removed (context-dependent):
['you', 'go', 'to', 'the', 'event', 'the', 'was', 'signed', 'yesterday']


## Normalizing Textual Data with Python

### Steps Required

Here, we will discuss some basic steps need for Text normalization.

1) Input text String,
2) Convert all letters of the string to one case(either lower or upper case),
3) If numbers are essential to convert to words else remove all numbers,
4) Remove punctuations, other formalities of grammar,
5) Remove white spaces,
6) Remove stop words, And any other computations.

In [1]:
# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
        
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

{'had', 'or', 'doing', 'shouldn', 'why', 'both', 'she', 'me', "we'd", 'where', "we've", "you've", "shan't", 'about', 'which', 'again', 'yours', "shouldn't", 'between', 'will', 'a', 'hasn', 'is', 'aren', 'couldn', "doesn't", "it'll", 'has', "they're", 'don', 'after', 'd', "haven't", "mustn't", 'further', 'nor', 'do', 'not', 'at', 'having', 'theirs', 'in', 'hadn', 'from', 'our', "won't", 'now', 'above', 'needn', 'be', 'does', 'each', 'shan', 'them', "couldn't", 'ours', 'ourselves', 's', "weren't", 'any', 'been', "it'd", "she'll", 'am', 'm', 'very', "it's", "they'd", 'the', 'here', 'themselves', 'under', 'did', "needn't", "mightn't", 'wouldn', 'her', 'over', "aren't", "they've", "you'll", 'have', "she'd", "we'll", 'with', "they'll", 'to', 'can', 'those', 'ma', 'whom', 'against', 'itself', 'of', 'these', 'because', 'yourself', 'few', 'their', 'wasn', "hadn't", 'so', "we're", 'out', "you're", "she's", "wouldn't", 'before', 'herself', 'himself', 'than', 'won', 'while', 'other', 'who', 'on', 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# import regex
import re

# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


# input string 
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string) 

# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
        
# removing last space
no_stpwords_string = no_stpwords_string[:-1]

# output
print(no_stpwords_string)

['python', 'released', 'in', 'was', 'a', 'major', 'revision', 'of', 'the', 'language', 'that', 'is', 'not', 'completely', 'backward', 'compatible', 'and', 'much', 'python', 'code', 'does', 'not', 'run', 'unmodified', 'on', 'python', 'with', 'python', 's', 'endoflife', 'only', 'python', 'x', 'and', 'later', 'are', 'supported', 'with', 'older', 'versions', 'still', 'supporting', 'eg', 'windows', 'and', 'old', 'installers', 'not', 'restricted', 'to', 'bit', 'windows']
python released major revision language completely backward compatible much python code run unmodified python python endoflife python x later supported older versions still supporting eg windows old installers restricted bit windows


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Syntax and Parsing
a) Part-of-Speech (POS) Tagging: Assigning parts of speech to each word in a sentence (e.g., noun, verb, adjective).\
b) Dependency Parsing: Analyzing the grammatical structure of a sentence to identify relationships between words.\
c) Constituency Parsing: Breaking down a sentence into its constituent parts or phrases (e.g., noun phrases, verb phrases).

## POS(Parts-Of-Speech) Tagging in NLP
One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging, which is giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs. Through improved comprehension of phrase structure and semantics, this technique makes it possible for machines to study and comprehend human language more accurately.\
After performing POS Tagging:

"The" is tagged as determiner (DT)\
"quick" is tagged as adjective (JJ)\
"brown" is tagged as adjective (JJ)\
"fox" is tagged as noun (NN)\
"jumps" is tagged as verb (VBZ)\
"over" is tagged as preposition (IN)\
"the" is tagged as determiner (DT)\
"lazy" is tagged as adjective (JJ)\
"dog" is tagged as noun (NN)

In [7]:
# Importing the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Download necessary resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "NLTK is a powerful library for natural language processing."

# Step 1: Tokenize the text into words
words = word_tokenize(text)

# Step 2: Perform Part-of-Speech tagging
pos_tags = pos_tag(words)

# Step 3: Display results
print("Original Text:")
print(text)

print("\nPoS Tagging Result:")
for word, tag in pos_tags:
    print(f"{word}: {tag}")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\SV937CY\AppData\Roaming\nltk_data...


Original Text:
NLTK is a powerful library for natural language processing.

PoS Tagging Result:
NLTK: NNP
is: VBZ
a: DT
powerful: JJ
library: NN
for: IN
natural: JJ
language: NN
processing: NN
.: .


[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Named Entity Recognition
Named Entity Recognition (NER) in NLP focuses on identifying and categorizing important information known as entities in text. These entities can be names of people, places, organizations, dates, etc. It helps in transforming unstructured text into structured information which helps in tasks like text summarization, knowledge graph creation and question answering.

In [9]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.3.0-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.0-cp311-cp311-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   ------------ --------------------------- 3.4/11.1 MB 18.3 MB/s eta 0:00:01
   -------------------------------- ------- 8.9/11.1 MB 23.1 MB/s eta 0:00:01
   ---------------------------------- ----- 9.4/11.1 MB 16.8 MB/s eta 0:00:01
   ------------------------------------ --- 10.2/11.1 MB 13.0 MB/s eta 0:00:01
   ---------------------------------------- 11.1/11.1 MB 10.7 MB/s eta 0:00:00
Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, pandas
Successfully insta


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Downloading beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
Downloading soupsieve-2.7-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.13.4 soupsieve-2.7



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [12]:
import pandas as pd 
import spacy 
import requests 
from bs4 import BeautifulSoup
nlp = spacy.load("en_core_web_sm")
pd.set_option("display.max_rows", 200)

In [22]:
content = "Trinamool Congress leader Mahua Moitra has moved the Supreme Court against her expulsion from the Lok Sabha over the cash-for-query allegations against her. Moitra was ousted from the Parliament last week after the Ethics Committee of the Lok Sabha found her guilty of jeopardising national security by sharing her parliamentary portal's login credentials with businessman Darshan Hiranandani."
doc = nlp(content)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Trinamool Congress 0 18 ORG
Mahua Moitra 26 38 PERSON
the Supreme Court 49 66 ORG
Moitra 157 163 NORP
Parliament 184 194 ORG
last week 195 204 DATE
the Ethics Committee 211 231 ORG
Darshan Hiranandani 373 392 PERSON


In [23]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
print(df)

                   text    type                 lemma
0    Trinamool Congress     ORG    Trinamool Congress
1          Mahua Moitra  PERSON          Mahua Moitra
2     the Supreme Court     ORG     the Supreme Court
3                Moitra    NORP                Moitra
4            Parliament     ORG            Parliament
5             last week    DATE             last week
6  the Ethics Committee     ORG  the Ethics Committee
7   Darshan Hiranandani  PERSON   Darshan Hiranandani


## Word Sense Disambiguation: Importance in Natural Language Processing

Word Sense Disambiguation is an important method of NLP by which the meaning of a word is determined, which is used in a particular context. NLP systems often face the challenge of properly identifying words, and determining the specific usage of a word in a particular sentence has many applications.

Word Sense Disambiguation basically solves the ambiguity that arises in determining the meaning of the same word used in different situations.

In [24]:
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize

a1= lesk(word_tokenize('This device is used to jam the signal'),'jam')
print(a1,a1.definition())
a2 = lesk(word_tokenize('I am stuck in a traffic jam'),'jam')
print(a2,a2.definition())

Synset('jamming.n.01') deliberate radiation or reflection of electromagnetic energy for the purpose of disrupting enemy use of electronic devices or systems
Synset('jam.v.05') get stuck and immobilized


## TF-IDF Works
TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).

Term Frequency (TF): Measures how often a word appears in a document. A higher frequency suggests greater importance. If a term appears frequently in a document, it is likely relevant to the document’s content. Formula:

![image.png](attachment:image.png)

Inverse Document Frequency (IDF): Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be meaningful and specific. Formula:

![image.png](attachment:image.png)

Absolutely! Here's a **layman's explanation of TF-IDF** along with the **formula and how it works**, explained with a real-world analogy:

---

## 🧠 What is TF-IDF? (Layman’s Explanation)

> **TF-IDF** stands for **Term Frequency – Inverse Document Frequency**.
> It helps a computer **figure out which words in a document are important**.

---

### 🧺 Analogy: Grocery Store Flyers

Imagine you’re reading flyers from **three grocery stores**.

* Every flyer mentions words like **“fresh”**, **“sale”**, and **“store”**. These appear **everywhere** — they don’t tell you much.
* But if one flyer uses the word **“avocados”** and none of the others do, you might think:

  > 💡 “That flyer is all about avocados!”

**TF-IDF helps a computer think that way.**
It gives **more weight to special words** and **less weight to common words.**

---

## 🔣 TF-IDF = TF × IDF

Let’s break it down.

---

### 🔹 1. **TF = Term Frequency**

> How often does the word appear in one document?

📘 **Formula**:

$$
\text{TF}(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}
$$

📦 **Example**:

* Word: `banana`
* Appears: 3 times in a document of 100 words
* Then: `TF(banana) = 3 / 100 = 0.03`

---

### 🔹 2. **IDF = Inverse Document Frequency**

> How rare is the word **across all documents**?

📘 **Formula**:

$$
\text{IDF}(t) = \log\left(\frac{N}{df(t)}\right)
$$

* `N` = total number of documents
* `df(t)` = number of documents **containing** term `t`

📦 **Example**:

* Total documents: 1000
* Word "banana" appears in 10 documents
* Then: `IDF(banana) = log(1000 / 10) = log(100) = 2.0`

---

### 🔹 3. **TF-IDF = TF × IDF**

$$
\text{TF-IDF}(banana) = 0.03 \times 2.0 = 0.06
$$

So, the word **"banana"** gets a score of `0.06`.
That means it’s **important** but not the most unique.

---

## 🔍 What’s the Big Idea?

* **Common words** (like “the”, “and”) appear everywhere → IDF is low → TF-IDF is low
* **Rare but important words** get high TF-IDF → They stand out

---

## 📊 Real World Uses

| Use Case               | How TF-IDF Helps                           |
| ---------------------- | ------------------------------------------ |
| Google Search          | Ranks web pages based on keyword relevance |
| Spam Detection         | Weighs words like "free", "win", "click"   |
| Text Classification    | Picks keywords to help assign topics       |
| Chatbots               | Matches user input to relevant answers     |
| Recommendation Systems | Finds important terms in user preferences  |

---

## ✅ Quick Summary

| Term       | Meaning                             |
| ---------- | ----------------------------------- |
| **TF**     | Word frequency in 1 document        |
| **IDF**    | Rarity of the word in all documents |
| **TF-IDF** | Importance of the word overall      |

---

Would you like:

* A **visual diagram** showing how TF-IDF works step-by-step?
* A **Jupyter notebook** with numeric breakdown and plot?

Let me know!


In [26]:
! pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl.metadata (14 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl (10.7 MB)
   ---------------------------------------- 0.0/10.7 MB ? eta -:--:--
   ------------------ --------------------- 5.0/10.7 MB 27.4 MB/s eta 0:00:01
   ------------------------------- -------- 8.4/10.7 MB 27.4 MB/s eta 0:00:01
   ---------------------------------------  10.5/10.7 MB 21.1 MB/s eta 0:00:01
   ---------------------------------------- 10.7/10.7 MB 15.2 MB/s eta 0:00:00
Downloading scipy-1.16.0-cp311-cp311-win_amd64.whl (38.6 MB)
   ---------------------------------------- 0.0/38.6 MB ? eta -:--:--
    --------------------------------------- 0.5/38.6 MB 8.5 MB/s eta 0:00:05
   ----


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [28]:
# import required module
from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
d0 = 'Geeks for geeks'
d1 = 'Geeks'
d2 = 'r2j'

# merge documents into a single corpus
string = [d0, d1, d2]

# create object
tfidf = TfidfVectorizer()

# get tf-df values
result = tfidf.fit_transform(string)

# get idf values
print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
    print(ele1, ':', ele2)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values
print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())


idf values:
for : 1.6931471805599454
geeks : 1.2876820724517808
r2j : 1.6931471805599454

Word indexes:
{'geeks': 1, 'for': 0, 'r2j': 2}

tf-idf value:
<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (3, 3)>
  Coords	Values
  (0, 1)	0.8355915419449176
  (0, 0)	0.5493512310263033
  (1, 1)	1.0
  (2, 2)	1.0

tf-idf values in matrix form:
[[0.54935123 0.83559154 0.        ]
 [0.         1.         0.        ]
 [0.         0.         1.        ]]


In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Step 1: Load real news data (subset of topics)
categories = ['rec.sport.hockey', 'sci.space', 'talk.politics.mideast']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Step 2: Split into train and test
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Step 3: Create model pipeline with TF-IDF + Naive Bayes
model = make_pipeline(TfidfVectorizer(stop_words='english'), MultinomialNB())

# Step 4: Train model
model.fit(X_train, y_train)

# Step 5: Predict and evaluate
predictions = model.predict(X_test)

# Step 6: Show results
print("Target Names:", data.target_names)
print("\nClassification Report:\n")
print(classification_report(y_test, predictions, target_names=data.target_names))


Target Names: ['rec.sport.hockey', 'sci.space', 'talk.politics.mideast']

Classification Report:

                       precision    recall  f1-score   support

     rec.sport.hockey       0.94      0.94      0.94       303
            sci.space       0.92      0.93      0.92       290
talk.politics.mideast       0.94      0.93      0.94       285

             accuracy                           0.93       878
            macro avg       0.93      0.93      0.93       878
         weighted avg       0.93      0.93      0.93       878



| Feature           | **TF-IDF**                                    | **Word2Vec**                                      |
| ----------------- | --------------------------------------------- | ------------------------------------------------- |
| What it does      | Counts and scores **how important** a word is | Learns **what a word means** based on its context |
| Data structure    | Sparse vector (lots of 0s)                    | Dense vector (real numbers that capture meaning)  |
| Output            | A vector per word (based on frequency/rarity) | A vector per word (based on meaning and context)  |
| Captures meaning? | ❌ No (only frequency)                         | ✅ Yes (semantics and word relationships)          |
| Learns synonyms?  | ❌ No                                          | ✅ Yes (e.g., “king” ≈ “queen”, “dog” ≈ “puppy”)   |


📌 TF-IDF:
You go to a fruit market.

TF-IDF says: "Banana was mentioned 10 times today, so it’s important!"

📌 Word2Vec:
You go to the same market every day and hear people talk about fruits.

Word2Vec says:

"Banana and mango are often bought together, used in similar sentences, and eaten as snacks… so they’re related!"



In [11]:
! pip install gensim





[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
### Code for Word2Vec

In [12]:
import gensim
from gensim.models import Word2Vec
from sklearn.datasets import fetch_20newsgroups
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Load real news articles
categories = ['rec.sport.hockey', 'sci.space', 'talk.politics.mideast']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Tokenize each document into a list of words
sentences = [word_tokenize(doc.lower()) for doc in data.data]

# Train Word2Vec
model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, workers=4)

# Example: Find similar words to 'space'
print("Words similar to 'space':")
print(model.wv.most_similar('space', topn=5))

# Example: Check similarity between words
print("\nSimilarity between 'nasa' and 'space':", model.wv.similarity('nasa', 'space'))
print("Similarity between 'goal' and 'hockey':", model.wv.similarity('goal', 'hockey'))


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [14]:
! python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
      --------------------------------------- 0.8/33.5 MB 11.2 MB/s eta 0:00:03
     ------- -------------------------------- 6.3/33.5 MB 22.7 MB/s eta 0:00:02
     ---------- ----------------------------- 8.9/33.5 MB 18.4 MB/s eta 0:00:02
     ------------- ------------------------- 11.5/33.5 MB 16.0 MB/s eta 0:00:02
     ----------------- --------------------- 14.7/33.5 MB 15.4 MB/s eta 0:00:02
     ------------------- ------------------- 16.8/33.5 MB 14.7 MB/s eta 0:00:02
     --------------------- ----------------- 18.4/33.5 MB 13.5 MB/s eta 0:00:02
     ----------------------- --------------- 20.2/33.5 MB 12.7 MB/s eta 0:00:02
     ------------------------- ------------- 22.0/33.5 MB 12.2 MB/s eta 0:00:01
     ---------------------------


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
import spacy
from sklearn.datasets import fetch_20newsgroups

# Load spaCy's medium model (has word vectors)
nlp = spacy.load("en_core_web_md")

# Load some news articles
categories = ['rec.sport.hockey', 'sci.space', 'talk.politics.mideast']
data = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))

# Just take the first 5 documents for demo
texts = data.data[:5]

# Convert each text to spaCy Doc and get vector
docs = [nlp(text) for text in texts]

# Print the vector for each document (average word vectors)
for i, doc in enumerate(docs):
    print(f"\n--- Document {i+1} (Category: {data.target_names[data.target[i]]}) ---")
    print(f"Text snippet: {doc.text[:100].strip()}...")
    print(f"Vector shape: {doc.vector.shape}")
    print(f"Vector preview: {doc.vector[:5]}")  # first 5 dims only



--- Document 1 (Category: rec.sport.hockey) ---
Text snippet: Last night during a Sharks' broadcast, Commissioner Bettman was
interviewed during the first interm...
Vector shape: (300,)
Vector preview: [-0.62975585  0.23110239 -0.22710201 -0.07602932 -0.05759767]

--- Document 2 (Category: talk.politics.mideast) ---
Text snippet: I think he is trying to mislead people.  In cases where race
information is sought, it is completel...
Vector shape: (300,)
Vector preview: [-0.6198063   0.13307561 -0.22647174 -0.08648135 -0.07353649]

--- Document 3 (Category: rec.sport.hockey) ---
Text snippet: Playoff leaders as of April 19, 1993

    Player       Team   GP  G   A  Pts +/- PIM

    M.Lemieux...
Vector shape: (300,)
Vector preview: [-0.31055307  0.19050068  0.09343816 -0.04042069  0.10491956]

--- Document 4 (Category: talk.politics.mideast) ---
Text snippet: I wouldn't bet on it.

Arab governments generally don't care much about the Palestineans and
their...
Vector shape: (300,)
Vector pr

In [16]:
# Word similarity (like Word2Vec)
word1 = nlp("space")
word2 = nlp("nasa")
word3 = nlp("hockey")

print("\nWord Similarities:")
print(f"'space' ~ 'nasa': {word1.similarity(word2):.2f}")
print(f"'space' ~ 'hockey': {word1.similarity(word3):.2f}")



Word Similarities:
'space' ~ 'nasa': 0.39
'space' ~ 'hockey': 0.08


In [17]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Compare document 0 to others
base_doc = docs[0].vector.reshape(1, -1)
similarities = cosine_similarity(base_doc, np.array([doc.vector for doc in docs]))

print("\nSimilarity to Document 0:")
for i, score in enumerate(similarities[0]):
    print(f"Doc {i}: {score:.2f}")



Similarity to Document 0:
Doc 0: 1.00
Doc 1: 0.97
Doc 2: 0.31
Doc 3: 0.96
Doc 4: 0.97


Absolutely! Let’s look at **real-world application examples** where you can **use TF-IDF** and **Word2Vec**—including what to choose and when.

---

## 🧠 Real-Life Application Example

### 🎯 Use Case: **Customer Support Ticket Classification**

Imagine you run a company that receives thousands of support requests like:

* *“I can’t reset my password”*
* *“Billing invoice is incorrect”*
* *“App crashes when I upload a photo”*

You want to automatically assign each ticket to a department:

🔧 Technical Support | 💰 Billing | 🔑 Account Management

---

## ✅ Where to Use TF-IDF?

### 🔍 Application: **Fast Ticket Classification with ML**

You can use **TF-IDF to convert the text of each support request** into a numeric vector, then feed it to a machine learning classifier (e.g., Naive Bayes or SVM).

### 🧰 Pipeline:

1. Preprocess: clean & tokenize ticket text
2. Use `TfidfVectorizer` to convert text to numeric features
3. Train a classifier (e.g., Logistic Regression)
4. Predict department for new tickets

✅ TF-IDF is great here because:

* Simple and fast
* Easy to understand
* Works well when you have labeled data

---

## ✅ Where to Use Word2Vec?

### 🧠 Application: **Semantic Search in Knowledge Base**

Let’s say users often search your help center with questions like:

> *“How do I log in with Google?”*

Your knowledge base article is titled:

> *“Steps to sign in using third-party providers”*

⚠ TF-IDF **won’t work well here** because the keywords don’t match exactly.

✅ Word2Vec helps because:

* It **captures meaning** and context
* It knows *“log in”* ≈ *“sign in”* and *“Google”* ≈ *“third-party provider”*

### 🧰 Pipeline:

1. Train Word2Vec model on your support tickets + help articles
2. Convert both queries and articles to **average Word2Vec vectors**
3. Use **cosine similarity** to return the most relevant article

---

## 🚀 Summary Table

| Feature                   | TF-IDF                                | Word2Vec                                     |
| ------------------------- | ------------------------------------- | -------------------------------------------- |
| 🔧 Use Case               | Classification (e.g., ticket routing) | Semantic search (e.g., query ≈ article)      |
| 📈 Data Requirement       | Labeled data                          | Can work unsupervised (learns from raw text) |
| 💬 Understands Synonyms?  | ❌ No (literal match only)             | ✅ Yes (“sign in” ≈ “log in”)                 |
| 🧠 Captures Word Meaning? | ❌ No                                  | ✅ Yes                                        |
| 🏃 Performance            | Fast, light                           | Needs more memory and time                   |

---

