# **Introduction to NLP**:


# Required Libraries and Their Purpose

| **Library**       | **Installation Command**     | **Purpose** |
|-------------------|----------------------------|-------------|
| `nltk`           | `pip install nltk`          | Natural Language Processing (Tokenization, Stopword Removal, Lemmatization) |
| `scikit-learn`   | `pip install scikit-learn`  | Machine Learning utilities (TF-IDF, Cosine Similarity) |
| `pandas`         | `pip install pandas`        | Data handling and processing |
| `numpy`          | `pip install numpy`         | Mathematical computations (used internally by `sklearn`) |

## Additional NLTK Downloads

After installing `nltk`, run the following commands to download necessary resources:



In [3]:
import nltk
import sklearn
import pandas
import numpy

print("All required libraries are installed successfully!")


All required libraries are installed successfully!




Natural Language Processing (NLP) is a branch of artificial intelligence that enables machines to understand, interpret, and generate human language. One of the first steps in NLP is text preprocessing, which involves cleaning and preparing text for further analysis. This includes:

* Tokenization
* Stopword Removal
* Stemming
* Lemmatization

### 1. Tokenization
Tokenization is the process of breaking a text into smaller pieces, called tokens. These tokens can be words, phrases, or sentences.

#### Types of Tokenization

1. Word Tokenization: Splitting text into individual words.
   
2. Sentence Tokenization: Splitting text into sentences.

### **Why is Tokenization Important in NLP?**

Tokenization is **one of the first steps** in NLP (Natural Language Processing). It helps break down text into smaller parts (tokens), making it easier for computers to understand and analyze language.

### **Importance of Tokenization:**

1.  **Helps in Understanding Text** – Computers can’t read text like humans. Breaking it into words or sentences makes it easier to process.
    
2.  **Prepares Data for Analysis** – Many NLP tasks (like sentiment analysis or chatbots) require working with individual words or sentences.
    
3.  **Improves Machine Learning Models** – Models learn better when the text is structured into meaningful parts.
    
4.  **Removes Unnecessary Complexity** – Instead of working with a full sentence, analyzing word-by-word makes processing more efficient.
    

### **Example:**

Text:👉 _"Natural Language Processing is amazing!"_

#### **Word Tokenization:**

👉 \['Natural', 'Language', 'Processing', 'is', 'amazing', '!'\]

#### **Sentence Tokenization:**

👉 \["Natural Language Processing is amazing!"\]

In [6]:
import nltk
from nltk.tokenize import word_tokenize

text = "Hello! How are you doing today?"
tokens = word_tokenize(text)

print(tokens)


['Hello', '!', 'How', 'are', 'you', 'doing', 'today', '?']


## Sentence Tokenization

In [8]:
from nltk.tokenize import sent_tokenize

text = "Hello! How are you doing today? I hope you are learning NLP."
sentences = sent_tokenize(text)

print(sentences)

['Hello!', 'How are you doing today?', 'I hope you are learning NLP.']


## 2. Stopword Removal
   
### **What are Stop Words?**

Stop words are common words in a language that do not add much meaning to a sentence and are usually **removed** from text analysis to save space and improve efficiency.

### **Example:**

In the sentence:👉 _"I love machine learning and AI."_

*   Words like **"I", "and"** are stop words because they don't carry significant meaning.
    

### **Why Remove Stop Words?**

1.  **Reduces Noise** – Makes the text cleaner and more focused.
    
2.  **Improves Performance** – Helps machine learning models process important words faster.
    
3.  **Saves Storage & Memory** – Eliminates unnecessary words from data storage.
    

### **Common Stop Words (English)**

*   "the", "is", "in", "and", "to", "on", "at", "for", "a", "an", "of", "it"
    

### **In NLP (Natural Language Processing)**

When working with text data, we often remove stop words to focus on important words that affect the meaning, such as **nouns, verbs, and adjectives**.

Would you like to see an example of how to remove stop words using Python? 🚀

In [10]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)


{'between', 'we', 'doesn', 'under', 'about', 'how', 'any', 'such', 'couldn', 'once', 'through', "doesn't", 'very', 'should', "haven't", 'you', 'or', 'ain', 're', 'll', 'it', "you've", 'over', 'most', 'until', 'our', 'from', "won't", 'above', 'because', 'few', 'aren', 'won', 'she', 'he', "shouldn't", 'them', 'does', "it's", 'a', 'am', 'just', 'this', 'that', 'each', 'don', "weren't", 'didn', "needn't", 'both', 'be', 'when', 'can', 'him', "you'd", 'shan', 'own', "you'll", 'whom', 'nor', "aren't", 'with', 'so', 'yours', 'of', 'there', 'but', 's', 'to', 'up', 'hasn', 'further', 'no', 'have', 'off', 'into', 'having', 'yourself', 'those', 'itself', 'before', 'been', 'do', 'himself', "don't", 'now', 'out', 'in', 'as', "shan't", 'who', 'what', "wouldn't", 'y', 'will', 'hadn', "mightn't", 'being', "you're", 't', "didn't", 'hers', 'all', "couldn't", 'below', 'too', 'haven', 'during', 'at', 'shouldn', 'on', 'my', 'd', 'o', 'had', 'down', 'are', 'ourselves', 'i', 'mustn', 'than', 'same', 'is', 'ag

In [11]:
text2 = "Hello! How are you doing today?"
tokens1 = word_tokenize(text2)

filtered_words = [word for word in tokens1 if word.lower() in stop_words]
print(filtered_words)


['How', 'are', 'you', 'doing']


In [12]:
filtered_words = [word for word in tokens1 if word.lower() not in stop_words]
print(filtered_words)

['Hello', '!', 'today', '?']


# 3. Stemming

### **Why is Stemming Important in NLP?**

Stemming helps computers **understand the core meaning of words** by reducing them to their root form. This makes text processing **faster and more efficient** in NLP tasks.

### **Importance of Stemming:**

1.  **Reduces Word Variations** – Words like _running_, _runs_, _runner_ all reduce to _run_, helping models treat them as the same word.
    
2.  **Saves Storage & Processing Power** – Instead of storing multiple versions of the same word, we only keep the root word.
    
3.  **Improves Search & Information Retrieval** – A search for _“connect”_ will also match _“connected”_, _“connecting”_, and _“connection”_.
    
4.  **Enhances Text Analysis** – Makes NLP models more effective by focusing on meaning rather than word variations.
    

### **Example:**

In [14]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Example words for different NLP tasks
examples = {
    "Search Engine Optimization": ["running", "runs", "runner", "ran", "run"],
    "Sentiment Analysis": ["happily", "happiness", "happy", "unhappily", "unhappy"],
    "Chatbot Responses": ["studying", "studied", "study", "studies"],
    "Spam Detection": ["buying", "bought", "buys", "buy"],
    "Resume Screening": ["developed", "developing", "developer", "development"]
}

# Apply stemming and print results
for use_case, words in examples.items():
    stemmed_words = [stemmer.stem(word) for word in words]
    print(f"\n🔹 {use_case}:\n   Original: {words}\n   Stemmed:  {stemmed_words}")



🔹 Search Engine Optimization:
   Original: ['running', 'runs', 'runner', 'ran', 'run']
   Stemmed:  ['run', 'run', 'runner', 'ran', 'run']

🔹 Sentiment Analysis:
   Original: ['happily', 'happiness', 'happy', 'unhappily', 'unhappy']
   Stemmed:  ['happili', 'happi', 'happi', 'unhappili', 'unhappi']

🔹 Chatbot Responses:
   Original: ['studying', 'studied', 'study', 'studies']
   Stemmed:  ['studi', 'studi', 'studi', 'studi']

🔹 Spam Detection:
   Original: ['buying', 'bought', 'buys', 'buy']
   Stemmed:  ['buy', 'bought', 'buy', 'buy']

🔹 Resume Screening:
   Original: ['developed', 'developing', 'developer', 'development']
   Stemmed:  ['develop', 'develop', 'develop', 'develop']


# 4. Lemmatization

### **Why is Lemmatization Important in NLP?**

Lemmatization is **better than stemming** because it reduces words to their meaningful base form (lemma) **without losing meaning**. It considers **grammar, context, and dictionary meanings**, making NLP models more **accurate and natural**.

### **Importance of Lemmatization:**

1.  **Produces Real Words** – Unlike stemming, which sometimes gives incorrect root forms (_e.g., “studies” → “studi”_), lemmatization gives correct words (_e.g., “studies” → “study”_).
    
2.  **Better Text Understanding** – Helps NLP applications like **chatbots, search engines, and sentiment analysis** process words correctly.
    
3.  **Handles Different Word Forms** – Converts **verbs, nouns, and adjectives** to their dictionary forms (e.g., _"running"_ → _"run"_, _"mice"_ → _"mouse"_).
    
4.  **Improves Model Accuracy** – Machine learning models perform better when trained on proper words rather than incorrect stems.

In [16]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "flies", "easily", "fairly", "connected"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]

print(lemmatized_words)


# Example words for different NLP tasks
examples = {
    "Search Engine Optimization": ["running", "runs", "runner", "ran", "run"],
    "Sentiment Analysis": ["happily", "happiness", "happy", "unhappily", "unhappy"],
    "Chatbot Responses": ["studying", "studied", "study", "studies"],
    "Spam Detection": ["buying", "bought", "buys", "buy"],
    "Resume Screening": ["developed", "developing", "developer", "development"]
}

# Apply stemming and print results
for use_case, words in examples.items():
    lemmatizer_words = [lemmatizer.lemmatize(word) for word in words]
    print(f"\n🔹 {use_case}:\n   Original: {words}\n   Stemmed:  {lemmatizer_words}")

['run', 'fly', 'easily', 'fairly', 'connect']

🔹 Search Engine Optimization:
   Original: ['running', 'runs', 'runner', 'ran', 'run']
   Stemmed:  ['running', 'run', 'runner', 'ran', 'run']

🔹 Sentiment Analysis:
   Original: ['happily', 'happiness', 'happy', 'unhappily', 'unhappy']
   Stemmed:  ['happily', 'happiness', 'happy', 'unhappily', 'unhappy']

🔹 Chatbot Responses:
   Original: ['studying', 'studied', 'study', 'studies']
   Stemmed:  ['studying', 'studied', 'study', 'study']

🔹 Spam Detection:
   Original: ['buying', 'bought', 'buys', 'buy']
   Stemmed:  ['buying', 'bought', 'buy', 'buy']

🔹 Resume Screening:
   Original: ['developed', 'developing', 'developer', 'development']
   Stemmed:  ['developed', 'developing', 'developer', 'development']


# **Why Do We Use Stemming If Lemmatization Is Better?**

Although lemmatization is more accurate, stemming is still used in many NLP applications because of its **speed and simplicity**. Here’s why:

---

## **1. Stemming is Faster**
- Stemming just **chops off suffixes** (e.g., `"running"` → `"run"`).
- Lemmatization **uses a dictionary** to find the correct word, making it **slower**.
- In large datasets (e.g., millions of documents), **stemming is preferred** for quick text processing.

✅ **Example:**
- **Stemming:** `"computing"` → `"comput"` (Fast, but not a real word)
- **Lemmatization:** `"computing"` → `"compute"` (Correct but slower)

---

## **2. Stemming is Useful in Search Engines**
- In **Google Search, Elasticsearch, or e-commerce searches**, we need to quickly match words.
- Users searching **"buying"** should still find results for **"buy"** even if stemming produces `"buy"` instead of `"buys"`.

✅ **Example:**
- **User searches** `"buying laptop"`
- **Stemming reduces** `"buying"` → `"buy"`, matching **"buy laptop"** results faster.

---

## **3. When High Accuracy is Not Needed**
- If **perfect word accuracy is not required**, stemming is **good enough**.
- Example: **Spam detection, keyword matching, sentiment analysis**.

✅ **Example:**
- **Stemming:** `"happily"` → `"happili"`
- **Lemmatization:** `"happily"` → `"happy"`
- In sentiment analysis, **both convey happiness**, so stemming works fine.

---

## **4. Stemming Uses Less Memory**
- Since stemming is **rule-based and doesn’t use dictionaries**, it **consumes less memory**.
- Useful in **low-power applications** (e.g., mobile NLP apps, embedded AI).

---

## **When to Use What?**

| **Factor**                | **Stemming**  | **Lemmatization**  |
|---------------------------|--------------|--------------------|
| **Speed**                 | ✅ Fast      | ❌ Slower         |
| **Accuracy**              | ❌ Less accurate | ✅ More accurate |
| **Computational Cost**    | ✅ Low       | ❌ High           |
| **Real-World Use**        | Search engines, quick text analysis | Chatbots, machine learning, deep NLP |

---

## **Conclusion:**
✔ **Use Stemming** when speed is more important than accuracy (e.g., search engines, quick filtering).  
✔ **Use Lemmatization** when meaning is important (e.g., chatbots, NLP models, grammar-based applications).

--- 


# **TF-IDF (Term Frequency-Inverse Document Frequency)**

## 📌 **What is TF-IDF?  
TF-IDF (**Term Frequency-Inverse Document Frequency**) is a numerical statistic used in **Natural Language Processing (NLP)** to measure the importance of words in a document relative to a collection of documents (**corpus**).  

It helps **identify important words** while **ignoring common words** that appear frequently but don't add much meaning (e.g., "the", "is", "in").  

---

## **1. Understanding TF (Term Frequency)**
**TF (Term Frequency)** measures **how often a word appears** in a document.  
The formula is:


### TF = Number of times the word appears in a document\Total words in the document

✅ **Example:**  
Document: `"Machine learning is amazing. Learning is fun."`  
- TF for `"learning"` = **2/7** = **0.2857**  
- TF for `"amazing"` = **1/7** = **0.1429**  

**TF alone is not enough** because common words like "is" and "the" will have high frequency but are not important.

---

## **2. Understanding IDF (Inverse Document Frequency)**
**IDF (Inverse Document Frequency)** gives **less importance to common words** and **more importance to rare words** across all documents.  

The formula is:


# IDF = log ( Total number of documents / Number of documents containing the word )




✅ **Example:**  
# **IDF Calculation with Numbers**

---

### ✅ **Example:**
We have **3 documents**:

1️⃣ **"Machine learning is powerful"**  
2️⃣ **"Deep learning improves AI"**  
3️⃣ **"Learning algorithms are useful"**  

---

### **Step 1: Count Word Occurrences Across Documents**
| **Word**       | **Appears in how many documents?** |
|---------------|---------------------------------|
| **learning**   | 3 |
| **powerful**   | 1 |
| **deep**       | 1 |
| **improves**   | 1 |
| **AI**         | 1 |
| **algorithms** | 1 |
| **useful**     | 1 |

---

### **Step 2: Apply the IDF Formula**
The formula for **Inverse Document Frequency (IDF)** is:

IDF = log ( Total number of documents / Number of documents containing the word )

---

### **IDF for "learning"**
IDF = log ( 3 / 3 ) = log(1) = **0**  
🔹 "learning" appears in all 3 documents → **Low IDF (not important).**

---

### **IDF for "powerful"**
IDF = log ( 3 / 1 ) = log(3) = **0.4771**  
🔹 "powerful" appears in only 1 document → **High IDF (important word).**

---

### **IDF for "AI"**
IDF = log ( 3 / 1 ) = log(3) = **0.4771**  
🔹 "AI" is also rare, so it has a **high IDF**.

---

## **3. Calculating TF-IDF Score**

The **TF-IDF score** is calculated using the formula:

\[
TF-IDF = TF \times IDF
\]

### ✅ **Example:**
If **TF = 0.2857** and **IDF = 0.4771**, then:

\[
TF-IDF = 0.2857 \times 0.4771 = 0.1363
\]

### 🔹 **Interpretation:**  
- A **higher TF-IDF score** means the word is **more important** in that document.  
- A **lower TF-IDF score** means the word is **less significant** and might appear frequently in other documents.

---

## **4. Why is TF-IDF Important?**
- **Improves text search (Google, Elasticsearch)**
- **Used in chatbots and recommendation systems**
- **Important for document classification**
- **Removes unimportant words while keeping key terms**

---

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Define the documents
documents = [
    "Machine learning is powerful",
    "Deep learning improves AI",
    "Learning algorithms are useful"
]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
vectorizer

# Fit and transform the documents into a TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Get feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Convert the TF-IDF matrix into a DataFrame
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
df_tfidf

Unnamed: 0,ai,algorithms,are,deep,improves,is,learning,machine,powerful,useful
0,0.0,0.0,0.0,0.0,0.0,0.546454,0.322745,0.546454,0.546454,0.0
1,0.546454,0.0,0.0,0.546454,0.546454,0.0,0.322745,0.0,0.0,0.0
2,0.0,0.546454,0.546454,0.0,0.0,0.0,0.322745,0.0,0.0,0.546454


# Example resume 1

In [21]:
resume_text = """
John Doe
Email: johndoe@example.com | LinkedIn: linkedin.com/in/johndoe | GitHub: github.com/johndoe

🔹 **Professional Summary:**
Data Scientist with **5+ years of experience** in machine learning, deep learning, and AI-driven solutions. Proven ability to analyze large datasets, develop predictive models, and deploy scalable AI applications. Passionate about leveraging data science to drive business insights and automation.

🔹 **Technical Skills:**
- Programming: Python, R, SQL, Scala
- Machine Learning: Scikit-Learn, XGBoost, LightGBM
- Deep Learning: TensorFlow, PyTorch, Keras
- Data Engineering: Spark, Hadoop, Airflow, ETL Pipelines
- Visualization: Matplotlib, Seaborn, Power BI, Tableau
- Cloud Platforms: AWS (S3, Lambda, SageMaker), Azure, Google Cloud (BigQuery)

🔹 **Work Experience:**
**Senior Data Scientist** | ABC Tech Solutions | Jan 2020 – Present  
- Developed an **AI-driven fraud detection system** reducing fraudulent transactions by 35%.
- Designed and deployed a **real-time recommendation engine** using collaborative filtering, improving customer retention by 25%.
- Spearheaded an **automated NLP pipeline** for sentiment analysis on 100,000+ customer reviews.
- Built and deployed **ML models for demand forecasting**, reducing inventory costs by 20%.

**Data Scientist** | XYZ Analytics | July 2017 – Dec 2019  
- Created **predictive risk models** for financial services, reducing loan default rates by 18%.
- Led a **big data analytics initiative**, processing 5TB+ data weekly using Spark.
- Improved **email marketing campaign targeting**, increasing conversion rates by 22%.
- Conducted **A/B testing** and customer segmentation analysis for data-driven decision-making.

🔹 **Education:**
Master’s in Data Science | Stanford University | 2017  
Bachelor’s in Computer Science | University of California, Berkeley | 2015  

🔹 **Certifications:**
- Google Professional Data Engineer  
- AWS Certified Machine Learning – Specialty  
- TensorFlow Developer Certification  

🔹 **Projects:**
- Developed **anomaly detection models** for network security using Autoencoders.
- Built an **AI chatbot** for automated customer support using NLP techniques.
- Designed a **real-time dashboard** for sales forecasting using Power BI.

🔹 **Publications & Research:**
- "Enhancing Fraud Detection with Machine Learning" – Published in IEEE  
- Speaker at **PyCon 2022** on Explainable AI  

🔹 **Soft Skills:**
- Strong problem-solving and analytical skills  
- Effective communication and stakeholder collaboration  
- Agile and cross-functional team leadership  

"""


# Job description Examples

In [23]:
job_description_1 = """
Job Title: Senior Data Scientist  
Location: San Francisco, CA | Remote Option Available  

Job Overview:  
We are seeking an experienced Senior Data Scientist to develop machine learning models, deep learning solutions, and AI-driven insights. You will work with large-scale datasets, build predictive analytics solutions, and deploy AI models in production environments.  

Key Responsibilities:  
- Develop and deploy machine learning and deep learning models using Scikit-Learn, TensorFlow, and PyTorch.  
- Build real-time recommendation engines and fraud detection systems to improve business performance.  
- Work with big data technologies (Spark, Hadoop) for data preprocessing and model training.  
- Create data pipelines and manage ETL processes for structured and unstructured data.  
- Design AI-powered NLP solutions for sentiment analysis and chatbots.  
- Collaborate with engineering teams to deploy ML models on AWS, GCP, or Azure.  
- Conduct A/B testing, customer segmentation, and data visualization using Power BI or Tableau.  

Required Qualifications:  
- 5+ years of experience in Data Science and AI model deployment.  
- Expertise in Python, SQL, and Scala for data analysis and modeling.  
- Strong background in Machine Learning, Deep Learning, and NLP.  
- Hands-on experience with Big Data (Spark, Hadoop, Airflow).  
- Proficiency in TensorFlow, PyTorch, Keras, and Scikit-Learn.  
- Experience working with AWS SageMaker, GCP BigQuery, and Azure ML.  
- Excellent problem-solving skills and ability to work in agile teams.  

Preferred Qualifications:  
- Google or AWS Certified Machine Learning Engineer.  
- Experience with real-time analytics and AI product deployment.  

Salary: 15,00,000 - 18,00,000 + Benefits  
Job Type: Full-Time | Remote / Hybrid Option  
"""



In [24]:
job_description_2 = """
Job Title: Data Engineer / AI Specialist  
Location: New York, NY  

Job Overview:  
We are looking for a Data Engineer with AI experience to build and optimize data pipelines, work on machine learning solutions, and contribute to AI-based projects.  

Key Responsibilities:  
- Build ETL pipelines and manage big data platforms (Hadoop, Spark).  
- Support ML model deployment and collaborate with Data Scientists.  
- Develop automated reporting dashboards using Power BI and Tableau.  
- Work on real-time data ingestion and transformation processes.  
- Assist in training and fine-tuning ML models in cloud environments.  
- Implement SQL-based data storage solutions for structured datasets.  

Required Qualifications:  
- 3+ years of experience in Data Engineering or Machine Learning.  
- Strong experience with SQL, Python, and Big Data Tools (Spark, Hadoop, Kafka).  
- Familiarity with Cloud Computing (AWS, GCP, or Azure).  
- Understanding of AI workflows and data-driven insights.  

Preferred Qualifications:  
- Experience working with Scikit-Learn, TensorFlow, or PyTorch.  
- Background in recommendation systems or AI analytics.  

Salary: 12,00,000 - 14,00,000 + Bonus  
Job Type: Full-Time | On-site  
"""


In [25]:
job_description_3 = """
Job Title: Business Intelligence Analyst  
Location: Chicago, IL  

Job Overview:  
We are hiring a Business Intelligence (BI) Analyst to develop reports, analyze market trends, and create dashboards for executive decision-making.  

Key Responsibilities:  
- Design business intelligence dashboards in Power BI and Tableau.  
- Perform market research and business analytics to drive strategic decisions.  
- Work with Excel, SQL, and statistical models for financial forecasting.  
- Collaborate with marketing and sales teams to identify customer insights.  
- Develop monthly reports and KPI tracking for management.  

Required Qualifications:  
- 2+ years of experience in Business Intelligence or Market Analytics.  
- Proficiency in SQL, Excel, and BI Tools (Power BI, Tableau).  
- Strong background in data visualization and business reporting.  

Preferred Qualifications:  
- Experience with CRM analytics and sales forecasting.  
- Basic knowledge of Python or R for data modeling.  

Salary: 80,00,000 - 10,00,000  
Job Type: Full-Time | Hybrid  
"""


In [26]:
import re
import nltk
import warnings
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Suppress warnings
warnings.filterwarnings("ignore")

# Download required NLTK resources silently
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# Function to preprocess text and display steps
def preprocess_text(text):
    print("\nOriginal Text:\n", text)
    
    # Remove special characters
    text = re.sub(r'\W', ' ', text)
    
    # Tokenization
    tokens = word_tokenize(text.lower())  
    print("\nTokenization:\n", tokens)
    
    # Stopword Removal
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    print("\nAfter Stopword Removal:\n", filtered_tokens)
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    print("\nAfter Stemming:\n", stemmed_tokens)
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word, pos='v') for word in stemmed_tokens]
    print("\nAfter Lemmatization:\n", lemmatized_tokens)

    return " ".join(lemmatized_tokens)  # Return as a string for vectorization


# Function to calculate similarity using Cosine Similarity
def calculate_resume_match(resume_text, job_description):
    # Preprocess both texts
    resume_processed = preprocess_text(resume_text)
    job_processed = preprocess_text(job_description)

    # Convert text into TF-IDF vectors
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform([resume_processed, job_processed])

    # Compute Cosine Similarity
    similarity_score = cosine_similarity(vectors[0], vectors[1])[0][0]

    # Convert similarity to percentage
    similarity_percentage = round(similarity_score * 100, 2)

    # Display results
    print("\nResume Match Percentage:", similarity_percentage, "%")

    return similarity_percentage


# Call function
match_percentage = calculate_resume_match(resume_text, job_description_1)
print("\nFinal Resume Similarity Score:", match_percentage, "%")



Original Text:
 
John Doe
Email: johndoe@example.com | LinkedIn: linkedin.com/in/johndoe | GitHub: github.com/johndoe

🔹 **Professional Summary:**
Data Scientist with **5+ years of experience** in machine learning, deep learning, and AI-driven solutions. Proven ability to analyze large datasets, develop predictive models, and deploy scalable AI applications. Passionate about leveraging data science to drive business insights and automation.

🔹 **Technical Skills:**
- Programming: Python, R, SQL, Scala
- Machine Learning: Scikit-Learn, XGBoost, LightGBM
- Deep Learning: TensorFlow, PyTorch, Keras
- Data Engineering: Spark, Hadoop, Airflow, ETL Pipelines
- Visualization: Matplotlib, Seaborn, Power BI, Tableau
- Cloud Platforms: AWS (S3, Lambda, SageMaker), Azure, Google Cloud (BigQuery)

🔹 **Work Experience:**
**Senior Data Scientist** | ABC Tech Solutions | Jan 2020 – Present  
- Developed an **AI-driven fraud detection system** reducing fraudulent transactions by 35%.
- Designed and d