1. Text Classification
Build and compare sentiment analysis models (e.g., movie/product reviews) using sklearn and huggingface transformers.
2. Named Entity Recognition (NER)
Annotate a small dataset for NER (e.g., extract names, organizations, dates from documents).
Fine-tune a pre-trained NER model on this data and evaluate results.
3. Text Preprocessing Pipeline
Implement and document a pipeline: tokenization, stopword removal, lemmatization/stemming.
Compare effects of different preprocessing steps on downstream tasks.
4. Keyword/Topic Extraction
Use TF-IDF, RAKE, or LDA to extract and summarize main topics/keywords from company data, FAQs, or user queries.
5. Text Similarity/Matching
Calculate pairwise document similarities (cosine similarity with embeddings or TF-IDF).
Build a simple FAQ bot: match a user question to its closest answer from documentation.
6. Basic Chatbot/QA System
Use RAG or a pre-built Q&A API to answer questions from a specific document or webpage.
Evaluate accuracy and suggest improvements.
7. Sentiment/Emotion Analysis Pipeline
Try simple and advanced approaches (lexicon-based vs huggingface transformers).
8. Experiment Tracking
Integrate model runs with WandB or MLflow for experiment logging (useful for reproducibility).

# **1 .Text Classification**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


data = {
    "text": [
        "I loved this movie, it was amazing!",
        "Terrible film, I hated it.",
        "It was an okay movie, not great.",
        "Best movie ever, fantastic!",
        "Worst movie I have ever seen.",
        "This product is excellent, highly recommended!",
        "Very bad quality, do not buy.",
        "The product is okay, but could be better.",
        "Amazing product, I love it!",
        "Terrible product, waste of money."
    ],
    "label": [1,0,1,1,0,1,0,1,1,0] # 0 = Positive , 1 = Negative
}

df = pd.DataFrame(data)


X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)


vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


clf = LogisticRegression()
clf.fit(X_train_vec, y_train)


y_pred = clf.predict(X_test_vec)
print("Sklearn Logistic Regression Results:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Sklearn Logistic Regression Results:
Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **2 . Named-Entity Recognition**

In [None]:
!pip install transformers torch --quiet

In [None]:
# Identify Name,Org. ,Date , Location , etc

from transformers import pipeline

ner = pipeline(
    "ner",
    model="Jean-Baptiste/camembert-ner-with-dates",
    aggregation_strategy="simple"
)

text = "Elon Musk founded SpaceX in 2002."

results = ner(text)

print("\nNER Predictions:")
for r in results:
    print(r["word"], "->", r["entity_group"])


NER Predictions:
Elon Musk -> PER
SpaceX -> ORG
in 2002 -> DATE


#  **3** . Text Preprocessing

In [None]:
import nltk
!pip install nltk
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [None]:
text = 'NLP stands for Natural Language Processing . it is widey used concept in Deep Learning'

In [None]:
# Word Tokenizer

word_tokens = nltk.word_tokenize(text)
print(word_tokens)

['NLP', 'stands', 'for', 'Natural', 'Language', 'Processing', '.', 'it', 'is', 'widey', 'used', 'concept', 'in', 'Deep', 'Learning']


In [None]:
# Sentence Tokenizer

sentence_tokens = nltk.sent_tokenize(text)
print(sentence_tokens)

['NLP stands for Natural Language Processing .', 'it is widey used concept in Deep Learning']


In [None]:
# Remove Stopwords like in,a,an,the,etc.

from os import remove
stop_words = set(nltk.corpus.stopwords.words('english'))
remove_stopwords = [word for word in word_tokens if word not in stop_words]
print(remove_stopwords)

['NLP', 'stands', 'Natural', 'Language', 'Processing']


In [None]:
# Identify Noun fron Text

text = 'NLP stands for Natural Language Processing'
word_tokens = nltk.word_tokenize(text)
POS_tags = nltk.pos_tag(word_tokens)
NOUN = [word for word, pos in POS_tags if pos.startswith('NN')]
print(NOUN)

['NLP', 'Natural', 'Language', 'Processing']


In [None]:
# Gives Related Words (Synonyms)

wordnet = nltk.corpus.wordnet
synonyms = []
a=input("Enter any One Word/Object Name/etc: ")
for syn in wordnet.synsets(a):
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
print(synonyms)

Enter any One Word/Object Name/etc: screen
['screen', 'silver_screen', 'projection_screen', 'blind', 'screen', 'screen', 'CRT_screen', 'screen', 'cover', 'covert', 'concealment', 'screen', 'filmdom', 'screenland', 'screen', 'sieve', 'screen', 'screen_door', 'screen', 'screen', 'screen', 'test', 'screen', 'screen', 'screen_out', 'sieve', 'sort', 'screen', 'screen', 'block_out', 'riddle', 'screen', 'shield', 'screen']


In [None]:
# Identify Parts of Speech

b = input("Enter any Sentence: ")
word_tokens_b = nltk.word_tokenize(b)
pos_tags = nltk.pos_tag(word_tokens_b)
print(pos_tags)

Enter any Sentence: hello my name is meena
[('hello', 'NN'), ('my', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('meena', 'JJ')]


In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):

    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

text = input("Enter any Sentence: ")
tokens = word_tokenize(text)

lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in tokens]

print(f"Original Text: {text}")
print(f"Lemmatized Words: {lemmatized_words}")

Enter any Sentence: hello everyone my name is meena
Original Text: hello everyone my name is meena
Lemmatized Words: ['hello', 'everyone', 'my', 'name', 'be', 'meena']


In [None]:
# Remove Suffix

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

word =input("Enter any Word with Suffix ")
lemma = lemmatizer.lemmatize(word, pos='v')
print(f"Lemmatized Word: {lemma}")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Enter any Word with Suffix running
Lemmatized Word: run


In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Remove Prefix and Suffix

a = input("Enter any Word: ")

prefixes = {'dis','diss','pre','sub','un'}
suffixes = {'ing','ed','fully','ly','full','ful'}

processed_word = a

for p in prefixes:
    if processed_word.startswith(p):
        processed_word = processed_word[len(p):]
        break

for s in suffixes:
    if processed_word.endswith(s):
        processed_word = processed_word[:-len(s)]
        break

print(processed_word)

Enter any Word: flying
fly


In [None]:
# Remove Punctuation

a= input("Enter any Sentence: ")

punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
no_punct = ""
for char in a:
    if char not in punctuations:
        no_punct = no_punct + char
print(no_punct)

  punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''


Enter any Sentence: hello......!!!!!!!!
hello


# **4 . Keyword/Topic Extraction(TF-IDF, RAKE , LDA)**

In [None]:
!pip install scikit-learn nltk rake-nltk gensim --quiet

In [None]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from rake_nltk import Rake
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

documents = [
    "Our company provides cloud-based storage solutions for small businesses.",
    "Users can reset their passwords and manage their accounts online.",
    "We offer 24/7 customer support and technical assistance.",
    "Our team helps clients migrate their data to the cloud securely."
]
stop_words = stopwords.words('english')

# TF-IDF Keywords

print("==== TF-IDF Keywords ====")
vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,2))
X = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

for i, doc in enumerate(documents):
    scores = zip(feature_names, X[i].toarray()[0])
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)[:5]  # top 5 keywords
    keywords = [word for word, score in sorted_scores if score > 0]
    print(f"Document {i+1} TF-IDF Keywords: {keywords}")

# RAKE Keywords

print("\n==== RAKE Keywords ====")
r = Rake(stopwords=stop_words, min_length=1, max_length=3)
for i, doc in enumerate(documents):
    r.extract_keywords_from_text(doc)
    keywords = r.get_ranked_phrases()[:5]  # top 5 phrases
    print(f"Document {i+1} RAKE Keywords: {keywords}")

# LDA Topics

print("\n==== LDA Topics ====")

texts = [[word.lower() for word in word_tokenize(doc) if word.isalpha() and word.lower() not in stop_words]
         for doc in documents]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]


lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15, random_state=42)

for idx, topic in lda_model.print_topics(num_words=5):
    print(f"Topic {idx+1}: {topic}")

==== TF-IDF Keywords ====
Document 1 TF-IDF Keywords: ['based', 'based storage', 'businesses', 'cloud based', 'company']
Document 2 TF-IDF Keywords: ['accounts', 'accounts online', 'manage', 'manage accounts', 'online']
Document 3 TF-IDF Keywords: ['24', '24 customer', 'assistance', 'customer', 'customer support']
Document 4 TF-IDF Keywords: ['clients', 'clients migrate', 'cloud securely', 'data', 'data cloud']

==== RAKE Keywords ====
Document 1 RAKE Keywords: ['company provides cloud', 'based storage solutions', 'small businesses']
Document 2 RAKE Keywords: ['accounts online', 'users', 'reset', 'passwords', 'manage']
Document 3 RAKE Keywords: ['7 customer support', 'technical assistance', 'offer 24']
Document 4 RAKE Keywords: ['cloud securely', 'data']

==== LDA Topics ====
Topic 1: 0.065*"solutions" + 0.065*"small" + 0.065*"provides" + 0.065*"businesses" + 0.065*"storage"
Topic 2: 0.060*"migrate" + 0.060*"clients" + 0.060*"securely" + 0.060*"data" + 0.060*"cloud"


# **5**. Text Similarity Matching

In [None]:
# Ask the listed Ques

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

faq_questions = [
    "How do I reset my password?",
    "What is the refund policy?",
    "How can I contact support?",
    "Where can I download the app?",
    "How do I change my email address?"
]

faq_answers = [
    "To reset your password, go to settings > account > reset password.",
    "Our refund policy allows refunds within 30 days of purchase.",
    "You can contact support via email at support@example.com.",
    "You can download the app from the App Store or Google Play.",
    "To change your email address, go to your profile settings and update your email."
]


vectorizer = TfidfVectorizer()
faq_vectors = vectorizer.fit_transform(faq_questions)

print(" FAQ Bot is ready! Type 'exit' to quit.\n")

while True:
    user_question = input("You: ")
    if user_question.lower() in ["exit", "quit"]:
        print("Bot: Goodbye!")
        break


    user_vector = vectorizer.transform([user_question])


    similarities = cosine_similarity(user_vector, faq_vectors)
    best_idx = similarities.argmax()


    print("Bot:", faq_answers[best_idx], "\n")

 FAQ Bot is ready! Type 'exit' to quit.

You: how can i contact support ?
Bot: You can contact support via email at support@example.com. 

You: exit
Bot: Goodbye!


# **6. Basic Chatbot/QA System**

In [None]:
# Ask Que from the document_text

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

document_text = """
Plants grow using sunlight, water, and nutrients.
Sunlight allows photosynthesis, which provides energy.
Water transports nutrients throughout the plant.
Nutrients build plant tissues and help the plant grow.
"""

sentences = [s.strip() for s in document_text.split(".") if s.strip()]


embedder = SentenceTransformer("all-MiniLM-L6-v2")
sentence_embeddings = embedder.encode(sentences)


model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=150,
    temperature=0.8,
    top_p=0.9
)

print(" Offline QA Assistant (type 'exit' to quit)\n")

while True:
    question = input("You: ")
    if question.lower() in ["exit", "quit"]:
        print("Assistant: Goodbye!")
        break


    q_embed = embedder.encode([question])
    similarities = cosine_similarity(q_embed, sentence_embeddings)[0]
    top_idx = similarities.argsort()[-2:][::-1]
    context = " ".join(sentences[i] for i in top_idx)


    prompt = f"Use the following information to answer the question naturally:\n{context}\nQuestion: {question}\nAnswer:"
    response = generator(prompt, num_return_sequences=1)[0]["generated_text"]
    answer_text = response.replace(prompt, "").strip()

    print("Assistant:", answer_text, "\n")

 Offline QA Assistant (type 'exit' to quit)

You: what nutrients does ?
Assistant: Nutrients are a fundamental part of the plant's body and 

You: exit
Assistant: Goodbye!


# **7 .Sentiment/Emotion Analysis**

In [None]:
!pip install nltk --quiet
!pip install transformers torch --quiet

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from transformers import pipeline

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
text = "I drink juice everyday"

# Lexicon-Based Sentiment

sia = SentimentIntensityAnalyzer()

score = sia.polarity_scores(text)
print("\nLexicon-Based Sentiment:")
print(score)

# Transformer Sentiment

sentiment_model = pipeline("sentiment-analysis")

result = sentiment_model(text)
print("\nTransformer Sentiment:")
print(result)

# Emotion Score

emotion_model = pipeline(
    "text-classification",
    model="j-hartmann/emotion-english-distilroberta-base",
    return_all_scores=True
)

results = emotion_model(text)

print("\nEmotion Scores:")
for r in results[0]:
    print(r["label"], "->", round(r["score"], 3))


Lexicon-Based Sentiment:
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

Transformer Sentiment:
[{'label': 'POSITIVE', 'score': 0.9973875880241394}]

Emotion Scores:
anger -> 0.039
disgust -> 0.131
fear -> 0.014
joy -> 0.014
neutral -> 0.761
sadness -> 0.025
surprise -> 0.015
