# 📘 **Sentiment Analysis Algorithm**

**Definition:**
Sentiment Analysis is a method that uses **NLP (Natural Language Processing)** and **machine learning** to figure out whether a piece of text expresses a **positive 😊, negative 😠, or neutral 😐** feeling.

---



## 🔍 **What it Does**

* Detects emotions or opinions in text
* Used on reviews, tweets, comments, feedback, etc.



## 🧠 **How it Works**

## 🧹 **1. Text Preprocessing**

* Remove punctuation, stopwords, emojis
* Tokenize ➗ words
* Lemmatize or stem them



## 🧾 **2. Feature Extraction**

Turn words into numbers using:

* **Bag of Words** 👜
* **TF-IDF** 📊
* **Word Embeddings** 🌐 (e.g., Word2Vec, GloVe)



## 🤖 **3. Algorithms**

Used to classify the sentiment:

* **Logistic Regression** 📈
* **Naive Bayes** 🤓
* **SVM (Support Vector Machine)** 💻
* **Deep Learning models**: LSTM, GRU, BERT 🧠



## 🧰 **Popular Tools & Libraries** (💯 Easy & Effective)

## 🧸 **TextBlob**

* Built on NLTK
* Gives:

  * **Polarity**: from −1 (negative) to +1 (positive)
  * **Subjectivity**: 0 to 1 (objective to subjective)
* Very beginner-friendly

## 🚀 **VADER (Valence Aware Dictionary and sEntiment Reasoner)**

* Designed for **social media text** (slang, emojis, all caps)
* Returns:

  * **Positive**, **Negative**, **Neutral**, and **Compound Score**
* Comes with **NLTK**



## 🛠️ **Applications**

* 📱 Social Media Monitoring
* 🛍️ Product Review Analysis
* 📰 Public Opinion Tracking
* 📞 Customer Support Feedback
* 💬 Chatbot Emotion Detection




---

# 📘 What is Sentiment Recognition?

**Sentiment Recognition** is the process where a computer system **detects and understands the emotion or sentiment** behind a piece of text. It tells us if the text is:

* Positive 😊
* Negative 😠
* Neutral 😐
* Or even more specific feelings like happy, sad, angry, surprised, etc.



## 🔍 How Sentiment Recognition Works

1. **Text Cleaning** 🧹
   Remove noise like punctuation, emojis (sometimes), and stopwords.

2. **Feature Extraction** 🧾
   Convert text into numbers using methods like Bag of Words, TF-IDF, or word embeddings.

3. **Modeling** 🤖
   Use algorithms or pretrained models to classify the sentiment.



## 🧰 Tools & Libraries for Sentiment Recognition

* **VADER** 🚀
  Great for social media text, quick and easy to use, no training needed.

* **TextBlob** 🧸
  Simple polarity and subjectivity scores, beginner-friendly.

* **Pretrained Deep Learning Models** 🧠
  Models like BERT can recognize complex emotions with better accuracy.



## ✅ When to Use Sentiment Recognition

* Social media monitoring 📱
* Customer feedback analysis 🛍️
* Chatbot emotion understanding 🤖
* Market research and brand reputation 📊



## ⚠️ Important Tips

* Prebuilt tools (VADER, TextBlob) are fast and easy but work best on general text.
* For **domain-specific** or **fine-grained emotions**, training your own model or fine-tuning pretrained models is better.





---

# 📝 Notes on Sentiment Recognition without Labeled Data

---

# 1️⃣ When You Have No Labeled Sentiment Data

You **cannot directly train** a supervised sentiment model without labeled classes (positive/negative/neutral). Here are ways to handle this:

## 🔹 Manual Labeling / Annotation 📝

* Label a **subset** of your data manually.
* Use this subset for training your model.
* Time-consuming but gives **high-quality data**.



## 🔹 Unsupervised or Semi-Supervised Methods 🔍

**a) Unsupervised Learning**

* Use clustering algorithms like K-Means to group similar texts.
* Interpret clusters as sentiment groups (no explicit labels needed).

**b) Semi-Supervised Learning**

* Start with a small labeled set (from manual labeling or weak labeling).
* Train a model, predict on unlabeled data, and iteratively improve.



## 🔹 Weak Supervision / Distant Supervision 🤖

* Use **rules, heuristics, or lexicons** to assign automatic (noisy) sentiment labels.
* Labels may be imperfect but enough to train a simple model.



## 🔹 Transfer Learning with Pretrained Models 🧠

* Use models like BERT, trained on large text corpora.
* Fine-tune on small labeled data (which you may label yourself).
* Reduces the need for large labeled datasets.



# 2️⃣ Rule-Based Sentiment Labeling (Lexicon Approach) 📚

## How it works:

* Use a **sentiment lexicon**: a dictionary of words with sentiment scores.
* Scan text for these words, sum their scores.
* Assign sentiment based on total score (positive, negative, neutral).



## Example Lexicon

| Word     | Sentiment Score |
| -------- | --------------- |
| happy    | +1              |
| good     | +1              |
| bad      | -1              |
| terrible | -2              |



## Pros 👍

* No need for labeled data!
* Simple and interpretable.
* Useful when you know the domain vocabulary.

## Cons 👎

* Doesn’t capture context, sarcasm, or complex sentiments.
* Less accurate than ML models.
* Misses sentiment if words not in lexicon.



## Simple Python Example:

```python
# Sentiment lexicon dictionary
sentiment_lexicon = {
    "happy": 1,
    "good": 1,
    "bad": -1,
    "terrible": -2,
    # Add more words as needed
}

def assign_sentiment(text):
    words = text.lower().split()
    score = sum(sentiment_lexicon.get(word, 0) for word in words)
    if score > 0:
        return "Positive 😊"
    elif score < 0:
        return "Negative 😠"
    else:
        return "Neutral 😐"

# Example Usage
print(assign_sentiment("I am happy and good today"))      # Positive 😊
print(assign_sentiment("This is a terrible and bad day")) # Negative 😠
```



# 3️⃣ Summary Table for No-Label Scenarios

| Situation                              | Recommended Approach                     |
| -------------------------------------- | ---------------------------------------- |
| No labeled data, want sentiment        | Manual labeling or weak supervision      |
| Want to find patterns without labels   | Unsupervised clustering                  |
| Small labeled set available            | Semi-supervised learning + fine-tuning   |
| Want high accuracy with minimal labels | Transfer learning with pretrained models |





---

# Rule-Based Sentiment Assignment Using Word Lists

## Concept:

* Prepare **3 lists**: `positive_words`, `negative_words`, `neutral_words`
* For each review (sentence/text), count how many words from each list appear.
* Assign sentiment based on the highest count category.



## Example Python code:

```python
# Define word lists
positive_words = ["good", "happy", "excellent", "love", "great", "awesome"]
negative_words = ["bad", "terrible", "hate", "poor", "awful", "worst"]
neutral_words = ["okay", "fine", "average", "normal", "mediocre"]

def assign_sentiment(review):
    words = review.lower().split()
    
    # Count matches for each sentiment list
    pos_count = sum(word in positive_words for word in words)
    neg_count = sum(word in negative_words for word in words)
    neu_count = sum(word in neutral_words for word in words)
    
    # Determine the sentiment based on max count
    if pos_count > neg_count and pos_count > neu_count:
        return "Positive 😊"
    elif neg_count > pos_count and neg_count > neu_count:
        return "Negative 😠"
    elif neu_count > pos_count and neu_count > neg_count:
        return "Neutral 😐"
    else:
        return "Mixed / Uncertain 🤔"

```


## Explanation:

* You scan each review’s words against your three lists.
* Whichever list has the highest word count "wins" and sets the sentiment.
* If there's a tie or no clear majority, you assign **Mixed / Uncertain**.



## Advantages

* Easy to customize word lists for your specific domain (e.g., Amazon reviews).
* No training data needed.
* Transparent and explainable.


## Limitations

* Context or negation (e.g., "not good") isn’t handled.
* May miss nuances or sarcasm.
* Limited vocabulary coverage unless your lists are big and well-curated.



In [34]:
import pandas as pd

data = []
with open(r"C:\Users\Nagesh Agrawal\OneDrive\Desktop\6_MACHINE LEARNING\3__NATURAL LANGUAGE PROCESSING\NLP_DATASETS\SENTIMENT ANALYSIS DATA\train.ft.txt\train.ft.txt", "r", encoding="utf-8") as file:
    for line in file:
        label, text = line.split(' ', 1)
        data.append([label, text.strip()])

# Convert to DataFrame
DATA = pd.DataFrame(data, columns=["label", "text"])

# Select first 500 rows
DATA = DATA[:500]

DATA


Unnamed: 0,label,text
0,__label__2,Stuning even for the non-gamer: This sound tra...
1,__label__2,The best soundtrack ever to anything.: I'm rea...
2,__label__2,Amazing!: This soundtrack is my favorite music...
3,__label__2,Excellent Soundtrack: I truly like this soundt...
4,__label__2,"Remember, Pull Your Jaw Off The Floor After He..."
...,...,...
495,__label__2,"If you don't have this, get it!: This book is ..."
496,__label__2,Read the angry anti-Hazlitt reviews for a real...
497,__label__2,Give this book to a your children.: Economics ...
498,__label__2,ECONOMICS IN ONE LESSON: Economics in One Less...


In [35]:
DATA["label"].value_counts()

label
__label__2    253
__label__1    247
Name: count, dtype: int64

In [36]:
STRING=DATA["text"][0]
STRING

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

## TEXT PREPROCESSING 🧑‍🏫

In [37]:
DATA

Unnamed: 0,label,text
0,__label__2,Stuning even for the non-gamer: This sound tra...
1,__label__2,The best soundtrack ever to anything.: I'm rea...
2,__label__2,Amazing!: This soundtrack is my favorite music...
3,__label__2,Excellent Soundtrack: I truly like this soundt...
4,__label__2,"Remember, Pull Your Jaw Off The Floor After He..."
...,...,...
495,__label__2,"If you don't have this, get it!: This book is ..."
496,__label__2,Read the angry anti-Hazlitt reviews for a real...
497,__label__2,Give this book to a your children.: Economics ...
498,__label__2,ECONOMICS IN ONE LESSON: Economics in One Less...


### LOWER CASE CONVERSION

In [38]:
DATA["text"] = DATA["text"].str.lower()

'''OR'''

STRING = STRING.lower()
STRING

'stuning even for the non-gamer: this sound track was beautiful! it paints the senery in your mind so well i would recomend it even to people who hate vid. game music! i have played the game chrono cross but out of all of the games i have ever played it has the best music! it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. it would impress anyone who cares to listen! ^_^'

### TRIMMING EXTRA SPACES

In [39]:
DATA["text"] = DATA["text"].str.strip()     # removes leading/trailing spaces[[removes spaces at the start and end of each string.]]
DATA["text"] = DATA["text"].str.replace(r'\s+', ' ', regex=True)  # replaces multiple spaces with a single space

'''OR'''


import re

STRING = STRING.strip()                     # lowercase + remove leading/trailing spaces
STRING = re.sub(r'\s+', ' ', STRING)                # replace multiple spaces with one
STRING

'stuning even for the non-gamer: this sound track was beautiful! it paints the senery in your mind so well i would recomend it even to people who hate vid. game music! i have played the game chrono cross but out of all of the games i have ever played it has the best music! it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. it would impress anyone who cares to listen! ^_^'

### REMOVE PUNCTUATION MARKS

In [40]:
import string

DATA["text"] = DATA["text"].str.replace(f"[{string.punctuation}]", "", regex=True)

'''OR'''

STRING = STRING.translate(str.maketrans("","",string.punctuation))
STRING

'stuning even for the nongamer this sound track was beautiful it paints the senery in your mind so well i would recomend it even to people who hate vid game music i have played the game chrono cross but out of all of the games i have ever played it has the best music it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras it would impress anyone who cares to listen '

In [41]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

**`str.maketrans()`** ek mapping banata hai — jisme batate ho kaunsa character kis se replace hoga, ya hataoge.

```python
str.maketrans('', '', string.punctuation)
```

* `''` → kuch replace nahi karna
* `''` → kis se replace nahi karna
* `string.punctuation` → ye sab hata do (`!"#$%&'()*+,-./:;<=>?@[\]^_{|}~`)

✅ **Short:**
`maketrans` + `translate` milke kaam karte hain — **jo character mapping banao, uske hisaab se replace ya remove karo**.

Punctuation hataane ke liye use hota hai:

```python
text.translate(str.maketrans('', '', string.punctuation))
```
| Function      | Use                                        |
| ------------- | ------------------------------------------ |
| `maketrans()` | Banta hai **kya replace/remove karna hai** |
| `translate()` | Wo mapping **apply karta hai string par**  |



### REMOVE DIGITS (NUMBERS)

In [42]:
DATA["text"] = DATA["text"].str.replace(r'\d+', '', regex=True)

'''OR'''
import re
STRING = re.sub(r'\d+', '', STRING)
STRING

'stuning even for the nongamer this sound track was beautiful it paints the senery in your mind so well i would recomend it even to people who hate vid game music i have played the game chrono cross but out of all of the games i have ever played it has the best music it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras it would impress anyone who cares to listen '

### ✅ `re.sub()` — Easy Explanation

`re.sub()` is used to **find and replace** using a **pattern** (regex).


### 📌 Syntax:

```python
re.sub(pattern, replacement, string)
```

* `pattern` → kya dhundhna hai (like digits, spaces, etc.)
* `replacement` → uski jagah kya rakhna hai
* `string` → kis string mein kaam karna hai

---

### REMOVE EMOJIS

In [43]:
import emoji

DATA["text"] = DATA["text"].apply(emoji.demojize)

'''OR'''

STRING = emoji.demojize(STRING)
STRING

'stuning even for the nongamer this sound track was beautiful it paints the senery in your mind so well i would recomend it even to people who hate vid game music i have played the game chrono cross but out of all of the games i have ever played it has the best music it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras it would impress anyone who cares to listen '

### REMOVE EMAILS OR EXTRACT EMAIL

In [44]:
EMAILS = DATA["text"].str.findall(r'\b[\w.-]+?@\w+?\.\w+?\b')
EMAILS

'''OR'''

emails = re.findall(r'\b[\w.-]+?@\w+?\.\w+?\b', STRING)

In [45]:
DATA["text"] = DATA["text"].str.replace(r'\b[\w.-]+?@\w+?\.w+?\b', '', regex=True)

'''OR'''
import re

STRING = re.sub(r'\b[\w.-]+?@\w+?\.\w+?\b', '', STRING)

## ✅ Regex Pattern:

```
r'\b[\w.-]+?@\w+?\.\w+?\b'
```

---

## 🔍 Toota Phoota Breakdown:

| हिस्सा     | मतलब                                                                                                  |
| ---------- | ----------------------------------------------------------------------------------------------------- |
| `r'...'`   | Raw string hai Python mein, jismein `\` ko special tarike se treat kiya jata hai                      |
| `\b`       | Word boundary – matlab email ke aage/peeche koi aur word chipka na ho                                 |
| `[\w.-]+?` | Email ka username part: letters (`a-z`, `A-Z`), numbers (`0-9`), underscore `_`, dot `.`, ya dash `-` |
| `@`        | Email ka central part – `@` hona hi chahiye                                                           |
| `\w+?`     | Domain ka naam – jaise `gmail`, `yahoo`, `company`                                                    |
| `\.`       | Dot – jaise `.com`, `.org`                                                                            |
| `\w+?`     | Domain ka extension – jaise `com`, `org`, `in`                                                        |
| `\b`       | Dubara word boundary – email ke baad kuch aur na chipka ho                                            |



---

## ✅ `\w` ka Matlab Kya Hai?

### `\w` ka full form hai:

**"word character"** – aur ye match karta hai:

```
[a-zA-Z0-9_]
```

### Matlab:

* **a to z**
* **A to Z**
* **0 to 9**
* **underscore `_`**

> Bas ye 4 types of characters ko pakadta hai.



## 🔍 Toh `[\w.-]` ka kya matlab hua?

Ye ek **character set** hai. Iska matlab:

> Match **koi bhi character jo ho `\w`, `.` ya `-`**
> Yaane:

```
[a-zA-Z0-9_.-]
```

### 🧠 Example:

```python
re.findall(r'[\w.-]+', "hello.user-123")
```

Ye match karega:

```
'hello.user-123'
```



## ✅ Fir Ye `+?` Kya Hai?

* `+` matlab: **ek ya usse zyada baar repeat ho sakta hai**
* `?` matlab: **non-greedy** (matlab kam se kam lena, zyada nahi)

Toh:

```python
[\w.-]+?
```

\= kam se kam characters le, par **jitne zarurat ho utne hi** (email ke part ke liye)

---


### REMOVE URL & HTML TAGS

In [46]:
DATA["text"] = DATA["text"].str.replace(r'https?://\S+|www\.\S+', '', regex=True)
DATA["text"] = DATA["text"].str.replace(r'<.*?>', '', regex=True)

'''OR'''

import re

STRING = re.sub(r'hhtps?://\S+/www\.\S+', '', STRING)
STRING = re.sub(r'<.*?>', '', STRING)

| Part        | Kya karta hai?                                                                      |                      |
| ----------- | ----------------------------------------------------------------------------------- | -------------------- |
| `https?://` | **http ya https** se shuru hone wale link dhundta hai (`s?` matlab *s ho ya na ho*) |                      |
| `\S+`       | **Jitne bhi non-space characters hain** us link ke aage tak sab le leta hai         |                      |
| \`          | \`                                                                                  | Matlab **ya** (`or`) |
| `www\.\S+`  | Jo link **[www](http://www).** se start hota hai usko bhi pakadta hai               |                      |


| Part  | Matlab kya hai?                                                                       |
| ----- | ------------------------------------------------------------------------------------- |
| `<`   | Tag ki **shuruaat** dhoondta hai                                                      |
| `.*?` | **Kuch bhi ho sakta hai**, lekin **shortest possible match** le raha hai (non-greedy) |
| `>`   | Tag ki **ending** dhoondta hai                                                        |


### EXPAND CONTRACTION

In [47]:
! pip install contractions



In [48]:
import contractions # (e.g., don't -> do not)

DATA["text"] = DATA["text"].apply(lambda x: contractions.fix(x))

'''OR'''

STRING = contractions.fix(STRING)
STRING

'stuning even for the nongamer this sound track was beautiful it paints the senery in your mind so well i would recomend it even to people who hate vid game music i have played the game chrono cross but out of all of the games i have ever played it has the best music it backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras it would impress anyone who cares to listen '

### TEXT IS CLEANED NOW WE MOVED TO FURTHER TEXT PROCESSING 


# 🔍 **What is NLTK?**

**NLTK (Natural Language Toolkit)** ek Python library hai jo hum natural language (human language) ko process karne ke liye use karte hain.

---

## 🧠 Use Case Examples:

| Task                                      | NLTK Feature        |
| ----------------------------------------- | ------------------- |
| Sentence ya word mein todna               | `tokenize`          |
| Grammar ke parts jaanna (noun, verb etc.) | `pos_tag`           |
| Word ka root nikaalna                     | `stem`, `lemmatize` |
| Meaning aur synonyms dekhna               | `wordnet`           |
| Stopwords (e.g., is, the, a) nikaalna     | `stopwords`         |
| Sentiment Analysis                        | (Basic)             |
| Text Classification                       | (Naive Bayes, etc.) |



# 📦 Installations & Download Essentials

```bash
pip install nltk
```

```python
import nltk
nltk.download('punkt')        # For tokenization
nltk.download('stopwords')    # For removing common words
nltk.download('wordnet')      # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('omw-1.4')      # For wordnet lemmatizer
```

## 📌 Summary Table

| Task                   | Function            |
| ---------------------- | ------------------- |
| Tokenize words         | `word_tokenize()`   |
| Tokenize sentences     | `sent_tokenize()`   |
| Stem                   | `PorterStemmer`     |
| Lemmatize              | `WordNetLemmatizer` |
| Remove stopwords       | `stopwords.words()` |
| POS Tagging            | `pos_tag()`         |
| Synonyms/Word meanings | `wordnet.synsets()` |



## 🏁 Conclusion:

NLTK is 🔥 powerful:

* Best for **educational, lightweight** NLP
* Easy for **text cleaning, processing, and tokenization**
* Great tool before jumping into spaCy or Transformers





### ✅ **Text Preprocessing Flow**

```markdown
📌 **TEXT PREPROCESSING PIPELINE FOR SENTIMENT ANALYSIS**

📝 RAW TEXT  
   ↓  
✂️ TOKENIZATION  
   → Break sentence into words  
   ["I", "am", "loving", "this", "movie"]

   ↓  
🗑️ STOPWORD REMOVAL  
   → Remove common words (e.g., "I", "am", "this")  
   ["loving", "movie"]

   ↓  
🪄 LEMMATIZATION / STEMMING  
   → Reduce words to base/root form  
   ["love", "movie"]

   ↓  
🔗 REJOIN TOKENS  
   → Join cleaned words back to sentence  
   "love movie"

   ↓  
✅ CLEANED TEXT READY FOR MODEL
```

---



### TOKENIZATION

## 📥 **Why do we need `nltk.download()`?**

🧠 **NLTK ke kuch tools chalane ke liye extra data chahiye hota hai**, jaise:

* 📚 Word lists (stopwords)
* 🧩 Tokenizer models (punkt)
* 📖 Dictionaries (wordnet)
* 🏷️ Grammar taggers (pos tagger)



💡 **Ye data NLTK package mein built-in nahi hota**, isliye **hume alag se download karna padta hai** — ek baar sirf!



⚠️ Agar ye download nahi karoge, to functions jaise:
`word_tokenize()`, `stopwords.words()`, `WordNetLemmatizer()`, `pos_tag()`
**kaam nahi karenge, aur error aa jayega!**


In [49]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Nagesh
[nltk_data]     Agrawal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [50]:
from nltk.tokenize import word_tokenize, sent_tokenize

DATA["TOKENS"] = DATA["text"].apply(word_tokenize)

'''OR'''

STRING_TOKENS = word_tokenize(STRING)
STRING_TOKENS

['stuning',
 'even',
 'for',
 'the',
 'nongamer',
 'this',
 'sound',
 'track',
 'was',
 'beautiful',
 'it',
 'paints',
 'the',
 'senery',
 'in',
 'your',
 'mind',
 'so',
 'well',
 'i',
 'would',
 'recomend',
 'it',
 'even',
 'to',
 'people',
 'who',
 'hate',
 'vid',
 'game',
 'music',
 'i',
 'have',
 'played',
 'the',
 'game',
 'chrono',
 'cross',
 'but',
 'out',
 'of',
 'all',
 'of',
 'the',
 'games',
 'i',
 'have',
 'ever',
 'played',
 'it',
 'has',
 'the',
 'best',
 'music',
 'it',
 'backs',
 'away',
 'from',
 'crude',
 'keyboarding',
 'and',
 'takes',
 'a',
 'fresher',
 'step',
 'with',
 'grate',
 'guitars',
 'and',
 'soulful',
 'orchestras',
 'it',
 'would',
 'impress',
 'anyone',
 'who',
 'cares',
 'to',
 'listen']

### STOPWORD REMOVAL

In [51]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Nagesh
[nltk_data]     Agrawal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [52]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

DATA["TOKEN"] = DATA["TOKENS"].apply(lambda x: [word for word in x if word not in stop_words])


'''OR'''


STRING_FILTERED_TOKENS  = [word for word in STRING_TOKENS if word not in stop_words]
STRING_FILTERED_TOKENS

['stuning',
 'even',
 'nongamer',
 'sound',
 'track',
 'beautiful',
 'paints',
 'senery',
 'mind',
 'well',
 'would',
 'recomend',
 'even',
 'people',
 'hate',
 'vid',
 'game',
 'music',
 'played',
 'game',
 'chrono',
 'cross',
 'games',
 'ever',
 'played',
 'best',
 'music',
 'backs',
 'away',
 'crude',
 'keyboarding',
 'takes',
 'fresher',
 'step',
 'grate',
 'guitars',
 'soulful',
 'orchestras',
 'would',
 'impress',
 'anyone',
 'cares',
 'listen']

### LEMMATIZATION / STEMMING

In [53]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')

[nltk_data] Downloading package omw-1.4 to C:\Users\Nagesh
[nltk_data]     Agrawal\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Nagesh
[nltk_data]     Agrawal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [54]:
from nltk.stem import WordNetLemmatizer

LEMMATIZER = WordNetLemmatizer()

DATA["LEMMATIZED"] = DATA["TOKEN"].apply(lambda x: [LEMMATIZER.lemmatize(word) for word in x])

'''OR'''

STRING_LEMMATIZED_TOKENS = [LEMMATIZER.lemmatize(word) for word in STRING_FILTERED_TOKENS]
STRING_LEMMATIZED_TOKENS

['stuning',
 'even',
 'nongamer',
 'sound',
 'track',
 'beautiful',
 'paint',
 'senery',
 'mind',
 'well',
 'would',
 'recomend',
 'even',
 'people',
 'hate',
 'vid',
 'game',
 'music',
 'played',
 'game',
 'chrono',
 'cross',
 'game',
 'ever',
 'played',
 'best',
 'music',
 'back',
 'away',
 'crude',
 'keyboarding',
 'take',
 'fresher',
 'step',
 'grate',
 'guitar',
 'soulful',
 'orchestra',
 'would',
 'impress',
 'anyone',
 'care',
 'listen']

In [55]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
DATA["STEMMED"] = DATA["TOKEN"].apply(lambda x: [stemmer.stem(word) for word in x])

'''OR'''

STRING_STEMMED_TOKENS = [stemmer.stem(word) for word in STRING_FILTERED_TOKENS]

In [56]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("fairies")  # ➡️ "fairi"

DATA["S_STEMMED"] = DATA["TOKEN"].apply(lambda x: [stemmer.stem(word) for word in x])

'''OR'''


STRING_STEMMED_TOKENS = [stemmer.stem(word) for word in STRING_FILTERED_TOKENS]

### DETOKENIZATION (REJOIN TOKEN TO TEXT)


In [57]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()

DATA['detokenized_text'] = DATA['LEMMATIZED'].apply(detokenizer.detokenize)


'''OR'''


STRING = detokenizer.detokenize(STRING_LEMMATIZED_TOKENS)



In [58]:
DATA['JOIN_TEXT'] = DATA['LEMMATIZED'].apply(lambda tokens: ' '.join(tokens) + '.')


'''OR'''

STRING = ' '.join(STRING_LEMMATIZED_TOKENS)

In [59]:
DATA

Unnamed: 0,label,text,TOKENS,TOKEN,LEMMATIZED,STEMMED,S_STEMMED,detokenized_text,JOIN_TEXT
0,__label__2,stuning even for the nongamer this sound track...,"[stuning, even, for, the, nongamer, this, soun...","[stuning, even, nongamer, sound, track, beauti...","[stuning, even, nongamer, sound, track, beauti...","[stune, even, nongam, sound, track, beauti, pa...","[stune, even, nongam, sound, track, beauti, pa...",stuning even nongamer sound track beautiful pa...,stuning even nongamer sound track beautiful pa...
1,__label__2,the best soundtrack ever to anything i am read...,"[the, best, soundtrack, ever, to, anything, i,...","[best, soundtrack, ever, anything, reading, lo...","[best, soundtrack, ever, anything, reading, lo...","[best, soundtrack, ever, anyth, read, lot, rev...","[best, soundtrack, ever, anyth, read, lot, rev...",best soundtrack ever anything reading lot revi...,best soundtrack ever anything reading lot revi...
2,__label__2,amazing this soundtrack is my favorite music o...,"[amazing, this, soundtrack, is, my, favorite, ...","[amazing, soundtrack, favorite, music, time, h...","[amazing, soundtrack, favorite, music, time, h...","[amaz, soundtrack, favorit, music, time, hand,...","[amaz, soundtrack, favorit, music, time, hand,...",amazing soundtrack favorite music time hand in...,amazing soundtrack favorite music time hand in...
3,__label__2,excellent soundtrack i truly like this soundtr...,"[excellent, soundtrack, i, truly, like, this, ...","[excellent, soundtrack, truly, like, soundtrac...","[excellent, soundtrack, truly, like, soundtrac...","[excel, soundtrack, truli, like, soundtrack, e...","[excel, soundtrack, truli, like, soundtrack, e...",excellent soundtrack truly like soundtrack enj...,excellent soundtrack truly like soundtrack enj...
4,__label__2,remember pull your jaw off the floor after hea...,"[remember, pull, your, jaw, off, the, floor, a...","[remember, pull, jaw, floor, hearing, played, ...","[remember, pull, jaw, floor, hearing, played, ...","[rememb, pull, jaw, floor, hear, play, game, k...","[rememb, pull, jaw, floor, hear, play, game, k...",remember pull jaw floor hearing played game kn...,remember pull jaw floor hearing played game kn...
...,...,...,...,...,...,...,...,...,...
495,__label__2,if you do not have this get it this book is a ...,"[if, you, do, not, have, this, get, it, this, ...","[get, book, wonderful, illustration, simple, s...","[get, book, wonderful, illustration, simple, s...","[get, book, wonder, illustr, simpl, simpl, go,...","[get, book, wonder, illustr, simpl, simpl, go,...",get book wonderful illustration simple simple ...,get book wonderful illustration simple simple ...
496,__label__2,read the angry antihazlitt reviews for a real ...,"[read, the, angry, antihazlitt, reviews, for, ...","[read, angry, antihazlitt, reviews, real, indi...","[read, angry, antihazlitt, review, real, indic...","[read, angri, antihazlitt, review, real, indic...","[read, angri, antihazlitt, review, real, indic...",read angry antihazlitt review real indication ...,read angry antihazlitt review real indication ...
497,__label__2,give this book to a your children economics in...,"[give, this, book, to, a, your, children, econ...","[give, book, children, economics, one, lesson,...","[give, book, child, economics, one, lesson, el...","[give, book, children, econom, one, lesson, el...","[give, book, children, econom, one, lesson, el...",give book child economics one lesson elucidate...,give book child economics one lesson elucidate...
498,__label__2,economics in one lesson economics in one lesso...,"[economics, in, one, lesson, economics, in, on...","[economics, one, lesson, economics, one, lesso...","[economics, one, lesson, economics, one, lesso...","[econom, one, lesson, econom, one, lesson, ano...","[econom, one, lesson, econom, one, lesson, ano...",economics one lesson economics one lesson anot...,economics one lesson economics one lesson anot...


##  SENTIMENT RECOGNITION

## 1. **TextBlob**

* Ek simple library jo pre-built rules aur word polarity pe kaam karti hai.
* Basic sentiment score deti hai: Positive, Negative, Neutral.
* Beginner-friendly!

**Explanation:**

Polarity > 0 means positive, < 0 means negative, 0 means neutral.



In [60]:
from textblob import TextBlob

text = "I love this product! It is amazing 😊"
blob = TextBlob(text)
print("Sentiment polarity:", blob.sentiment.polarity)  # -1 se +1 (negative to positive)
print("Sentiment subjectivity:", blob.sentiment.subjectivity)  # 0 to 1 (fact to opinion)

Sentiment polarity: 0.6125
Sentiment subjectivity: 0.75


In [None]:
# 🧠  Define a function to get polarity
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

# 📊  Apply the function to the 'review' column
DATA["TEXTBLOB_SENTIMENT_SCORE"] = DATA["detokenized_text"].apply(get_sentiment)

# 😄 Step 5: Add sentiment label (positive, negative, neutral)
def label_sentiment(score):
    if score > 0:
        return "Positive 😊"
    elif score < 0:
        return "Negative 😞"
    else:
        return "Neutral 😐"

DATA["TEXTBLOB_SENTIMENTS"] = DATA["TEXTBLOB_SENTIMENT_SCORE"].apply(label_sentiment)
DATA.head()

# 🧠 DEFINE FUNCTION TO GET SUBJECTIVE SCORE !!!

Unnamed: 0,label,text,TOKENS,TOKEN,LEMMATIZED,STEMMED,S_STEMMED,detokenized_text,JOIN_TEXT,TEXTBLOB_SENTIMENT_SCORE,TEXTBLOB_SENTIMENTS
0,__label__2,stuning even for the nongamer this sound track...,"[stuning, even, for, the, nongamer, this, soun...","[stuning, even, nongamer, sound, track, beauti...","[stuning, even, nongamer, sound, track, beauti...","[stune, even, nongam, sound, track, beauti, pa...","[stune, even, nongam, sound, track, beauti, pa...",stuning even nongamer sound track beautiful pa...,stuning even nongamer sound track beautiful pa...,-0.045,Negative 😞
1,__label__2,the best soundtrack ever to anything i am read...,"[the, best, soundtrack, ever, to, anything, i,...","[best, soundtrack, ever, anything, reading, lo...","[best, soundtrack, ever, anything, reading, lo...","[best, soundtrack, ever, anyth, read, lot, rev...","[best, soundtrack, ever, anyth, read, lot, rev...",best soundtrack ever anything reading lot revi...,best soundtrack ever anything reading lot revi...,0.29375,Positive 😊
2,__label__2,amazing this soundtrack is my favorite music o...,"[amazing, this, soundtrack, is, my, favorite, ...","[amazing, soundtrack, favorite, music, time, h...","[amazing, soundtrack, favorite, music, time, h...","[amaz, soundtrack, favorit, music, time, hand,...","[amaz, soundtrack, favorit, music, time, hand,...",amazing soundtrack favorite music time hand in...,amazing soundtrack favorite music time hand in...,0.243382,Positive 😊
3,__label__2,excellent soundtrack i truly like this soundtr...,"[excellent, soundtrack, i, truly, like, this, ...","[excellent, soundtrack, truly, like, soundtrac...","[excellent, soundtrack, truly, like, soundtrac...","[excel, soundtrack, truli, like, soundtrack, e...","[excel, soundtrack, truli, like, soundtrack, e...",excellent soundtrack truly like soundtrack enj...,excellent soundtrack truly like soundtrack enj...,0.272727,Positive 😊
4,__label__2,remember pull your jaw off the floor after hea...,"[remember, pull, your, jaw, off, the, floor, a...","[remember, pull, jaw, floor, hearing, played, ...","[remember, pull, jaw, floor, hearing, played, ...","[rememb, pull, jaw, floor, hear, play, game, k...","[rememb, pull, jaw, floor, hear, play, game, k...",remember pull jaw floor hearing played game kn...,remember pull jaw floor hearing played game kn...,0.369841,Positive 😊



---

## 2. **VADER (Valence Aware Dictionary and sEntiment Reasoner)**

* Specially social media jaise short text ke liye bana.
* Ek lexicon + rule-based approach use karta hai.
* Neutral, Positive, Negative scores deta hai.

**Explanation:**
`compound` score ko dekho:

* > 0.05 positive,
* <-0.05 negative,
* Beech mein neutral.


In [62]:
! pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [63]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
text = "This movie is awesome! 😍 But some parts were boring..."
scores = analyzer.polarity_scores(text)
print(scores)  # {'neg':..., 'neu':..., 'pos':..., 'compound':...}


{'neg': 0.166, 'neu': 0.562, 'pos': 0.272, 'compound': 0.2244}


In [70]:
# 🧠  Define a function to get VADER compound score
def get_vader_score(text):
    return analyzer.polarity_scores(text)['compound']

# 📊 Apply the function to the review column
DATA["VADER_COMPOUND_SCORE"] = DATA["detokenized_text"].apply(get_vader_score)

# 😄 Convert score to sentiment label
def vader_label(score):
    if score >= 0.05:
        return "Positive 😊"
    elif score <= -0.05:
        return "Negative 😞"
    else:
        return "Neutral 😐"

DATA["VADER_SENTIMENTS"] = DATA["VADER_COMPOUND_SCORE"].apply(vader_label)


---

## 3. **Pretrained Deep Learning Models (BERT)**

* BERT jaise models ko fine-tune ya directly use kar sakte hain.
* Context samajh ke better accuracy deta hai, par heavy hai.
* Transfer learning ke liye best!

**(using HuggingFace transformers):**



In [65]:
! pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Downloading tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 645.7 kB/s eta 0:00:02
   ------------ --------------------------- 0.5/1.7 MB 645.7 kB/s eta 0:00:02
   ------------------ --------------------- 0.8/1.7 MB 671.3 kB/s eta 0:00:02
   ------------------ --------------------- 0.8/1.7 MB 671.3 kB/s eta 0:00:02
   ------------------------ --------------- 1.0/1.7 MB 729.5 kB/s eta 0:00:01
   ------------------------------ --------- 1.3/1.7 MB 714.3 kB/s eta 0:00:01
   ------------------------------ --------- 1.3/1.7 MB 714.3 kB/s eta 0:00:01
   ------------------------------------

In [None]:
# 😎 Let's apply Hugging Face's pipeline("sentiment-analysis") (like BERT-based models)
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I am very happy today! 😊")
print(result)  # [{'label': 'POSITIVE', 'score': 0.999}]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.





Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998797178268433}]


In [74]:
# 🧠  Apply the classifier to each detokenized_text
DATA["HUGGING_FACE_RESULT"] = DATA["detokenized_text"].apply(lambda x: classifier(x)[0])  # Each result is a dictionary

# 🔍 Step 5: Extract label and score from the result
DATA["HUGGING_FACE_LABEL"] = DATA["HUGGING_FACE_RESULT"].apply(lambda x: x['label'])
DATA["CONFIDENCE"] = DATA["HUGGING_FACE_RESULT"].apply(lambda x: round(x['score'], 3))

# 🌈 Step 6: Add emojis to labels
def decorate(label):
    if label == "POSITIVE":
        return "Positive 😊"
    elif label == "NEGATIVE":
        return "Negative 😞"
    else:
        return "Neutral 😐"

DATA["HUGGING_FACE_SENTIMENTS"] = DATA["HUGGING_FACE_LABEL"].apply(decorate)

DATA.head()


Unnamed: 0,label,text,TOKENS,TOKEN,LEMMATIZED,STEMMED,S_STEMMED,detokenized_text,JOIN_TEXT,TEXTBLOB_SENTIMENT_SCORE,TEXTBLOB_SENTIMENTS,VADER_COMPOUND_SCORE,VADER_SENTIMENTS,HUGGING_FACE_RESULT,HUGGING_FACE_LABEL,CONFIDENCE,HUGGING_FACE_SENTIMENTS
0,__label__2,stuning even for the nongamer this sound track...,"[stuning, even, for, the, nongamer, this, soun...","[stuning, even, nongamer, sound, track, beauti...","[stuning, even, nongamer, sound, track, beauti...","[stune, even, nongam, sound, track, beauti, pa...","[stune, even, nongam, sound, track, beauti, pa...",stuning even nongamer sound track beautiful pa...,stuning even nongamer sound track beautiful pa...,-0.045,Negative 😞,0.9136,Positive 😊,"{'label': 'POSITIVE', 'score': 0.6898343563079...",POSITIVE,0.69,Positive 😊
1,__label__2,the best soundtrack ever to anything i am read...,"[the, best, soundtrack, ever, to, anything, i,...","[best, soundtrack, ever, anything, reading, lo...","[best, soundtrack, ever, anything, reading, lo...","[best, soundtrack, ever, anyth, read, lot, rev...","[best, soundtrack, ever, anyth, read, lot, rev...",best soundtrack ever anything reading lot revi...,best soundtrack ever anything reading lot revi...,0.29375,Positive 😊,0.9559,Positive 😊,"{'label': 'POSITIVE', 'score': 0.989570140838623}",POSITIVE,0.99,Positive 😊
2,__label__2,amazing this soundtrack is my favorite music o...,"[amazing, this, soundtrack, is, my, favorite, ...","[amazing, soundtrack, favorite, music, time, h...","[amazing, soundtrack, favorite, music, time, h...","[amaz, soundtrack, favorit, music, time, hand,...","[amaz, soundtrack, favorit, music, time, hand,...",amazing soundtrack favorite music time hand in...,amazing soundtrack favorite music time hand in...,0.243382,Positive 😊,0.9887,Positive 😊,"{'label': 'POSITIVE', 'score': 0.99959796667099}",POSITIVE,1.0,Positive 😊
3,__label__2,excellent soundtrack i truly like this soundtr...,"[excellent, soundtrack, i, truly, like, this, ...","[excellent, soundtrack, truly, like, soundtrac...","[excellent, soundtrack, truly, like, soundtrac...","[excel, soundtrack, truli, like, soundtrack, e...","[excel, soundtrack, truli, like, soundtrack, e...",excellent soundtrack truly like soundtrack enj...,excellent soundtrack truly like soundtrack enj...,0.272727,Positive 😊,0.9808,Positive 😊,"{'label': 'POSITIVE', 'score': 0.9985697269439...",POSITIVE,0.999,Positive 😊
4,__label__2,remember pull your jaw off the floor after hea...,"[remember, pull, your, jaw, off, the, floor, a...","[remember, pull, jaw, floor, hearing, played, ...","[remember, pull, jaw, floor, hearing, played, ...","[rememb, pull, jaw, floor, hear, play, game, k...","[rememb, pull, jaw, floor, hear, play, game, k...",remember pull jaw floor hearing played game kn...,remember pull jaw floor hearing played game kn...,0.369841,Positive 😊,0.9831,Positive 😊,"{'label': 'POSITIVE', 'score': 0.9981192946434...",POSITIVE,0.998,Positive 😊



---

# Jab aapke paas **koi labeled data nahi ho** 😟

## 1. **Manual Labeling / Annotation**

* Insaan khud sentences ko label karta hai.
* Time-consuming, par best quality.



## 2. **Unsupervised or Semi-Supervised Methods**

* Labeled data na ho to ye algorithms patterns se seekhte hain.
* Clustering, lexicons use kar sakte hain.



## 3. **Weak Supervision / Distant Supervision**

* Indirect signals se labels generate karna (jaise hashtags, emojis).
* Example: Tweet mein "😊" ho to positive label.



## 4. **Transfer Learning with Pretrained Models (BERT)**

* Pehle se trained model le kar apne data pe thoda train karo.
* Kam labeled data mein bhi kaam karta hai.



## 5. **Rule-Based Sentiment Labeling (Lexicon Approach)**

* Shabdon ki ek list hoti hai jisme unka sentiment score hota hai.
* Text ke shabdon ke scores jod ke overall sentiment nikalta hai.




---

# **Rule-Based Sentiment Assignment Using Word Lists Example:**

Suppose hamare paas word list:

```python
lexicon = {
    "good": 1,
    "happy": 2,
    "bad": -2,
    "sad": -1,
    "amazing": 3,
    "boring": -2
}
```

**Python code:**

```python
def rule_based_sentiment(text, lexicon):
    words = text.lower().split()
    score = 0
    for w in words:
        if w in lexicon:
            score += lexicon[w]
    if score > 0:
        return "Positive 😊"
    elif score < 0:
        return "Negative 😞"
    else:
        return "Neutral 😐"

text = "This product is amazing but a bit boring"
print(rule_based_sentiment(text, lexicon))
```

