<a href="https://colab.research.google.com/github/MelissaMatindi/AI_Future_Directions_assignment/blob/main/DAY_1_Introduction_to_Sentiment_Analysis_%26_Text_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Day 1: Introduction to Sentiment Analysis & Text Processing**

**Theory:**

* What is sentiment analysis? Applications in industry
* Binary vs. multi-class sentiment classification
* Text preprocessing pipeline: tokenization, stopword removal, stemming/lemmatization
* Introduction to bag-of-words and TF-IDF representations

**Hands-on Task:**
1. Install required libraries: nltk, scikit-learn, pandas, numpy, matplotlib
2. Create a simple text preprocessing function:
3. Test on sample sentences with varying sentiment


**Deliverable:** Python script with preprocessing function and test cases


# **THEORY:**
-----
---

## **a) What is sentiment analysis?**

Sentiment analysis is the use of NLP to identify, extract, and quantify subjective information ‚Äî primarily the emotional tone (positive, negative, neutral) expressed in text.

Text is unstructured; so sentiment turns it into actionable insights.

## **Applications in industry**

* **Brand Monitoring:** Companies track mentions on X/Twitter or Reddit. Example: Coca-Cola monitors real-time sentiment during ad campaigns to detect backlash early.
* **Customer Feedback:** Amazon/e-commerce analyzes product reviews to prioritize improvements.
* **Social Media Crisis Detection:** NGOs or social platforms flag rising negative sentiment in user posts as early distress signals.
* **Market Research:** Political campaigns gauge public opinion on candidates from news/forums.
* **Financial Trading:** Hedge funds analyze earnings call transcripts or tweets for stock sentiment.

 During product launches, a sudden sentiment drop can trigger PR responses, saving millions.

---

## **b) Binary vs. multi-class sentiment classification**

**Binary Classification:**

- Two classes: Positive vs. Negative.
- Simpler, more robust, higher accuracy.
- Example: IMDB movie reviews (pos/neg).
- Best when neutral is rare or irrelevant.

**Multi-class Classification:**

- Three or more classes: Positive, Negative, Neutral
- More informative but harder (data imbalance, lower accuracy, needs more labeled data).
- Example: Analyzing X posts for mental health ‚Äî detecting "neutral" vs. "anxious" vs. "depressed" tones.
---

## **c) Text preprocessing pipeline: tokenization, stopword removal, stemming/lemmatization**

Raw text is noisy (punctuation, capitalization, common words). Preprocessing standardizes it so models focus on meaningful signals.
Standard Steps:

1. Lowercasing: "Great" ‚Üí "great" (reduces vocabulary size).
2. Remove punctuation/numbers/URLs: "Wow!!!" ‚Üí "Wow".
3. Tokenization: Split into words/tokens.
Example: "I love NLP!" ‚Üí ["I", "love", "NLP"].
4. Stopword Removal: Drop high-frequency, low-meaning words.

*  English stopwords: "the", "is", "and", "of", "to".
*  Example: "the movie is great" ‚Üí "movie great".

5. Stemming or Lemmatization:
* Stemming: Rule-based chopping (Porter/Snowball). Fast but crude.
"running", "runner", "ran" ‚Üí "run".
* Lemmatization: Context-aware (uses WordNet dictionary). Slower but accurate.
"running" ‚Üí "run", "better" ‚Üí "good", "mice" ‚Üí "mouse".

---

## **d) Introduction to bag-of-words and TF-IDF representations**

**Bag-of-Words (BoW):**

* Simplest: Represent text as a vector of word counts (ignores order).

Example corpus:
Doc1: "great movie"
Doc2: "terrible movie"

Vocabulary: ["great", "movie", "terrible"]

BoW vectors:
Doc1: [1, 1, 0]
Doc2: [0, 1, 1]

* Pros: Simple, interpretable.
* Cons: No word importance or semantics.



**TF-IDF (Term Frequency‚ÄìInverse Document Frequency):**

* Improves BoW by weighting:
* TF: How often a word appears in a document.
* IDF: Reduces weight of common words across all documents.

* Example: "movie" appears in both ‚Üí low IDF hence downweighted. "great"/"terrible" ‚Üí high IDF hence important.
* Result: Rare sentiment-bearing words get higher scores.
* Industry standard for classic classifiers like Naive Bayes.

# **Hands-On Tasks**

### 1. Install required libraries: nltk, scikit-learn, pandas, numpy, matplotlib

In [1]:
# Installing libraries

!pip install nltk scikit-learn pandas numpy matplotlib --quiet

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

print("Libraries imported successfully!")

Libraries imported successfully!


### 2. Create a simple text preprocessing function

In [2]:
# Download NLTK resources

nltk.download('punkt')          # Tokenizer
nltk.download('stopwords')      # English stopwords
nltk.download('wordnet')        # Lemmatizer
nltk.download('omw-1.4')        # Extended WordNet data

print("NLTK data downloaded!")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


NLTK data downloaded!


In [3]:
# Build the Processing Function

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text,
                    lowercase=True,
                    remove_punct=True,
                    remove_stopwords=True,
                    lemmatize=True):
    """
    Comprehensive text preprocessing function.
    Parameters control each step for experimentation.
    """
    if lowercase:
        text = text.lower()

    if remove_punct:
        # Remove everything except letters and spaces
        text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords
    if remove_stopwords:
        tokens = [token for token in tokens if token not in stop_words]

    # Lemmatize
    if lemmatize:
        tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Return as cleaned string
    return ' '.join(tokens)

print("Function defined!")

Function defined!


### 3. Test on sample sentences with varying sentiment

In [6]:
nltk.download('punkt_tab', quiet=True)

# Diverse sample sentences (movie reviews + social-style)
samples = [
    "This movie was ABSOLUTELY AMAZING!!! I loved every second of it üòç",
    "Terrible film. Complete waste of time and money.",
    "It was okay, nothing special. Not bad, not great.",
    "The acting was brilliant but the plot was so predictable...",
    "Can't believe how bad this was!! Worst movie ever üëé",
    "Not going to lie, I actually enjoyed it more than expected!",
    "The visuals were stunning, but the story felt empty.",
    "What a masterpiece! Everyone should watch this."
]

print("Original vs Cleaned:\n")
for i, sentence in enumerate(samples, 1):
    cleaned = preprocess_text(sentence)
    print(f"{i}. Original: {sentence}")
    print(f"   Cleaned : {cleaned}\n")

Original vs Cleaned:

1. Original: This movie was ABSOLUTELY AMAZING!!! I loved every second of it üòç
   Cleaned : movie absolutely amazing loved every second

2. Original: Terrible film. Complete waste of time and money.
   Cleaned : terrible film complete waste time money

3. Original: It was okay, nothing special. Not bad, not great.
   Cleaned : okay nothing special bad great

4. Original: The acting was brilliant but the plot was so predictable...
   Cleaned : acting brilliant plot predictable

5. Original: Can't believe how bad this was!! Worst movie ever üëé
   Cleaned : cant believe bad worst movie ever

6. Original: Not going to lie, I actually enjoyed it more than expected!
   Cleaned : going lie actually enjoyed expected

7. Original: The visuals were stunning, but the story felt empty.
   Cleaned : visuals stunning story felt empty

8. Original: What a masterpiece! Everyone should watch this.
   Cleaned : masterpiece everyone watch



In [7]:
# EXPERIMENT
# Trying variations to see impact
print("Experiment: Without lemmatization")
print(preprocess_text(samples[0], lemmatize=False))

print("\nExperiment: Keep stopwords")
print(preprocess_text(samples[0], remove_stopwords=False))

print("\nExperiment: Minimal cleaning (only lowercase + tokenize)")
print(preprocess_text(samples[0], remove_punct=False, remove_stopwords=False, lemmatize=False))

Experiment: Without lemmatization
movie absolutely amazing loved every second

Experiment: Keep stopwords
this movie wa absolutely amazing i loved every second of it

Experiment: Minimal cleaning (only lowercase + tokenize)
this movie was absolutely amazing ! ! ! i loved every second of it üòç
