# Lesson L1: Natural Language Pre‑processing 🌟

Welcome to **Lesson 1** of the *DeepLearning.AI NLP Specialization* teaching notebook.  
This notebook shows you **how to clean and prepare raw text** so it can be fed to classical machine‑learning algorithms *and* modern neural networks.

---

## Why does text preprocessing matter? 🤔
* Raw text is messy – full of punctuation, contractions, emojis, hyperlinks and *noise* that distract a model.  
* Careful preprocessing lets the model focus on **signal**, improving accuracy, reducing vocabulary size and training time.

---

## What you'll do in this notebook  🗺️

1. **Build intuition with a toy example** – walk a single sentence through tokenization ➡️ stop‑word filtering ➡️ stemming.  
2. **Peek at a real dataset** – load the NLTK Twitter corpus and inspect its structure.  
3. **Implement a complete preprocessing pipeline** you can reuse in later lessons and assignments.  
4. **Play in an interactive Gradio demo** – paste any tweet and watch each preprocessing step in real‑time.

> 👉 Feel free to run each cell, tweak the code and break things. That's the fastest way to learn!


In [None]:
# @title Install & import required libraries
# Colab automatically skips installations that already exist
!pip -q install nltk==3.8.1 gradio==4.16.0 --progress-bar off

import nltk, re, string, random
from nltk.corpus import twitter_samples, stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer

# Download small NLTK resources 📥
nltk.download('stopwords')
nltk.download('twitter_samples')
nltk.download('punkt')


## 1️⃣ Toy example – preprocessing one sentence

Before diving into thousands of tweets, let's warm‑up with **one short sentence**.  
We'll walk through each preprocessing step so you can *see* what changes and *why* it matters.

In [None]:
sentence = "I love Natural Language Processing! :) #NLP"

print("🔸 Original sentence:")
print(sentence)

# 1. Tokenize
tokenizer = TweetTokenizer(preserve_case=False,
                           strip_handles=True,
                           reduce_len=True)
tokens = tokenizer.tokenize(sentence)
print("\n🔹 After tokenization:")
print(tokens)

# 2. Remove stop‑words & punctuation
stopwords_en = stopwords.words('english')
tokens_no_sw = [w for w in tokens if w not in stopwords_en and w not in string.punctuation]
print("\n🔹 After stop‑word & punctuation filtering:")
print(tokens_no_sw)

# 3. Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in tokens_no_sw]
print("\n🔹 After stemming:")
print(stemmed)


**What happened?**

* `TweetTokenizer` **standardises** the text – lower‑casing, normalising elongated words *soooo → so*, and removing Twitter handles.  
* Stop‑words like *'i', 'love'* (which carry little meaning for sentiment analysis) are filtered out.  
* Finally, **stemming** reduces words to a common root (*processing → process*), shrinking the vocabulary your model has to learn.

## 2️⃣ Exploring the Twitter sentiment dataset

We'll use the **`twitter_samples`** corpus that ships with NLTK.  
It contains **5,000 positive** and **5,000 negative** English tweets originally collected by researchers at the University of Michigan.

Let's load the text and take a quick look.

In [None]:
# Load positive & negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

print(f"✅ Loaded {len(all_positive_tweets):,} positive tweets")
print(f"✅ Loaded {len(all_negative_tweets):,} negative tweets")

# Quick sanity check – view one random tweet from each class
print("\n🔸 Example positive tweet:")
print(random.choice(all_positive_tweets))
print("\n🔸 Example negative tweet:")
print(random.choice(all_negative_tweets))


*Notice the hashtags, emojis, URLs and random user mentions.*  
We'll need to clean all that up before feeding the text to a model.

## 3️⃣ Building a reusable `process_tweet()` function

Instead of repeating the three steps above every time, let's wrap them in a **single helper** you can call from future notebooks.

In [None]:
def process_tweet(tweet: str):
    """Preprocess a single tweet into a list of cleaned, stemmed tokens.

    Steps
    -----
    1. Lower‑case and remove hyperlinks, hashtags & handles.
    2. Tokenize with `TweetTokenizer`.
    3. Filter stop‑words & punctuation.
    4. Stem remaining tokens.
    """
    # 1. Normalise text
    tweet = tweet.lower()
    tweet = re.sub(r'https?://\S+|www\.\S+', '', tweet)   # remove links
    tweet = re.sub(r'@\w+', '', tweet)                     # remove handles
    tweet = re.sub(r'#', '', tweet)                         # strip hashtag symbol

    # 2. Tokenise
    tokenizer = TweetTokenizer(preserve_case=False,
                               strip_handles=True,
                               reduce_len=True)
    tokens = tokenizer.tokenize(tweet)

    # 3. Remove stopwords & punctuation
    stopwords_en = stopwords.words('english')
    tokens_clean = [tok for tok in tokens
                    if tok not in stopwords_en
                    and tok not in string.punctuation]

    # 4. Stemming
    stemmer = PorterStemmer()
    stems = [stemmer.stem(tok) for tok in tokens_clean]

    return stems


In [None]:
sample = all_positive_tweets[2277]
print("Original tweet:\n", sample)

print("\nProcessed tokens:")
print(process_tweet(sample))


✨ *Much cleaner!* You can now vectorise these tokens (e.g., with TF‑IDF or word embeddings) and train a classifier.

## 4️⃣ Interactive playground 🔧

Use the widget below to **experiment** – type any short snippet and see exactly what `process_tweet()` returns.

In [None]:
import gradio as gr

def preprocess(text):
    """Helper for the demo"""
    return ' '.join(process_tweet(text))

demo = gr.Interface(
    fn=preprocess,
    inputs=gr.Textbox(lines=2, placeholder="Type a tweet here..."),
    outputs=gr.Textbox(label="Processed tokens"),
    title="Tweet Pre‑processor",
    description="Watch how raw text is cleaned, tokenised and stemmed."
)

demo.launch(debug=False, share=False)


---

### 🎉 You finished Lesson L1!

You're now equipped to **clean tweets** for sentiment analysis and beyond.  
In the next lesson you'll build features from these tokens and train your first classifier.