<a href="https://colab.research.google.com/github/098Steve/Jupyter/blob/main/NLP_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics

Go to File -> "Save a Copy in Drive"  
This allows you to have your own version and edit the file

Tokenisation – breaking sentences into words (like breaking a KitKat into sticks)

Lowercasing – so we don’t treat “Cat” and “cat” like different species

Removing punctuation – because commas and full stops don’t usually help much

Removing stop words – getting rid of boring words like “the” and “is”

Stemming and Lemmatization – turning “running” into “run”, like trimming the fat

Putting it all together – step-by-step

In [None]:
!pip install nltk
!pip install -U spacy
!python -m spacy download en_core_web_sm
!python -m nltk.downloader punkt stopwords wordnet
!python -m spacy download en_core_web_sm

1. ✂️ Tokenisation – Chopping up text  
Imagine a sentence like a loaf of bread. Tokenisation slices it into manageable chunks (words).

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

text = "Lovely weather we're having today, isn't it?"
tokens = word_tokenize(text)

print(tokens)

2. 🔡 Lowercasing – Keeping it simple  
Computers are picky. “Apple” and “apple” are not the same to them. Lowercasing helps avoid that.

In [None]:
lower_tokens = [token.lower() for token in tokens]
print(lower_tokens)

3. ❌ Remove punctuation – Clearing the clutter  
Punctuation is like the scaffolding of a sentence. Handy for us, but not so helpful for machines.

In [None]:
import string

no_punct = [word for word in lower_tokens if word not in string.punctuation]
print(no_punct)

4. 💤 Remove stop words – Get to the point  
Stop words are the little filler words like "is", "the", and "of". They're like background noise in a conversation.

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in no_punct if word not in stop_words]

print(filtered_tokens)

5a. 🪓 Stemming – Chop it off!  
Stemming is a bit aggressive. It chops words to their root by brute force. Not always elegant, but gets the job done.

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]

print(stemmed)

5b. Lemmatization – The tidy version  
Lemmatization is more thoughtful. It looks at the context and gives you the real base word, or “lemma”.

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print(lemmatized)

⚙️ All Together Now: A Clean-up Function  
Let’s now build our ultimate function — like a Swiss Army knife of text cleaning 🛠️

In [None]:
def clean_text(text, use_lemmatization=True):
    # 1. Tokenise
    tokens = word_tokenize(text)

    # 2. Lowercase
    tokens = [word.lower() for word in tokens]

    # 3. Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]

    # 4. Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # 5. Stemming or Lemmatization
    if use_lemmatization:
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    else:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(word) for word in tokens]

    return tokens


🧪 Let's try it!

In [None]:
sample = "Cats were running faster than the dogs. Amazing, isn't it?"
print(clean_text(sample))

In [None]:
print(clean_text(sample, use_lemmatization=False))

🧪 Let's try it again slightly differently!

In [None]:
def basic_nlp(text):
    # Step 1: Tokenise
    tokens = word_tokenize(text)

    # Step 2: Lowercase
    lower = [w.lower() for w in tokens]

    # Step 3: Remove punctuation
    no_punct = [w for w in lower if w not in string.punctuation]

    # Step 4: Remove stop words
    stop_words = set(stopwords.words('english'))
    no_stops = [w for w in no_punct if w not in stop_words]

    # Step 5: Stemming
    stemmer = PorterStemmer()
    stems = [stemmer.stem(w) for w in no_stops]

    # Step 6: Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmata = [lemmatizer.lemmatize(w) for w in no_stops]

    return {
        "original_tokens": tokens,
        "lowercase": lower,
        "no_punctuation": no_punct,
        "no_stopwords": no_stops,
        "stems": stems,
        "lemmata": lemmata
    }

In [None]:
result = basic_nlp("The quick brown foxes were jumping over the lazy dogs!")
for step, output in result.items():
    print(f"{step}:\n{output}\n")

🕵️ Named Entity Recognition (NER) – Spotting the VIPs  
NER is all about picking out the named stuff in a sentence — like the nouns that really matter.  
  
“Barack Obama was born in Hawaii.”  
→ "Barack Obama" = PERSON  
→ "Hawaii" = GPE (Geo-Political Entity)  
  
Think of it like highlighting names, dates and places in a newspaper article — but with code.  

First, install and load spaCy and its English model (if you haven't already)

In [None]:
!pip install -U spacy
!python -m spacy download en_core_web_sm

🧠 NER in action with spaCy

In [None]:
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Example sentence
text = "King Charles visited London in April 2023 to attend a meeting with NHS officials."

# Process the text
doc = nlp(text)

# Print named entities
for ent in doc.ents:
    print(f"{ent.text} → {ent.label_}")

In [None]:
# Let's get info on the NER item
spacy.explain("GPE")  # or any other label

# We can add NER to our pipeline of NLP tools

In [None]:
def basic_nlp_with_ner(text):
    import nltk
    import string
    import spacy
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer, WordNetLemmatizer

    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('wordnet', quiet=True)

    # NLTK pipeline
    tokens = word_tokenize(text)
    lower = [w.lower() for w in tokens]
    no_punct = [w for w in lower if w not in string.punctuation]
    stop_words = set(stopwords.words('english'))
    no_stops = [w for w in no_punct if w not in stop_words]
    stemmer = PorterStemmer()
    stems = [stemmer.stem(w) for w in no_stops]
    lemmatizer = WordNetLemmatizer()
    lemmata = [lemmatizer.lemmatize(w) for w in no_stops]

    # spaCy NER
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]

    return {
        "original_tokens": tokens,
        "lowercase": lower,
        "no_punctuation": no_punct,
        "no_stopwords": no_stops,
        "stems": stems,
        "lemmata": lemmata,
        "named_entities": named_entities
    }

In [None]:
results = basic_nlp_with_ner("King Charles visited London in April 2023 to attend a meeting with NHS officials.")
for key, value in results.items():
    print(f"{key}:\n{value}\n")

😃😐😠 Sentiment Analysis – Reading the mood  
We’ll use TextBlob, which is beginner-friendly and built on top of NLTK. It gives us two numbers:  
  
Polarity (from –1 to +1): negative → positive  
  
Subjectivity (from 0 to 1): fact → opinion

In [None]:
from textblob import TextBlob

text = "I absolutely love how easy this is to use!"
blob = TextBlob(text)

print("Polarity:", blob.sentiment.polarity)
print("Subjectivity:", blob.sentiment.subjectivity)


# Okay, that's a lot of NLP basics!

# Now go back through and insert different sentences to see how that changes the output

# Once you're satisfied, start looking at tokenisation:
https://platform.openai.com/tokenizer  
Type sentences into the box, looking at the output