# Natural Language Processing with Python

Welcome to my Jupyter notebook, where I share my learning journey in Natural Language Processing (NLP) with Python. In this notebook, I'll walk through key concepts, techniques, and libraries, and share hands-on examples as I dive into my first experiences with NLP.

## What is Natural Language Processing (NLP)?

NLP is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language. Key tasks in NLP include:
- Text analysis
- Sentiment analysis
- Named Entity Recognition (NER)
- Language translation
- Text summarization

## Overview

In this notebook, we will explore key NLP techniques and tools, primarily focusing on Python’s `NLTK` library. Key concepts include:
- **Segmentation**: Dividing text into smaller chunks (e.g., sentences or words).
- **Tokenization**: Converting text into meaningful units (tokens).
- **Stop Words**: Removing common, non-essential words.
- **Stemming & Lemmatization**: Reducing words to their root or base forms.
- **Part-of-Speech (POS) Tagging**: Assigning word classes to tokens.
- **Named Entity Recognition (NER)**: Extracting and classifying entities (e.g., people, organizations, locations).

## Learning Objectives

By the end of this notebook, you will understand:
1. How to segment text into sentences and words.
2. Tokenization techniques for breaking down text.
3. How to remove stop words to focus on meaningful terms.
4. The differences between stemming and lemmatization.
5. POS tagging and how it classifies words in context.
6. NER for extracting entities from text.

## Resources

For further learning, explore:
1. [GeeksforGeeks: Part of Speech Tagging](https://www.geeksforgeeks.org/nlp-part-of-speech-default-tagging/)
2. [YouTube: NLP Overview and Basics](https://www.youtube.com/watch?v=MpIagClRELI)
3. [YouTube: Additional NLP Insights](https://www.youtube.com/watch?v=fLvJ8VdHLA0&list=PLhBFZf0L5I7qN_qb4P1lvrjhHtMd7403_&index=18)

---


#### 1) Segmentation
Segmentation in NLP refers to dividing text into meaningful units, such as sentences, words, or topics. It's a foundational step for further processing and analysis. Common types of segmentation include:
- Sentence Segmentation: Splitting a document into sentences using punctuation and context.
- Word Segmentation: Breaking down sentences into words, crucial for languages without clear word boundaries (e.g., Chinese).
- Topic Segmentation: Dividing a document into sections based on themes or topics for better understanding.

Segmentation helps NLP systems organize and analyze text more effectively for tasks like summarization, translation, or sentiment analysis.

---

In [41]:
# Below is a set of text that I will be working with in order to test NLP methods
text = "Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry. The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066. Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace. Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded."
text

"Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry. The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066. Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace. Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded."

In [None]:
# Importing required libraries (NLTK)
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

In [47]:
# Splitting the text into sentences with PUNKT
# Punkt is a tokenizer that includes pre-trained data for different languages to help with splitting text into sentences.
sentences = sent_tokenize(text)
sentences

['Millions of people across the UK and beyond have celebrated the coronation of King Charles III - a symbolic ceremony combining a religious service and pageantry.',
 'The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066.',
 'Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace.',
 "Here's how the day of splendour and formality, which featured customs dating back more than 1,000 years, unfolded."]

In [60]:
# Calling separated parts of the text data (elements, via index positions)
sentences[1]

'The ceremony was held at Westminster Abbey, with the King becoming the 40th reigning monarch to be crowned there since 1066.'

In [81]:
# Punctuation removal with the RE library
import re

# Removing punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", sentences[2])
text

'Queen Camilla was crowned alongside him before a huge parade back to Buckingham Palace '

#### 2) Tokenization
Tokenization is the process of breaking down a text into smaller units called "tokens." Tokens can be words, phrases, or even individual characters, depending on the use case. For example:
- Sentence Tokenization: Splits text into sentences.
- Word Tokenization: Breaks sentences into words.
- Character Tokenization: Divides words into individual characters.

---

In [86]:
# Importing required libaries
from nltk.tokenize import word_tokenize

In [94]:
# Creating a list with each element corresponding to a separate word
words = word_tokenize(text)
print(words)

['Queen', 'Camilla', 'was', 'crowned', 'alongside', 'him', 'before', 'a', 'huge', 'parade', 'back', 'to', 'Buckingham', 'Palace']


#### 3) Removal of Stop Words
Stop words are common words like "a," "the," and "in" that occur frequently in text but carry minimal meaning. In NLP, these words are often removed to focus on content-rich terms that improve analysis and efficiency. However, their removal depends on the task—sometimes they contribute to sentence structure or context, like in sentiment analysis.

---

In [97]:
# Importing required libraries
from nltk.corpus import stopwords

In [99]:
# Removing stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['Queen', 'Camilla', 'crowned', 'alongside', 'huge', 'parade', 'back', 'Buckingham', 'Palace']


In [101]:
# You can see the stop words in NLTK's corpus (The NLTK corpus is a massive dump of all kinds of natural language data sets)
print(stopwords.words("french"))

['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aur

#### 4) Stemming and Lemmetization
Lemmatization is the process of reducing a word to its base or dictionary form, known as its "lemma," while considering the word's context and part of speech. For example, the words "running" and "ran" are both reduced to the lemma "run." Unlike stemming, which simply trims word endings, lemmatization ensures the result is a valid word, making it more accurate for NLP tasks like text classification, search, or sentiment analysis.

---

In [130]:
# Importing required libraries for stemming
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['queen', 'camilla', 'crown', 'alongsid', 'huge', 'parad', 'back', 'buckingham', 'palac']


In [132]:
# Importing required libraries for lemmatization
#nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmatized = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmatized)

['Queen', 'Camilla', 'crowned', 'alongside', 'huge', 'parade', 'back', 'Buckingham', 'Palace']


In [134]:
# Another stemming and lemmatization example
words2 = ['wait', 'waiting', 'studies', 'studying', 'computers']

# Stemming- Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words2]
print("Stemming output: {}".format(stemmed))

# Lemmatization- Reduce words to their root form
lemmatized = [WordNetLemmatizer().lemmatize(w) for w in words2]
print("Lemmatization output: {}".format(lemmatized))

Stemming output: ['wait', 'wait', 'studi', 'studi', 'comput']
Lemmatization output: ['wait', 'waiting', 'study', 'studying', 'computer']


#### 5) Part of Speech (POS) Tagging
Part-of-speech (POS) tagging assigns a grammatical category, like noun, verb, or adjective, to each word in a sentence based on its context. For example, in "She runs fast," "runs" is tagged as a verb, while in "She bought new running shoes," "running" is tagged as an adjective. POS tagging helps NLP systems understand sentence structure and meaning, aiding in tasks like syntax parsing, sentiment analysis, and machine translation.

---

In [None]:
# Importing required libraries
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')

In [145]:
from nltk import pos_tag

In [149]:
# Tag each word with part of speech
pos_tag(words)

[('Queen', 'NNP'),
 ('Camilla', 'NNP'),
 ('crowned', 'VBD'),
 ('alongside', 'RB'),
 ('huge', 'JJ'),
 ('parade', 'NN'),
 ('back', 'RB'),
 ('Buckingham', 'NNP'),
 ('Palace', 'NNP')]

You can find the tagging labels here: https://www.geeksforgeeks.org/nlp-part-of-speech-default-tagging/

#### 5) Named Entity Recognition
Named Entity Recognition (NER) is the process of identifying and categorizing specific entities in text, such as names of people, organizations, locations, dates, and more. For example, in "Bill Gates founded Microsoft in 1975," "Bill Gates" is identified as a person, "Microsoft" as an organization, and "1975" as a date. NER helps NLP systems extract valuable information for tasks like information retrieval, chatbots, and summarization.

---

In [None]:
# Importing required libraries
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

In [175]:
# This performs NER by tokenizing the third sentence, tagging words with their parts of speech, 
# identifying named entities (like people or places), and outputting them in a structured tree format.
ner_tree = ne_chunk(pos_tag(word_tokenize(sentences[2])))
print(ner_tree)

(S
  (PERSON Queen/NNP)
  (PERSON Camilla/NNP)
  was/VBD
  crowned/VBN
  alongside/RB
  him/PRP
  before/IN
  a/DT
  huge/JJ
  parade/NN
  back/RB
  to/TO
  (PERSON Buckingham/NNP Palace/NNP)
  ./.)


In [183]:
# We can run this against the whole 'data set' to see what it expands on
ner_tree = ne_chunk(pos_tag(word_tokenize(text)))
print(ner_tree)

(S
  (PERSON Queen/NNP)
  (PERSON Camilla/NNP)
  was/VBD
  crowned/VBN
  alongside/RB
  him/PRP
  before/IN
  a/DT
  huge/JJ
  parade/NN
  back/RB
  to/TO
  (PERSON Buckingham/NNP Palace/NNP))


In [191]:
# This gives a better understanding of how NER identifies and points out- person, organization, role, dates, etc.
new_text = "The CEO of NVIDIA, Jensen Huang, was born on April 5th, 1993 and is 62 years old!"
ner_tree = ne_chunk(pos_tag(word_tokenize(new_text)))
print(ner_tree)

(S
  The/DT
  (ORGANIZATION CEO/NNP)
  of/IN
  (GPE NVIDIA/NNP)
  ,/,
  (PERSON Jensen/NNP Huang/NNP)
  ,/,
  was/VBD
  born/VBN
  on/IN
  April/NNP
  5th/CD
  ,/,
  1993/CD
  and/CC
  is/VBZ
  62/CD
  years/NNS
  old/JJ
  !/.)
