## PART 4: NATURAL LANGUAGE PROCESSING

### **A}** NLP & WORD VECTORIZATION

> INTRODUCTION
* **Natural Language Processing**: It focuses on making computers understand and interact with human language
* Modern NLP is driven by machine Learning
> TEXT PREPROCESSING
1. **TOKENIZATION**
* Splitting text into individual words or tokens.
2. **CLEANING**
* Removing punctuation, converting all to lowercase & handling characters to prepare the text.
3. **STOP WORDS REMOVAL**
* Eliminating common but non-informative workds like 'the', 'of'
> STEMMING VS LEMMATIZATION
1. **STEMMING**
* Reduces words to their root form by chopping off the suffixes.
* i.e "Studies" becomes "studi"
2. **LEMMATIZATION**
* More sophisticated, converts words to their base form.
* i.e "Studies" becomes "Study"
> VECTORIZATION
1. **BAG OF WORDS (COUNT VECTORIZATION)**
* Counts the frequency of each word in a document ignoring their order.
* when working with 1 document, we create a single vector where each element in the vector corresponds to the count of a unique word in the document.
* When working with multiple documents, we would store everything in a DataFrame.
2. **TF-IDF (TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY)**
* Weighs words based on their importance across a document collection.
* This gives higher value to rare but meaningful termss.
> APPLICATIONS
1. **Text CLASSIFICATION & SPAM DETECTION**
* Bayesian mmethods e.g Naive Bayes are often used to classify text {i.e spam filtering}
2. **REAL WORLD NLP PRODUCTS**
* From simple chatbots to complex systems like siri & Google Duplex

### **B}** INTRODUCTION TO REGULAR EXPRESSIONS

> INTRO
* **Regex**: A powerful tool for pattern matching & text filtering, enabling quick searches & manipulations within text data.
* Useful in: Web scraping & text data preprocessing in NLP.
* It is essential during tokenization stage in NLP. It improves how words with apostrophes (they're) are split into tokens.
> BASIC PATTERNS 
* "re" library in python is used to compile patterns & find matches within strings.
> REGEX COMPONENTS
* **RANGE**: Define groups of characters using square brackets {[A-Z for uppercase letters]}
* **CHARACTER CLASSES**: Shortcut for common tasks   
      * {\d for digits, \w for wor characters}
* **Groups & Quantifiers**: Use parantheses for groups & curly braces for specifying repitition { (A-Z0-9){3}  }
> EXAMPLE 
* A regex pattern to match basic email address
*  '([A-Za-z]+)@([A-Za-z]+)\.com'


### **C}** REGULAR EXPRESSIONS - CODEALONG

> 1. READING FILE

In [1]:
import re

with open(r"C:\Users\User\Documents\Moringa_labs\PHASE 4\1.-Phase-4-SUMMARY-\DATA\menu.txt", "r") as f:
    file = f.read()

print(file)

Flatiron School Cafe Menu


Appetizers

Nachos - $10
Calamari - $12
3 Cheese Platter - $8.75

Entrees

Chicken Sandwich - $16.95
A fried chicken sandwich with lettuce, tomato, and mayo. Add cheese for $1.50

Flatiron Steak - $22
A prime cut of Flatiron Steak, cooked to your liking. Comes with a side of vegetables. Add a salad or cup of soup for $3

Garden Salad - $14
A salad with stuff from the garden on our roof. 3 types of dressing available.


Want to place an order for delivery? Call us at (555) 123-8452!



> 2. FINDING DIGITS 
* Create a pattern to match all individual digits

In [2]:
pattern = "\d"
p = re.compile(pattern)
digits = p.findall(file)
print(digits)

['1', '0', '1', '2', '3', '8', '7', '5', '1', '6', '9', '5', '1', '5', '0', '2', '2', '3', '1', '4', '3', '5', '5', '5', '1', '2', '3', '8', '4', '5', '2']


>  3. MATCHING PRICES 
* Modify patetrn to find dollar amounts by escaping the dollar sign(\$) and also include digits
*  **This approach captures the first digits only,  truncates $10 to $1**

In [3]:
pattern = "\$\d"
p = re.compile(pattern)
digits = p.findall(file)
print(digits)

['$1', '$1', '$8', '$1', '$1', '$2', '$3', '$1']


> 4. IMPROVING PRICE PATTERN 
* update pattern to capture more characters after the dollar sign.
* **It worked for some, it left out any prices that have less than 3 characters after initial match i.e $10**

In [4]:
pattern = "\$\d.{3}"
p = re.compile(pattern)
digits = p.findall(file)
print(digits)

['$8.75', '$16.9', '$1.50']


> 5. CREATE COMPREHENSIVE PRICE PATTERN 
* Use groups and ranges to match all prices including decimal points

In [5]:
pattern = "(\$\d+\.?\d*)"
p = re.compile(pattern)
digits = p.findall(file)
print(digits)

['$10', '$12', '$8.75', '$16.95', '$1.50', '$22', '$3', '$14']


1. $ matches the dollar sign. 
    * It's a metacharacter in regex showing end of a string. We escape this using backslash ( \ )
2. \d+ 
    * \d any digit from 0-9
    * + one or more  preceding elements
    * \d+ one or more digits[$1,$10,$123 ]
3. \ .? 
    * . matches a literal dot, a decimal point  
    * we escape it coz (.) is a metacharacter
4. \d* 
    * matches zero or more digits after the decimal pt
    * e.g 10. , 10.0, 10.99
5. ()
    * creates a capturing group

> 6. EXTRACTING PHONE NUMBERS

In [6]:
pattern = "(\(\d{3}\) (\d{3}-\d{4}))"
p = re.compile(pattern)
digits = p.findall(file)
digits

[('(555) 123-8452', '123-8452')]

/d{3} matches the first 3 digits of the phone number.    
/d{4} matches the last 4 digits of the number.   
\ ( matches opening parenthesis   
\ ) matches closing parenthesis   
" " the space matches a space character   

### **D}** BAG OF WORDS MODEL {BoW}

> CONCEPT OF BoW MODEL
* We treat text as a "bag of words" without considering grammr or word order.
* Major focus on word presence & frequency of words in a document.
* It creates a vocabulary{The unique words} from all documents then it reps each document as a vector whereby each dimension corresponds to a word

> STEPS TO BUILD A BOW MODEL

1. TOKENIZATION
* Break text into individual words{tokens}
  EXAMPLE: 
    * I Love NLP and I love Machine Learning
    It will be tokenized as:
    * **["I", "love", "NLP", "and", "I", "love", "machine", "learning"]**  
2. BUILDING A VOCABULARY
* The unique words across entire corpus(collection of documents) are identified to form a vocabulary.
    VOCABULARY EXE:
    * **["I", "love", "NLP", "and", "machine", "learning"]** 
3. CREATING A VECTOR REPRESENTATION
* Each sentence is then represented as a vector of word counts from the vocabulary we have.
  EXAMPLE: 
  ~ VOCABULARY
    * "I love NLP and I love machine Learning" 
  ~ VECTOR REPRESENTATION  
    * [2,2,1,1,1,1]
4. DOCUMENT TERM MATRIX
* In multiple documents, BoW creates a document-term matrix where each row rep a document and each column rep a word from the vocabulary.
* The cells in the matrix, show the count of how often that word appears in the document.  
* EXAMPLE: We have 2 sentences  
DOC 1. "I love NLP"  
DOC 2. "I love machine learning"

* THE DOCUMENT TERM MATRIX

|-  | I | love | NLP | and | machine | learning |
| --- | --- | --- | --- | --- |  --- | --- | 
|Doc 1 |1 | 1 | 1  | 0 | 0 |  0  | 
|Doc 2 | 1 | 1 | 0 | 0 | 1 | 1 |






> Advantages and disadvantages of BoW

| Advantages | Disadvantages |
| ----       | ----        |
| Simplicity: It is easy to implement & is straightforward on text representation | Ignores Context: It doesn't capture the order/ rltship btn words i.e "I love NLP" would be same to "NLP love I" | 
| Effective for text classification tasks | High dimensionality: AS vocabulary grows, feature space becomes large causing sparsity in document-term matrix |
|  | Unable to capture semantics: BoW considers word frequencies & ignores the context behing the words |


> Variations & Extensions of BoW
1. TF-IDF (Term Frequency - Inverse Document Frequency)
* It's an extension of BoW, it adjusts word counts based on how frequently they appear across documents.
* common words across many documents{the,is} are given less importance.
* Rare but meaningful words{machine, learning} are given more weight.
2. N-grams
* Considers combination of n consecutive words. (bigrams, trigrams) to capture word order information.

### **E}** NLP FEATURE ENGINEERING WITH NLTK

> NLP Feature techniques are:
   1. **StopWord Removal**: It helps reduce noise
   2. **Frequency distributions**: It provides insights into common words.
   3. **Stemming & Lemmatization**: It reduces word forms for more meaningful analysisi
   4. **Biggrams & N-grams**: It captures word dependencies.
   5. **Mutual Information**: It quantifies word associations esp for biggrams.

> 1. **STOPWORDS REMOVAL**
* Stopwords are common words i.e "a" "and" "but" "or" that aren't so meaningful but oftenlt dominate text.
* Removing stopwords reduces dimensionality of data & highlights meaningful words

In [7]:
from nltk.corpus import stopwords
import string

# Get the english stopwords & add punctuation
stopwords_list = stopwords.words("english")
stopwords_list += list(string.punctuation)
print(stopwords_list)

# Tokenization & stopword removal example
from nltk import word_tokenize
tokens = word_tokenize("This is a sample text to demonstrate stopword removal.")
stopped_tokens = [word.lower() for word in tokens if word.lower() not in stopwords_list]
print(stopped_tokens)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

> 2. FREQUENCY DISTRIBUTIONS
* It shows how often each word appears in a dataset.
* This allows us to understand common terms used in corpus.

In [8]:
from nltk import FreqDist

# Tokenize words from sample text
tokens = ['sample', 'text', 'demonstrate', 'sample', 'word']

# create freq distnibution
freq_dist = FreqDist(tokens)

# display most common words
most_common = freq_dist.most_common(3)
print(most_common)

[('sample', 2), ('text', 1), ('demonstrate', 1)]


> 3. STEMMING & LEMMATIZATION
* **A] Stemming** reduces words to their root form by fwg a set of predefined rules. I.e "running" & "run" will become "run"
* Commonly used stemmer is Porter Stemmer
* **B] Lemmatization** maps words to their base forms using a dictionary.
* I.e "running" becomes "run " & "better" becomes "good"
* This improves accuracy thus *lemmatization* is more accurate than stemming

In [9]:
# A] STEMMING
from nltk import PorterStemmer

# Initialize the porter stemmer
stemmer = PorterStemmer()

# Example words
words = ["running", "ran", "runs"]

# stemming
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['run', 'ran', 'run']


In [12]:
# B] LEMMATIZATION
from nltk.stem.wordnet import WordNetLemmatizer

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Example words
lemmatized_words = [lemmatizer.lemmatize(word) for word in ["feet", "running"]]
print(lemmatized_words)

# Not working for running
# WordnetLemmatizer requires a POS{Part of speech}tag to identify and lemmatize a word.
pos_word = [("feet","n"), ("running", "v")]
lemmatized_word = [lemmatizer.lemmatize(word,pos) for word,pos in pos_word]
print(lemmatized_word)

['foot', 'running']
['foot', 'run']


> 4. BIGRAMS & N-GRAMS
* **Bigran** is a pair of adjacent words.
* They help capture word dependencies i.e {"New York" & "San Francisco"}
* **N-grams** are sequences of n words

In [13]:
from nltk import bigrams

# Sentence exe
sentence = "the dog played outside"

# Tokenize & create bigrams
tokens = word_tokenize(sentence)
bigram_tokens = list(bigrams(tokens))
print(bigram_tokens)

[('the', 'dog'), ('dog', 'played'), ('played', 'outside')]


> 5. MUTUAL INFORMATION {MI} SCORE
* **MI score** measures dependence btn 2 words in a bigram.
* High MI score: Shows that 2 words are strongly correlated {i.e "San Francisco"}
* **Pointwise MI**: measures how much info is contained in the bigram compared to the individual words

### **F}** CONTEXT FREE GRAMMAR{CFG} & PART OF SPEECH(POS)

> INTRO
* CFG is a set of rules that explain the structure of sentences at the syntactic level without regarding meaning.
* Used in both NLP esp in POS Tagging & Linguistics
> CFG rules for sentence structure
* A sentence(S) = Noun Phrase(NP) + Verb Phrase(VP)
* Noun Phrase(NP) = Determiner(Det) + Noun(N)
* Verb Phrase(VP) = Verb(V) + Noun Phrase(NP)
> POS Tagging 
* It assigns grammatical labels (i.e nouns,verb) to words.
* It helps computers interpret sentence structure.
> PARSE TREES
* It visually reps how a sentence is broken down into grammatical components
> CFG's IMPORTANCE
* Help computers process human language by understanding sentence structures.
* It interprets ambiguous sentences ("run" can be both noun & verb)
> CFG EXE FOR: "I SHOT AN ELEPHANT IN MY PYJAMAS"
* Sentences(S) -> Noun Phrase(NP) + Verb Phrase(VP)
* Noun Phrase(NP) -> "I"
* Verb Phrase(VP) -> "shot" + Noun Phrase (NP) + Prepositional Phrase(PP)
* Prepositional Phrase(PP) -> "in" + Noun Phrase(NP)
> POS TAGGING  
* It uses NLP tools(NLTK) simplifies this by automatically assigning POS tags from predefined datasets. i.e Penn Treebank


> ### **G}** TEXT CLASSIFICATION
* Involves preprocessing, cleaning & transforming text data into format suitable for ML models.
> STEPS IN TEXT CLASSIFICATION
* 1. 