<a href="https://colab.research.google.com/github/GigiQR99/NLP-exercise/blob/main/Vectorization_NLP_Lee_wk02_RQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **VECTORIZATION**

### **VECTORIZE:** Convert words into numbers. Make the text understandable to the ML algorithms

### **PRIOR CLASS:** Pre-processing the data, **Stemming** (extract the words' root), **Lemmatize** (consider synonims, context), **Stopwords**

## **Categorizing Data**


1.   **By Structure:** Structured (excel, CSV), Semistructured (JSON, HTML) they can be modified, Unstructured (the text of a phone call), is the harder to comprehend for ML w/o loss of information.
2.   **Based on Content:** text, image, audio, video

### **Noise must be removed (Cleaning):**
Otherwise you negatively impact the results,waste GPU, waste processing time.That noise dont contribute to the meaning & semantics of the text (punctuation, emojis, etc)

## **Clean Text Data**

1. Stopwords removal
2. Tokenization: split a sentence in their words/puntuation part
3. Stemming:
4. Lemmatization


### **Basic NLP libraries**: for string manipulation & pattern mantching in strings
1. **RE** (regular expresion),
2. **textblob (~NLTK)**
3.  **Keras** developed after Tenserflow


### **IMPORT LIBRARIES FROM SKLEARN**

In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
# %matplotlib inline

## **Clean Text Code**

####**import re**: Imports **regular expression**
#### ***def clean_text(sentence):*** I will pass "sentence" into the funtion and I will return a list of clean words. Defines the function *clean_text* that accepts the argument *sentence* which is the text string you want to clean.
#### ***return re.sub(r'([^\s\w]|_)+', ' ', sentence).split()***: *([^\s\w]|_)* This is a ***regex pattern*** (regular expresion patern). The function takes a sentence, removes punctuation and underscores by replacing them with spaces, and then returns a list of the words in the cleaned sentenc


*   **re.sub(r'([^\s\w]|_)+', ' ', sentence):** This uses the re.sub function to find and replace patterns in the input sentence
*   r'([^\s\w]|_)+' is the regular expression pattern. It matches one or more characters that are not whitespace (\s) or word characters (\w), or are an underscore (_).
* In essence, it targets punctuation and underscores.
* ' ' is the replacement string. Any characters matched by the pattern are replaced with a single space. sentence is the input string where the replacements will be made.
* ***.split():*** After the punctuation and underscores are replaced with spaces, the resulting string is split into a list of words using whitespace as the delimiter.




###**Clean Data Example**

In [39]:
import re
def clean_text(sentence):
    return re.sub(r'([^\s\w]|_)+', ' ', sentence).split()

In [40]:
# Your sentence
sentence = '''The class ordered the book from MDC Shark Pack. But the book hasn't arrived yet''' #Triple quotes allows me to put anything in between.

# Call the function
result = clean_text(sentence)

print(result)

['The', 'class', 'ordered', 'the', 'book', 'from', 'MDC', 'Shark', 'Pack', 'But', 'the', 'book', 'hasn', 't', 'arrived', 'yet']


##**"n_gram_extractor"**

###Put together **words that have similar meaning** (up to 4 words). E.g: USA = United States = UniteD estates of america = America = U.S.

Function for extracting n-grams from a sentence.

Here's how it works:

***import re***: Imports the regular expression module, as seen before, for text manipulation.

***def n_gram_extractor(sentence, n):*** Defines the function n_gram_extractor that takes two arguments: sentence (the input text) and n (the desired size of the n-grams).

***tokens = re.sub(r'([^\s\w]|_)+', ' ', sentence).split():*** This line is similar to the clean_text function you saw earlier. It cleans the input sentence by removing punctuation and underscores and then splits it into a list of individual tokens.

***for i in range(len(tokens)-n+1):*** This loop iterates through the list of tokens. The range function is set up so that the loop stops when there are no longer enough tokens left to form an n-gram of size n.

***print(tokens[i:i+n]):*** Inside the loop, this line extracts a slice of the tokens list starting from index i and ending at index i+n. This slice represents an n-gram of size n. The print() function then displays this n-gram.

### In essence, this **function takes a sentence, cleans it, splits it into words**, and then iterates through the words to extract and print all possible sequences of n consecutive words (n-grams).


In [41]:
import re
def n_gram_extractor(sentence, n):
    tokens = re.sub(r'([^\s\w]|_)+', ' ', sentence).split()
    for i in range(len(tokens)-n+1):
        print(tokens[i:i+n])

###**Exercise n_gram_extractor**

***n_gram_extractor('The cute little boy is playing with the kitten.', 3):*** This line calls the n_gram_extractor function with two arguments:
The sentence string: 'The cute little boy is playing with the kitten.'
The desired n-gram size: 3

###***from nltk import ngrams***: Creates a Dictionary of Tupples of 2 words each.




In [42]:
# Run this to see bigrams
n_gram_extractor('The cute little boy is playing with the kitten.', 3)

['The', 'cute', 'little']
['cute', 'little', 'boy']
['little', 'boy', 'is']
['boy', 'is', 'playing']
['is', 'playing', 'with']
['playing', 'with', 'the']
['with', 'the', 'kitten']


In [43]:
from nltk import ngrams
list(ngrams('The cute little boy is playing with the kitten.'.split(), 2))

[('The', 'cute'),
 ('cute', 'little'),
 ('little', 'boy'),
 ('boy', 'is'),
 ('is', 'playing'),
 ('playing', 'with'),
 ('with', 'the'),
 ('the', 'kitten.')]

In [44]:
import nltk
nltk.download('punkt_tab')
from textblob import TextBlob
blob = TextBlob("The cute little boy is playing with the kitten.")
blob.ngrams(n=2)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


[WordList(['The', 'cute']),
 WordList(['cute', 'little']),
 WordList(['little', 'boy']),
 WordList(['boy', 'is']),
 WordList(['is', 'playing']),
 WordList(['playing', 'with']),
 WordList(['with', 'the']),
 WordList(['the', 'kitten'])]

###***Import nltk:*** Imports the Natural Language Toolkit library.

***nltk.download('punkt_tab')***: This attempts to download the 'punkt_tab' resource from NLTK. This resource is a **tokenizer model**. The output shows that it's downloading and unzipping the resource.

***from textblob import TextBlob***: Imports the TextBlob class from the textblob library. TextBlob provides a simple API for common NLP tasks.

***blob = TextBlob("The cute little boy is playing with the kitten.")***: Creates a TextBlob object named blob from the input sentence.

***blob.ngrams(n=2)***: This calls the ngrams method on the blob object.
n=2 specifies that you want to extract bigrams (sequences of 2 words).

The output you see:

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.

[WordList(['The', 'cute']),
 WordList(['cute', 'little']),
 WordList(['little', 'boy']),
 WordList(['boy', 'is']),
 WordList(['is', 'playing']),

Shows the bigrams extracted by TextBlob. Notice that ***TextBlob keeps punctuation attached to the last word (kitten.)*** unlike the previous n_gram_extractor function which removed punctuation first.

### **Pip List**  | ***!Pip list***
### List of python packages that ve been store

### The **"!"** at the beginning of **!pip** is **specific to Jupyter notebooks** or environments like **Google Colab.**

In [45]:
!pip
!pip list | grep nltk


Usage:   
  pip3 <command> [options]

Commands:
  install                     Install packages.
  download                    Download packages.
  uninstall                   Uninstall packages.
  freeze                      Output installed packages in requirements format.
  inspect                     Inspect the python environment.
  list                        List installed packages.
  show                        Show information about installed packages.
  check                       Verify installed packages have compatible dependencies.
  config                      Manage local and global configuration.
  search                      Search PyPI for packages.
  cache                       Inspect and manage pip's wheel cache.
  index                       Inspect information available from package indexes.
  wheel                       Build wheels from your requirements.
  hash                        Compute hashes of package archives.
  completion                  A helper c

###**KERAS & TEXTBLOB**

**Keras**: Help ot clean this text than has some typos

**KERAS TOKENIZATION** does **aggresive cleaning**, **remove all special punctuation/character**, lowercases all, split on spaces
Keras is ideal to use in tokeninzation when training NNW and normalized data.

***from tensorflow.keras.preprocessing.text import text_to_word_sequence***: This **imports the text_to_word_sequence function** specifically from the Keras deep learning library (as part of TensorFlow). This is designed to tokenize text, particularly for use in neural network models.

**TEXTBLOB**: Ideal when you **need more control** ver subsequent processing. You want to preserve characters likes '#' in IG, sentiment analysis (preserve emoticons), when cases sensitivity matters (emails).

In [46]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from textblob import TextBlob

sentence = 'Carlos posted, "Watching Super Bowl LVIII from Hard Rock Stadium, \
Miami Gardens. Amazing halftimeby Shakira! Incredible atmosphere! @miamidolphins \
@nfl #Miami #SuperBowl2024. For event photos contact me carlos@miamishots.com :)"'

Here Miami Dophins is considered a ***"one word"*** only with Keras. But hard rock stadium, Miami Gardens are still separated

In [47]:
def get_keras_tokens(text):
    return text_to_word_sequence(text)

# Run this to see Keras output
result_keras = get_keras_tokens(sentence)
print(result_keras)

['carlos', 'posted', 'watching', 'super', 'bowl', 'lviii', 'from', 'hard', 'rock', 'stadium', 'miami', 'gardens', 'amazing', 'halftimeby', 'shakira', 'incredible', 'atmosphere', 'miamidolphins', 'nfl', 'miami', 'superbowl2024', 'for', 'event', 'photos', 'contact', 'me', 'carlos', 'miamishots', 'com']


In [48]:
def get_textblob_tokens(text):
    blob = TextBlob(text)
    return blob.words

# Run this to see TextBlob output
result_textblob = get_textblob_tokens(sentence)
print(result_textblob)

['Carlos', 'posted', 'Watching', 'Super', 'Bowl', 'LVIII', 'from', 'Hard', 'Rock', 'Stadium', 'Miami', 'Gardens', 'Amazing', 'halftimeby', 'Shakira', 'Incredible', 'atmosphere', 'miamidolphins', 'nfl', 'Miami', 'SuperBowl2024', 'For', 'event', 'photos', 'contact', 'me', 'carlos', 'miamishots.com']


In [49]:
# Run this to see both outputs side by side
sentence = 'Carlos posted, "Watching Super Bowl LVIII from Hard Rock Stadium, \
Miami Gardens. Amazing halftimeby Shakira! Incredible atmosphere! @miamidolphins \
@nfl #Miami #SuperBowl2024. For event photos contact me carlos@miamishots.com :)"'

print("KERAS TOKENS:")
keras_tokens = get_keras_tokens(sentence)
print(keras_tokens)
print(f"Token count: {len(keras_tokens)}")

print("\nTEXTBLOB TOKENS:")
textblob_tokens = get_textblob_tokens(sentence)
print(list(textblob_tokens))
print(f"Token count: {len(textblob_tokens)}")

print("\nDIFFERENCES:")
print(f"Email handling: Keras splits it, TextBlob keeps it whole")
print(f"Hashtags: Keras removes #, TextBlob preserves #")
print(f"Mentions: Keras removes @, TextBlob preserves @")
print(f"Case: Keras lowercases, TextBlob preserves case")

KERAS TOKENS:
['carlos', 'posted', 'watching', 'super', 'bowl', 'lviii', 'from', 'hard', 'rock', 'stadium', 'miami', 'gardens', 'amazing', 'halftimeby', 'shakira', 'incredible', 'atmosphere', 'miamidolphins', 'nfl', 'miami', 'superbowl2024', 'for', 'event', 'photos', 'contact', 'me', 'carlos', 'miamishots', 'com']
Token count: 29

TEXTBLOB TOKENS:
['Carlos', 'posted', 'Watching', 'Super', 'Bowl', 'LVIII', 'from', 'Hard', 'Rock', 'Stadium', 'Miami', 'Gardens', 'Amazing', 'halftimeby', 'Shakira', 'Incredible', 'atmosphere', 'miamidolphins', 'nfl', 'Miami', 'SuperBowl2024', 'For', 'event', 'photos', 'contact', 'me', 'carlos', 'miamishots.com']
Token count: 28

DIFFERENCES:
Email handling: Keras splits it, TextBlob keeps it whole
Hashtags: Keras removes #, TextBlob preserves #
Mentions: Keras removes @, TextBlob preserves @
Case: Keras lowercases, TextBlob preserves case


In [50]:
# Example: Analyzing Miami tourism tweets
miami_tweet = "Just visited @ViscayaMuseum! Beautiful gardens 🌺 #MiamiDade #ArtDeco. Book tours at info@viscaya.org"

print("For hashtag analysis (use TextBlob):")
print(get_textblob_tokens(miami_tweet))  # Preserves #MiamiDade, #ArtDeco

print("\nFor word frequency analysis (use Keras):")
print(get_keras_tokens(miami_tweet))  # Normalizes everything for counting

For hashtag analysis (use TextBlob):
['Just', 'visited', 'ViscayaMuseum', 'Beautiful', 'gardens', '🌺', 'MiamiDade', 'ArtDeco', 'Book', 'tours', 'at', 'info', 'viscaya.org']

For word frequency analysis (use Keras):
['just', 'visited', 'viscayamuseum', 'beautiful', 'gardens', '🌺', 'miamidade', 'artdeco', 'book', 'tours', 'at', 'info', 'viscaya', 'org']


In [51]:
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import MWETokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import WordPunctTokenizer

sentence = 'Carlos tweeted, "Experiencing Ultra Music Festival from Bayfront Park, \
Miami. Incredible performance by Swedish House Mafia! Amazing visuals! @ultra \
@marshmello #Miami #UltraFest2024. For VIP tickets contact carlos@ultramiami.com :)"'

In [52]:
def tokenize_with_tweet_tokenizer(text):
    tweet_tokenizer = TweetTokenizer()
    return tweet_tokenizer.tokenize(text)

# Run this
result = tokenize_with_tweet_tokenizer(sentence)
print(result)

['Carlos', 'tweeted', ',', '"', 'Experiencing', 'Ultra', 'Music', 'Festival', 'from', 'Bayfront', 'Park', ',', 'Miami', '.', 'Incredible', 'performance', 'by', 'Swedish', 'House', 'Mafia', '!', 'Amazing', 'visuals', '!', '@ultra', '@marshmello', '#Miami', '#UltraFest2024', '.', 'For', 'VIP', 'tickets', 'contact', 'carlos@ultramiami.com', ':)', '"']


In [53]:
def tokenize_with_mwe(text):
    mwe_tokenizer = MWETokenizer([('Bayfront', 'Park')])  # Define multi-word expressions
    mwe_tokenizer.add_mwe(('Swedish', 'House', 'Mafia!'))  # Add another MWE
    mwe_tokenizer.add_mwe(('Ultra', 'Music', 'Festival'))
    return mwe_tokenizer.tokenize(text.split())

# Run this
result = tokenize_with_mwe(sentence)
print(result)

['Carlos', 'tweeted,', '"Experiencing', 'Ultra_Music_Festival', 'from', 'Bayfront', 'Park,', 'Miami.', 'Incredible', 'performance', 'by', 'Swedish_House_Mafia!', 'Amazing', 'visuals!', '@ultra', '@marshmello', '#Miami', '#UltraFest2024.', 'For', 'VIP', 'tickets', 'contact', 'carlos@ultramiami.com', ':)"']


In [54]:
def tokenize_with_regex_tokenizer(text):
    # Pattern: \w+ (words) | \$[\d\.]+ (prices) | \S+ (non-spaces)
    reg_tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')
    return reg_tokenizer.tokenize(text)

# Run this
result = tokenize_with_regex_tokenizer(sentence)
print(result)

['Carlos', 'tweeted', ',', '"Experiencing', 'Ultra', 'Music', 'Festival', 'from', 'Bayfront', 'Park', ',', 'Miami', '.', 'Incredible', 'performance', 'by', 'Swedish', 'House', 'Mafia', '!', 'Amazing', 'visuals', '!', '@ultra', '@marshmello', '#Miami', '#UltraFest2024.', 'For', 'VIP', 'tickets', 'contact', 'carlos', '@ultramiami.com', ':)"']


In [55]:
def tokenize_with_wst(text):
    wh_tokenizer = WhitespaceTokenizer()
    return wh_tokenizer.tokenize(text)

# Run this
result = tokenize_with_wst(sentence)
print(result)

['Carlos', 'tweeted,', '"Experiencing', 'Ultra', 'Music', 'Festival', 'from', 'Bayfront', 'Park,', 'Miami.', 'Incredible', 'performance', 'by', 'Swedish', 'House', 'Mafia!', 'Amazing', 'visuals!', '@ultra', '@marshmello', '#Miami', '#UltraFest2024.', 'For', 'VIP', 'tickets', 'contact', 'carlos@ultramiami.com', ':)"']


In [56]:
def tokenize_with_wordpunct_tokenizer(text):
    wp_tokenizer = WordPunctTokenizer()
    return wp_tokenizer.tokenize(text)

# Run this
result = tokenize_with_wordpunct_tokenizer(sentence)
print(result)

['Carlos', 'tweeted', ',', '"', 'Experiencing', 'Ultra', 'Music', 'Festival', 'from', 'Bayfront', 'Park', ',', 'Miami', '.', 'Incredible', 'performance', 'by', 'Swedish', 'House', 'Mafia', '!', 'Amazing', 'visuals', '!', '@', 'ultra', '@', 'marshmello', '#', 'Miami', '#', 'UltraFest2024', '.', 'For', 'VIP', 'tickets', 'contact', 'carlos', '@', 'ultramiami', '.', 'com', ':)"']


In [57]:
# Analyzing Miami restaurant reviews with different needs
review = "Amazing dinner @JoesStone! Stone crabs = $49.95. Must-try! #MiamiFood :)"

# For sentiment with emoticons
print("For sentiment analysis:")
print(tokenize_with_tweet_tokenizer(review))

# For price extraction
print("\nFor price extraction:")
price_tokenizer = RegexpTokenizer('\$[\d\.]+|\w+|\S+')
print(price_tokenizer.tokenize(review))

# For keeping restaurant names
print("\nFor entity recognition:")
mwe = MWETokenizer([('Stone', 'crabs')])
print(mwe.tokenize(review.split()))

For sentiment analysis:
['Amazing', 'dinner', '@JoesStone', '!', 'Stone', 'crabs', '=', '$', '49.95', '.', 'Must-try', '!', '#MiamiFood', ':)']

For price extraction:
['Amazing', 'dinner', '@JoesStone!', 'Stone', 'crabs', '=', '$49.95.', 'Must', '-try!', '#MiamiFood', ':)']

For entity recognition:
['Amazing', 'dinner', '@JoesStone!', 'Stone_crabs', '=', '$49.95.', 'Must-try!', '#MiamiFood', ':)']


  price_tokenizer = RegexpTokenizer('\$[\d\.]+|\w+|\S+')


In [58]:
from nltk.stem import RegexpStemmer
def get_stems(text):
    regex_stemmer = RegexpStemmer('ing$', min=4) # creating an object of RegexpStemmer,
                                             # any string ending with the given
                                             # regex ‘ing$’ will be removed.
    # The below code line will convert every word into its stem using regex stemmer
    # and then join them with space.
    return ' '.join([regex_stemmer.stem(wd) for wd in text.split()])


sentence = "I love playing football"
get_stems(sentence)

'I love play football'

In [59]:
from nltk.stem.porter import *

sentence = "Before eating it would be nice to sanitize your hands with a sanitizer"

def get_stems(text):
    ps_stemmer = PorterStemmer()
    return ' '.join([ps_stemmer.stem(wd) for wd in text.split()])

get_stems(sentence)

'befor eat it would be nice to sanit your hand with a sanit'

In [60]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')


sentence = "The products produced by the process today are far better than what it produces generally."

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [61]:
lemmatizer = WordNetLemmatizer()
def get_lemmas(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in word_tokenize(text)])



get_lemmas(sentence)

'The product produced by the process today are far better than what it produce generally .'

In [62]:
from textblob import TextBlob
sentence = TextBlob('She sells seashells on the seashore')

In [63]:
sentence.words

WordList(['She', 'sells', 'seashells', 'on', 'the', 'seashore'])

In [64]:
def singularize(word):
    return word.singularize()

singularize(sentence.words[2])

'seashell'

In [65]:
def pluralize(word):
    return word.pluralize()

pluralize(sentence.words[5])

'seashores'

In [66]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
Finished.


###***!pip install textblob[translate]*** = not available in textblob

This **install additional dependencies required for translation within textblob**. The ! at the beginning signifies that this is a shell command being executed directly in the notebook jupyter/collab environment.

The **output** shows that the requirements are already satisfied, meaning the necessary packages are already installed. It also ***shows a warning that the textblob version 0.19.0 does not provide the extra 'translate'***, which might mean the translation functionality might not be available as expected.



In [67]:
!pip install textblob[translate]



### **ALTERNATIVE: Use Google translate**

In [68]:
#===============================
# AUSE GOOGLETRANS
#===============================

!pip install googletrans==4.0.0-rc1

from googletrans import Translator

def translate_alt(text, from_l, to_l):
    translator = Translator()
    result = translator.translate(text, src=from_l, dest=to_l)
    return result.text

# Test
result = translate_alt(text='por favor', from_l='es', to_l='en')
print(result)  # Should output: "Please"

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->googl

please


### **Must restart the session to obtain the proprer google translation**

The warning means that some parts of the "chardet" and "idna" libraries were already loaded into the Colab environment before you installed the specific version of googletrans that required different versions of those libraries. To ensure that the newly installed versions of chardet and idna are used by all libraries, you need to restart the Colab runtime.

It's like needing to restart your computer after installing some software updates for them to take full effect.


## ▶ **WHAT IS SpaCy?**: ❗Review this!!
### https://spacy.io/

1. It is a free open-license  industrial-strengh NLP
2. The spacy-llm package **integrates LLMs** into spaCy, featuring a modular system for **fast prototyping & prompting**, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.
3. If you **add spaCy as conector to your LLM** (Claude ideally) it provides de code for **automatic text cleaning**.

**Video NLP class 2**, time 2:00hrs, Dr. Lee explains hot to add Spacy.io to Claude

### **Steps to use SpaCy repository:**
1. Copy its repository from **github**: https://github.com/explosion/spaCy
2. go to https://gitmcp.io/
3. Paste it in the **gitmcp** ***"try example"*** blank space, and click on **"to MCP"**.

MCP server is a backend component that serves tools and resources to an AI system (the MCP client) through the Model Context Protocol (MCP). It acts as an intermediary, handling complex tasks like database operations. a backend component that serves tools and resources to an AI system (the MCP client) through the Model Context Protocol (MCP).

4. It will convert into an MCP server inmediately, and then copy paste that MCP server URL (https://gitmcp.io/explosion/spaCy)

5. Open Claude, and **"+ add as connector"** > click on **"Manage connector"**
6. Scroll down and  click on **"Add custom connector"** = set a name e.g : "Spacey documentation"

So whenever you have a question regarding SpaCy programming NLP, you can use diretly this in the prompt to Claude

***"please give me a small SPacy proof of concept that will take a pragraph ad clean it. use mcp"***

✊ **ALL THE 307 LINES OF CODE BELOW WAS GENERATED WITH SPACY, with no error!!**



## EXAMPLE OF SPACY CODE THROUGH CLAUDE

This code uses the spaCy library to clean and analyze text. It defines a class called TextCleaner which has tools to:

**Load a spaCy language model.**
1. Clean text by ***removing things like punctuation, extra spaces, stop words*** (common words like 'the', 'a'), ****numbers, emails, and website addresses***.

2. It can also **convert text to lowercase** and **find the base form of words** (lemmatization).

3. **Find & list specific types of information** in the text, like names of people, organizations, or dates (***NER: extracting entities***).

4. **Identify the grammatical role of each word** (like noun, verb, adjective)

5. **Remove words based on their grammatical role**.

The main part of the code shows how to **use the TextCleaner** by applying these ***cleaning and analysis steps*** to a sample paragraph and printing the results.

In [1]:
#!/usr/bin/env python3

#=====================================
# SpaCy Text Cleaning Proof of Concept
#=====================================
# A comprehensive example showing various text cleaning techniques using SpaCy.

"""
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
from typing import List, Set

class TextCleaner:
    """A comprehensive text cleaning pipeline using SpaCy."""

    def __init__(self, model_name: str = "en_core_web_sm"):
        """
        Initialize the text cleaner with a SpaCy model.

        Args:
            model_name: Name of the SpaCy model to load
        """
        try:
            self.nlp = spacy.load(model_name)
        except OSError:
            print(f"Model '{model_name}' not found. Installing...")
            import subprocess
            subprocess.run(["python", "-m", "spacy", "download", model_name])
            self.nlp = spacy.load(model_name)

        # Get stop words from SpaCy
        self.stop_words = STOP_WORDS

    def clean_text(self,
                   text: str,
                   lowercase: bool = True,
                   remove_stopwords: bool = True,
                   remove_punctuation: bool = True,
                   remove_numbers: bool = False,
                   remove_spaces: bool = True,
                   lemmatize: bool = True,
                   remove_emails: bool = True,
                   remove_urls: bool = True,
                   remove_special_chars: bool = True,
                   min_token_length: int = 2) -> str:
        """
        Clean text using various techniques.

        Args:
            text: Input text to clean
            lowercase: Convert to lowercase
            remove_stopwords: Remove stop words
            remove_punctuation: Remove punctuation
            remove_numbers: Remove numeric tokens
            remove_spaces: Remove extra whitespace
            lemmatize: Convert words to lemmas
            remove_emails: Remove email addresses
            remove_urls: Remove URLs
            remove_special_chars: Remove special characters
            min_token_length: Minimum token length to keep

        Returns:
            Cleaned text
        """

        # Pre-processing: Remove URLs and emails using regex
        if remove_urls:
            text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)
            text = re.sub(r'www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text)

        if remove_emails:
            text = re.sub(r'\S+@\S+', '', text)

        # Process with SpaCy
        doc = self.nlp(text)

        # Token-level cleaning
        cleaned_tokens = []

        for token in doc:
            # Skip based on various criteria
            if remove_punctuation and token.is_punct:
                continue

            if remove_stopwords and token.text.lower() in self.stop_words:
                continue

            if remove_numbers and (token.like_num or token.is_digit):
                continue

            if token.is_space:
                continue

            # Get the token text (lemma or original)
            if lemmatize and not token.is_punct:
                token_text = token.lemma_
            else:
                token_text = token.text

            # Apply lowercase
            if lowercase:
                token_text = token_text.lower()

            # Check minimum length
            if len(token_text) < min_token_length:
                continue

            # Remove special characters if requested
            if remove_special_chars:
                token_text = re.sub(r'[^a-zA-Z0-9\s]', '', token_text)

            # Add to cleaned tokens if not empty
            if token_text.strip():
                cleaned_tokens.append(token_text)

        # Join tokens
        cleaned_text = ' '.join(cleaned_tokens)

        # Remove extra spaces
        if remove_spaces:
            cleaned_text = ' '.join(cleaned_text.split())

        return cleaned_text

    def extract_entities(self, text: str) -> List[tuple]:
        """
        Extract named entities from text.

        Args:
            text: Input text

        Returns:
            List of (entity_text, entity_label) tuples
        """
        doc = self.nlp(text)
        return [(ent.text, ent.label_) for ent in doc.ents]

    def get_pos_tags(self, text: str) -> List[tuple]:
        """
        Get part-of-speech tags for tokens.

        Args:
            text: Input text

        Returns:
            List of (token, pos_tag) tuples
        """
        doc = self.nlp(text)
        return [(token.text, token.pos_) for token in doc]

    def remove_by_pos(self, text: str, pos_to_remove: Set[str]) -> str:
        """
        Remove tokens based on their POS tags.

        Args:
            text: Input text
            pos_to_remove: Set of POS tags to remove (e.g., {'ADV', 'ADJ'})

        Returns:
            Cleaned text
        """
        doc = self.nlp(text)
        tokens = [token.text for token in doc if token.pos_ not in pos_to_remove]
        return ' '.join(tokens)

    def normalize_whitespace(self, text: str) -> str:
        """
        Normalize various types of whitespace characters.

        Args:
            text: Input text

        Returns:
            Text with normalized whitespace
        """
        # Replace various whitespace characters with regular space
        text = re.sub(r'[\t\n\r\f\v]+', ' ', text)
        # Remove multiple spaces
        text = re.sub(r'\s+', ' ', text)
        return text.strip()


def main():
    """Demonstrate the text cleaning capabilities."""

    # Sample paragraph with various issues
    sample_text = """
    Hello World!!!  This is a SAMPLE paragraph with     various issues.
    Visit our website at https://www.example.com or contact us at info@example.com.

    The meeting is scheduled for 15th December, 2024 at 3:30 PM.
    We'll be discussing the Q4 2024 results and planning for 2025.

    Some unnecessary words: very, really, actually, basically are often overused.
    Numbers like 123, 456.78 and special characters @#$% should be handled!

    Microsoft Corporation (MSFT) announced new features.   Apple Inc. is also innovating.
    """

    print("=" * 80)
    print("SPACY TEXT CLEANING PROOF OF CONCEPT")
    print("=" * 80)

    # Initialize cleaner
    cleaner = TextCleaner()

    print("\nORIGINAL TEXT:")
    print("-" * 40)
    print(sample_text)

    # 1. Basic cleaning
    print("\n1. BASIC CLEANING (lowercase, punctuation, stopwords):")
    print("-" * 40)
    cleaned = cleaner.clean_text(
        sample_text,
        lowercase=True,
        remove_stopwords=True,
        remove_punctuation=True,
        lemmatize=False
    )
    print(cleaned)

    # 2. Advanced cleaning with lemmatization
    print("\n2. WITH LEMMATIZATION:")
    print("-" * 40)
    cleaned = cleaner.clean_text(
        sample_text,
        lowercase=True,
        remove_stopwords=True,
        remove_punctuation=True,
        lemmatize=True
    )
    print(cleaned)

    # 3. Removing URLs, emails, and numbers
    print("\n3. REMOVING URLs, EMAILS, AND NUMBERS:")
    print("-" * 40)
    cleaned = cleaner.clean_text(
        sample_text,
        remove_urls=True,
        remove_emails=True,
        remove_numbers=True,
        lemmatize=True
    )
    print(cleaned)

    # 4. Minimal cleaning (preserve more information)
    print("\n4. MINIMAL CLEANING (preserve case and some punctuation):")
    print("-" * 40)
    cleaned = cleaner.clean_text(
        sample_text,
        lowercase=False,
        remove_stopwords=True,
        remove_punctuation=False,
        lemmatize=False,
        remove_urls=True,
        remove_emails=True
    )
    print(cleaned)

    # 5. Extract named entities
    print("\n5. NAMED ENTITIES FOUND:")
    print("-" * 40)
    entities = cleaner.extract_entities(sample_text)
    for text, label in entities:
        print(f"  - {text}: {label}")

    # 6. Remove specific POS tags
    print("\n6. REMOVE ADJECTIVES AND ADVERBS:")
    print("-" * 40)
    cleaned = cleaner.remove_by_pos(sample_text, {'ADJ', 'ADV'})
    print(cleaned)

    # 7. Custom cleaning configuration
    print("\n7. CUSTOM CONFIGURATION (strict cleaning):")
    print("-" * 40)
    cleaned = cleaner.clean_text(
        sample_text,
        lowercase=True,
        remove_stopwords=True,
        remove_punctuation=True,
        remove_numbers=True,
        remove_spaces=True,
        lemmatize=True,
        remove_emails=True,
        remove_urls=True,
        remove_special_chars=True,
        min_token_length=3  # Only keep tokens with 3+ characters
    )
    print(cleaned)

    # 8. Show POS tags for understanding
    print("\n8. PART-OF-SPEECH TAGS (first 20 tokens):")
    print("-" * 40)
    pos_tags = cleaner.get_pos_tags(sample_text)[:20]
    for token, pos in pos_tags:
        if not token.isspace():
            print(f"  {token:15} -> {pos}")

    print("\n" + "=" * 80)
    print("CLEANING COMPLETE!")
    print("=" * 80)


if __name__ == "__main__":
    main()

SPACY TEXT CLEANING PROOF OF CONCEPT

ORIGINAL TEXT:
----------------------------------------

    Hello World!!!  This is a SAMPLE paragraph with     various issues.
    Visit our website at https://www.example.com or contact us at info@example.com.

    The meeting is scheduled for 15th December, 2024 at 3:30 PM.
    We'll be discussing the Q4 2024 results and planning for 2025.

    Some unnecessary words: very, really, actually, basically are often overused.
    Numbers like 123, 456.78 and special characters @#$% should be handled!

    Microsoft Corporation (MSFT) announced new features.   Apple Inc. is also innovating.
    

1. BASIC CLEANING (lowercase, punctuation, stopwords):
----------------------------------------
hello world sample paragraph issues visit website contact meeting scheduled 15th december 2024 330 pm discussing q4 2024 results planning 2025 unnecessary words actually basically overused numbers like 123 45678 special characters handled microsoft corporation msf

The code defines a TextCleaner class with several methods for preprocessing text data:

__init__(self, model_name="en_core_web_sm"): Initializes the TextCleaner by loading a SpaCy language model. If the model is not found, it automatically downloads it.

***clean_text(...):*** This is  core method for cleaning text. It takes various boolean arguments to control the cleaning process, such as converting to lowercase, removing stop words, punctuation, numbers, spaces, emails, URLs, and special characters, as well as performing lemmatization and setting a minimum token length. It uses regular expressions for initial removal of URLs and emails, and then processes the text with the loaded SpaCy model to handle other cleaning tasks based on token properties.

***extract_entities(self, text: str):*** Extracts named entities (like people, organizations, dates) from the input text using the SpaCy model.

***get_pos_tags(self, text: str):*** Gets the part-of-speech tag for each token in the input text.

***remove_by_pos(self, text: str, pos_to_remove: Set[str]):*** Removes tokens from the text based on a provided set of part-of-speech tags.

***normalize_whitespace(self, text: str)***: Cleans up various forms of whitespace in the text.

The main() function provides a proof of concept by:

Defining a sample_text with various cleaning challenges.
Initializing a TextCleaner object.
Demonstrating different text cleaning configurations using the clean_text method (basic cleaning, with lemmatization, removing specific elements, minimal cleaning, custom strict cleaning).
Showing examples of entity extraction, removing tokens by POS tags, and displaying POS tags for understanding.
Essentially, this script provides a flexible and powerful way to clean text data for natural language processing tasks, allowing you to customize the cleaning steps based on your specific needs.

##**Example 2:** Tokenize & stopword

In [2]:
from nltk import word_tokenize
sentence = "She sells seashells on the seashore"

### **Then, Stopwords Removal:**
Why? cause if you remve these you still can understand the sentence by context

In [3]:
def remove_stop_words(text,stop_word_list):
    return ' '.join([word for word in word_tokenize(text) if word.lower() not in stop_word_list])


custom_stop_word_list = ['she', 'on', 'the', 'am', 'is', 'not']
remove_stop_words(sentence,custom_stop_word_list)

'sells seashells seashore'

### **FEATURE EXTRACTION FROM TEXT**
#### **General features:** statistical calculations. Do not depend on context of the text. E.G: ***Number of tokens or characters*** in text

#### **Specific features**: are dependnt of the inherent meaning of text. Represent semantics of text. E.g: the ***frequency of unique words*** in the text is a specific feature.

Example: 2 sentences with = # of words (4): "The sky is blue" & "the pillar is yellow".
Same general features (# of words), but diferent  individual tokens

***Import pandas as pd*** and ***from textblob import TextBlob***: import the  libraries, **pandas for data manipulation** and **TextBlob for text processing** (although TextBlob is not used in the selected snippet itself, it's likely used in subsequent cells).

***df = pd.DataFrame(...)***: This creates a pandas DataFrame. The data is provided as a list of lists, where each inner list contains a single sentence.
***df.columns = ['text']***: This assigns the column name 'text' to the newly created DataFrame.
***df.head()***: This displays the first few rows of the DataFrame, which in this case is the entire DataFrame since it only has a few rows.

This sets up the data to be used for further text processing and analysis demonstrated in the following cells.

**EXAMPLE: Give me the # of words in this dataset**

In [None]:
import pandas as pd
from textblob import TextBlob
df = pd.DataFrame([['The interim budget for 2019 will be announced on 1st February.'], ['Do you know how much expectation the middle-class working population is having from this budget?'], ['February is the shortest month in a year.'], ['This financial year will end on 31st March.']])
df.columns = ['text']
df.head()

Unnamed: 0,text
0,The interim budget for 2019 will be announced ...
1,Do you know how much expectation the middle-cl...
2,February is the shortest month in a year.
3,This financial year will end on 31st March.



### COUNT WORDS IN A SENTENCE:

The code ***def add_num_words(df)*** defines and uses a function to count the number of words in each sentence in your DataFrame:

***def add_num_words(df)***: This defines a fxn called add_num_words that takes a pandas DataFrame (df) as input.

***df['number_of_words'] = df['text'].apply(lambda x : len(TextBlob(str(x)).words)):*** This is the core of the fxn. Creates a new column in the DataFrame called number_of_words. It applies a lambda fxn to each entry in the 'text' column (df['text'].apply(...)).
The lambda function takes each text entry (x), converts it to a string (str(x)), creates a TextBlob object from it, accesses the .words attribute (which gives a list of words), and then calculates the length of that list (len(...)), effectively counting the words.

**return df**: The function returns the modified DataFrame with the new column.

***add_num_words(df)['number_of_words']***:  calls the add_num_words fxn with your DataFrame df and then selects and displays only the newly created 'number_of_words' column.

###The output 0 11, 1 15, 2 8, 3 8 shows the number of words for each sentence in your DataFrame.

In [None]:
def add_num_words(df):
    df['number_of_words'] = df['text'].apply(lambda x : len(TextBlob(str(x)).words))
    return df
add_num_words(df)['number_of_words']

Unnamed: 0,number_of_words
0,11
1,15
2,8
3,8


**What is the intersection between some of these token and some of these wh_ words?**

 ***def is_present(wh_words, df):***, defines a Python fxn named is_present.

  This function is designed to take two arguments: `wh_words (which is expected to be a set of words)` and df (which is expected to be a pandas DataFrame). The purpose of this function, as indicated by its name, is likely to check for the presence of words from the wh_words set within the text data contained in the DataFrame df.

In [None]:
def is_present(wh_words, df):

    # The below line of code will find the intersection between set of tokens of
    #  every sentence and the wh_words and will return true if the length of intersection
    #  set is non-zero.
    df['is_wh_words_present'] = df['text'].apply(lambda x : True if \
                                                 len(set(TextBlob(str(x)).words).intersection(wh_words))>0 else False)
    return df

wh_words = set(['why', 'who', 'which', 'what', 'where', 'when', 'how'])

is_present(wh_words, df)['is_wh_words_present']

Unnamed: 0,is_wh_words_present
0,False
1,True
2,False
3,False


 #### **Import essential tools from the NLTK library** for NLP processing, specifically for splitting text into words (word_tokenize) and identifying their grammatical roles (pos_tag).

 #### It also **downloads the necessary data** (tagsets & averaged_perceptron_tagger) for these NLTK functions to work correctly.

In [4]:
import pandas as pd
from string import punctuation
import nltk

nltk.download('tagsets')
from nltk.data import load

nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk import word_tokenize
from collections import Counter

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [5]:
def get_tagsets():
    tagdict = load('help/tagsets/upenn_tagset.pickle')
    return list(tagdict.keys())

tag_list = get_tagsets()

print(tag_list)

['PRP$', 'VBG', 'FW', 'VB', 'POS', "''", 'VBP', 'VBN', 'JJ', 'WP', 'VBZ', 'DT', 'RP', '$', 'NN', ')', '(', 'RBR', 'VBD', ',', '.', 'TO', 'LS', 'RB', ':', 'NNS', 'NNP', '``', 'WRB', 'CC', 'PDT', 'RBS', 'PRP', 'CD', 'EX', 'IN', 'WP$', 'MD', 'NNPS', '--', 'JJS', 'JJR', 'SYM', 'UH', 'WDT']


## **BAG OF WORDS (BoW)**

### **Convert every sentence into a vector.** Is the 1st algorithm to know in NLP

#### The **vector's length  =  # of unique words**

Every single word is assigned a unique index #: *"The dog runs in a park"*. Index: 0 1. 2. 3. 4. 5

This is done in 2 steps:
1. The vocabulary or dictionary of all the words is generated
2. Every doc is represented by a list which length = # of words in that vocabulary/dictiornary.  In terms of the presence or absense of al words

**Example of BoW:**

* I request to give all the reviews on a restaurant for example, broken down in sentences:

Review 1 : "Great Cuban food. Amazing Cuban coffee"

Review 2: "Terrible service. Food was terrible"

Review 3: " Great Service. Amazing food! Great coffee"

* Then give me the unique words:

**Step 1** = identify all unique words in the corpus

vocabulary = ['amazing', 'coffee', 'cuban', 'food', great', 'service', 'terrible', 'was']

**Step 2** = Count the words in each doc ( in this case is "review"):

Word     Review 1.   Review 2.    Review 3.

Amazing    1.      0.  1

coffee  1. 0.  1

Cuban 2   0. 1

food  1  1. 1

great 1 0 2

service. 0 1  1

terrible 0 2 0

was 0 1 0


**Step 3** = vectors
review_1_vector = [1,1,2,1,1,0,0,0,0] # great cuban food...

review_2_vector = [0,0,0,1,0,1,2,1]  # Terrible service...

review_3_vector = [1,1,0,1,2,1,0,0,0]   # Great service...

In [6]:
#===================================
# Our Miami restaurant reviews
#===================================

#Import my words count vectorizar

from sklearn.feature_extraction.text import CountVectorizer

reviews = [
    "Great Cuban food. Amazing Cuban coffee.",
    "Terrible service. Food was terrible.",
    "Great service. Amazing food. Great coffee."
]

# Create the Bag of Words matrix
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(reviews)  # I am feeding my custome data of reviews

# See the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nBag of Words Matrix:")
print(bow_matrix.toarray())

Vocabulary: ['amazing' 'coffee' 'cuban' 'food' 'great' 'service' 'terrible' 'was']

Bag of Words Matrix:
[[1 1 2 1 1 0 0 0]
 [0 0 0 1 0 1 2 1]
 [1 1 0 1 2 1 0 0]]


### **With BoW simple algorithm, we can do:**

 * Spam detection.
 * Categorize news articles, or content
 * Sentiment analysis (id complaints)
 * Find document similarity (plagiarism)
 * Language detection,

## **Spam detection**


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

# Miami-themed email examples
emails = [
    # Spam emails
    "WIN FREE cruise to Bahamas from Miami port! Click NOW!!!",
    "URGENT! Your Miami bank account needs verification! Act fast!",
    "Congratulations! You won $1000 Miami shopping spree! Claim here!",
    "FREE tickets to Ultra Music Festival! Limited time offer!",

    # Normal emails
    "Meeting tomorrow at the Brickell office at 2pm",
    "Can you send me the quarterly report when you have time?",
    "Let's grab Cuban coffee after the presentation",
    "The project deadline has been moved to next Friday"
]

labels = ['spam', 'spam', 'spam', 'spam',
          'normal', 'normal', 'normal', 'normal']

# Create BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X, labels)

# Test new email
new_email = ["FREE Miami Heat tickets! Click this link now!"]
new_email_bow = vectorizer.transform(new_email)
prediction = classifier.predict(new_email_bow)
print(f"Prediction: {prediction[0]}")  # Output: spam

Prediction: spam


##**Categorize news articles or content**: art? sports? nature?
e.g: product categorization, or social media BoW can help you id the most popular trends or #

In [8]:
# Categorize Miami news articles
articles = [
    "The Miami Heat defeated the Lakers in overtime last night at the arena",
    "New coral restoration project launches in Biscayne Bay this week",
    "Art Basel Miami Beach announces featured artists for 2024 exhibition",
    "Hurricane season preparation tips for South Florida residents",
    "Dolphins quarterback throws three touchdowns in victory",
    "Climate change impacts on Miami Beach erosion studied by scientists"
]

categories = ['sports', 'environment', 'arts', 'weather', 'sports', 'environment']

# BoW + Classification
vectorizer = CountVectorizer(max_features=50)
X = vectorizer.fit_transform(articles)
# Now you can train any classifier (SVM, Random Forest, etc.)

##**Sentiment analysis**
Positive or Negative review?

In [9]:
from sklearn.linear_model import LogisticRegression

# Miami restaurant reviews
reviews = [
    "Amazing Cuban sandwich! Best in Miami! Will definitely return!",
    "Terrible service, cold food, never coming back to this place",
    "Decent food but nothing special, average Miami restaurant",
    "Horrible experience, worst meal ever, completely disappointed",
    "Outstanding seafood! Fresh catch! Excellent service! Love it!",
    "Mediocre at best, expected more from the reviews"
]

sentiments = ['positive', 'negative', 'neutral', 'negative', 'positive', 'neutral']

# Create BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews)

# Train sentiment classifier
sentiment_classifier = LogisticRegression()
sentiment_classifier.fit(X, sentiments)

# Analyze new review
new_review = ["The food was fantastic and service was great!"]
new_bow = vectorizer.transform(new_review)
sentiment = sentiment_classifier.predict(new_bow)
print(f"Sentiment: {sentiment[0]}")  # Output: positive

Sentiment: neutral


##**Find document similarity** (plagiarism or similar reviews)

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

# Miami event descriptions
events = [
    "Beach volleyball tournament at South Beach this Saturday",
    "Sand volleyball competition on South Beach this weekend",
    "Art gallery opening in Wynwood Arts District Friday night",
    "New exhibition opens at Wynwood Walls this Friday evening",
    "Food truck festival at Bayfront Park all weekend"
]

# Convert to BoW
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(events)

# Calculate similarity
similarity_matrix = cosine_similarity(bow_matrix)

# Find most similar events to event 0
event_idx = 0
similarities = similarity_matrix[event_idx]
most_similar_idx = similarities.argsort()[-2]  # -1 would be itself

print(f"Event: {events[event_idx]}")
print(f"Most similar: {events[most_similar_idx]}")
print(f"Similarity score: {similarities[most_similar_idx]:.2f}")

Event: Beach volleyball tournament at South Beach this Saturday
Most similar: Sand volleyball competition on South Beach this weekend
Similarity score: 0.56


###This code implements a simple search engine using the **Bag of Words model and cosine similarity** to **find documents most similar to a given query.**

In [12]:
def search_documents(query, documents):
    """Simple BoW-based search engine"""

    # Combine query with documents
    all_text = [query] + documents

    # Create BoW
    vectorizer = CountVectorizer()
    bow_matrix = vectorizer.fit_transform(all_text)

    # Calculate similarity between query (index 0) and all docs
    query_vector = bow_matrix[0]
    doc_vectors = bow_matrix[1:]

    similarities = cosine_similarity(query_vector, doc_vectors).flatten()

    # Rank documents
    ranked_idx = similarities.argsort()[::-1]

    return [(documents[idx], similarities[idx]) for idx in ranked_idx if similarities[idx] > 0]

# Miami business descriptions
businesses = [
    "Joe's Stone Crab serves fresh seafood and famous stone crabs",
    "Versailles Restaurant offers authentic Cuban cuisine and coffee",
    "Books & Books is an independent bookstore with author events",
    "Jungle Island features exotic animals and interactive shows",
    "Pérez Art Museum Miami showcases contemporary and modern art"
]

# Search
results = search_documents("Cuban food restaurant", businesses)
for doc, score in results[:3]:
    print(f"Score: {score:.2f} - {doc}")

Score: 0.41 - Versailles Restaurant offers authentic Cuban cuisine and coffee


***def search_documents(query, documents):*** function search_documents that takes a search query string and a list of documents (strings) as input.

***all_text = [query] + documents***: Combines the query and all documents into a single list of strings.

***vectorizer = CountVectorizer(): ***Creates an instance of CountVectorizer to convert text into a Bag of Words representation.

***bow_matrix = vectorizer.fit_transform(all_text)***: Fits the vectorizer to all the text (query + documents) and transforms them into a BoW matrix.

***query_vector = bow_matrix[0]***: Selects the BoW vector for the query (which is the first item in all_text).

***doc_vectors = bow_matrix[1:]***: Selects the BoW vectors for all the documents.

***similarities = cosine_similarity(query_vector, doc_vectors).flatten():*** Calculates the cosine similarity between the query vector and each document vector. flatten() converts the result into a 1D array.

***ranked_idx = similarities.argsort()[::-1]:*** Gets the indices that would sort the similarities array in descending order.

***return [(documents[idx], similarities[idx]) for idx in ranked_idx if similarities[idx] > 0]:*** Returns a list of tuples, where each tuple contains a document and its similarity score, sorted from highest similarity to lowest, and only includes documents with a similarity score greater than 0.

The code then defines a list of sample Miami business descriptions, calls the search_documents function with a query, and prints the top 3 results.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

#=======================
# Miami news snippets
#=======================

news_snippets = [
    "Hurricane preparedness workshop scheduled for residents",
    "Beach erosion concerns grow after recent storms",
    "Heat win playoff game with Butler scoring 35 points",
    "Dolphins draft new quarterback in first round",
    "Art Basel brings international artists to Miami Beach",
    "Storm surge warnings issued for coastal areas",
    "Museum opens new contemporary art exhibition",
    "Basketball team advances to conference finals"
]

# Create BoW
vectorizer = CountVectorizer(max_features=20, stop_words='english')
bow = vectorizer.fit_transform(news_snippets)

# Topic modeling
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(bow)

# Display topics
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

Topic 1: beach, art, artists, brings, basel
Topic 2: new, exhibition, contemporary, draft, dolphins
Topic 3: finals, advances, basketball, conference, areas


In [None]:
#===================
#HOTEL FEEDBACK
#===================
feedback = [
    "Great location near beach but parking was expensive",
    "Beautiful beach view and excellent pool area",
    "Parking was terrible and very expensive",
    "Love the beach access and pool facilities",
    "Room was clean but parking situation is horrible"
]

# Create BoW focusing on specific aspects
vectorizer = CountVectorizer(ngram_range=(1, 2))  # Include bigrams
bow = vectorizer.fit_transform(feedback)

# Sum word frequencies
word_freq = bow.sum(axis=0).A1
word_names = vectorizer.get_feature_names_out()

# Find most mentioned aspects
import numpy as np
top_indices = word_freq.argsort()[-10:][::-1]

print("Most mentioned aspects:")
for idx in top_indices:
    print(f"  '{word_names[idx]}': {word_freq[idx]} mentions")

Most mentioned aspects:
  'was': 3 mentions
  'parking': 3 mentions
  'and': 3 mentions
  'beach': 3 mentions
  'pool': 2 mentions
  'parking was': 2 mentions
  'but parking': 2 mentions
  'expensive': 2 mentions
  'but': 2 mentions
  'was clean': 1 mentions


In [14]:
#===================================
# Miami's multilingual environment
#===================================

texts = [
    "Welcome to Miami Beach, enjoy your stay!",  # English
    "Bienvenido a Miami Beach, disfruta tu estancia!",  # Spanish
    "The best Cuban coffee in all of Miami",  # English
    "El mejor café cubano de todo Miami",  # Spanish
    "Bienvenue à Miami Beach!"  # French
]

languages = ['english', 'spanish', 'english', 'spanish', 'french']

# Character-level BoW can help detect languages
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 3))
X = char_vectorizer.fit_transform(texts)

# Train language detector
from sklearn.naive_bayes import MultinomialNB
lang_detector = MultinomialNB()
lang_detector.fit(X, languages)

# Detect language
new_text = ["Hola, dónde está la playa?"]
new_bow = char_vectorizer.transform(new_text)
detected = lang_detector.predict(new_bow)
print(f"Detected language: {detected[0]}")

Detected language: spanish



## Activity 2.01: Extracting Top Keywords from the News Article

In this activity, you will extract the most frequently occurring keywords from a sample news article using Python and the Natural Language Toolkit (NLTK).

### Prerequisites

- Basic understanding of Python programming.
- An environment to run Python code (like Jupyter Notebook or Google Colab).

### Data

The news article used in this activity is available at the following link: [news_article.txt](https://github.com/fenago/natural-language-processing-workshop/blob/master/Lab02/data/news_article.txt).

### Steps to Follow

1. **Set Up Your Environment**:
   - Open Jupyter Notebook or Google Colab.
   - Ensure Python is installed along with NLTK. Install NLTK if not already installed using `!pip install nltk`.

2. **Import Necessary Libraries**:
   - Import `nltk` and other necessary Python libraries.

3. **Define Helper Functions**:
   - Create functions to load the text file, convert text to lowercase, tokenize the text, remove stop words, perform stemming, and calculate word frequencies.

4. **Load the News Article**:
   - Use Python's file handling methods to load `news_article.txt` into a string.

5. **Preprocess the Text**:
   - Convert the text to lowercase.
   - Tokenize the text using a whitespace tokenizer.
   - Remove stop words from the tokens.
   - Perform stemming on the remaining tokens.

6. **Calculate Word Frequencies**:
   - Count the frequency of each word after stemming.
   - Display the most frequent keywords.

### Challenge for Students

Now that you've learned how to extract keywords from a news article, challenge yourself by applying these techniques to a different dataset. Here's what you can do:

- **Find a Unique Dataset**: Select a text dataset of your interest. This could be another news article, a blog post, or any textual data.
- **Implement the Keyword Extraction Process**: Apply the steps you've learned in this activity to your dataset. This includes text preprocessing, tokenization, stop word removal, stemming, and frequency analysis.
- **Analyze Your Results**: Look at the most frequent keywords in your dataset. Do they give you insights into the main themes or topics of the text?

**Contextualize Your Learning**: Reflect on how this process could be useful in real-world applications like search engine optimization, content analysis, or summarizing information.


# for the news_article.txt - just create a text file and put this into the contents:

Ever since the populist Law and Justice (pis) party took power in 2015, Adam Bodnar, Poland’s
 human-rights ombudsman, has been warning against its relentless efforts to get control of the
 courts. To illustrate the danger, he uses an expression from communist times: lex telefonica.
 In the Polish People’s Republic, verdicts were routinely dictated by a phone call from an
 apparatchik at party headquarters. Today’s government has more subtle techniques,
 but the goal is the same, Mr Bodnar says: “If a judge has a case on his desk with some
 political importance, he should be afraid.”

The European Commission is worried, too. It accuses pis of violating Poland’s commitments
to the rule of law under the European Union’s founding treaty. In 2017 the commission took
Poland to the European Court of Justice (ecj) over laws that gave politicians control over
appointing judges. (For example, they lowered judges’ retirement age while letting the justice
 minister pick whom to exempt.) The ecj ruled against the Poles, who had in the meantime
 scrapped some of the measures.

## Activity 2.02: Text Visualization

In this activity, you will create a word cloud for the 50 most frequent words in a dataset. The dataset consists of random sentences that need to be cleaned and analyzed to identify frequently occurring words.

### Prerequisites

- Basic understanding of Python programming.
- Familiarity with text processing and visualization libraries in Python.

### Data

The dataset used in this activity is available at the following link: [text_corpus.txt](https://github.com/fenago/natural-language-processing-workshop/blob/master/Lab02/data/text_corpus.txt
).

### Steps to Follow

1. **Import Necessary Libraries**:
   - Import libraries required for data fetching, text processing, and visualization (like `pandas`, `nltk`, `matplotlib`, `wordcloud`, etc.).

2. **Fetch the Dataset**:
   - Retrieve the `text_corpus.txt` file and load its contents.

3. **Preprocess the Text**:
   - Perform text cleaning to remove unwanted characters and formats.
   - Tokenize the text.
   - Apply lemmatization to convert words to their base form.

4. **Identify Top 50 Words**:
   - Calculate the frequency of each word in the cleaned dataset.
   - Create a set of the top 50 most frequent words along with their frequencies.

5. **Create a Word Cloud**:
   - Use the word cloud library to visualize the top 50 words.
   - Customize the word cloud's appearance as needed.

6. **Analyze the Word Cloud**:
   - Compare the word cloud with the calculated word frequencies.
   - Justify the representation of words in the word cloud based on their frequencies.

### Challenge for Students

Now that you have created a word cloud for a given dataset, try extending your skills with these tasks:

- **Use a Different Dataset**: Find another text dataset that interests you. It could be a collection of social media posts, reviews, or any other textual content.
- **Apply Enhanced Text Processing**: Experiment with different preprocessing techniques like stop word removal, n-grams, or POS tagging.
- **Visualize Your Findings**: Create a word cloud for your chosen dataset. How does the word cloud reflect the key themes or sentiments in the data?
- **Draw Insights**: Reflect on how word clouds can aid in quick data analysis, highlighting key areas for deeper exploration.

**Explore Further**: Consider how word clouds can be used in areas like marketing analysis, sentiment analysis, or summarizing large volumes of text.


In [None]:
news_article = '''Ever since the populist Law and Justice (pis) party took power in 2015, Adam Bodnar, Poland’s human-rights ombudsman, has been warning against its relentless efforts to get control of the courts. To illustrate the danger, he uses an expression from communist times: lex telefonica. In the Polish People’s Republic, verdicts were routinely dictated by a phone call from an apparatchik at party headquarters. Today’s government has more subtle techniques, but the goal is the same, Mr Bodnar says: “If a judge has a case on his desk with some political importance, he should be afraid.”

The European Commission is worried, too. It accuses pis of violating Poland’s commitments to the rule of law under the European Union’s founding treaty. In 2017 the commission took Poland to the European Court of Justice (ecj) over laws that gave politicians control over appointing judges. (For example, they lowered judges’ retirement age while letting the justice minister pick whom to exempt.) The ecj ruled against the Poles, who had in the meantime scrapped some of the measures.'''


In [None]:
news_article

