## Prerequisites ##

1. Natural Language Processing (NLP) is the practice of trying to make sense out of naturally spoken language.

2. Common techniques include:

*   **Tokenization**: separating a piece of text into smaller units (words --> subwords --> characters)
*   **Stemming Lemmatization**: reduce the word into its root form
*   **Word-sense Disambiguation**: determine the correct meaning of word based on its context in a sentence
*   **Named-entity Recognition**: find unique entities (e.g. names, places) in text

3. Common subgoals include:


*  **Text Classification**:  Assigning predefined categories or labels to text.  
   - Example: Sentiment analysis (positive, negative, neutral), spam detection, topic categorization.

*  **Text Similarity**:  Measuring how similar two pieces of text are.  
   - Example: Cosine similarity, Jaccard similarity, or using embeddings like Word2Vec, BERT.

*  **Chatbot**:  Building conversational agents that can interact with users in natural language.  
   - Example: Rule-based chatbots, retrieval-based chatbots, or generative models like GPT.

*  **Machine Translation**:  Automatically translating text from one language to another.  
   - Example: Google Translate, neural machine translation (NMT) models like Transformer.

*  **Topic Modeling**:  Identifying abstract topics within a collection of documents.  
   - Example: Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF).

*  **Text Summarization**:  Generating a concise summary of a longer text while retaining key information.  
   - Example: Extractive summarization (selecting important sentences) or abstractive summarization (generating new sentences).

* **Language Modeling**:  Predicting the next word in a sequence or generating coherent text.  
   - Example: GPT, BERT, or traditional n-gram models.

* **Information Extraction**:  Extracting structured information from unstructured text.  
   - Example: Named Entity Recognition (NER), relation extraction, event extraction.

* **Information Retrieval**:  Finding relevant documents or information from a large dataset.  
   - Example: Search engines, document retrieval systems using TF-IDF or BM25.

* **Voice Assistant**:  Building systems that can understand and respond to spoken language.  
    - Example: Siri, Alexa, Google Assistant, which combine Automatic Speech Recognition (ASR) and NLP.

* **Content Moderation**: Monitoring and managing user-generated content to ensure it adheres to guidelines and standards.  
   - Example: Detecting hate speech, spam, or inappropriate content using rule-based systems, machine learning, or deep learning.

* **Sentiment Analysis**: Determining the emotional tone or attitude expressed in a piece of text.  
   - Example: Classifying text as positive, negative, or neutral using lexicon-based approaches, machine learning, or deep learning models.


4.  **Text vectorization** is the process of converting text data (words, sentences, or documents) into numerical representations (vectors) that machine learning models can understand and process. It is a foundational step that enables various NLP tasks and goals

  * **Traditional Methods**:
      - **Bag-of-Words (BoW)**: Represents text as a vector of word frequencies.
      - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs words based on their importance in a document relative to a corpus.
      - **One-Hot Encoding**: Represents each word as a binary vector.

  * **Word Embeddings**:
      - **Word2Vec**: Represents words as dense vectors in a continuous vector space.
      - **GloVe**: Combines global statistics with local context to create word vectors.

    * **Contextual Embeddings**:
      - **BERT**: Generates context-aware embeddings for words or sentences.
      - **GPT**: Produces embeddings based on the context of the entire text.

5. **Cosine Similarity** is a metric used to measure how similar two vectors are, regardless of their magnitude.

* Calculates the cosine of the angle between two vectors in a multi-dimensional space. The result ranges from **-1 to 1**:
*  Widely used in various NLP tasks to (e.g. compare similiarity of text in the form of vectors)
   





In [None]:
from gensim.models import Word2Vec
from nltk.corpus import brown
import nltk

# Download the Brown corpus
nltk.download('brown')

# Load the Brown corpus
sentences = brown.sents()

# Train the Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Get the vector for a word
vector = model.wv['computer']
print("Vector for 'computer':", vector)

# Find similar words
similar_words = model.wv.most_similar('computer', topn=5)
print("Similar words to 'computer':", similar_words)

# Perform vector arithmetic
result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print("king - man + woman =", result)

# Save the model
model.save("word2vec.model")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Vector for 'computer': [ 0.0081644   0.08825592  0.00719581  0.06027758 -0.02150829 -0.09762977
  0.05002759  0.162107   -0.05735681 -0.08673133  0.03525983 -0.15187016
  0.04029834 -0.0135191   0.06247007 -0.01725993  0.05179335 -0.02849256
 -0.05461214 -0.08266587  0.10455804  0.04346529  0.05108863  0.00600776
  0.04418414  0.03388423 -0.12310209  0.01332964 -0.06466891  0.01582695
  0.06006123 -0.02829382  0.03596974 -0.08448923 -0.00498709 -0.03576018
  0.04804855 -0.06727072 -0.03595547 -0.02523234  0.01799497 -0.09745809
 -0.03569978  0.00483696  0.02099412 -0.02849314 -0.04460847 -0.01337569
 -0.05066493  0.05920564  0.00652491 -0.09858956 -0.11757163 -0.00296414
 -0.02751409 -0.02523994  0.09571654 -0.09371163  0.01030799  0.02648861
 -0.06512243  0.02275435  0.01157099  0.01632542 -0.03085877  0.10753866
 -0.00861219  0.07730311 -0.10477111  0.09956023  0.00507206  0.09231879
  0.05559046 -0.05996138  0.02511782  0.02273149  0.01172667  0.07508503
 -0.05046707 -0.01198102 -0.

## Capstone project 1: word embeddings

In [2]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m63.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy, gensim
  Attempting uninstall: scipy
    Found existing installation: scipy 1.14.1
    Uninstalling scipy-1.14.1:
      Successfully 

In [5]:
import gensim.downloader as api

wv = api.load("word2vec-google-news-300")




In [6]:
wv.similarity(w1="great", w2="well")

0.4098271

In [7]:
wv.most_similar("good")

[('great', 0.7291510105133057),
 ('bad', 0.7190051078796387),
 ('terrific', 0.6889115571975708),
 ('decent', 0.6837348341941833),
 ('nice', 0.6836092472076416),
 ('excellent', 0.644292950630188),
 ('fantastic', 0.6407778263092041),
 ('better', 0.6120728850364685),
 ('solid', 0.5806034803390503),
 ('lousy', 0.576420247554779)]

In [8]:
wv.most_similar(positive=["king", "woman"], negative=["man"])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

In [9]:
wv.doesnt_match(['dog', 'cat', 'lion', 'microsoft'])

'microsoft'

In [None]:
glv = api.load("glove-twitter-25")

In [None]:
glv.doesnt_match("banana grapes orange humnan".split())

In [10]:
wv_great = wv["great"]
wv_good = wv["good"]
wv_great.shape

(300,)

In [21]:
import pandas as pd

df = pd.read_csv("fake_and_real_news.csv")
df.shape

(9900, 2)

In [22]:
df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


In [None]:
df.label.value_counts()

In [23]:
df["label_num"] = df.label.map({
    "Fake": 0,
    "Real": 1
})
df.head()

Unnamed: 0,Text,label,label_num
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,0
1,U.S. conservative leader optimistic of common ...,Real,1
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,1
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,0
4,Democrats say Trump agrees to work on immigrat...,Real,1


In [None]:
!python -m spacy download en_core_web_lg

In [29]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [39]:
def preprocess_and_vectorize(text):
  doc = nlp(text)

  filtered_tokens = []
  for token in doc:
    if token.is_punct or token.is_stop:
      continue
    filtered_tokens.append(token.lemma_)

  return wv.get_mean_vector(filtered_tokens)

In [40]:
preprocess_and_vectorize("Don't worry if you don't understand").shape

(300,)

In [36]:
v1 = wv["worry"]
v2 = wv["understand"]

import numpy as np
np.mean([v1, v2], axis=0)[:3]

array([ 0.00976562, -0.00561523, -0.08905029], dtype=float32)

In [35]:
wv.get_mean_vector(["worry", "understand"], pre_normalize=False)

array([ 0.00976562, -0.00561523, -0.08905029,  0.01330566, -0.2709961 ,
        0.14746094,  0.3408203 , -0.01840591,  0.15161133, -0.06945801,
       -0.05749512, -0.17822266, -0.03805542,  0.08730698, -0.22216797,
        0.2578125 ,  0.06481934,  0.29589844,  0.00537109, -0.1875    ,
       -0.1159668 ,  0.0715332 ,  0.08691406,  0.05912399,  0.18359375,
        0.17687988,  0.09130859, -0.22705078,  0.10522461, -0.2475586 ,
       -0.02436638,  0.01245117, -0.06616211, -0.02587891,  0.13476562,
       -0.02604675,  0.06582642,  0.0612793 ,  0.07128906,  0.13867188,
        0.03234863, -0.03295898,  0.17736816, -0.08789062, -0.21777344,
       -0.11010742, -0.08728027, -0.01922607, -0.04943848,  0.05273438,
       -0.18066406,  0.13122559, -0.07498932, -0.10064697, -0.01171875,
        0.12963867, -0.10766602, -0.14624023,  0.11303711, -0.12280273,
       -0.03540039,  0.03601074, -0.01379395,  0.01042175,  0.1105957 ,
       -0.03820801, -0.20751953,  0.1352539 , -0.0625    , -0.01

In [42]:
df['vector'] = df['Text'].apply(lambda text: preprocess_and_vectorize(text))

In [44]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.vector,
    df.label_num,
    test_size=0.2,
    random_state=2022,
    stratify=df.label_num)


In [45]:
X_train[:2]

Unnamed: 0,vector
5454,"[-0.0025821957, 0.0066654426, 0.0025264677, 0...."
2881,"[0.0076183933, 0.0024890136, -0.017599065, 0.0..."


In [46]:
X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

In [47]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

clf = GradientBoostingClassifier()
clf.fit(X_train_2d, y_train)

y_pred = clf.predict(X_test_2d)

print(classification_report(y_test, y_pred))

KeyboardInterrupt: 

In [None]:
test_news = [
    "Michigan governor denies misleading U.S. House on Flint water (Reuters) - Michigan Governor Rick Snyder denied Thursday that he had misled a U.S. House of Representatives committee last year over testimony on Flintâ€™s water crisis after lawmakers asked if his testimony had been contradicted by a witness in a court hearing. The House Oversight and Government Reform Committee wrote Snyder earlier Thursday asking him about published reports that one of his aides, Harvey Hollins, testified in a court hearing last week in Michigan that he had notified Snyder of an outbreak of Legionnairesâ€™ disease linked to the Flint water crisis in December 2015, rather than 2016 as Snyder had testified. â€œMy testimony was truthful and I stand by it,â€ Snyder told the committee in a letter, adding that his office has provided tens of thousands of pages of records to the committee and would continue to cooperate fully.  Last week, prosecutors in Michigan said Dr. Eden Wells, the stateâ€™s chief medical executive who already faced lesser charges, would become the sixth current or former official to face involuntary manslaughter charges in connection with the crisis. The charges stem from more than 80 cases of Legionnairesâ€™ disease and at least 12 deaths that were believed to be linked to the water in Flint after the city switched its source from Lake Huron to the Flint River in April 2014. Wells was among six current and former Michigan and Flint officials charged in June. The other five, including Michigan Health and Human Services Director Nick Lyon, were charged at the time with involuntary manslaughter",
    " WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia For Hacking Our Election Is Unpatriotic This woman is insane.In an incredibly disrespectful rant against President Obama and anyone else who supports investigating Russian interference in our election, Fox News host Jeanine Pirro said that anybody who is against Donald Trump is anti-American. Look, it s time to take sides,  she began.",
    " Sarah Palin Celebrates After White Man Who Pulled Gun On Black Protesters Goes Unpunished (VIDEO) Sarah Palin, one of the nigh-innumerable  deplorables  in Donald Trump s  basket,  almost outdid herself in terms of horribleness on Friday."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
clf.predict(test_news_vectors)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm


from matplotlib import pyplot as plt
import seaborn as sn
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Prediction')
plt.ylabel('Truth')

## simple transformers implementation

In [None]:
!pip install transformers torch tensorflow

In [48]:
from transformers import pipeline

# Load a pre-trained sentiment analysis model
classifier = pipeline("sentiment-analysis")

# Analyze sentiment of a sentence
result = classifier("I love using Hugging Face Transformers!")
print(result)  # Output: [{'label': 'POSITIVE', 'score': 0.9998}]

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9971315860748291}]


In [None]:
from transformers import pipeline

# Load a pre-trained text generation model
generator = pipeline("text-generation", model="gpt2")

# Generate text
result = generator("Once upon a time", max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])

In [None]:
from transformers import pipeline

# Load a pre-trained NER model
ner = pipeline("ner", grouped_entities=True)

# Extract entities from a sentence
result = ner("My name is John and I work at Google in California.")
print(result)

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load a dataset (e.g., IMDb reviews)
dataset = load_dataset("imdb")

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Fine-tune the model
trainer.train()

In [49]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Tokenize input text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    print(logits)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[ 0.3127, -0.4075]])


## Text classification with BERT

In [52]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

In [50]:
from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [53]:
from datasets import load_dataset

# Load the IMDb dataset
dataset = load_dataset("imdb")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

# Fine-tune the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)