# Welcome!

## Plan for this workshop:

- Import dataset
- Text Preprocessing
- Working with Datasets
- Tokenization
- Data Preparation
- Classification

## What is NLP?
NLP, or Natural Language Processing, is a subfield of artificial
intelligence (AI) that deals with the interaction between computers and
humans in natural language.

## Usages of NLP
NLP has numerous applications across various industries:

- Sentiment Analysis: Determining the emotional tone of text, such as identifying positive, negative, or neutral sentiments in customer reviews.
- Chatbots: Creating conversational agents that can interact with users in a human-like manner.
- Machine Translation: Automatically translating text from one language to another.
- Text Summarization: Condensing large amounts of text into shorter, concise summaries.
- Question Answering: Developing systems that can answer questions posed in natural language.



## Machine translation influenced the development of NLP

- Early Days: In the early days of computing, machine translation was a primary focus. Researchers sought to develop algorithms and techniques to automatically translate between languages.
- Rule-Based Approaches: Initial efforts relied on rule-based approaches, where systems followed predefined grammatical rules to convert text. However, these methods proved limited in handling the nuances and complexities of human language.
- Statistical Methods: The field shifted towards statistical methods, leveraging large amounts of parallel text data to learn patterns and probabilities for translation.
- Foundation for NLP: These statistical techniques, along with linguistic concepts developed for machine translation, formed the foundation for many core NLP tasks like part-of-speech tagging, parsing, and text analysis.
- Modern NLP: While machine translation remains a key area within NLP, the field has expanded to encompass a wider range of applications, including sentiment analysis, text summarization, question answering, and more.

## Example using Transformers

### Sentiment classification 💗

Here's a simple example of **sentiment analysis** using the Hugging Face Transformers library:

First, install the transformers library:

In [60]:
!pip install transformers -q

In [None]:
!pip install datasets pyarrow==14.0.2

Then, run the following code:



In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("This is a positive sentence.")

In [None]:
result

This code snippet uses a pre-trained sentiment analysis model to classify the sentiment of the input sentence.

### Text generation 📚

It involves using models to create new text, such as completing sentences, writing stories, or generating different creative content formats.

Here's an example of how to perform text generation using the Hugging Face Transformers library:

The generator takes user input, processes it with the model, and returns a generated response. Feel free to modify the user_input and experiment with different interactions.

In [None]:
generator = pipeline("text-generation")
result = generator("We are at a hackers conference  ", max_length=50)

In [None]:
result

- Chat 🐈

In [None]:
# Initialize the text generation pipeline with a dialog model
generator = pipeline("text-generation", model="microsoft/DialoGPT-medium")

# Function to generate chatbot responses
def generate_response(user_input):
  response = generator(user_input, max_length=1000)[0]["generated_text"]
  return response

user_input = "Hi"

response = generate_response(user_input)

In [None]:
print(response)


# NLP pipeline


## Overview

- Load Data: Import dataset from Hugging Face Datasets.
- Text Preprocessing:
  - Lowercasing
  - Punctuation removal
- Tokenization: Utilize a pre-trained tokenizer from Transformers
- Data Preparation:
  - Create vocabulary
  - Convert text to numerical vectors
  - Classification


  ![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

## Data import and cleaning



### Importing a dataset 📫:

In [None]:
!pip install datasets -q

Loading the IMDB dataset: [model card](https://huggingface.co/datasets/stanfordnlp/imdb)

---



In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")

In [None]:
dataset

In [None]:
dataset['train'][0]

In [None]:
df = dataset["train"].to_pandas()
df

### Cleaning 🧹

Now that we have the movie review dataset, let's perform some cleaning on the text data before tokenization. We'll focus on **lowercasing** and **removing punctuation** for this example.

In [None]:
df['text'][0]

In [None]:
import re

def clean_text(text):
  text = text.lower()  # Lowercase the text
  text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
  return text

df['text'] = df['text'].apply(clean_text)
print(df.head())

In [None]:
df['text'][0]

This code defines a clean_text function that takes a text string as input, converts it to lowercase using .lower(), and removes punctuation using a regular expression (re.sub()). We then apply this function to the 'text' column of the DataFrame using .apply().

### Let's also remove stop words: ⏹

In [None]:
from nltk.corpus import stopwords
import nltk

text = df['text'][0]
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

tokens = text.split()
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

In [None]:
filtered_text = ' '.join(filtered_tokens)
print(filtered_text)

### Lemmatization

Lemmatization and stemming are both text normalization techniques used in natural language processing (NLP) to prepare text for analysis. They aim to reduce words to their base or root form, but they differ in their approach and results.

**Stemming**

Chops off the ends of words to remove suffixes, prefixes, etc.
It reduces words to their stem (root form).
Example: "running" becomes "run," "studies" becomes "studi".

**Lemmatization**

Reduces words to their dictionary form (lemma).
Example: "better" becomes "good," "running" becomes "run."


### Stemming

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

for token in stemmed_tokens:
  print(token)


In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

for token in lemmatized_tokens:
  print(token)


## Tokenization: Turning Words into Numbers
While humans understand words and sentences, computers primarily work with numbers. To enable computers to process and understand human language, we need a way to convert text into a numerical representation. This is where tokenization comes in.

Tokenization is the process of breaking down text into individual units called tokens. These tokens can be words, subwords, or even characters, depending on the chosen approach.

For instance, the sentence "This is an example." can be tokenized into the following words:

["This", "is", "an", "example", "."]

Each token is then assigned a unique numerical identifier, allowing computers to represent and manipulate text using numbers.

Here's a simple example of tokenization using the Hugging Face Transformers library:

In [None]:
#just taking our previous sentence for later:
example1 = df['text'][0]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer.tokenize("We are at a hacking conference.")
print(tokens)

In [None]:
lem_tokens = [tokenizer.tokenize(token) for token in lemmatized_tokens]

print(lem_tokens)

Tokenization Examples
Let's explore different tokenization methods and their importance:

1. Basic Splitting

A simple way to tokenize text is by splitting it based on spaces:

In [None]:

hacking_sent = "We are at a hacking conference."
#text = df['text'][0]
#tokens = text.split()
tokens = hacking_sent.split()
print(tokens)

This example uses a regular expression to extract words while ignoring punctuation.

2. Transformers Tokenizers

For advanced NLP tasks, using tokenizers from pre-trained models is crucial:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer.tokenize(hacking_sent)
print(tokens)


## Tokens to ids 🔢

The token ids are the unique value of a token in a vocabulary list.

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(token_ids)


In [None]:

tokens = tokenizer.tokenize("We are at a hacking party.")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(token_ids)


In [None]:
tokens = tokenizer.tokenize("You are at a hacking party.")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(token_ids)

In [None]:
tokens = tokenizer.tokenize("You are at a hacking party!!")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(token_ids)

These tokenizers are trained on large datasets and can handle various linguistic nuances, including subword tokenization and special characters.



## Padding and Truncation 🔪

When working with Transformers for NLP tasks, we often deal with multiple sentences or text sequences of varying lengths. However, these models typically require input sequences to have the same length. This is where padding and truncation come into play.

### **Padding**

Padding involves adding special tokens (e.g., [PAD]) to shorter sequences to make them match the length of the longest sequence in a batch. This ensures consistent input dimensions for the model.

### **Truncation**

Truncation involves shortening longer sequences to a maximum length. This can be done by removing tokens from the beginning or end of the sequence.

**Example**

Suppose we have two sentences:

- Sentence 1: "This is a short sentence."
- Sentence 2: "This is a much longer sentence with more words."

After tokenization, we might have:

- Sentence 1: ["This", "is", "a", "short", "sentence", "."]
- Sentence 2: ["This", "is", "a", "much", "longer", "sentence", "with", "more", "words", "."]

To process these sentences with a Transformer, we might set a maximum length of 8 tokens.

- Padding: Sentence 1 would be padded with two [PAD] tokens: ["This", "is", "a", "short", "sentence", ".", [PAD], [PAD]]
- Truncation: Sentence 2 would be truncated to 8 tokens: ["This", "is", "a", "much", "longer", "sentence", "with"]

Proper padding and truncation can prevent errors and improve the performance of Transformer models.

In [None]:
sentences = ["A white rabbit.", "A lot of black cats in the garden."]

padded_sequences = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

In [None]:
padded_sequences

In [None]:
import torch

input_ids_tensor1 = padded_sequences['input_ids'][0]
input_ids_tensor2 = padded_sequences['input_ids'][1]

print("Input IDs Tensor 1:\n", input_ids_tensor1)
print("\nInput IDs Tensor 2:\n", input_ids_tensor2)

combined_tensor = torch.cat((input_ids_tensor1.unsqueeze(0), input_ids_tensor2.unsqueeze(0)), dim=0)
print("\nCombined Tensor:\n", combined_tensor)


Let's define a tokenization function and add a new column to our DataFrame with the tokens for each review. We'll use a basic tokenizer from the Hugging Face Transformers library for this example.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_text(text):
  tokens = tokenizer.tokenize(text)
  return tokens

df['tokens'] = df['text'].apply(tokenize_text)
print(df.head())

In [None]:
df['tokens']

In [None]:
df

## Sentiment analysis

Let's focus on a classification task using the first 5 lines of our processed and tokenized movie review data. For simplicity, we'll perform sentiment analysis and try to predict whether a review is positive or negative.


In [None]:
data = df[['tokens', 'label']].head(5)
data

We can perform sentiment classification on each of the first 5 movie reviews using a pre-trained model. Since we have limited data, using a pre-trained model is a good approach as it leverages knowledge learned from a massive dataset.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

data = df['text'].head(5)
sentiments = data.apply(lambda text: classifier(text)[0]['label'])
print(sentiments)

Pretty bad results 😶

Here might be why (01:19 am reflexion):

Token indices sequence length is longer than the specified maximum sequence length for this model (606 > 512). Running this sequence through the model **will result in indexing errors**.


In [None]:
first_tweet_tokens = df['tokens'][2]
num_tokens = len(first_tweet_tokens)
print(f"Number of tokens in the first tweet: {num_tokens}")


🤔 01:22 am...

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model_name = "bert-base-uncased"  # Example model, choose one suitable for your task
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def predict_sentiment(tokens):
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
  input_ids = torch.tensor([input_ids])  # Convert to PyTorch tensor
  outputs = model(input_ids)
  predicted_class = outputs.logits.argmax(-1)
  return predicted_class

data = df['tokens'].head(5)
sentiments = data.apply(predict_sentiment)
print(sentiments)

😀 01:24 am

It looks better!!

# Your turn

## Tokenizing a Sentence

Exercise:

- Use the BERT tokenizer to tokenize the third sentence of the dataset.
- Store the tokens in a variable called "tokens" and print them.

(Solution):



In [None]:
sentence = df['text'][2]
print(sentence)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer.tokenize(sentence)
print(tokens)

## Removing Stop Words

- Remove stop words from the following sentence: "This is a sentence with some stop words."
- Use the NLTK library and its stopwords corpus.
- Store the filtered tokens in a variable called "filtered_tokens" and print them.

In [None]:
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [None]:
tokens = sentence.split()
tokens

In [None]:
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)

 ## Calculating Sentiment Score

- Use a pre-trained sentiment analysis model from the transformers library to calculate the sentiment of the sentence for your sentence.
- Print the result, which should include the predicted label and score.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier(sentence)
print(result)

## What we covered

- What is NLP?
- Applications of NLP
- Transformers Library
- Tokenization
- Text Preprocessing
- Padding and Truncation
- Run a sentiment classifer

## Now your data is ready for training!
## Let's go deeper..

# Predictions on a real dataset

In [None]:
import pandas as pd

In [None]:
second_sample = df.head(10)

In [None]:
df.head(10)

Oh, only '0' labels! 😯

In [None]:
# prompt: I want to create a subset of df which has both  text labeled as 0 and 1. Create a subset of 10

df_subset = df[df.label.isin([0, 1])].groupby('label').head(5)

df_subset

In [None]:
# Shuffle the DataFrame rows
shuffled_df = df_subset.sample(frac=1).reset_index(drop=True)

print(shuffled_df)


In [None]:
#Show the first 4 tweets of this dataframe

In [None]:
shuffled_df[:4]

In [None]:
#Show only the labels for the first 4 tweets

In [None]:
shuffled_df[:4]['label']

### Predict sentiment

In [None]:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


tokenized_tweets = shuffled_df["text"].apply(lambda text: tokenizer(text, padding="max_length", truncation=True)).tolist()

# Convert tokenized data into PyTorch tensors
input_ids = [tweet['input_ids'] for tweet in tokenized_tweets]
attention_mask = [tweet['attention_mask'] for tweet in tokenized_tweets]

input_ids = torch.tensor(input_ids)
attention_mask = torch.tensor(attention_mask)


# Run the model for prediction
with torch.no_grad():
  outputs = model(input_ids, attention_mask=attention_mask)

# Get the predicted labels
predicted_labels = torch.argmax(outputs.logits, dim=1).numpy()

# Add the predicted sentiment to the sample DataFrame
shuffled_df["predicted_sentiment"] = predicted_labels

shuffled_df

In [None]:
def calculate_accuracy(df):
  correct_predictions = (df['label'] == df['predicted_sentiment']).sum()
  total_predictions = len(df)
  accuracy = correct_predictions / total_predictions
  return accuracy

In [None]:
accuracy = calculate_accuracy(shuffled_df)
print(f"Accuracy: {accuracy}")

### How to improve accuracy?

# Name entity recognition

### What are entities? 👽

BERT-base-NER, fine-tuned for Named Entity Recognition (NER), typically uses the following entities:

**Common Entities:**

- PER: Person
- ORG: Organization
- LOC: Location
- MISC: Miscellaneous
- Less Common, but Sometimes Included:

- DATE: Date
- TIME: Time
- MONEY: Monetary values
- PERCENT: Percentage
- Important Notes:


**B-, I- Prefixes:** You'll often see these prefixes before entity labels. They indicate the beginning (B-) and inside (I-) of a multi-word entity. For example:
B-PER (Beginning of a person's name)
I-PER (Inside a person's name)


In [None]:
dataset['train']['text'][0:2]

In [None]:

# Choose a pre-trained model for NER
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
NER_model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create a NER pipeline
nlp = pipeline("ner", model=NER_model, tokenizer=tokenizer)

# Take the first five tweets from the dataset for NER
subset_datasets = dataset["train"].select(range(5))

# Process and print NER results for each example
for example in subset_datasets:
  text = example["text"]
  print(f"Text: {text}")
  ner_results = nlp(text)
  print(f"NER Results: {ner_results}")
  print("---")


In [None]:
for example in subset_datasets:
  text = example["text"]
  short_text = ' '.join(text.split()[:30])
  print(f"Text: {short_text}")
  ner_results = nlp(text)
  print("Token\tNER")
  for result in ner_results:
    print(f"{result['word']}\t{result['entity']}")
  print("---")

# Wrap it all up 👉 an example of dialog generation

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Function to generate chatbot responses
def generate_response(user_input):
  input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors="pt")
  chat_history_ids = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
  response = tokenizer.decode(chat_history_ids[:, input_ids.shape[-1]:][0], skip_special_tokens=True)
  return response


user_input = "Thanks guys, this is the end of this workshop. Hope you enjoyed it!"
response = generate_response(user_input)


In [None]:
print(response)


# Questions ?


# Thank you!

🧑
Twitter: @hello_locked | C00kie_two@infosec.exchange
