# Hacker's Guide to Natural Language Processing (NLP)

🚀 Created for the [Duke AI Hackathon 2024](https://dukeaihackathon.com/)

👋 by [Dr. Brinnae Bent](https://www.linkedin.com/in/brinnaebent/)

## What is NLP?
Natural Language Processing (NLP) is a subfield of AI focused on enabling computers to understand, interpret, and generate language. It combines linguistics, computer science, and machine learning.

NLP includes various tasks such as:
* speech recognition
* language translation
* sentiment analysis
* text summarization.

\
In this tutorial, we will cover some of the basics, using popular NLP Python libraries `nltk`, `spaCy`, and `transformers`.



## Table of Contents
* [Environmental Variables](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=4vGJIX1-UTIg&line=29&uniqifier=1)

* [Libraries](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=JbkFW64HY3Uj&line=16&uniqifier=1)

* NLTK
  * [Tokenization](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=hDCw9klGaSTO&line=4&uniqifier=1)
  * [Parts-of-speech Tagging](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=oZJo_1vYa82Q&line=32&uniqifier=1)
* spaCy
  * [Named Entity Recognition](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=JwLf8G3Ycwmv&line=4&uniqifier=1)
  * [Dependency Parsing](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=PtZ2oMWAdHBX&line=5&uniqifier=1)
* transformers
  * [Sentiment Analysis](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=2MIpV37-dbYW&line=5&uniqifier=1)
  * [Text Summarization](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=ScyP1hMkfBi1&line=5&uniqifier=1)
  * [Text Generation](https://colab.research.google.com/drive/1c0Qn7vqiVs1YzW69wcJsAefBPNwJCqPP#scrollTo=g7oTGYiNfVUC&line=7&uniqifier=1)




---



### Environment Variables (ie API keys, tokens)

For some models, you may need a HuggingFace Token. You can create one following these [instructions](https://huggingface.co/docs/hub/en/security-tokens).

Environment variables are an important part of application configuration when dealing with sensitive information like API keys. They let you keep your credentials and configuration separate from your code.

API Keys/Tokens should always be treated as sensitive information. Never commit these keys to version control systems like Git and don't add them directly to your code.

**Using environment variables in Google Colab**

1. Locate the key icon on the left sidebar of the Colab interface
2. Click on the key icon to open the "Secrets" panel
3. Add your secret key-value pairs:

> Name: `HUGGINGFACE_TOKEN`

> Value: `your_actual_token_here`


Then add the environment variables to your notebook using the code below:

```python
from google.colab import userdata
userdata.get('HUGGINGFACE_TOKEN')
```

**Using environment variables locally**

When working on your local machine, it's common to use a .env file to manage environment variables. Always add .env to your .gitignore file to prevent accidental commits!

1. Create a file named .env in your project's root directory
2. Add your environment variables to this file:

`HUGGINGFACE_TOKEN=your_actual_token_here`


3. Install the python-dotenv library:

`pip install python-dotenv`

4. In your Python script, load and use the variables:

```python
from dotenv import load_dotenv
import os

load_dotenv()
huggingface_token = os.getenv('HUGGINGFACE_TOKEN')
```



## Install libraries

We will be working with some NLP libraries in this tutorial, including `nltk`, `spacy`, and `transformers`.

### NLTK (Natural Language Toolkit)
Easy-to-use interfaces for over 50 corpora and lexical resources. Used commonly in text classification, sentiment analysis, tokenization and word segmentation, part-of-speech tagging, named entity recognition, and syntactic parsing.

[Documentation](https://www.nltk.org/)

### spaCy
Python library for fast and efficient tokenization, pre-trained statistical models and word vectors, named entity recognition, dependency parsing, and sentence segmentation.

[Documentation](https://spacy.io/)

### transformers 🤗
Transformers is a library from HuggingFace that provides state-of-the-art machine learning models for NLP tasks. You can access to pre-trained models like BERT, GPT-2, T5, and many others. You can use their APIs to fine-tune models on custom tasks, and it supports both PyTorch and TensorFlow backends. (Also useful for non-NLP tasks like computer vision and audio)

[Documentation](https://huggingface.co/docs/transformers/en/index)


In [1]:
# This code installs the packages. If you do this outside of Google CoLab, you will want to add these to a requirements.txt file (or other)
!pip install nltk==3.8.1 spacy==3.7.5 transformers==4.44.2 torch==2.4.1+cu121
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import nltk
import spacy
from transformers import pipeline

## Tokenization (NLTK)

### What is tokenization?
Tokenization is the process of breaking down text into smaller units called tokens, typically words, subwords, or characters. It's a preprocessing step in NLP that splits input text into meaningful units.

For example, the sentence `"I love AI"` might be tokenized into `["I", "love", "AI"]` using word-level tokenization.

In [3]:
nltk.download('punkt')

text = "NLTK is a leading platform for building Python programs to work with human language data."

# Tokenization
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokens: ['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


## Parts-of-speech tagging (NLTK)

### What is parts-of-speech (POS) tagging?
Parts-of-speech tagging is the process of labeling words in a text with their corresponding grammatical categories, such as nouns, verbs, adjectives, and adverbs.

POS tagging is important for various NLP tasks, including syntactic parsing, information extraction, and machine translation.

For instance, in the sentence "The cat chased the mouse," a POS tagger would label "cat" and "mouse" as nouns, "chased" as a verb, and "the" as determiners.

Examples:

- DT: Determiner
- EX: Existential there
- JJ: Adjective
- IN: Preposition or subordinating conjunction
- TO: to
- NN: Noun, singular or mass
- NNS: Noun, plural
- NNP: Proper noun, singular
- NNPS: Proper noun, plural
- PRP: Personal pronoun
- RB: Adverb
- RBR: Adverb, comparative
- RBS: Adverb, superlative
- SYM: Symbol
- UH: Interjection
- VB: Verb, base form
- VBD: Verb, past tense
- VBG: Verb, gerund or present participle
- VBN: Verb, past participle
- VBP: Verb, non-3rd person singular present
- VBZ: Verb, 3rd person singular present

In [4]:
nltk.download('averaged_perceptron_tagger')

# This will use the tokens generated during tokenization, above. Make sure to run that first!

# Part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


POS Tags: [('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]


## Named Entities (spaCy)

### What is Named Entity Recognition?
Named Entity Recognition (NER) identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, and dates.

In [5]:
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} - {ent.label_}")

Named Entities:
Apple - ORG
U.K. - GPE
$1 billion - MONEY


## Dependency Parsing (spaCy)

Dependency parsing analyzes the grammatical structure of a sentence by identifying relationships between words.

It creates a tree-like structure where each word is connected to its syntactic head, showing how words depend on or modify each other. This process helps determine the sentence's meaning by revealing subject-verb-object relationships and other grammatical connections.

In [6]:
# This uses doc, created in the code block above. Make sure to run that first!

print("\nDependency Parsing:")
for token in doc:
    print(f"{token.text} - {token.dep_} - {token.head.text}")


Dependency Parsing:
Apple - nsubj - looking
is - aux - looking
looking - ROOT - looking
at - prep - looking
buying - pcomp - at
U.K. - dobj - buying
startup - dep - looking
for - prep - startup
$ - quantmod - billion
1 - compound - billion
billion - pobj - for


## Sentiment Analysis (transformers)

Sentiment analysis determines the emotional tone expressed in text. It uses algorithms to classify text as positive, negative, or neutral, often assigning a numerical score to indicate the "intensity" of sentiment.

This technique is widely used in business, social media monitoring, and customer feedback/marketing analysis to better understand opinions about products, services, or topics.

Sentiment analysis can range from simple rule-based systems to complex machine learning models that consider context and nuance in language.

The transformers library allows us to access many models trained for sentiment analysis. You can also fine-tune your own sentiment analysis models using the transformers library. [See tutorial here](https://huggingface.co/blog/sentiment-analysis-python).

In [7]:
sentiment_analyzer = pipeline("sentiment-analysis")

text = "I love learning about NLP! It's cool."
result = sentiment_analyzer(text)
print(f"Sentiment: {result[0]['label']}, Score: {result[0]['score']:.4f}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



Sentiment: POSITIVE, Score: 0.9999


Using a specific model:

Find specific models [here](https://huggingface.co/models?sort=trending&search=sentiment+analysis).

In [8]:
specific_model_sentiment_analyzer = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
text = "I love learning about NLP! It's cool."
result = specific_model_sentiment_analyzer(text)
print(f"Sentiment: {result[0]['label']}, Score: {result[0]['score']:.4f}")

config.json:   0%|          | 0.00/949 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/843k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


Sentiment: POS, Score: 0.9927


## Text Summarization (transformers)

Text Summarization involves condensing a longer document into a shorter version while preserving its key information and meaning.

It can be **extractive**, selecting and combining existing sentences from the original text, or **abstractive**, generating new sentences to capture the essence of the content.

Summarization algorithms analyze factors like sentence importance, keyword frequency, and semantic relationships to identify the most crucial information.

In [9]:
summarizer = pipeline("summarization")

long_text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of understanding the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
"""

summary = summarizer(long_text, max_length=75, min_length=30, do_sample=False)
print("Summary:", summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Summary:  Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language . The goal is a computer capable of understanding the contents of documents, including the contextual nuances of the language within them .


## Text Generation (transformers)

Text generation is the process of creating human-like text using AI algorithms.

The transformers library allows us to utilize text generation models like GPT-2.

These models use neural networks, specifically transformer architectures, to understand patterns in language and generate new text. The process typically starts with a prompt, and the model predicts the most likely next word based on its training, repeating this process.

In [10]:
text_generator = pipeline("text-generation")

prompt = "In the future, artificial intelligence will"
generated_text = text_generator(prompt, max_length=50, num_return_sequences=1)
print("Generated Text:", generated_text[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text: In the future, artificial intelligence will learn more and more.
