## 💻 UnpackAI DL201 Bootcamp - Week 1 - Skills: NLP

### 📕 Learning Objectives

* Solidify the basic notion of NLP and how it can be applied to a variety of tasks.
* Practice using Pandas for loading and processing text data.
* Ilustrate the process of converting a text document into a dataframe and from there into a tensor.

### A basic NLP Overview

From Wikipedia:
- "Natural language processing (NLP) is a subfield of **linguistics, computer science, and artificial intelligence** concerned with the interactions between computers and **human language**, in particular how to program computers to **process and analyze** large amounts of natural language data. The goal is a computer capable of **"understanding"** the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."

- Approaches to NLP tasks:
    - Rule-based
    - Traditional machine learning
    - Deep learning

In NLP, we often need to perform text preprocessing, such as removing stop words, stemming, lemmatization, and tokenization.
A nice overview is presented in: 
- https://stanfordnlp.github.io/CoreNLP/ 
- https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP

Common NLP tasks:
- Classification
- Masked filing
- Text prediction
- Sentiment analysis
    - Positive
    - Negative
    - Subjectivity
- Entity recognition
    - Person
    - Location
    - Organization
- Entity extraction
- Keyword extraction
- Topic extraction

### Ilustrative example

Below there is a code example that that illustrates the usage of Pandas for text manipulation and a few exploratory steps to create Tensors representing the text data.

In [None]:
# Install packages
#! pip install transformers
#!pip install openpyxl

# Import libraries
import os
import pandas as pd
import torch
import matplotlib.pyplot as plt
from transformers import BertTokenizer, BertModel

It is important to set correctly your data folder path as a local variable, depending on where you run this notebook.

In [None]:
# Set the data directory path as a variable

# Uncomment this for Kaggle
#!git clone https://github.com/unpackAI/DL201.git
#DATA_DIR = Path('/kaggle/working/DL201/data/nlp') #uncomment for kaggle

# Uncomment this for local
os.chdir('../data/nlp')
DATA_DIR = os.getcwd()

print(f'data directory is {DATA_DIR}')

Let's load a sample text file and feed it into the BERT model. The data/nlp folder of the repository contains a txt file with sentences from a book, the book was taken from: http://www.textfiles.com/stories/. Feel free to download a different book and use it when explolring this notebook.

In [None]:
# Load a sample text, from the data folder
os.chdir(DATA_DIR)
sample_text = open('alad10.txt').read()

# Split the text into sentences
sentences = sample_text.split('\n')

# Load the sentences into a dataframe
df = pd.DataFrame(sentences, columns=['sentence'])

As it has been reitared before, loading the data into Pandas gives us tremendous flexibility to perform data cleaning and preprocessing with ease.

In [None]:
# Inspect some of the sentences
df.sample(15)

In [None]:
# Remove punctuation from all sentences
df['sentence'] = df['sentence'].str.replace('[^\w\s]','')

# Note: instead of regex a list of punctuation can be used, give it a try!
punctuation = [
    '.', ',', '!', '?', ':', ';', '"', "'", '-', '_', '(', ')', '[', ']', '{', '}', '#', '@', '$', '%', '^', '&', '*',
     '+', '=', '<', '>', '/', '\\', '|', '~', '`', '“', '”', '‘', '’'
]

df.sample(10)

In [None]:
# Convert all sentences to lowercase
df['sentence'] = df['sentence'].str.lower()
df.sample(10)

**Sentences are a key unit of information when it comes to NLP** (as wells as tokens) in order to represent our data as a uniform "block" of text, we need to find out our longest sentence, the rest of them will later be padded with padding tokens.

In [None]:
# Get the length of the longest senctence
max_len = df['sentence'].str.len().max()
print(f'max sentence length is {max_len}')

The transformers library provides a convenient way to load a variety of BERT models. Let's first load and explore a tokenizer.

In [None]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
# Get the tokenizer vocabulary words
vocab = bert_tokenizer.vocab
vocab_size = len(vocab)
print(f'vocab size is: {vocab_size}')

In [None]:
# Get the vocabulary words as a list, load them into a dataframe
vocab_list = list(vocab.keys())
vocab_df = pd.DataFrame(vocab_list, columns=['tokens'])
vocab_df.sample(15)

In [None]:
# Get the count of tokens that begin with 'UNUSED'
unused_tokens = vocab_df[vocab_df['tokens'].str.find('unused')>=0]
print(f'There are {len(unused_tokens)} tokens that begin with "unused"')
unused_tokens.sample(10)

In [None]:
# Get the tokens that have a size of 1 character
one_char_tokens = vocab_df[vocab_df['tokens'].str.len()==1]
print(f'There are {len(one_char_tokens)} tokens that have a size of 1 character')
one_char_tokens.sample(10)

In [None]:
# Get the tokens which have a size of more than 2 characters and does not contain the word 'unused'
two_char_tokens = vocab_df[(vocab_df['tokens'].str.len()>2) & (vocab_df['tokens'].str.find('unused')<0)]
print(f'There are {len(two_char_tokens)} tokens that likely reprensent English words')
two_char_tokens.sample(10)

In [None]:
bert_model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
