# BERT

**Bidirectional Encoder Representations from Transformers.**

_ | _
- | -
![alt](https://pytorch.org/assets/images/bert1.png) | ![alt](https://pytorch.org/assets/images/bert2.png)


### **Overview**

BERT was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin *et al.* The model is based on the Transformer architecture introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani *et al.* and has led to significant improvements in a wide range of natural language tasks.

At the highest level, BERT maps from a block of text to a numeric vector which summarizes the relevant information in the text. 

What is remarkable is that numeric summary is sufficiently informative that, for example, the numeric summary of a paragraph followed by a reading comprehension question contains all the information necessary to satisfactorily answer the question.

#### **Transfer Learning**

BERT is a great example of a paradigm called *transfer learning*, which has proved very effective in recent years. In the first step, a network is trained on an unsupervised task using massive amounts of data. In the case of BERT, it was trained to predict missing words and to detect when pairs of sentences are presented in reversed order using all of Wikipedia. This was initially done by Google, using intense computational resources.

Once this network has been trained, it is then used to perform many other supervised tasks using only limited data and computational resources: for example, sentiment classification in tweets or quesiton answering. The network is re-trained to perform these other tasks in such a way that only the final, output parts of the network are allowed to adjust by very much, so that most of the "information'' originally learned the network is preserved. This process is called *fine tuning*. 

##Getting to know BERT

BERT, and many of its variants, are made avialable to the public by the open source [Huggingface Transformers](https://huggingface.co/transformers/) project. This is an amazing resource, giving researchers and practitioners easy-to-use access to this technology. 

In order to use BERT for modeling, we simply need to download the pre-trained neural network and fine tune it on our dataset, which is illustrated below. 

In [None]:
!pip install transformers

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
from transformers import TFBertModel, BertTokenizer

# Formatting tools
from pprint import pformat 
np.set_printoptions(threshold=10)

In [None]:
# Download text pre-processor ("tokenizer")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
# Download BERT model
bert = TFBertModel.from_pretrained("bert-base-uncased")

### Tokenization

The first step in using BERT (or any similar text embedding tool) is to *tokenize* the data. This step standardizes blocks of text, so that meaningless differences in text presentation don't affect the behavior of our algorithm. 

Typically the text is transformed into a sequence of 'tokens,' each of which corresponds to a numeric code. 

In [None]:
# Let's try it out!
s = "What happens to this string?"
print('Original String: \n\"{}\"\n'.format(s))
tensors = tokenizer(s)
print('Numeric encoding: \n' + pformat(tensors))

# What does this mean?
print('\nActual tokens:')
tokenizer.convert_ids_to_tokens(tensors['input_ids'])

### BERT in a nutshell

Once we have our numeric tokens, we can simply plug them into the BERT network and get a numeric vector summary. Note that in applications, the BERT summary will be "fine tuned" to a particular task, which hasn't happened yet. 

In [None]:
print('Input: "What happens to this string?"\n')

# Tokenize the string
tensors_tf = tokenizer("What happens to this string?", return_tensors="tf")

# Run it through BERT
output = bert(tensors_tf)

# Inspect the output
_shape = output['pooler_output'].shape
print(
"""Output type: {}\n 
Output shape: {}\n
Output preview: {}\n"""
.format(
type(output['pooler_output']),
 _shape, 
pformat(output['pooler_output'].numpy())))


# A practical introduction to BERT

In the next part of the notebook, we are going to explore how a tool like BERT may be useful to an econometrician. 

In particular, we are going to apply BERT to a subset of data from the Amazon marketplace consisting of roughly 10,000 listings for products in the toy category. Each product comes with a text description, a price, and a number of times reviewed (which we'll use as a proxy for demand / market share). 

**Problem 1**:
What are some issues you may anticipate when using number of reviews as a proxy for demand or market share?

### Getting to know the data

First, we'll download and clean up the data, and do some preliminary inspection.

In [None]:
# Download data
DATA_URL = 'https://www.dropbox.com/s/on2nzeqcdgmt627/amazon_co-ecommerce_sample.csv?dl=1'
data = pd.read_csv(DATA_URL)

# Clean numeric data fields
data['number_of_reviews'] = pd.to_numeric(data
                              .number_of_reviews
                              .str.replace(r"\D+",''))
data['price'] = (data
                    .price
                    .str.extract(r'(\d+\.*\d+)')
                    .astype('float'))

# Drop products with very few reviews
data = data[data['number_of_reviews'] > 0]

# Compute log prices
data['ln_p'] = np.log(data.price)

# Impute market shares
data['ln_q'] =  np.log(data['number_of_reviews'] / data['number_of_reviews'].sum())

# Collect relevant text data
data[['text']] = (data[[
                    'product_name',
                    'product_description']]
                  .astype('str')
                  .agg(' | '.join, axis=1))
 
#  Drop irrelevant data and inspect
data = data[['text','ln_p','ln_q']]
data = data.dropna()
data.head()

Let's make a two-way scatter plot of prices and (proxied) market shares. 

In [None]:
# Plot log price against market share
data.plot.scatter('ln_p','ln_q')

Let's begin with a simple prediction task. We will discover how well can we explain the price of these products using their textual descriptions.

**Problem 2**:
 1. Build a linear model that explains the price of each product using it's text embedding vector as the explanatory variables. 

 2. Build a two-layer perceptron neural network that explains the price of each product using the text embedding vector as input (see example code below).
<!-- 3. Now, instead of taking the text embeddings as fixed, we allow the it to ``fine tune.'' Construct a neural network by combining the (pre-loaded) BERT network -->

 3. Report the $R^2$ of both approaches. 

 4. As an econometrician, what are some concerns you may have about how to interpret these models?

In [None]:
## First, let's split and preprocess (tokenize) the text to prepare it for BERT 

main = data.sample(frac=0.6,random_state=200)
holdout = data.drop(main.index)

tensors = tokenizer(
    list(main["text"]),
    padding=True, 
    truncation=True, 
    max_length=128,
    return_tensors="tf")

ln_p = main["ln_p"]
ln_q = main["ln_q"]

In [None]:
## Now let's prepare our model

from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense, Dropout, Concatenate

input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

# First we compute the text embedding
Z = bert(input_ids, token_type_ids, attention_mask)

# We want the "pooled / summary" embedding, not individual word embeddings
Z = Z[1]

# Then we do a regular regression
Z = Dense(128, activation='relu')(Z)
Z = Dropout(0.2)(Z)
Z = Dense(32, activation='relu')(Z)
Z = Dropout(0.2)(Z)
Z = Dense(8, activation='relu')(Z)
ln_p_hat = Dense(1, activation='linear')(Z)

PricePredictionNetwork = Model([input_ids, token_type_ids, attention_mask], ln_p_hat)
PricePredictionNetwork.compile(optimizer='adam', loss='mse')
PricePredictionNetwork.summary()

In [None]:
PricePredictionNetwork.fit(
                [tensors['input_ids'], tensors['token_type_ids'], tensors['attention_mask']], 
                ln_p,
                epochs=3,
                batch_size=16,
                shuffle=True)

Now, let's go one step further and construct a DML estimator of the average price elasticity. In particular, we will model market share $q_i$ as
$$\ln q_i = \alpha + \beta \ln p_i + \psi(d_i) + \epsilon_i,$$ where $d_i$ denotes the description of product $i$ and $\psi$ is the composition of text embedding and a two-layer perceptron. 

**Problem 3**: 
 1. Split the sample in two, and predict $\ln p_i$ and $\ln q_i$ using $d_i$ with a two-layer perceptron as before, using the main sample.
 2. In the holdout sample, perform an OLS regression of the residual of $\ln q_i$ on the residual of $\ln p_i$ (using the previous problem's model). 
 3. What do you find? 

In [None]:
## Build the quantity prediction network

# Initialize new BERT model from original
bert2 = TFBertModel.from_pretrained("bert-base-uncased")

# Define inputs
input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

# First we compute the text embedding
Z = bert2(input_ids, token_type_ids, attention_mask)

# We want the "pooled / summary" embedding, not individual word embeddings
Z = Z[1]

# Construct network
Z = Dense(128, activation='relu')(Z)
Z = Dropout(0.2)(Z)
Z = Dense(32, activation='relu')(Z)
Z = Dropout(0.2)(Z)
Z = Dense(8, activation='relu')(Z)
ln_q_hat = Dense(1, activation='linear')(Z)

# Compile model and optimization routine
QuantityPredictionNetwork = Model([input_ids, token_type_ids, attention_mask], ln_q_hat)
QuantityPredictionNetwork.compile(optimizer='adam', loss='mse')
QuantityPredictionNetwork.summary()

In [None]:
## Fit the quantity prediction network in the main sample
QuantityPredictionNetwork.fit(
                [tensors['input_ids'], tensors['token_type_ids'], tensors['attention_mask']], 
                ln_q,
                epochs=3,
                batch_size=16,
                shuffle=True)

In [None]:
## Predict in the holdout sample, residualize and regress

# Preprocess holdout sample
tensors_holdout = tokenizer(
    list(holdout["text"]),
    padding=True, 
    truncation=True, 
    max_length=128,
    return_tensors="tf")

# Compute predictions
ln_p_hat_holdout = PricePredictionNetwork.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])
ln_q_hat_holdout = QuantityPredictionNetwork.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])


In [None]:
# Compute residuals
r_p = holdout["ln_p"] - ln_p_hat_holdout.reshape((-1,))
r_q = holdout["ln_q"] - ln_q_hat_holdout.reshape((-1,))

# Regress to obtain elasticity estimate
beta = np.mean(r_p * r_q) / np.mean(r_p * r_p)

# standard error on elastiticy estimate
se = np.sqrt(np.mean( (r_p* r_q)**2)/(np.mean(r_p*r_p)**2)/holdout["ln_p"].size)

print('Elasticity of Demand with Respect to Price: {}'.format(beta))
print('Standard Error: {}'.format(se))

## Clustering Products

In this final part of the notebook, we'll illustrate how the BERT text embeddings can be used to cluster products based on their  descriptions.

Intiuitively, our neural network has now learned which aspects of the text description are relevant to predict prices and market shares. 
We can therefore use the embeddings produced by our network to cluster products, and we might expect that the clusters reflect market-relevant information. 

In the following block of cells, we compute embeddings using our learned models and cluster them using $k$-means clustering with $k=10$. Finally, we will explore how the estimated price elasticity differs across clusters.

### Overview of **$k$-means clustering**
The $k$-means clustering algorithm seeks to divide $n$ data vectors into $k$ groups, each of which contain points that are "close together."

In particular, let $C_1, \ldots, C_k$ be a partitioning of the data into $k$ disjoint, nonempty subsets (clusters), and define
$$\bar{C_i}=\frac{1}{\#C_i}\left(\sum_{x \in C_i} x\right)$$
to be the *centroid* of the cluster $C_i$. The $k$-means clustering score $\mathrm{sc}(C_1 \ldots C_k)$ is defined to be
$$\mathrm{sc}(C_1 \ldots C_k) = \sum_{i=1}^k \sum_{x \in C_i} \left(x - \bar{C_i}\right)^2.$$

The $k$-means clustering is then defined to be any partitioning $C^*_1 \ldots C^*_k$ that minimizes the score $\mathrm{sc}(-)$.

**Problem 4** Show that the $k$-means clustering depends only on the pairwise distances between points. *Hint: verify that $\sum_{x,y \in C_i} (x - \bar{C_i})(y - \bar{C_i}) = 0$.*

In [None]:
## STEP 1: Compute embeddings

input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

Y1 = bert(input_ids, token_type_ids, attention_mask)[1]
Y2 = bert2(input_ids, token_type_ids, attention_mask)[1]
Y = Concatenate()([Y1,Y2])

embedding_model = Model([input_ids, token_type_ids, attention_mask], Y)

embeddings = embedding_model.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])

### Dimension reduction and the **Johnson-Lindenstrauss transform**

Our learned embeddings have dimension in the $1000$s, and $k$-means clustering is often an expensive operation. To improve the situation, we will use a neat trick that is used extensively in machine learning applications: the *Johnson-Lindenstrauss transform*. 

This trick involves finding a low-dimensional linear projection of the embeddings that approximately preserves pairwise distances. 

In fact, Johnson and Lindenstrauss proved a much more interesting statement: a Gaussian random matrix will *almost always* approximately preserve pairwise distances.

**Problem 5** Suppose we have a low-dimensional projection matrix $\Pi$ that preserves pairwise distances, and let $X$ be the design matrix. Explain how and why we could compute the $k$-means clustering using only the projected data $\Pi X$. *Hint: use Problem 4.*



In [None]:
# STEP 2 Make low-dimensional projections
from sklearn.random_projection import GaussianRandomProjection

jl = GaussianRandomProjection(eps=.25)
embeddings_lowdim = jl.fit_transform(embeddings)

In [None]:
# STEP 3 Compute clusters
from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=10)
k_means.fit(embeddings_lowdim)
cluster_ids = k_means.labels_

In [None]:
# STEP 4 Regress within each cluster

betas = np.zeros(10)
ses = np.zeros(10)

for c in range(10):

  r_p_c = r_p[cluster_ids == c]
  r_q_c = r_q[cluster_ids == c]

  # Regress to obtain elasticity estimate
  betas[c] = np.mean(r_p_c * r_q_c) / np.mean(r_p_c * r_p_c)

  # standard error on elastiticy estimate
  ses[c] = np.sqrt(np.mean( (r_p_c* r_q_c)**2)/(np.mean(r_p_c*r_p_c)**2)/r_p_c.size)

In [None]:
# STEP 5 Plot
from matplotlib import pyplot as plt

plt.bar(range(10),betas, yerr = ses)