# BERT

**Bidirectional Encoder Representations from Transformers.**

_ | _
- | -
![alt](https://pytorch.org/assets/images/bert1.png) | ![alt](https://pytorch.org/assets/images/bert2.png)


### **Overview**

BERT was released together with the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Jacob Devlin *et al.* The model is based on the Transformer architecture introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani *et al.* and has led to significant improvements in a wide range of natural language tasks.

At the highest level, BERT maps from a block of text to a numeric vector which summarizes the relevant information in the text.

What is remarkable is that numeric summary is sufficiently informative that, for example, the numeric summary of a paragraph followed by a reading comprehension question contains all the information necessary to satisfactorily answer the question.

#### **Transfer Learning**

BERT is a great example of a paradigm called *transfer learning*, which has proved very effective in recent years. In the first step, a network is trained on an unsupervised task using massive amounts of data. In the case of BERT, it was trained to predict missing words and to detect when pairs of sentences are presented in reversed order using all of Wikipedia. This was initially done by Google, using intense computational resources.

Once this network has been trained, it is then used to perform many other supervised tasks using only limited data and computational resources: for example, sentiment classification in tweets or quesiton answering. The network is re-trained to perform these other tasks in such a way that only the final, output parts of the network are allowed to adjust by very much, so that most of the "information'' originally learned the network is preserved. This process is called *fine tuning*.

##Getting to know BERT

BERT, and many of its variants, are made avialable to the public by the open source [Huggingface Transformers](https://huggingface.co/transformers/) project. This is an amazing resource, giving researchers and practitioners easy-to-use access to this technology.

In order to use BERT for modeling, we simply need to download the pre-trained neural network and fine tune it on our dataset, which is illustrated below.

In [None]:
%%capture
# Install Huggingface Transformers toolkit
!pip install transformers==4.37.2    # Last version of transformers before keras 3
!pip install shap
!pip install tensorflow_addons
!pip install livelossplot
!pip install sqldf
!pip install auto-sklearn
!pip install -U scikit-learn

In [None]:
# Mount google drive so we can save stuff for later
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# Load dependencies
import tensorflow as tf
import numpy as np
import pandas as pd
import sqldf as sql
import plotnine as p9; p9.theme_set(p9.theme_bw)

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression
import statsmodels.formula.api as sm
from transformers import TFBertModel, BertTokenizer, DistilBertTokenizer, TFDistilBertModel
from transformers.models.bert.modeling_tf_bert import TFBertMainLayer

# Formatting tools
from pprint import pformat
np.set_printoptions(threshold=10)
import warnings
warnings.simplefilter('ignore')

In [None]:
ssq = lambda x: np.inner(x,x)
def get_r2(y,yhat):
    resids = yhat.reshape(-1) - y
    flucs = y - np.mean(y)
    print('RSS: {}, TSS + MEAN^2: {}, TSS: {}, R^2: {}'.format(ssq(resids), ssq(y), ssq(flucs), 1 - ssq(resids)/ssq(flucs)))

In [None]:
# Load TensorFlow, and ensure GPU is pressent
# The GPU will massively speed up neural network training
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

In [None]:
# Download text pre-processor ("tokenizer")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
# Download BERT model
bert = TFBertModel.from_pretrained("bert-base-uncased")

### Tokenization

The first step in using BERT (or any similar text embedding tool) is to *tokenize* the data. This step standardizes blocks of text, so that meaningless differences in text presentation don't affect the behavior of our algorithm.

Typically the text is transformed into a sequence of 'tokens,' each of which corresponds to a numeric code.

In [None]:
# Let's try it out!
s = "What happens to this string?"
print('Original String: \n\"{}\"\n'.format(s))
tensors = tokenizer(s)
print('Numeric encoding: \n' + pformat(tensors))

# What does this mean?
print('\nActual tokens:')
tokenizer.convert_ids_to_tokens(tensors['input_ids'])

### BERT in a nutshell

Once we have our numeric tokens, we can simply plug them into the BERT network and get a numeric vector summary. Note that in applications, the BERT summary will be "fine tuned" to a particular task, which hasn't happened yet.

In [None]:
print('Input: "What happens to this string?"\n')

# Tokenize the string
tensors_tf = tokenizer("What happens to this string?", return_tensors="tf")

# Run it through BERT
output = bert(tensors_tf)

# Inspect the output
_shape = output['pooler_output'].shape
print(
"""Output type: {}\n
Output shape: {}\n
Output preview: {}\n"""
.format(
type(output['pooler_output']),
 _shape,
pformat(output['pooler_output'].numpy())))


# A practical introduction to BERT

In the next part of the notebook, we are going to explore how a tool like BERT may be useful for causal inference.

In particular, we are going to apply BERT to a subset of data from the Amazon marketplace consisting of roughly 10,000 listings for products in the toy category. Each product comes with a text description, a price, and a number of times reviewed (which we'll use as a proxy for demand / market share).

For more information on the dataset, checkout the [Dataset README](https://github.com/CausalAIBook/MetricsMLNotebooks/blob/main/data/amazon_toys.md).

**For thought**:
What are some issues you may anticipate when using number of reviews as a proxy for demand or market share?

### Getting to know the data

First, we'll download and clean up the data, and do some preliminary inspection.

In [None]:
DATA_URL = 'https://github.com/CausalAIBook/MetricsMLNotebooks/raw/main/data/amazon_toys.csv'
data = pd.read_csv(DATA_URL)

In [None]:
data.columns

In [None]:
# Clean numeric data fields (remove all non-digit characters and parse as a numeric value)
data['number_of_reviews'] = pd.to_numeric(data
                              .number_of_reviews
                              .str.replace(r"\D+",''))
data['price'] = (data
                    .price
                    .str.extract(r'(\d+\.*\d+)')
                    .astype('float'))

# Drop products with very few reviews
data = data[data['number_of_reviews'] > 0]

# Compute log prices
data['ln_p'] = np.log(data.price)

# Impute market shares from # of reviews
data['ln_q'] =  np.log(data['number_of_reviews'] / data['number_of_reviews'].sum())

# Collect relevant text data
data['text'] = (data[[
                    'product_name',
                    'manufacturer',
                    'product_description'
                    ]]
                  .astype('str')
                  .agg(' | '.join, axis=1))

#  Drop irrelevant data and inspect
data = data[['text','ln_p','ln_q', 'amazon_category_and_sub_category']]
data = data.dropna()
data.head()

In [None]:
# Text lengths
data['text_num_words'] = data['text'].str.split().apply(len)
print(np.nanquantile(data['text_num_words'], 0.99))
(p9.ggplot(data, p9.aes('text_num_words')) + p9.geom_density())

Let's make a two-way scatter plot of prices and (proxied) market shares.

In [None]:
(p9.ggplot(data, p9.aes('ln_p','ln_q')) + p9.geom_point() + p9.stat_smooth(color="red"))

In [None]:
(p9.ggplot(data, p9.aes('ln_p','ln_q')) + p9.stat_smooth(color="red"))

In [None]:
result = sm.ols('ln_q ~ ln_p ', data=data).fit() # + C(subcat)
print('Elasticity: {}, SE: {}, R2: {}'.format(result.params['ln_p'],result.bse['ln_p'], result.rsquared_adj))
result.conf_int(alpha=0.05)

Let's begin with a simple prediction task. We will discover how well can we explain the price of these products using their textual descriptions.

In [None]:
from sklearn.model_selection import train_test_split

main_ind, test_ind = train_test_split(np.arange(data.shape[0]), test_size=.2, shuffle=True, random_state=124)
main = data.iloc[main_ind]

train_ind, val_ind = train_test_split(np.arange(main.shape[0]), test_size=0.25, random_state=124) # 0.25 x 0.8 = 0.2

train = main.iloc[train_ind]
val = main.iloc[val_ind]
holdout = data.iloc[test_ind]

tensors = tokenizer(
    list(train["text"]),
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="tf")

val_tensors = tokenizer(
    list(val["text"]),
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="tf")

# Preprocess holdout sample
tensors_holdout = tokenizer(
    list(holdout["text"]),
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="tf")

In [None]:
ln_p = train["ln_p"]
ln_q = train["ln_q"]
val_ln_p = val["ln_p"]
val_ln_q = val["ln_q"]

# Using BERT as Feature Extractor

In [None]:
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense, Dropout, Concatenate
import tensorflow_addons as tfa
from tensorflow.keras import regularizers

input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

Z = bert(input_ids, token_type_ids, attention_mask)[1]

embedding_model = Model([input_ids, token_type_ids, attention_mask], Z)

embeddings = embedding_model.predict([tensors['input_ids'], tensors['token_type_ids'], tensors['attention_mask']])

In [None]:
from sklearn.linear_model import LassoCV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

lcv = make_pipeline(StandardScaler(), LassoCV(cv=KFold(n_splits=5, shuffle=True, random_state=123), random_state=123))
lcv.fit(embeddings, ln_p)

In [None]:
embeddings_val = embedding_model.predict([val_tensors['input_ids'], val_tensors['token_type_ids'], val_tensors['attention_mask']])

In [None]:
get_r2(val_ln_p, lcv.predict(embeddings_val))

In [None]:
embeddings_holdout = embedding_model.predict([tensors_holdout['input_ids'], tensors_holdout['token_type_ids'], tensors_holdout['attention_mask']])

In [None]:
get_r2(holdout['ln_p'], lcv.predict(embeddings_holdout))

In [None]:
ln_p_hat_holdout = lcv.predict(embeddings_holdout)

# Linear Probing: Training Only Final Layer after BERT

In [None]:
### Now let's prepare our model

from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense, Dropout, Concatenate
import tensorflow_addons as tfa
from tensorflow.keras import regularizers

tf.keras.utils.set_random_seed(123)

input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

# # First we compute the text embedding
Z = bert(input_ids, token_type_ids, attention_mask)

for layer in bert.layers:
    layer.trainable=False
    for w in layer.weights: w._trainable=False

# # We want the "pooled / summary" embedding, not individual word embeddings
Z = Z[1]

# # Then we do a regular regression
# Z = Dropout(0.2)(Z)
ln_p_hat = Dense(1, activation='linear',
                 kernel_regularizer=regularizers.L2(1e-3))(Z)

PricePredictionNetwork = Model([
                                input_ids,
                                token_type_ids,
                                attention_mask,
                                ], ln_p_hat)
PricePredictionNetwork.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=tfa.metrics.RSquare(),
)
PricePredictionNetwork.summary()

In [None]:
from livelossplot import PlotLossesKeras

tf.keras.utils.set_random_seed(123)
earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
modelcheckpoint = tf.keras.callbacks.ModelCheckpoint("/content/gdrive/MyDrive/pweights.hdf5", monitor='val_loss', save_best_only=True, save_weights_only=True)

PricePredictionNetwork.fit(
                x= [tensors['input_ids'],
                    tensors['token_type_ids'],
                    tensors['attention_mask'],],
                y=ln_p,
                validation_data = (
                    [val_tensors['input_ids'],
                     val_tensors['token_type_ids'],
                     val_tensors['attention_mask']], val_ln_p
                ),
                epochs=5,
                callbacks = [earlystopping, modelcheckpoint,
                             PlotLossesKeras(groups = {'train_loss': ['loss'], 'train_rsq':['r_square'], 'val_loss': ['val_loss'], 'val_rsq': ['val_r_square']})],
                batch_size=16,
                shuffle=True)

# Fine Tuning starting from the Linear Probing Trained Weights

Now we train the whole network, initializing the weights based on the result of the linear probing phase in the previous section.

In [None]:
### Now let's prepare our model

from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Dense, Dropout, Concatenate
import tensorflow_addons as tfa
from tensorflow.keras import regularizers

tf.keras.utils.set_random_seed(123)

input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

# # First we compute the text embedding
Z = bert(input_ids, token_type_ids, attention_mask)

for layer in bert.layers:
    layer.trainable=True
    for w in layer.weights: w._trainable=True

# # We want the "pooled / summary" embedding, not individual word embeddings
Z = Z[1]

# # Then we do a regularized linear regression
ln_p_hat = Dense(1, activation='linear',
                 kernel_regularizer=regularizers.L2(1e-3))(Z)

PricePredictionNetwork = Model([
                                input_ids,
                                token_type_ids,
                                attention_mask,
                                ], ln_p_hat)
PricePredictionNetwork.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=tfa.metrics.RSquare(),
)
PricePredictionNetwork.summary()

In [None]:
PricePredictionNetwork.load_weights("/content/gdrive/MyDrive/pweights.hdf5")

In [None]:
from livelossplot import PlotLossesKeras

tf.keras.utils.set_random_seed(123)

earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
modelcheckpoint = tf.keras.callbacks.ModelCheckpoint("/content/gdrive/MyDrive/pweights.hdf5", monitor='val_loss', save_best_only=True, save_weights_only=True)

PricePredictionNetwork.fit(
                x= [tensors['input_ids'],
                    tensors['token_type_ids'],
                    tensors['attention_mask'],],
                y=ln_p,
                validation_data = (
                    [val_tensors['input_ids'],
                     val_tensors['token_type_ids'],
                     val_tensors['attention_mask']], val_ln_p
                ),
                epochs=10,
                callbacks = [earlystopping, modelcheckpoint,
                             PlotLossesKeras(groups = {'train_loss': ['loss'], 'train_rsq':['r_square'], 'val_loss': ['val_loss'], 'val_rsq': ['val_r_square']})],
                batch_size=16,
                shuffle=True)

In [None]:
PricePredictionNetwork.load_weights("/content/gdrive/MyDrive/pweights.hdf5")

In [None]:
# Compute predictions
ln_p_hat_holdout = PricePredictionNetwork.predict([
                                                   tensors_holdout['input_ids'],
                                                   tensors_holdout['token_type_ids'],
                                                   tensors_holdout['attention_mask'],
                                                   ])

In [None]:
print('Neural Net R^2, Price Prediction:')
get_r2(holdout['ln_p'], ln_p_hat_holdout)

In [None]:
import matplotlib.pyplot as plt
plt.hist(ln_p_hat_holdout)
plt.show()

Now, let's go one step further and construct a DML estimator of the average price elasticity. In particular, we will model market share $q_i$ as
$$\ln q_i = \alpha + \beta \ln p_i + \psi(d_i) + \epsilon_i,$$ where $d_i$ denotes the description of product $i$ and $\psi$ is the composition of text embedding and a linear layer.

In [None]:
## Build the quantity prediction network

tf.keras.utils.set_random_seed(123)

# Initialize new BERT model from original
bert2 = TFBertModel.from_pretrained("bert-base-uncased")

# for layer in bert2.layers:
#     layer.trainable=False
#     for w in layer.weights: w._trainable=False

# Define inputs
input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

# First we compute the text embedding
Z = bert2(input_ids, token_type_ids, attention_mask)

# We want the "pooled / summary" embedding, not individual word embeddings
Z = Z[1]

ln_q_hat = Dense(1, activation='linear', kernel_regularizer=regularizers.L2(1e-3))(Z)

# Compile model and optimization routine
QuantityPredictionNetwork = Model([
                                   input_ids,
                                   token_type_ids,
                                   attention_mask,
                                   ], ln_q_hat)
QuantityPredictionNetwork.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=tfa.metrics.RSquare(),
)
QuantityPredictionNetwork.summary()

In [None]:
## Fit the quantity prediction network in the main sample
tf.keras.utils.set_random_seed(123)

earlystopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
modelcheckpoint = tf.keras.callbacks.ModelCheckpoint("/content/gdrive/MyDrive/qweights.hdf5", monitor='val_loss', save_best_only=True, save_weights_only=True)

QuantityPredictionNetwork.fit(
                [
                 tensors['input_ids'],
                 tensors['token_type_ids'],
                 tensors['attention_mask'],
                 ],
                ln_q,
                validation_data = (
                    [val_tensors['input_ids'],
                 val_tensors['token_type_ids'],
                 val_tensors['attention_mask']], val_ln_q
                ),
                epochs=10,
                callbacks = [earlystopping, modelcheckpoint,
                             PlotLossesKeras(groups = {'train_loss': ['loss'], 'train_rsq':['r_square'], 'val_loss': ['val_loss'], 'val_rsq': ['val_r_square']})],
                batch_size=16,
                shuffle=True)

In [None]:
QuantityPredictionNetwork.load_weights("/content/gdrive/MyDrive/qweights.hdf5")

In [None]:
## Predict in the holdout sample, residualize and regress

ln_q_hat_holdout = QuantityPredictionNetwork.predict([
                                                      tensors_holdout['input_ids'],
                                                      tensors_holdout['token_type_ids'],
                                                      tensors_holdout['attention_mask'],
                                                      ])

In [None]:
print('Neural Net R^2, Quantity Prediction:')
get_r2(holdout['ln_q'], ln_q_hat_holdout)

In [None]:
# Compute residuals
r_p = holdout["ln_p"] - ln_p_hat_holdout.reshape((-1,))
r_q = holdout["ln_q"] - ln_q_hat_holdout.reshape((-1,))

# Regress to obtain elasticity estimate
beta = np.mean(r_p * r_q) / np.mean(r_p * r_p)

# standard error on elastiticy estimate
se = np.sqrt(np.mean( (r_p* r_q)**2)/(np.mean(r_p*r_p)**2)/holdout["ln_p"].size)

print('Elasticity of Demand with Respect to Price: {}'.format(beta))
print('Standard Error: {}'.format(se))

# Heterogeneous Elasticities within Major Product Categories

We now look at the major product categories that have many products and we investigate whether the "within group" price elasticities

In [None]:
holdout['category'] = holdout['amazon_category_and_sub_category'].str.split('>').apply(lambda x: x[0])

In [None]:
# Elasticity within the main product categories
sql.run("""
  SELECT category, COUNT(*)
  FROM holdout
  GROUP BY 1
  HAVING COUNT(*)>=100
  ORDER BY 2 desc
""")

In [None]:
main_cats = sql.run("""
  SELECT category
  FROM holdout
  GROUP BY 1
  HAVING COUNT(*)>=100
""")['category']

dfs = []
for cat in main_cats:
    r_p = holdout[holdout['category'] == cat]["ln_p"] - ln_p_hat_holdout.reshape((-1,))[holdout['category'] == cat]
    r_q = holdout[holdout['category'] == cat]["ln_q"] - ln_q_hat_holdout.reshape((-1,))[holdout['category'] == cat]
    # Regress to obtain elasticity estimate
    beta = np.mean(r_p * r_q) / np.mean(r_p * r_p)

    # standard error on elastiticy estimate
    se = np.sqrt(np.mean( (r_p* r_q)**2)/(np.mean(r_p*r_p)**2)/holdout["ln_p"].size)

    df = pd.DataFrame({'point': beta, 'se': se, 'lower': beta - 1.96 * se, 'upper': beta + 1.96 * se}, index=[0])
    df['category'] = cat
    df['N'] = holdout[holdout['category'] == cat].shape[0]
    dfs.append(df)

df = pd.concat(dfs)
df

## Clustering Products

In this final part of the notebook, we'll illustrate how the BERT text embeddings can be used to cluster products based on their  descriptions.

Intiuitively, our neural network has now learned which aspects of the text description are relevant to predict prices and market shares.
We can therefore use the embeddings produced by our network to cluster products, and we might expect that the clusters reflect market-relevant information.

In the following block of cells, we compute embeddings using our learned models and cluster them using $k$-means clustering with $k=10$. Finally, we will explore how the estimated price elasticity differs across clusters.

### Overview of **$k$-means clustering**
The $k$-means clustering algorithm seeks to divide $n$ data vectors into $k$ groups, each of which contain points that are "close together."

In particular, let $C_1, \ldots, C_k$ be a partitioning of the data into $k$ disjoint, nonempty subsets (clusters), and define
$$\bar{C_i}=\frac{1}{\#C_i}\sum_{x \in C_i} x$$
to be the *centroid* of the cluster $C_i$. The $k$-means clustering score $\mathrm{sc}(C_1 \ldots C_k)$ is defined to be
$$\mathrm{sc}(C_1 \ldots C_k) = \sum_{i=1}^k \sum_{x \in C_i} \left(x - \bar{C_i}\right)^2.$$

The $k$-means clustering is then defined to be any partitioning $C^*_1 \ldots C^*_k$ that minimizes the score $\mathrm{sc}(-)$.


In [None]:
## STEP 1: Compute embeddings

input_ids = Input(shape=(128,), dtype=tf.int32)
token_type_ids = Input(shape=(128,), dtype=tf.int32)
attention_mask = Input(shape=(128,), dtype=tf.int32)

Y1 = bert(input_ids, token_type_ids, attention_mask)[1]
Y2 = bert2(input_ids, token_type_ids, attention_mask)[1]
Y = Concatenate()([Y1,Y2])

embedding_model = Model([input_ids, token_type_ids, attention_mask], Y)

embeddings = embedding_model.predict([tensors_holdout['input_ids'],
                                      tensors_holdout['token_type_ids'],
                                      tensors_holdout['attention_mask']])

### Dimension reduction and the **Johnson-Lindenstrauss transform**

Our learned embeddings have dimension in the $1000$s, and $k$-means clustering is often an expensive operation. To improve the situation, we will use a neat trick that is used extensively in machine learning applications: the *Johnson-Lindenstrauss transform*.

This trick involves finding a low-dimensional linear projection of the embeddings that approximately preserves pairwise distances.

In fact, Johnson and Lindenstrauss proved a much more interesting statement: a Gaussian random matrix will *almost always* approximately preserve pairwise distances.



In [None]:
# STEP 2 Make low-dimensional projections
from sklearn.random_projection import GaussianRandomProjection

jl = GaussianRandomProjection(eps=.25)
embeddings_lowdim = jl.fit_transform(embeddings)

In [None]:
# STEP 3 Compute clusters
from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=10)
k_means.fit(embeddings_lowdim)
cluster_ids = k_means.labels_

In [None]:
# STEP 4 Regress within each cluster

betas = np.zeros(10)
ses = np.zeros(10)

r_p = holdout["ln_p"] - ln_p_hat_holdout.reshape((-1,))
r_q = holdout["ln_q"] - ln_q_hat_holdout.reshape((-1,))

for c in range(10):

  r_p_c = r_p[cluster_ids == c]
  r_q_c = r_q[cluster_ids == c]

  # Regress to obtain elasticity estimate
  betas[c] = np.mean(r_p_c * r_q_c) / np.mean(r_p_c * r_p_c)

  # standard error on elastiticy estimate
  ses[c] = np.sqrt(np.mean( (r_p_c * r_q_c)**2)/(np.mean(r_p_c*r_p_c)**2)/r_p_c.size)

In [None]:
# STEP 5 Plot
from matplotlib import pyplot as plt

plt.bar(range(10), betas, yerr = 1.96 * ses)