# Tutorial: Predictions with CultureBERT

CultureBERT was trained on 1,400 employee reviews to measure corporate culture. More specifically, it predicts corporate culture based on the four culture dimensions of the Competing Values Framework.

Please cite: 
Koch, Sebastian; Pasch, Stefan (2022): CultureBERT: Fine-Tuning Transformer Based Language Models for Corporate Culture. Available online at http://arxiv.org/abs/2212.00509.

### Install Transformers

Make sure your runtime utilizes GPU

In [None]:
!pip install transformers
!pip install datasets

### load texts 

Own data can be put by drag and drop into data section in Colab (data get lost when runtime ends) or mounted via Google Drive

In [48]:
import pandas as pd

## example text
text_input = ["Don't treat your employees like numbers!",
              "Too much red-tape",
              "Very friendly and collaborative work environment"]

##load own texts
"""
my_df = pd.read_csv("/content/my_data.csv", sep = ";")
text_input = my_df["full_review"].to_list()
"""

df = pd.DataFrame({'text': text_input})

In [49]:
### tokenize texts
from datasets import Dataset

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",  max_length=200, truncation=True)

text_dataset = Dataset.from_pandas(df)
text_tokenized = text_dataset.map(tokenize_function, batched=True)


  0%|          | 0/1 [00:00<?, ?ba/s]

# Predict Dominant Culture

In this section we load the model *roberta-large-dominant-culture* that predicts which of the four culture dimensions of the Competing Values Framework best fits the text at hands, i.e., what is the dominant culture.

In [None]:
### load model
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("CultureBERT/roberta-large-dominant-culture")
model = AutoModelForSequenceClassification.from_pretrained("CultureBERT/roberta-large-dominant-culture", num_labels=4)

In [51]:
### make predictions
from transformers import Trainer
from scipy.special import softmax

trainer = Trainer(model=model)

predictions = trainer.predict(test_dataset = text_tokenized)


No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 3
  Batch size = 8


In [52]:
#transform predictions into probabilities
probabilities = softmax(predictions.predictions, axis = 1)

clan_scores = []
adhocracy_scores = []
market_scores = []
hierarchy_scores = []


for prediction in probabilities:
  clan_scores.append(prediction[0])
  adhocracy_scores.append(prediction[1])
  market_scores.append(prediction[2])
  hierarchy_scores.append(prediction[3])

Create Dataframe and determine dominant culture

In [53]:
df_dominant_culture = pd.DataFrame(
    {
     'text': text_input,
     'clan': clan_scores,
     'adhocracy': adhocracy_scores,
     'market': market_scores,
     'hierarchy': hierarchy_scores,
    })

df_dominant_culture['dominant_culture'] = df_dominant_culture[['clan','adhocracy', 'market', 'hierarchy']].idxmax(axis=1)

Safe Dataframe

In [40]:
df_dominant_culture.to_csv("/content/my_scores.csv", sep = ";")

In [54]:
### see results
df_dominant_culture

Unnamed: 0,text,clan,adhocracy,market,hierarchy,dominant_culture
0,Don't treat your employees like numbers!,0.114001,0.096637,0.552592,0.23677,market
1,Too much red-tape,0.183804,0.125862,0.290897,0.399437,hierarchy
2,Very friendly and collaborative work environment,0.732498,0.082554,0.084714,0.100233,clan


# Predict Culture Dimension

In this section we load a model that classifies a text according to one CVF dimension as either netural, positive, or negative

In [None]:
### load model
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

## choose one dimension. Example with Clan Dimension

culture_model ="CultureBERT/roberta-large-clan"
#culture_model ="CultureBERT/roberta-large-adhocracy"
#culture_model ="CultureBERT/roberta-large-market"
#culture_model ="CultureBERT/roberta-large-hierarchy"

tokenizer = AutoTokenizer.from_pretrained(culture_model)
model = AutoModelForSequenceClassification.from_pretrained(culture_model, num_labels=3)

In [None]:
### make predictions
from transformers import Trainer
from scipy.special import softmax

trainer = Trainer(model=model)

predictions = trainer.predict(test_dataset = text_tokenized)

In [57]:
#transform predictions into probabilities
probabilities = softmax(predictions.predictions, axis = 1)

neutral_scores = []
positive_scores = []
negative_scores = []



for prediction in probabilities:
  neutral_scores.append(prediction[0])
  positive_scores.append(prediction[1])
  negative_scores.append(prediction[2])


In [58]:
df_culture = pd.DataFrame(
    {
     'text': text_input,
     'neutral': neutral_scores,
     'positive': positive_scores,
     'negative': negative_scores,
    })

df_culture['prediction'] = df_culture[['neutral','positive', 'negative']].idxmax(axis=1)

In [None]:
df_culture.to_csv("/content/single_culture_dimension.csv", sep = ";")

In [59]:
df_culture

Unnamed: 0,text,neutral,positive,negative,prediction
0,Don't treat your employees like numbers!,0.009679,0.006638,0.983683,negative
1,Too much red-tape,0.925018,0.015644,0.059339,neutral
2,Very friendly and collaborative work environment,0.007218,0.990893,0.001889,positive
