[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/Stefan-Pasch/CultureBERT/blob/main/Tutorial_CultureBERT.ipynb)

# Tutorial: Predict Corporate Culture with CultureBERT

CultureBERT was trained on 1,400 employee reviews to measure corporate culture. More specifically, it predicts corporate culture based on the four culture dimensions of the Competing Values Framework.

Please cite: 
Koch, Sebastian; Pasch, Stefan (2022): CultureBERT: Fine-Tuning Transformer-Based Language Models for Corporate Culture. Available online at http://arxiv.org/abs/2212.00509.


Check Hugging Face model hub: https://huggingface.co/CultureBERT.

### Install Transformers

Make sure your runtime utilizes GPU.

In [None]:
!pip install transformers
!pip install datasets
!pip install accelerate -U

### Load Texts 

Own text to be classfied can be loaded into Colab's data section using drag and drop (data get lost when runtime ends) or mounted via Google Drive.

In [None]:
import pandas as pd

## example text
text_input = ["Don't treat your employees like numbers!",
              "Too much red-tape",
              "Very friendly and collaborative work environment"]

##load own texts
"""
my_df = pd.read_csv("/content/MY_FILE.csv", sep = ";")
text_input = my_df["MY_TEXT_COLUMN"].to_list()
"""

df = pd.DataFrame({'text': text_input})

# Predict Dominant Culture

In this section, we load the model *roberta-large-dominant-culture*, which predicts which of the four culture dimensions of the Competing Values Framework best fits the text at hands, i.e., what is the dominant culture.

In [None]:
### load model
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("CultureBERT/roberta-large-dominant-culture")
model = AutoModelForSequenceClassification.from_pretrained("CultureBERT/roberta-large-dominant-culture", num_labels=4)

In [None]:
### tokenize texts
from datasets import Dataset

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",  max_length=200, truncation=True)

text_dataset = Dataset.from_pandas(df)
text_tokenized = text_dataset.map(tokenize_function, batched=True)

In [None]:
### make predictions
from transformers import Trainer
from scipy.special import softmax

trainer = Trainer(model=model)

predictions = trainer.predict(test_dataset = text_tokenized)


In [None]:
#transform predictions into probabilities
probabilities = softmax(predictions.predictions, axis = 1)

clan_scores = []
adhocracy_scores = []
market_scores = []
hierarchy_scores = []


for prediction in probabilities:
  clan_scores.append(prediction[0])
  adhocracy_scores.append(prediction[1])
  market_scores.append(prediction[2])
  hierarchy_scores.append(prediction[3])

Create data frame and determine dominant culture.

In [None]:
df_dominant_culture = pd.DataFrame(
    {
     'text': text_input,
     'clan': clan_scores,
     'adhocracy': adhocracy_scores,
     'market': market_scores,
     'hierarchy': hierarchy_scores,
    })

df_dominant_culture['dominant_culture'] = df_dominant_culture[['clan','adhocracy', 'market', 'hierarchy']].idxmax(axis=1)

Store data frame.

In [None]:
df_dominant_culture.to_csv("/content/my_scores.csv", sep = ";")

In [None]:
### see results
df_dominant_culture

Unnamed: 0,text,clan,adhocracy,market,hierarchy,dominant_culture
0,Don't treat your employees like numbers!,0.114001,0.096637,0.552592,0.23677,market
1,Too much red-tape,0.183804,0.125862,0.290897,0.399437,hierarchy
2,Very friendly and collaborative work environment,0.732498,0.082554,0.084714,0.100233,clan


# Predict Culture Dimension

In this section, we load a model that classifies text with respect to one of the four culture dimensions of the Competing Values Framework. More specifically, the model determines whether a given text contains information in line with the culture dimension of interest (positive text), contains information in opposite of the culture dimension of interest (negative text), or does not allow any inference about the culture dimension of interest at all (neutral text). As an example, we load the model *roberta-large-clan* to classify text with respect to the culture dimension Clan (a collaborative corporate culture).

In [None]:
### load model
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

## Choose the culture dimension of interest. The following example is for the culture dimension Clan (collaborative culture).

culture_model ="CultureBERT/roberta-large-clan"
#culture_model ="CultureBERT/roberta-large-adhocracy"
#culture_model ="CultureBERT/roberta-large-market"
#culture_model ="CultureBERT/roberta-large-hierarchy"

tokenizer = AutoTokenizer.from_pretrained(culture_model)
model = AutoModelForSequenceClassification.from_pretrained(culture_model, num_labels=3)

In [None]:
### tokenize texts
from datasets import Dataset

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",  max_length=200, truncation=True)

text_dataset = Dataset.from_pandas(df)
text_tokenized = text_dataset.map(tokenize_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
### make predictions
from transformers import Trainer
from scipy.special import softmax

trainer = Trainer(model=model)

predictions = trainer.predict(test_dataset = text_tokenized)

In [None]:
#transform predictions into probabilities
probabilities = softmax(predictions.predictions, axis = 1)

neutral_scores = []
positive_scores = []
negative_scores = []



for prediction in probabilities:
  neutral_scores.append(prediction[0])
  positive_scores.append(prediction[1])
  negative_scores.append(prediction[2])


In [None]:
df_culture = pd.DataFrame(
    {
     'text': text_input,
     'neutral': neutral_scores,
     'positive': positive_scores,
     'negative': negative_scores,
    })

df_culture['prediction'] = df_culture[['neutral','positive', 'negative']].idxmax(axis=1)

In [None]:
df_culture.to_csv("/content/single_culture_dimension.csv", sep = ";")

In [None]:
df_culture

Unnamed: 0,text,neutral,positive,negative,prediction
0,Don't treat your employees like numbers!,0.009679,0.006638,0.983683,negative
1,Too much red-tape,0.925018,0.015644,0.059339,neutral
2,Very friendly and collaborative work environment,0.007218,0.990893,0.001889,positive
