## Steps included in this project
- Loading the Dataset
- Pre-processing the raw data
- Getting BERT Pre-trained model and its tokenizer
- Training and evaluation
- Prediction Pipeline

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Loading the dataset
The dataset we are using here is personality classification of german metadata

In [4]:
import pandas as pd
df = pd.read_excel('/content/drive/MyDrive/data/data.xlsx')
df.head()

Unnamed: 0,id,Geschlecht,Alter,Text,Extraversion_Classification,Gewissenhaftigkeit_Classification,EmotionaleStabilitaet_Classification,Offenheit_Classification,Empathie_Classification,Wirksamkeitsueberzeugung_Classification,Optimismus_Classification,Resilienz_Classification,UnternehmerischesKapital_Classification,AgilityMindset_Classification,Machiavellismus_Classification,Narzissmus_Classification,Psychopathie_Classification,ZerstoererischesPotential_Classification
0,15,M,1972,Ich habe meinen Vater bis zu seinem Tode gepfl...,LOW,AVERAGE,HIGH,AVERAGE,LOW,AVERAGE,LOW,AVERAGE,AVERAGE,AVERAGE,HIGH,LOW,AVERAGE,AVERAGE
1,20,F,1976,Ich habe die Aufnahmeprüfung beim Landesgymnas...,AVERAGE,LOW,AVERAGE,HIGH,AVERAGE,AVERAGE,LOW,AVERAGE,AVERAGE,AVERAGE,AVERAGE,AVERAGE,AVERAGE,AVERAGE
2,24,F,1966,Bei meinem Sohn wurde in Kindertagen ADHS diag...,LOW,HIGH,AVERAGE,HIGH,HIGH,AVERAGE,HIGH,AVERAGE,HIGH,AVERAGE,LOW,AVERAGE,AVERAGE,AVERAGE
3,45,F,1966,"Mein größter Erfolg war, dass es mir trotz mei...",AVERAGE,LOW,LOW,LOW,AVERAGE,LOW,AVERAGE,LOW,LOW,LOW,LOW,HIGH,AVERAGE,AVERAGE
4,49,M,1975,Ich habe siebeneinhalb Wohnungen gekauft. Eine...,AVERAGE,LOW,LOW,HIGH,LOW,AVERAGE,AVERAGE,AVERAGE,AVERAGE,AVERAGE,AVERAGE,HIGH,HIGH,HIGH


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980 entries, 0 to 1979
Data columns (total 18 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   id                                        1980 non-null   int64 
 1   Geschlecht                                1980 non-null   object
 2   Alter                                     1980 non-null   int64 
 3   Text                                      1980 non-null   object
 4   Extraversion_Classification               1980 non-null   object
 5   Gewissenhaftigkeit_Classification         1980 non-null   object
 6   EmotionaleStabilitaet_Classification      1980 non-null   object
 7   Offenheit_Classification                  1980 non-null   object
 8   Empathie_Classification                   1980 non-null   object
 9   Wirksamkeitsueberzeugung_Classification   1980 non-null   object
 10  Optimismus_Classification                 1980 n

In [7]:
df['Extraversion_Classification'].value_counts()

AVERAGE    860
LOW        650
HIGH       470
Name: Extraversion_Classification, dtype: int64

# Splitting the dataset
We initially create the training dataset with a fraction of 0.8 from overall rows in Dataframe. we also define **random_state** which corresponds to the seed, so that the results are reproducible.

In [8]:
#df_train = df.sample(frac=0.8, random_state=25)
#df_test = df.drop(df_train.index)
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2, random_state=25)

print(f"No. of training examples: {df_train.shape[0]}")
print(f"No. of testing examples: {df_test.shape[0]}")


No. of training examples: 1584
No. of testing examples: 396


# Converting our **Extraversion_Classification** into Categorical data
- Mapping sentiments label with some numbers using a python dictionary and then convert them into a categorical column using to_categorical.

In [9]:
encoded_dict = {'LOW':0, 'AVERAGE':1, 'HIGH':2}
df_train['Extraversion_Classification']=df.Extraversion_Classification.map(encoded_dict)
df_test['Extraversion_Classification']=df.Extraversion_Classification.map(encoded_dict)

importing to_categorical class from utils:

In [10]:
from tensorflow.keras.utils import  to_categorical

converting our integer coded Sentiment column into categorical data(matrix)

In [11]:
y_train = to_categorical(df_train.Extraversion_Classification)
y_test = to_categorical(df_test.Extraversion_Classification)

In [12]:
y_train,y_test

(array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.],
        ...,
        [0., 0., 1.],
        [1., 0., 0.],
        [0., 1., 0.]], dtype=float32), array([[0., 1., 0.],
        [0., 1., 0.],
        [0., 1., 0.],
        ...,
        [0., 1., 0.],
        [1., 0., 0.],
        [0., 1., 0.]], dtype=float32))

We have successfully processed our Extraversion_classification column( target); now, it’s time to process our input text data using a tokenizer.

# Getting transformers package
- installing transformer package and then import it

In [None]:
!pip install transformers
import transformers

Loading Model and Tokenizer from the transformers package 

In [None]:
from transformers import AutoTokenizer,TFBertModel
tokenizer = AutoTokenizer.from_pretrained('bert-base-german-cased')
bert = TFBertModel.from_pretrained('bert-base-german-cased')

We need a tokenizer to convert the input text's word into tokens:
- The **classAutoTokenizer** contains various types of tokenizers.
- **TFBertModel** pre-trained Bert model for TensorFlow
- Here We are loading **bert-base-german-cased** model

In [15]:
tokenizer.tokenize("ich liebe Deutschland. Es ist ein schönes Land")

['ich',
 'liebe',
 'Deutschland',
 '.',
 'Es',
 'ist',
 'ein',
 'schöne',
 '##s',
 'Land']

# Input Data Modeling
Before training, we need to convert the input textual data into BERT's inout data format using a tokenizer. Since we have loaded **bert-base-german-cased**, so tokenizer will also be **Bert-base-german-cased**.

In [16]:
# Tokenize the input (takes some time) 
# here tokenizer using from bert-base-german-cased
x_train = tokenizer(
    text=df_train.Text.tolist(),
    add_special_tokens=True,
    max_length=256,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)
x_test = tokenizer(
    text=df_test.Text.tolist(),
    add_special_tokens=True,
    max_length=256,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)

Tokenizer takes all the necessary parameters and returns tensor in the same format Bert accepts.

- **return_token_type_ids = False**: token_type_ids is not necessary for our training in this case.
- **return_attention_mask = True** we want to include attention_mask in our input.
- **return_tensors=’tf’**: we want our input tensor for the TensorFlow model.
- **max_length=256:**
  - we want the maximum length of each sentence to be 70; if a sentence is bigger than this, it will be trimmed if a sentence is smaller than
70 then it will be padded.
- **add_special_tokens=True**, CLS, SEP token will be added in the tokenization.

Hereafter data modelling, the tokenizer will return a dictionary (X) containing ‘Input_ids’, ‘attention_mask’ as key for their respective
data.

In [17]:
input_ids = x_train['input_ids']
attention_mask = x_train['attention_mask']

In [18]:
x_test['input_ids']

<tf.Tensor: shape=(396, 256), dtype=int32, numpy=
array([[    3,  3599, 16621, ...,     0,     0,     0],
       [    3,    62, 26914, ...,     0,     0,     0],
       [    3,  3599, 16621, ...,     0,     0,     0],
       ...,
       [    3,  1671,   555, ...,     0,     0,     0],
       [    3,    39,   391, ...,     0,     0,     0],
       [    3, 15430,  9651, ...,     0,     0,     0]], dtype=int32)>

# Model Building
Importing necessary libraries

In [19]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Input, Dense

In [21]:
max_len = 256
input_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_ids")
input_mask = Input(shape=(max_len,), dtype=tf.int32, name="attention_mask")
embeddings = bert(input_ids,attention_mask = input_mask)[0] 
out = tf.keras.layers.GlobalMaxPool1D()(embeddings)
out = Dense(128, activation='relu')(out)
out = tf.keras.layers.Dropout(0.1)(out)
out = Dense(32,activation = 'relu')(out)
y = Dense(3,activation = 'sigmoid')(out)
sentiment_model = tf.keras.Model(inputs=[input_ids, input_mask], outputs=y)
sentiment_model.layers[2].trainable = True

Bert layers accept three input arrays, **input_ids,attention_mask, toke_type_ids**
**input_ids** meabs our input words encoding, then **attention mask**.
**token_type_ids** is necessary for question answering model: in this case we will not pass token_type_ids.
- For the Bert layer, we need two input layers, in this case, input_ids, attention_mask.
- **Embeddings** contain hidden states of the Bert layer.
using
GlobalMaxPooling1D then dense layer to build CNN layers using hidden
states of Bert. These CNN layers will yield our output.
bert[0] is the last hidden state, bert[1] is the
pooler_output, for building CNN layers on top of the BERT layer, we have
used Bert’s hidden forms.

# Model compilation
Defining learning parameters and compiling the model.

In [22]:
optimizer = Adam(
    learning_rate=5e-05, # this learning rate is for bert model , taken from huggingface website 
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0)
# Set loss and metrics
loss =CategoricalCrossentropy(from_logits = True)
metric = CategoricalAccuracy('balanced_accuracy'),
#metric = [tf.keras.metrics.BinaryAccuracy(name='accuracy'),tf.keras.metrics.Precision(name='precision'),tf.keras.metrics.Recall(name='recall')]
# Compile the model
sentiment_model.compile(optimizer = optimizer,loss = loss, metrics = metric)

**learning_rate = 5e-05** the learning rate for the model will be significantly lower.

**Loss = CategoricalCrossentropy** since we are passing the categorical data as the target.

**Balanced accuracy** will take care of our average accuracy for all the classes.

In [23]:
sentiment_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 256)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 256)]        0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  109081344   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 256,                                         

#Model training
we have the model ready with X_train, y_train. we can train the model.
Training and fine tuning of the BERT model takes time.

In [24]:
x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']}
print(x['input_ids'].shape)
print(x['attention_mask'].shape)
print(y_train.shape)
print(y_train[1])

(1584, 256)
(1584, 256)
(1584, 3)
[0. 1. 0.]


In [27]:
train_history = sentiment_model.fit(
    x ={'input_ids':x_train['input_ids'],'attention_mask':x_train['attention_mask']} ,
    y = y_train,
    validation_data = ({'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']}, y_test),epochs=1)





- **model.fit** returns a history object which keeps all the training history.
- **x_test** became a dictionary containing ‘input_ids’, ‘attention_mask‘ after pre-processing. We are passing input_ids and attention_mask for the training.
- In the validation data, we are passing the test data.

# Model Evaluation
Testing our model on the test data

In [28]:
x_test = {'input_ids':x_test['input_ids'],'attention_mask':x_test['attention_mask']}

In [29]:
predicted_raw = sentiment_model.predict(x_test)
predicted_raw

array([[0.59371257, 0.63438976, 0.56323344],
       [0.5728156 , 0.58896255, 0.55890733],
       [0.5942178 , 0.635093  , 0.60654104],
       ...,
       [0.588902  , 0.6130559 , 0.57000756],
       [0.5423725 , 0.5897281 , 0.55537117],
       [0.5250233 , 0.62670827, 0.5935005 ]], dtype=float32)

Taking the index of value having maximum probability

In [30]:
import numpy as np
y_predicted = np.argmax(predicted_raw,axis=1)
y_true = np.array(df_test['Extraversion_Classification'])

In [31]:
y_predicted

array([1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 2, 1, 2, 1, 1, 1, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 2, 1, 2, 0, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1,
       2, 0, 1, 1, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2,
       0, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 0, 2, 2, 1,
       1, 0, 1, 1, 1, 1, 1, 2, 0, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1,
       2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 0,
       1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1,
       2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1,
       1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1,

# Classification Report

In [32]:
from sklearn.metrics import classification_report
print(classification_report(y_true,y_predicted))

              precision    recall  f1-score   support

           0       0.31      0.04      0.07       129
           1       0.48      0.70      0.57       192
           2       0.10      0.13      0.12        75

    accuracy                           0.38       396
   macro avg       0.30      0.29      0.25       396
weighted avg       0.35      0.38      0.32       396



## Prediction Pipeline
Converting indexes back to the Sentiment label:

In [34]:
texts = input(str('input the text:'))
x_val = tokenizer(
    text=texts,
    add_special_tokens=True,
    max_length=256,
    truncation=True,
    padding='max_length', 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True) 
validation = sentiment_model.predict({'input_ids':x_val['input_ids'],'attention_mask':x_val['attention_mask']})*100
for key , value in zip(encoded_dict.keys(),validation[0]):
    print(key,value)

input the text:Ich habe die Aufnahmeprüfung beim Landesgymnas.
LOW 56.318832
AVERAGE 60.72657
HIGH 56.320847
