# PoliticEs 2022 BoW baselines and tutorial
This Google Colab is an example of a baseline based on BoW models for the shared-task PoliticEs. Here we show how to load the development dataset and how to train 4 baselines models based on logistic regression with a simple Bag-of-Words (BoW) model for each trait (gender, profession, ideology_binary and ideology_multiclass). In addition, we show how to calculate the final F1-score of each model and how to generate the final submission file.

More information regarding the shared task can be found at: https://codalab.lisn.upsaclay.fr/competitions/1948


In [None]:
# The first step is to import the required libraries
# We rely on Pandas, Numpy and Scikit-learn in order to manage the input data, 
# and train the machine-learning models
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from tqdm import tqdm


In [None]:
# Next, we load the datasets. 
# We have the datasets organised into two files. On the one hand, a file for 
# training and, on the other, one dataset to test the performance of our model
!rm development.csv
!rm development_test.csv
!wget https://pln.inf.um.es/corpora/politices/development.csv --no-check-certificate
!wget https://pln.inf.um.es/corpora/politices/development_test.csv --no-check-certificate


--2022-02-11 17:30:30--  https://pln.inf.um.es/corpora/politices/development.csv
Resolving pln.inf.um.es (pln.inf.um.es)... 155.54.204.105
Connecting to pln.inf.um.es (pln.inf.um.es)|155.54.204.105|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 1482900 (1.4M) [text/csv]
Saving to: ‘development.csv’


2022-02-11 17:30:31 (1.65 MB/s) - ‘development.csv’ saved [1482900/1482900]

--2022-02-11 17:30:31--  https://pln.inf.um.es/corpora/politices/development_test.csv
Resolving pln.inf.um.es (pln.inf.um.es)... 155.54.204.105
Connecting to pln.inf.um.es (pln.inf.um.es)|155.54.204.105|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 305529 (298K) [text/csv]
Saving to: ‘development_test.csv’


2022-02-11 17:30:32 (548 KB/s) - ‘development_test.csv’ saved [305529/305529]



# Load data

Next, the development data files are loaded. As we can see, these files are organised at the document level, that is, each row represents a document of a user (label). For each user, we have two demograhpic traits (gender and profession) and two psychographic traits (binary political ideology and multiclass political ideology).

In [None]:
# Load the training and test file
df_train = pd.read_csv ('development.csv')
df_test = pd.read_csv ('development_test.csv')

# The train dataframe is shown
print (df_train)

# We can observe that we have 50 documents for each user
df_train.groupby ('label').size ()

      Unnamed: 0  ...                                              tweet
0          36617  ...  EE UU y China: Los dos grandes pelean, el mund...
1          11991  ...  Sensación Previsible a esta hora: Alegría [POL...
2          40804  ...  No te salves. no te quedes inmóvil al borde de...
3          48101  ...  Al menos 25 militares venezolanos, todos de ba...
4          27627  ...  Rivera que , con Sanchez ,da una mayoría absol...
...          ...  ...                                                ...
4995        6914  ...  Dani Mateo insiste en Catalunya Radio en sus g...
4996       39390  ...  Si.... Una condenada por apología del terroris...
4997       44724  ...  Cuidémonos de la “colaboración y lealtad” de C...
4998       39401  ...  Cuenta [POLITICAL_PARTY] que el Gobierno se ha...
4999       29880  ...  Ese lío de la extrema izquierda con la ley:. D...

[5000 rows x 7 columns]


label
@user10     50
@user105    50
@user110    50
@user117    50
@user12     50
            ..
@user85     50
@user86     50
@user93     50
@user94     50
@user96     50
Length: 100, dtype: int64

# Train the baseline
In the next cells, we will preprare the training dataset for conducting the 
author profiling tasks. For this, we will first combine all the documents for 
each author, and then we will train a Logistic Regression model for each trait.

In [None]:
# An important stage when we are dealing with author profiling tasks is that 
# the results should be at author level. However, the dataframe is prepared at 
# the document-level. Therefore, we are going to merge all the texts from the 
# same user and concatenate them with a custom separator

# From now one, dataframes will contain the dataframes
dataframes = {
  'train': df_train, 
  'test': df_test
}

# NOTE: As loops does not bind variable data, we do sequence unpacking
for key, df in dataframes.items ():

  # These columns are shared for all documents of each user
  columns_to_group_by_user = ['label', 'gender', 'profession', 'ideology_binary', 'ideology_multiclass']


  # Group the dataframe by user (label)
  group = df.groupby (by = columns_to_group_by_user, dropna = False, observed = True, sort = False)


  # Create a custom dataframe per user
  df_users = group[columns_to_group_by_user].agg (func = ['count'], as_index = False, observed = True).index.to_frame (index = False)


  # Temporal variable
  merged_fields = []


  # We merge the documents with a fancy TQDM progress bar
  pbar = tqdm (df_users.iterrows (), total = df_users.shape[0], desc = "merging users")
    
    
  # Iterate over rows in a fancy way
  for index, row in pbar:
      df_user = df[(df['label'] == row['label'])]
      merged_fields.append ({**row, **{field: ' [SEP] '.join (df_user[field].fillna ('')) for field in ['tweet']}})
    
  # Modify the original variable dataframe
  dataframes[key] = pd.DataFrame (merged_fields)
  


merging users: 100%|██████████| 101/101 [00:00<00:00, 674.74it/s]
merging users: 100%|██████████| 20/20 [00:00<00:00, 932.55it/s]


In [None]:
# Create a TFIDF Vectorizer using sci-kit. With this, we are going to represent all texts
# as counts of the vocabulary. 
vectorizer = TfidfVectorizer (
  analyzer = 'word',
  min_df = .1,
  max_features = 5000,
  lowercase = True
) 


# Get the TF-IDF values from the training set
X_train = vectorizer.fit_transform (dataframes['train']['tweet'])

# Get the TF-IDF values from the test set
# Note that we apply the TF-IDF learned from the training split 
X_test = vectorizer.transform (dataframes['test']['tweet'])


# We are going to store a baseline per trait
baselines = {}

# As we observed, this task is about four traits: two demographic and two psychographic. Therefore, we are going to
# train different and separate models for each task
for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:

  # Get a baseline classifier
  baselines[label] = LogisticRegression ()


  # Train the baseline for this label
  baselines[label].fit (X_train, dataframes['train'][label])



# Evaluation of the baseline
Next, once we already have the models trained, we are going to calculate the scores of the results. Note that we can do this as we have the official test labels in the development_test.csv file.

In [None]:
# Validate the result
# As we observed, this task is about four traits: two demographic and two psychographic. Therefore, we are going to
# train different models for each task
# Note that we are doing this because we know the labels on the test set
for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:

  # Get the predictions
  y_pred = baselines[label].predict (X_test)

  # Then the results are printed
  print (label)
  print (classification_report (dataframes['test'][label], y_pred, zero_division = 0, digits = 6))



gender
              precision    recall  f1-score   support

      female   0.166667  0.250000  0.200000         4
        male   0.785714  0.687500  0.733333        16

    accuracy                       0.600000        20
   macro avg   0.476190  0.468750  0.466667        20
weighted avg   0.661905  0.600000  0.626667        20

profession
              precision    recall  f1-score   support

  journalist   0.000000  0.000000  0.000000         6
  politician   0.700000  1.000000  0.823529        14

    accuracy                       0.700000        20
   macro avg   0.350000  0.500000  0.411765        20
weighted avg   0.490000  0.700000  0.576471        20

ideology_binary
              precision    recall  f1-score   support

        left   0.357143  1.000000  0.526316         5
       right   1.000000  0.400000  0.571429        15

    accuracy                       0.550000        20
   macro avg   0.678571  0.700000  0.548872        20
weighted avg   0.839286  0.550000  0.560

In the following cell the final F1-score is calculated as the macro-average of all the F1s obtained in the classification task (macro-f1-gender, macro-f1-profession, macro-f1-ideology_binary and macro-f1-ideology_multiclass).

In [None]:
f1_scores = {}

# Next, we are going to calculate the total result
for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:

  # Get the predictions
  y_pred = baselines[label].predict (X_test)

  f1_scores[label] = f1_score(dataframes['test'][label], y_pred, average='macro')

f1_scores = list (f1_scores.values ())

print ("Your final F1-score is {f1}".format (f1 = sum(f1_scores) / len(f1_scores)))


Your final F1-score is 0.4237901739643226


# Generation of the submission file
Finally, an output file is generated with the predictions in the format required for submission to CodaLab.

In [None]:
# Now we are going to generate the output for the CodaLab submission page
# The output order is in the same order that the testing file, thus 
# we do not need to keep any index or ID
output_df = pd.DataFrame ()
output_df['user'] = dataframes['test']['label']

# Generate the output
for label in ['gender', 'profession', 'ideology_binary', 'ideology_multiclass']:
  output_df[label] = baselines[label].predict (X_test)

print (output_df)
output_df.to_csv ('results.csv', index = False)

        user  gender  profession ideology_binary ideology_multiclass
0   @user106  female  politician            left       moderate_left
1   @user180  female  politician            left       moderate_left
2   @user226    male  politician            left      moderate_right
3    @user23  female  politician            left       moderate_left
4   @user237    male  politician           right      moderate_right
5   @user250  female  politician            left       moderate_left
6   @user280    male  politician            left       moderate_left
7   @user295    male  politician            left       moderate_left
8   @user332    male  politician           right      moderate_right
9   @user334    male  politician            left      moderate_right
10  @user350  female  politician            left       moderate_left
11  @user361    male  politician           right      moderate_right
12  @user406  female  politician            left       moderate_left
13   @user42    male  politician  