# Notebook for labeling Articles with ranking of related google searches (extracted from google trends)



**Preface:** The classification process can be accelerated by using GPU Power. If you use Google Collab with a T4 runtime environment, you can speed up the processing time by a factor of 10 compared to a Mac 2020 with a 6-core Intel CPU. 

<details>
<summary>
Click here for considerations on google colab…
</summary>

**1. loading the data**

Upload your source files to your google drive and replace the code in section "## 1. Load data for labeling process" with the following code

>```python
>from google.colab import drive
>drive.mount("/content/drive")
>
>file_path_features = 'drive/My Drive/Colab Notebooks/data_features.csv'
>file_path_labels = 'drive/My Drive/Colab Notebooks/related_queries.csv'
>
>df = pd.read_csv(file_path_features)
>df_labels = pd.read_csv(file_path_labels)
>```

**2. loading the model**

When google colab tries to create the pipeline with a model from huggingface.co it likely will complain `OSError: You are trying to access a gated repo.`. 
You still can access it by adding your api_token from huggingface. [How to create a user access token for huggingface](https://huggingface.co/docs/hub/en/security-tokens). 
You can add your user access token to your script for loading the moddel from huggingface. Once you loaded the model for the first time you should remove the token from the notebook before you save it. As long as the colab notebook is connected to the runtime environment the model will be available. If you loose the connection for some reason you need your token again.

Replace the codeblock "# Initialize the pipeline" with the following code if you want to use google colab:

>```python
>import os
>api_token = "..." # remove token after successful initializing the pipeline
>os.environ["HF_TOKEN"] = api_token # environment variable has to be set additionally to the parameter of the pipeline function
>classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli", use_auth_token=api_token, device=0)
>```

</details>

**Main Chapters:**

**1.** Load data for labeling process

**2.** Enrich page_ids with google score

In [1]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm

2024-04-25 15:03:52.201811: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 1. Load data for labeling process

In [15]:
file_path_features = '../data/data_features.csv'
file_path_labels = '../data/related_queries.csv'

df = pd.read_csv(file_path_features)
df_labels = pd.read_csv(file_path_labels)

In [3]:
df.shape

(6815, 54)

## 2. Enrich page_ids with google score

### 2.1 Functions for selecting score with highest probability and looping over dataframe

In [4]:
def get_predictions_score(prediction):
    """
    Description:
        Function for returning the labels and scores with the highes prediction probability
        from a given classification pipeline
    Args:
        prediction (pipeline): classification pipeline

    Returns:
        max_labe (str): label with highest probability
        max_probability (float): highest probability
    """
    pred_labels = prediction['labels']
    pred_scores = prediction['scores']
    
    # Find the index of the label with the highest probability
    max_index = pred_scores.index(max(pred_scores))
    
    # Extract the label and its corresponding probability
    max_label = pred_labels[max_index]
    max_probability = pred_scores[max_index]
    
    return max_label, max_probability

In [13]:
def trends_classify(filter, df_labels, df_gscore, classifier, limititer=0):
    """Function for labelling dataframe with labels and scores based on highest prediction probability
    Args:
        filter (str): filter string used for filtering dataframe by certain classification_product
        df_labels (dataframe): dataframe df_labels containing all labels to be filtered by filter.
        df_gscore (dataframe): dataframe df_gscore containing all articles with text to be classified, to be filtered by filter.
        classifier (pipeline): classifier for labelling.
        limititer (int): Defaults to 0. If int is given iteration over rows will stop at value of int
    Returns:
        df_gscore_iter: dataframe df_score with two new columns (label and proba) and several labelingresults per row
    """
    iter = filter

    df_labels_per_category = df_labels[df_labels['classification_product'] == iter]
    candidate_labels = df_labels_per_category['query'].astype(str).tolist()

    df_gscore_iter = df_gscore[df_gscore['classification_product'] == iter]

    # shorten df to only return limited rows per function call
    if limititer > 0 and limititer < len(df_gscore_iter):
        df_gscore_iter = df_gscore_iter.iloc[0:limititer]

    tqdm.pandas(desc=f"Googel search related keyword classification for {iter}")
    # replace progress_apply with apply if you run this on google colab with T4 or on a GPU
    df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].apply(lambda x: get_predictions_score(classifier(x, candidate_labels))))

    return df_gscore_iter

### 2.3 Initialise the pipeline

- The selected pipeline is a pre trained multi language classification model named "joeddav/xlm-roberta-large-xnli". It will be downloaded to a local cache folder of the machine where this notebook is processed.


In [6]:
# Initialisation of the pipeline
classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli", device=0)

2024-04-25 15:04:09.373541: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-04-25 15:04:09.373575: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All model checkpoint layers were used when initializing TFXLMRobertaForSequenceClassification.

All the layers of TFXLMRobertaForSequenceClassification were initialized from the model checkpoint at joeddav/xlm-roberta-large-xnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForSequenceClassification for predictions without further training.


### 2.4 Prepare the dataset

In [16]:
# Prepare the dataset
relevant_columns = ['page_id', 'classification_product', 'abstract', 'meta_description', 'meta_title' ]
df_gscore = df[relevant_columns].copy()
df_gscore['text_to_classify'] = df_gscore['abstract'].fillna('') + ' ' + df_gscore['meta_description'].fillna('') + ' ' + df_gscore['meta_title'].fillna('')
class_product = df.classification_product.unique().tolist()

display(df_gscore.shape)
display(df_gscore.isna().sum())

(6815, 6)

page_id                   0
classification_product    0
abstract                  7
meta_description          0
meta_title                0
text_to_classify          0
dtype: int64

### 2.5 Classify the dataset
- recommended to do a testrun with a small number for limited iterations (e.g. 1 or 2)

In [19]:
iterations = 0  # set to small number for testrun, e.g. 1 or 2 (0 means unlimited).

# create empty instance of target dataframe
df_gscore_out = pd.DataFrame(columns=relevant_columns + ['text_to_classify', 'predicted_query_label', 'predicted_probability'])
for cp in tqdm(class_product):
    print(f"Googel search related keyword classification for {cp}")
    df_gscore_classified = trends_classify(cp, df_labels, df_gscore, classifier, limititer=iterations)
    df_gscore_out = pd.concat([df_gscore_out, df_gscore_classified], axis=0).reset_index(drop=True)

  0%|          | 0/17 [00:00<?, ?it/s]

Googel search related keyword classification for E-Auto


  6%|▌         | 1/17 [00:15<04:01, 15.09s/it]

Googel search related keyword classification for Auto


 12%|█▏        | 2/17 [00:30<03:44, 15.00s/it]

Googel search related keyword classification for Zubehör


 18%|█▊        | 3/17 [00:34<02:22, 10.19s/it]

Googel search related keyword classification for Motorrad


 24%|██▎       | 4/17 [00:49<02:39, 12.25s/it]

Googel search related keyword classification for Energie


 29%|██▉       | 5/17 [01:02<02:26, 12.23s/it]

Googel search related keyword classification for Verkehr


 35%|███▌      | 6/17 [01:05<01:40,  9.17s/it]

Googel search related keyword classification for Wallbox/Laden


 41%|████      | 7/17 [01:20<01:50, 11.05s/it]

Googel search related keyword classification for Solaranlagen


 47%|████▋     | 8/17 [01:24<01:19,  8.85s/it]

Googel search related keyword classification for E-Bike


 53%|█████▎    | 9/17 [01:39<01:26, 10.78s/it]

Googel search related keyword classification for Fahrrad


 59%|█████▉    | 10/17 [01:54<01:24, 12.08s/it]

Googel search related keyword classification for E-Scooter


 65%|██████▍   | 11/17 [02:10<01:19, 13.18s/it]

Googel search related keyword classification for Solarspeicher


 71%|███████   | 12/17 [02:24<01:07, 13.41s/it]

Googel search related keyword classification for Balkonkraftwerk


 76%|███████▋  | 13/17 [02:39<00:55, 13.91s/it]

Googel search related keyword classification for Solargenerator


 82%|████████▏ | 14/17 [02:44<00:34, 11.50s/it]

Googel search related keyword classification for THG


 88%|████████▊ | 15/17 [02:46<00:16,  8.45s/it]

Googel search related keyword classification for Wärmepumpe


 94%|█████████▍| 16/17 [03:01<00:10, 10.53s/it]

Googel search related keyword classification for Versicherung


100%|██████████| 17/17 [03:17<00:00, 11.64s/it]


### 2.6 Join score to dataset and export

In [9]:
# df_labels['predicted_query_label'] equals to df_labels['query']
df_labels = df_labels.rename(columns={'query': 'predicted_query_label', 'value': 'query_score'})

In [10]:
# join score in one step
df_gscore_new = df_gscore_out.merge(df_labels, on=['classification_product', 'predicted_query_label'], how='left')
df_gscore_new = df_gscore_new.drop_duplicates(keep='first')

In [11]:
display(df_labels.info())
display(df_gscore_out.shape)
display(df_gscore_new.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351 entries, 0 to 350
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   predicted_query_label   351 non-null    object
 1   query_score             351 non-null    int64 
 2   classification_product  351 non-null    object
dtypes: int64(1), object(2)
memory usage: 8.4+ KB


None

(17, 8)

(17, 9)

In [47]:
df_gscore_new.to_csv('../data/google_trends/data_trends_classified.csv', encoding='utf-8', index=False)

## 3. postprocessing analysis 
- processing in google colab successful after aprox. 1 h 30 min
- cell ouptup became too big due to extensive usage of progress bars
- 34 duplicates are in dataset which have to be removed

--> add steps to previous chapter

In [55]:
df_post = pd.read_csv('../data/google_trends/data_trends_classified_orig.csv')

In [56]:
df_post.shape

(6849, 10)

In [57]:
df_post.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6849 entries, 0 to 6848
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              6849 non-null   int64  
 1   page_id                 6849 non-null   int64  
 2   classification_product  6849 non-null   object 
 3   abstract                6842 non-null   object 
 4   meta_description        6849 non-null   object 
 5   meta_title              6849 non-null   object 
 6   text_to_classify        6849 non-null   object 
 7   predicted_query_label   6849 non-null   object 
 8   predicted_probability   6849 non-null   float64
 9   query_score             6849 non-null   int64  
dtypes: float64(1), int64(3), object(6)
memory usage: 535.2+ KB


In [58]:
df_post.drop(['Unnamed: 0'], axis=1, inplace=True)

In [72]:
df_first = df_post.drop_duplicates(keep='first')

In [73]:
df_first.shape

(6815, 9)

In [74]:
df_first.to_csv('../data/google_trends/data_trends_classified.csv', encoding='utf-8', index=False)