# Notebook for labeling Articles with ranking of related google searches (extracted from google trends)

1. Test-run with sample data
2. Interpretation of test-run
3. Classification and export for feature candidates

In [50]:
from transformers import pipeline
import pandas as pd
from tqdm import tqdm

In [51]:
classifier = pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli")

All model checkpoint layers were used when initializing TFXLMRobertaForSequenceClassification.

All the layers of TFXLMRobertaForSequenceClassification were initialized from the model checkpoint at joeddav/xlm-roberta-large-xnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForSequenceClassification for predictions without further training.


## 1. Testrun with sample
- get one example ('E-Auto') from page_ids in feature data set and apply the classifier model with lables from google trends realated queries for 'E-Auto'

In [5]:
file_path_features = '../data/data_features.csv'
file_path_labels = '../data/related_queries.csv'

df = pd.read_csv(file_path_features)
df_labels = pd.read_csv(file_path_labels)


In [None]:
df_labels.info()

In [None]:
df_labels.head(20)

In [8]:
# get sample text to be classified from classification_product = 'E-Auto'

df_e_auto = df[df['classification_product'] == 'E-Auto']

sample_abstract = df_e_auto.abstract.iloc[0]
sample_meta_desc = df_e_auto.meta_description.iloc[0]
sample_meta_title = df_e_auto.meta_title.iloc[0]

In [9]:
# get labels for classification from google trends related searches for search term = 'E-Auto'
df_labels_e_auto = df_labels[df_labels['classification_product'] == 'E-Auto']

labels_e_auto = df_labels_e_auto['query'].astype(str).tolist()

In [None]:
df_labels_e_auto['query']

In [11]:
def get_predictions_score_tab(prediction):
    """
    Function to display predictions from the model in a tabular form
    """
    pred_labels = prediction['labels']
    pred_scores = prediction['scores']
    seq = [prediction['sequence']]
    return  pd.concat([
                pd.DataFrame(seq),
                pd.DataFrame(pred_labels),
                pd.DataFrame(pred_scores),
            ], axis=1, ignore_index=True).rename(columns={0:'Sequence',1:'Labels', 2:'Probability'}).set_index(['Sequence'])

In [46]:
sequence1 = sample_abstract
sequence2 = sample_meta_desc
sequence3 = sample_meta_title
sequence_concat = ' '.join([sequence1, sequence2, sequence3])

In [None]:
candidate_labels = labels_e_auto # ["geography",  "delivery"]

In [40]:
pred1 = classifier(sequence1, candidate_labels)
pred2 = classifier(sequence2, candidate_labels)
pred3 = classifier(sequence3, candidate_labels)

In [48]:
pred4 = classifier(sequence_concat, candidate_labels)

In [34]:
get_predictions_score_tab(pred1)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
"Im Grunde kann man sein E-Auto überall laden, wo es Strom gibt. Tatsächlich unterscheiden sich die Preise aber je nach Ladepunkt erheblich. An manchen Stellen kostet das Laden mehr als doppelt so viel, als an anderen.",elektroauto,0.554633
,e auto,0.220633
,e-auto laden,0.074661
,e auto laden,0.046809
,förderung e-auto,0.017415
,e-auto kaufen,0.013685
,e-auto vergleich,0.01308
,e-auto reichweite,0.009276
,e-auto kleinwagen,0.006359
,e-auto ladestation,0.00487


### Label and score result for *Abstract*: 'elektroauto' --> googel related query score (34)

In [41]:
get_predictions_score_tab(pred2)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
Ladestationen für Elektroautos. Kosten und Anbieter im Vergleich,e-auto vergleich,0.376705
,e-auto laden,0.204149
,e-auto ladestation,0.1778
,elektroauto,0.096548
,e auto laden,0.08119
,förderung e-auto,0.024533
,e-auto kaufen,0.007046
,e-auto prämie,0.006781
,e-auto reichweite,0.003179
,e auto,0.003076


### Label and score result for *Meta Description*: 'e-auto vergleich' (15)

In [42]:
get_predictions_score_tab(pred3)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
"Elektrofahrzeug-Ladestation: Anbieter, Kosten, Funktion",e-auto ladestation,0.295233
,e-auto vergleich,0.186813
,e-auto laden,0.165971
,förderung e-auto,0.126846
,elektroauto,0.076566
,e auto laden,0.04446
,e-auto prämie,0.039827
,e-auto reichweite,0.011048
,e-auto kleinwagen,0.009537
,e-auto kaufen,0.004864


### Label and score result for *Meta Title*: 'e-auto ladestation' (27)

In [49]:
get_predictions_score_tab(pred4)

Unnamed: 0_level_0,Labels,Probability
Sequence,Unnamed: 1_level_1,Unnamed: 2_level_1
"Im Grunde kann man sein E-Auto überall laden, wo es Strom gibt. Tatsächlich unterscheiden sich die Preise aber je nach Ladepunkt erheblich. An manchen Stellen kostet das Laden mehr als doppelt so viel, als an anderen. Ladestationen für Elektroautos. Kosten und Anbieter im Vergleich Elektrofahrzeug-Ladestation: Anbieter, Kosten, Funktion",e-auto vergleich,0.534223
,e-auto laden,0.122227
,elektroauto,0.118224
,e-auto ladestation,0.070181
,e auto laden,0.04192
,e auto,0.030883
,förderung e-auto,0.015133
,e-auto reichweite,0.011212
,e-auto prämie,0.011115
,e-auto kaufen,0.007534


### Label and score result for *Abstract + Meta description + Meta Title*: 'e-auto vergleich' (15)

## 2. Testrun Conclusion:
labeling by abstract, meta_title and meta_description lead to different results in Propability and Score:
Input / Propability / Score
- Abstract /  55% / 34
- Metra Description / 38% / 15
- Meta Title / 29% / 27
- All together / 53% / 15

--> **Outlook / Improvements**: try out multi labelling and hypothesis templates

## 3. Enrich page_ids with google score

In [12]:
# Define the function get_predictions_score
def get_predictions_score(prediction):
    pred_labels = prediction['labels']
    pred_scores = prediction['scores']
    
    # Find the index of the label with the highest probability
    max_index = pred_scores.index(max(pred_scores))
    
    # Extract the label and its corresponding probability
    max_label = pred_labels[max_index]
    max_probability = pred_scores[max_index]
    
    return max_label, max_probability

In [55]:
get_predictions_score(pred4)

('e-auto vergleich', 0.5342233777046204)

### start with 'Versicherung' (only 16 records)


In [53]:
# classification per entity of classification_product
class_product = df.classification_product.unique().tolist()
class_product

['E-Auto',
 'Auto',
 'Zubehör',
 'Motorrad',
 'Energie',
 'Verkehr',
 'Wallbox/Laden',
 'Solaranlagen',
 'E-Bike',
 'Fahrrad',
 'E-Scooter',
 'Solarspeicher',
 'Balkonkraftwerk',
 'Solargenerator',
 'THG',
 'Wärmepumpe',
 'Versicherung']

In [None]:
relevant_columns = ['page_id', 'classification_product', 'abstract', 'meta_description', 'meta_title' ]
df_gscore = df[relevant_columns].copy()
df_gscore['text_to_classify'] = df_gscore['abstract'] + ' ' + df_gscore['meta_description'] + ' ' + df_gscore['meta_title']


df_gscore_out = pd.DataFrame(columns=relevant_columns + ['text_to_classify', 'predicted_query_label', 'predicted_probability'])

df_gscore.shape

In [None]:
filter = 'Versicherung'

In [34]:
iter = filter

df_labels_per_category = df_labels[df_labels['classification_product'] == iter]
candidate_labels = df_labels_per_category['query'].astype(str).tolist()

df_gscore_iter = df_gscore[df_gscore['classification_product'] == iter]

tqdm.pandas(desc=f"Googel search related keyword classification for {iter}")
df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].progress_apply(lambda x: get_predictions_score(classifier(x, candidate_labels))))

df_gscore_iter


In [23]:
df_gscore_iter.to_csv('../data/google_trends/data_trends_class_versicherung.csv')

### Solargenerator does not exist


### Solarspeicher

In [45]:
filter = 'Solarspeicher'

In [48]:
iter = filter

df_labels_per_category = df_labels[df_labels['classification_product'] == iter]
candidate_labels = df_labels_per_category['query'].astype(str).tolist()

df_gscore_iter = df_gscore[df_gscore['classification_product'] == iter]

tqdm.pandas(desc=f"Googel search related keyword classification for {iter}")
df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].progress_apply(lambda x: get_predictions_score(classifier(x, candidate_labels))))

df_gscore_iter

Googel search related keyword classification for Solarspeicher: 100%|██████████| 80/80 [1:34:22<00:00, 70.79s/it]   
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].progress_apply(lambda x: get_predictions_score(classifier(x, candidate_labels))))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].progress_apply(lambda x: 

Unnamed: 0,page_id,classification_product,abstract,meta_description,meta_title,text_to_classify,predicted_query_label,predicted_probability
103,104438,Solarspeicher,"Betrachtet man sein E-Auto ein wenig anders, k...",SEO Description ändern: E-Auto als Stromspeich...,Elektroauto als Stromspeicher nutzen: So funkt...,"Betrachtet man sein E-Auto ein wenig anders, k...",stromspeicher,0.650090
284,107520,Solarspeicher,Auch wenn das Design des portablen Riesenakkus...,Auch wenn das Design des portablen Riesenakkus...,Batterie auf Rädern versorgt mühelos ein ganze...,Auch wenn das Design des portablen Riesenakkus...,stromspeicher,0.416165
296,107679,Solarspeicher,Mittlerweile gibt es eine immer größer werdend...,Solarspeicher - Testsieger: Das sind die beste...,Solarspeicher im Test 2023/24: Die besten Mode...,Mittlerweile gibt es eine immer größer werdend...,photovoltaik,0.340139
739,1010660,Solarspeicher,Wenn Sie Ihre PV-Anlage mit einem Energiespeic...,Wenn Sie Ihre PV-Anlage mit einem Energiespeic...,Energiespeicher bei Solaranlage: So lange hält...,Wenn Sie Ihre PV-Anlage mit einem Energiespeic...,solar speicher,0.229040
1109,1011212,Solarspeicher,"PV-Anlagen mit Speicher sorgen dafür, dass eig...","PV-Anlagen mit Speicher sorgen dafür, dass eig...","PV-Anlage mit Speicher: Preise, Anbieter, Infos","PV-Anlagen mit Speicher sorgen dafür, dass eig...",solaranlage,0.236906
...,...,...,...,...,...,...,...,...
6011,1017508,Solarspeicher,Eine beeindruckende Entwicklung in Sachen Sola...,Eine beeindruckende Entwicklung in Sachen Sola...,Verdoppelung in nur einem Jahr: Deutsche insta...,Eine beeindruckende Entwicklung in Sachen Sola...,solar speicher,0.462494
6191,1017790,Solarspeicher,In Deutschland markiert das vergangene Jahr ei...,In Deutschland markiert das vergangene Jahr ei...,Meilenstein geknackt: Stromspeicher reichen je...,In Deutschland markiert das vergangene Jahr ei...,solar speicher,0.258887
6248,1017871,Solarspeicher,Umfassende Transparenz in der Branche der Heim...,Umfassende Transparenz in der Branche der Heim...,Die besten Stromspeicher 2024: Deutsche Herste...,Umfassende Transparenz in der Branche der Heim...,stromspeicher,0.711050
6794,1018741,Solarspeicher,Eine neue Speicherlösung für das eigene Balkon...,Zur Markteinführung des VitaPower-Speichers ge...,Neuer Speicher für Balkonkraftwerke kommt: Frü...,Eine neue Speicherlösung für das eigene Balkon...,balkonkraftwerk speicher,0.244425


In [49]:
df_gscore_iter.to_csv('../data/google_trends/data_trends_class_solarspeicher.csv')

## E-Bike


In [57]:
def trends_classify(filter, df_labels=df_labels, df_gscore=df_gscore, classifier=classifier):
    iter = filter

    df_labels_per_category = df_labels[df_labels['classification_product'] == iter]
    candidate_labels = df_labels_per_category['query'].astype(str).tolist()

    df_gscore_iter = df_gscore[df_gscore['classification_product'] == iter]

    tqdm.pandas(desc=f"Googel search related keyword classification for {iter}")
    df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].progress_apply(lambda x: get_predictions_score(classifier(x, candidate_labels))))

    return df_gscore_iter

In [58]:
filter = 'E-Bike'

In [None]:
trends_classify(filter)

In [85]:

# prepare dataset for processing
relevant_columns = ['page_id', 'classification_product', 'abstract', 'meta_description', 'meta_title' ]
df_gscore = df[relevant_columns].copy()
df_gscore['text_to_classify'] = df_gscore['abstract'] + ' ' + df_gscore['meta_description'] + ' ' + df_gscore['meta_title']


# classification per entity of classification_product
class_product = df.classification_product.unique().tolist()


df_gscore_out = pd.DataFrame(columns=relevant_columns + ['text_to_classify', 'predicted_query_label', 'predicted_probability'])

filter = 'Versicherung'

for iter in tqdm(class_product):
    if iter == filter:
        tqdm.pandas(desc=f"Googel search related keyword classification for {iter}")
        # get relevant candidate_labels
        df_labels_per_category = df_labels[df_labels['classification_product'] == iter]
        candidate_labels = df_labels_per_category['query'].astype(str).tolist()
        print(iter)
        print(labels_per_category)
        df_gscore_iter = df_gscore[df_gscore['classification_product'] == iter]

        df_gscore_iter['predicted_query_label'], df_gscore_iter['predicted_probability'] = zip(*df_gscore_iter['text_to_classify'].progress_apply(lambda x: get_predictions_score(classifier(x, candidate_labels))))
        df_gscore_out = pd.concat(df_gscore_out, df_gscore_iter, ignore_index=True )


  0%|          | 0/17 [00:00<?, ?it/s]

Versicherung
['versicherung kfz', 'versicherung auto', 'allianz', 'allianz versicherung', 'ergo versicherung', 'huk', 'huk versicherung', 'ergo', 'adac versicherung', 'adac', 'vhv versicherung', 'check24 versicherung', 'axa versicherung', 'versicherung kündigen', 'lvm versicherung', 'württembergische versicherung', 'devk versicherung', 'devk', 'württembergische', 'kfz versicherung vergleich', 'wgv versicherung', 'hdi versicherung', 'wgv', 'nürnberger versicherung', 'hdi']




Googel search related keyword classification for Versicherung:   0%|          | 5/6815 [04:16<96:54:08, 51.23s/it]
 94%|█████████▍| 16/17 [04:16<00:16, 16.01s/it]


KeyboardInterrupt: 

(6815, 6)