In [1]:
%config Completion.use_jedi = False
%reload_ext autoreload
%autoreload 2

In [2]:
import json

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser

from ki_gegen_rechts import llm_chains
from ki_gegen_rechts.analyser import parallel_text_analyser_chains

<a id="toc"></a>
# Table of Content

- [1. General Information](#cap1)
- [2. Hatespeech Analyser Full Chain](#cap2)
  - [2.1 Define example messages](#cap3) 
  - [2.2 Define LLM settings](#cap4)
  - [2.3 Example: Analyse single message](#cap5)
  - [2.4 View result of example message](#cap6)
  - [2.5 Example: Analyse batch of messages](#cap7)    
- [3. Deconstruct full analyser chain into components](#cap8)
  - [3.1 Hatespeech Detector Chain](#cap9)
  - [3.2 Hatespeech Validator Chain](#cap10)
  - [3.3 Hatespeech Classifier Chain](#cap11)
  - [3.4 Right-wing Rater Chain](#cap12)
  - [3.5 OpenAI Moderator](#cap13)
- [4. Conclusion and next steps](#last) 

# 1. General Information <a id="cap1"></a>

This project explores OpenAI's GPT-3.5 turbo and moderation layer to identify hate speech within text messages. It's crucial to understand that hate speech is a sensitive topic and always requires human oversight. One of the main goals is to create a more granular dataset to better address hate speech and right-wing extremism, which will be valuable for fine tuning future models.

>  Caution: This notebook contains examples of hate speech, including antisemitism, sexism and other forms of hate speech. These examples, whether they are created for this project or collected from external sources, are strictly for educational and research purposes. They do not reflect my personal views.

For further reading, consider checking out OpenAI's blog post on using [GPT-4 for content moderation](https://openai.com/blog/using-gpt-4-for-content-moderation), which might offer additional insights.

# 2. Hatespeech Analyser Full Chain <a id="cap2"></a>

For the Hatespeech Analyser I am using mainly Langchain to build my LLM chains. Each chain consists of a custom prompt, a LLM (OpenAI gpt3.5-turbo) and JSON Parser with custom format instructions. The input of the chains is a single message.

The full chain combines 4 LLM Chains and one moderator chain (More details in the scripts and in [Section 3](#cap8)):


| Chain                  | Description                                                                                                                                                                              |
|-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Hatespeech Detector   | Analyzes messages to determine the presence of hatespeech, categorizing responses as "Direct Hatespeech", "No Hatespeech", etc., and provides an explanation for its classification.    |
| Hatespeech Validator  | Similar to the Detector, it evaluates messages for hatespeech using a different prompt, incorporating the Detector's output in its analysis to validate the presence of hatespeech.       |
| Hatespeech Classifier | Identifies the specific type of hatespeech present in a message (e.g., "racism", "sexism") and further categorizes the message as "Offensive Insult", "Personal Experience", or "Historical Reference". |
| Right-wing Rater      | Examines messages for signs of right-wing ideologies, assigning a severity rating on a scale from 0 to 3, where 0 indicates no right-wing content and 3 signals a clear indication of right-wing ideology that warrants reporting. |
| Moderator Chain       | Uses OpenAI's moderation model to check the content of the message                                        |

In this iteration I'm running the most chains in parallel which aren't dependent or connected. Only the Hatespeech Validator needs the classification output of the Hatespeech Detector.

See the concept draw below to understand how the chains are connected.

#### Concept Draw of the full analyzer chain:

<img src="../visuals/concept_llm_chains.svg">

To better understand the full chain I've added an example message with the outputs to the concept.

#### Example output of the full analyzer chain:

<img src="../visuals/concept_llm_with_example_output.svg">

I will first use the full analyzer chain and show you the results. Afterwards in section 3 I will deconstruct each chain into the components.

### 2.1 Define example messages  <a id="cap3"></a>

- These message are from the web or constructed and DO NOT replicate my opinion!
- The messages are in german, please use a translator to understand the content.
- The messages contain antisemitism, racism, sexism and other hatespeech forms.
- "Ich mag Bananen" = "I like Bananas" is meant to be a neutral check.

In [3]:
examples = [
    {"message": "Diese gierigen Juden gehören vergast - leider hat Hilter seine Arbeit damals nicht beendet."},
    {"message": "Als Kind wurde ich immer als Neger beschimpft"},
    {"message": "Adolf Hitler war ein deutscher Politiker. Mit seiner Schrift Mein Kampf (1925/26) prägte er die antisemitische und rassistische, auf Eroberung von Lebensraum ausgerichtete Ideologie des Nationalsozialismus."},
    {"message": "Ich mag Bananen"},
    {"message": "Du dumme Fotze, ich ficke dein Leben"},
    {"message": "Du Schwuchtel!"},
    {"message": "Scheiß Neger, die erlauben sich wie immer alles - sollen mal wieder zurück in den Dschungel"},
    {"message": "Diese scheiß Grünen, die uns nur alles verbieten und wegnehmen. 88 - der einzig wahre Führer, der die deutschen Bedürfnisse verstanden hat."}
]

### 2.2 Define LLM Settings <a id="cap4"></a>
- I am using OpenAI gpt3.5 turbo 1106 (Why not gpt4? tbh to save some money)
- setting temperature to 0.0 and a seed to try to get deterministic results (is the seed implemented in Langchain?)

In [4]:
llm = ChatOpenAI(model="gpt-3.5-turbo-1106", temperature=0.0, model_kwargs={"seed": 0}) 

### 2.3 Example: Analyse single message <a id="cap5"></a>

In [5]:
message = examples[0] # using the first message
message

{'message': 'Diese gierigen Juden gehören vergast - leider hat Hilter seine Arbeit damals nicht beendet.'}

In [6]:
analyser_full_chain = parallel_text_analyser_chains(llm)
result = analyser_full_chain.invoke(message)

In [7]:
print(json.dumps(result, indent=4))

{
    "detection": {
        "detector": {
            "classification": "Direct hate speech",
            "explanation": "The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The author's intention is to disparage and promote hatred towards this group."
        },
        "validator": {
            "classification": "Direct hate speech",
            "explanation": "I agree with the previous expert's classification. The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The author's intention is clearly to disparage and promote hatred towards this group."
        }
    },
    "classifier": {
        "classification": "Offensive insult",
        "racism": true,
        "anitsemitism": true,
        "homophobia": false,
        "ableism": false,
        "violence": true,
        "sexism": false,
        "other_hate_speech": false,
 

### 2.4 View results <a id="cap6"></a>

Let's format the results to get a better overview.
- The first table show the string classification of each chain and the "expert's" explanation.
- The red dots in the second table indicates that the category applies (True).

In [8]:
from ki_gegen_rechts.utils import pretty_tables_single_result
import pandas as pd
pd.options.display.max_colwidth = 500

df_evaluation, df_explanations = pretty_tables_single_result(result)
txt = message["message"]

In [9]:
print(f"Original message: {txt}")
df_explanations

Original message: Diese gierigen Juden gehören vergast - leider hat Hilter seine Arbeit damals nicht beendet.


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,"The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The author's intention is to disparage and promote hatred towards this group."
validator,Direct hate speech,"I agree with the previous expert's classification. The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The author's intention is clearly to disparage and promote hatred towards this group."
classifier,Offensive insult,"The message contains offensive and hateful language directed at Jewish people, promoting violence and expressing antisemitic sentiments."
right_wing_rater,Category 3,"The message contains explicit and extreme right-wing rhetoric, advocating for the genocide of Jewish people. It openly promotes violence and hatred based on racial supremacy, which is a severe violation of German law. Immediate action is required to report the content and notify law enforcement."


In [10]:
df_evaluation # I worked the whole time in vscode, the formatting is a bit different -> more borders

Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


[Back to Table of Content](#toc)

### 2.5 Example: Analyse batch of messages <a id="cap7"></a>

Because the analyser is based on Langchain's parallel chain you can use all corresponding methods like .batch or the asynchronous functions (not tested yet). I'm working on a visualisation of mutiple results.

[Jump to section 3](#cap8)

In [11]:
results = analyser_full_chain.batch(examples)

In [12]:
for result in results:
    orig_message = result["moderator"]["message"]
    print(f"Original message: {orig_message}")
    df_evaluation, df_explanations = pretty_tables_single_result(result)
    display(df_evaluation), display(df_explanations)
    print("\n\n")

Original message: Diese gierigen Juden gehören vergast - leider hat Hilter seine Arbeit damals nicht beendet.


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,"The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The content directly reflects the author's hateful views."
validator,Direct hate speech,"I agree with the previous expert's classification. The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The content directly reflects the author's hateful views."
classifier,Offensive insult,"The message contains offensive and hateful language directed at Jewish people, promoting violence and expressing antisemitic views."
right_wing_rater,Category 3,"The message openly advocates for violence and hatred based on right-wing extremist ideologies. It contains explicit anti-Semitic rhetoric, promoting the genocide of Jewish people. This constitutes a severe violation of German law and immediate action is required to report the content and notify law enforcement."





Original message: Als Kind wurde ich immer als Neger beschimpft


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Indirect hate speech,"The message contains a derogatory term used to describe the author's personal experiences of being called a derogatory term based on race. It does not express the author's personal hateful beliefs, but rather shares a personal experience of hate speech encountered. Therefore, it falls under the category of Indirect hate speech."
validator,Indirect hate speech,"I agree with the previous expert's classification. The message contains a derogatory term used to describe the author's personal experiences of being called a derogatory term based on race. It does not express the author's personal hateful beliefs, but rather shares a personal experience of hate speech encountered. Therefore, it falls under the category of Indirect hate speech."
classifier,Personal experience,"The message includes a personal experience of being called a derogatory term based on race, which falls under the category of racism."
right_wing_rater,Category 0,"The message does not contain any indication of right-wing ideologies. It describes a situation where the author was the victim of racial discrimination, but it does not promote or endorse any harmful right-wing ideologies."





Original message: Adolf Hitler war ein deutscher Politiker. Mit seiner Schrift Mein Kampf (1925/26) prägte er die antisemitische und rassistische, auf Eroberung von Lebensraum ausgerichtete Ideologie des Nationalsozialismus.


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,"The message contains explicit hate speech, as it promotes antisemitism and racism through the ideology of National Socialism. It directly reflects the author's views and intentions to disparage and demean individuals based on their ethnicity."
validator,Direct hate speech,"I agree with the previous expert's classification of direct hate speech. The message explicitly promotes antisemitism and racism through the ideology of National Socialism, reflecting the author's intention to disparage and demean individuals based on their ethnicity. The language used in the message leaves no doubt about the hateful nature of the content."
classifier,Historical reference,"The message contains a historical reference to Adolf Hitler and his influence on the antisemitic and racist ideology of National Socialism, meeting the criteria for the historical reference category and the subcategories of racism and antisemitism."
right_wing_rater,Category 3,"The message contains explicit references to Adolf Hitler and Mein Kampf, which are associated with the extreme right-wing ideology of National Socialism (Nazism). The mention of antisemitism, racism, and the pursuit of Lebensraum (living space) clearly aligns with harmful right-wing ideologies. This message constitutes a severe violation of German law and openly promotes supremacist and violent beliefs."





Original message: Ich mag Bananen


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,No hate speech,"The message is neutral and does not contain any form of hate speech, derogatory, discriminatory, or hostile language. It is a positive statement about liking bananas."
validator,No hate speech,"I agree with the previous expert's classification. The message is indeed neutral and does not contain any form of hate speech. It is a positive statement about liking bananas, and there are no derogatory, discriminatory, or hostile language used."
classifier,Personal experience,The message is a personal expression of liking bananas and does not contain any hate speech or discriminatory content.
right_wing_rater,Category 0,The message contains no indication of right-wing ideologies. It is a simple statement expressing a preference for bananas.





Original message: Du dumme Fotze, ich ficke dein Leben


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,The message contains explicit derogatory language and expresses hostility towards an individual. It directly promotes hatred and discrimination based on gender.
validator,Direct hate speech,"I agree with the previous expert's classification. The message contains explicit derogatory language and expresses hostility towards an individual, promoting hatred and discrimination based on gender. The language used is clearly indicative of hate speech, and the intent of the author is to spread negativity and discrimination."
classifier,Offensive insult,"The message contains offensive and violent language directed at a specific person, using derogatory and sexist terms."
right_wing_rater,Category 0,The message contains no indication of right-wing ideologies. It is a vulgar insult but does not contain any specific right-wing rhetoric.





Original message: Du Schwuchtel!


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,"The message contains a derogatory slur targeting an individual based on their sexual orientation, reflecting the author's intention to express hostility. The language used is explicitly demeaning and promotes discrimination."
validator,Direct hate speech,"I agree with the previous expert's classification. The message contains a derogatory slur targeting an individual based on their sexual orientation, and the language used is explicitly demeaning and promotes discrimination. The author's intention to express hostility is evident, supporting the classification of direct hate speech."
classifier,Offensive insult,"The message contains a homophobic slur, which is a clear example of offensive insult and promotes hate speech against homosexual people."
right_wing_rater,Category 2,"The message contains a derogatory and offensive term with clear homophobic connotations. This language promotes exclusion and division, aligning with harmful right-wing ideologies. It is potentially harmful and should be reported and further reviewed."





Original message: Scheiß Neger, die erlauben sich wie immer alles - sollen mal wieder zurück in den Dschungel


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,The message contains derogatory language and expresses hostility towards individuals based on race. The author's intention is to disparage and demean a specific racial group.
validator,Direct hate speech,"I agree with the expert's opinion. The message contains derogatory language and expresses hostility towards individuals based on race, with the clear intention to disparage and demean a specific racial group. The classification of direct hate speech is accurate based on the language used and the author's intent."
classifier,Offensive insult,The message contains a racial slur and promotes hate speech and violence against people of African descent.
right_wing_rater,Category 3,"The message contains extremely harmful and racist language, advocating for the return of individuals to the 'jungle' based on their race. This openly promotes racial supremacy and hatred, constituting a severe violation of German law. Immediate action is required to report the content and notify law enforcement or relevant authorities."





Original message: Diese scheiß Grünen, die uns nur alles verbieten und wegnehmen. 88 - der einzig wahre Führer, der die deutschen Bedürfnisse verstanden hat.


Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Hate Speech Classifier,Right Wing Rater,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results,Moderator Results
other_hate_speech,sexism,ableism,racism,anitsemitism,violence,homophobia,right_wing_indicator,harassment,harassment_threatening,hate,hate_threatening,self_harm,self_harm_instructions,self_harm_intent,sexual,sexual_minors,violence,violence_graphic,self-harm,sexual/minors,hate/threatening,violence/graphic,self-harm/intent,self-harm/instructions,harassment/threatening
●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●,●


Unnamed: 0,Classification,Explanation
detector,Direct hate speech,The message contains direct hate speech as it expresses hostility towards a political party and uses a reference to '88' which is a neo-Nazi code for 'Heil Hitler.' The language used is derogatory and promotes hatred towards a specific group.
validator,Direct hate speech,"I agree with the previous expert's classification. The message contains explicit hostility and derogatory language towards a specific political party, as well as a reference to '88' which is associated with neo-Nazi ideology. The language used clearly promotes hatred and intolerance towards a particular group, justifying the classification as direct hate speech."
classifier,Offensive insult,"The message contains offensive language and references to Nazi propaganda, promoting violence and hate speech against certain groups."
right_wing_rater,Category 2,"The message contains clear right-wing ideological rhetoric, promoting a divisive and exclusionary viewpoint. The reference to '88' is a well-known neo-Nazi code for 'Heil Hitler', and the mention of 'der einzig wahre Führer' (the one true leader) is a direct reference to Adolf Hitler. This kind of language is harmful and needs to be reported and further reviewed."







## Findings/Evaluation

- The results are looking promising but there are issues we need to address:
    - Original message: Adolf Hitler war ein deutscher Politiker... -> This is the first sentence about Adolf Hilter from wikipedia
        - This Classifier correctly detects that the message is a historical reference
        - But the Detector/Validator aren't able to classify the message properly because "they" rely on the keywords like antisemitic and racist which drives the chains to classify the message as "direct hate speech".
        - The same is apply for the RW Rater
    -   I have sometimes problems to replicate the results (It could be beneficial to pass the same message mutiple times into the chain and count the occurence? At least for the classifications and boolean outputs)

## 3. Deconstruct full analyser chain into components (18.02.24) <a id="cap8"></a>
[Back to Table of Content](#toc)

> Please keep in mind that the model is not deterministic, the results can be slightly different if you're invoking it multiple times.

### 3.1 Hatespeech Detector Chain <a id="cap9"></a>

You can find all prompts in [this script](../src/ki_gegen_rechts/prompts.py).

#### 3.1.1 The Prompt 

```python
"""You are a hate speech expert. Your role is to categorize each message by analyzing its content for the presence and type of hate speech.
Utilize the following definitions which are delimited with XML tags to ensure accurate and nuanced classifications:

<category>Direct hate speech</category>
<definition>This category is for messages where the author explicitly expresses hate speech. The content directly reflects the author's own views,
showing an intention to disparage, demean, or express hostility towards individuals or groups based on race, ethnicity, religion, gender, sexual orientation,
or other identity markers. Direct hate speech is characterized by the author's use of derogatory language, slurs, or any explicit statements that promote
hatred or discrimination. if the message contains hate speech and is expressed by the author of the message and reflects directly his opinion.</definition>

<category>Indirect hate speech</category>
<definition>Assign a message to this category if it contains hate speech articulated through the lens of personal experiences
or observations, yet does not directly express the author's personal hateful beliefs. This includes scenarios where the author discusses hate speech encountered
in personal experiences, shares narratives that include hate speech to highlight societal issues, or articulates scenarios involving hate speech without endorsing it.
This category also could include quoting someone else's hate speech</definition>

<category>No hate speech</category>
<definition>Messages without any form of hate speech, derogatory, discriminatory, or hostile language towards any group or individual,
belong here. This includes content that is neutral, positive, or unrelated to hate speech.</definition>

<category>Review needed</category>
<definition>Messages without any form of hate speech, derogatory, discriminatory, or hostile language towards any group or individual,
belong here. This includes content that is neutral, positive, or unrelated to hate speech.</definition>

<category>Unknown</category>
<definition>If the classification is unclear due to lack of context, ambiguous language, or other factors preventing a definitive categorization,
label the message as "Unknown." This category is for messages that need additional information or context for accurate classification.</definition>

Carefully read each message, paying close attention to the context and the language used. Always think step by step and decide based on your conclusion.
Your detailed assessment is vital for ensuring a respectful and safe communication environment.
First explain your conclusion in 80 words and categorize it. Always answer in the following format:
{format_instructions}

<message>
{message}
</message>

Answer:"""
````

#### 3.1.2 Invoking the Hatespeech Detector

<b>Inputs: </b>
- message

<b>Outputs: JSON with following keys</b>
- classification: Classify with a string if the message is 'Direct hate speech', 'Indirect hate speech', 'No hate speech', 'Review needed' or 'Unknown'
- explanation: An explanation of the model's decision

In [13]:
from ki_gegen_rechts.prompts import HATESPEECH_DETECTOR_PROMPT, HatespeechDetectionFormat

In [14]:
parser = JsonOutputParser(pydantic_object=HatespeechDetectionFormat)
chain = llm_chains.create_public_chat_gpt_chain(HATESPEECH_DETECTOR_PROMPT, parser)
result = chain.invoke(message)
print(json.dumps(result, indent=4))

{
    "explanation": "The message contains explicit hate speech targeting Jews, using derogatory language and expressing a desire for their extermination. The author's intention is to disparage and promote hatred towards this group.",
    "classification": "Direct hate speech"
}


## 3.2 Hatespeech Validator Chain <a id="cap10"></a>

In [15]:
from ki_gegen_rechts.prompts import HATESPEECH_VALIDATOR_PROMPT, HatespeechValidatorFormat

#### 3.2.1 The Prompt

```python
"""You are tasked with re-evaluating the classification of a message that has been previously 
assessed by another expert in the context of hate speech detection. 
The message could fall into one of the following categories:

    `Direct hate speech`: The message contains hate speech expressed directly by the author. This does not include any personal experience where the author was the victim of hate speech.
    `Indirect hate speech`: The message contains hate speech that is not directly expressed by the author but implies endorsement or propagation of hate speech.
    `No hate speech`: The message does not exhibit any characteristics of hate speech.
    `Review needed`: The message exhibits some characteristics of hate speech but is not definitive, and therefore, further review is necessary.
    `Unknown`: The message could not be confidently classified due to ambiguity or lack of clear indicators.
    
Instructions: Carefully read the message, paying close attention to the context and the language used. Accurately apply the categories based on
the provided definitions, focusing on how hate speech is presented and the author's intent. Your detailed assessment is vital for ensuring a respectful
and safe communication environment. Always think step by step and decide based on your conclusion. Provide your perspective to the message and explain your conclusion.

Here is the classification provided by the previous expert below:
<opinion>    
Classification: {classification}
</opinion>    

Review the content of the message:
<message>
{message}
</message>

Always answer in the following format:
{format_instructions}

Answer:"""
```

#### 3.2.2 Invoke the Hatespeech Validator

<b>Inputs: </b>
- message
- classification (of the Hatespeech Detector chain)

<b>Outputs: JSON with following keys</b>
- classification: Classify with a string if the message is 'Direct hate speech', 'Indirect hate speech', 'No hate speech', 'Review needed' or 'Unknown'
- explanation: An explanation of the model's decision

In [16]:
result.update(message) # we need to pass the classification of the hatespeech detector and the original message

In [17]:
parser = JsonOutputParser(pydantic_object=HatespeechValidatorFormat)
chain_2 = llm_chains.create_public_chat_gpt_chain(HATESPEECH_VALIDATOR_PROMPT,
                                                  parser)
print(json.dumps(chain_2.invoke(result), indent=4))

{
    "classification": "Direct hate speech",
    "explanation": "I agree with the previous expert's classification. The message contains direct hate speech targeting Jews, expressing a desire for them to be gassed. The reference to Hitler's work not being finished further emphasizes the hateful intent of the message."
}


## 3.3 Hatespeech Classifier Chain <a id="cap11"></a>

In [18]:
from ki_gegen_rechts.prompts import HATESPEECH_CLASSIFICATION_PROMPT, HatespeechClassifierFormat

#### 3.2.1 The Prompt

```python
"""Please analyze the following message and categorize its content based on the listed categories.
Categorize the message in two major steps.

In the first step you categorize if one of the following main categories apply to the message:

    - Personal experience: Indicate if the message includes personal stories, anecdotes, or life experiences, where the author of the message suffered with hate speech.
    - Historical reference: Identify references or quotes to historical events, figures, or contexts. Only classify it as an `historical reference` if the historical content is used to provide information and is not used to insult specific groups or persons.
    - Offensive insult: The message clearly insults, discriminates or promoting hate and violence against particular groups, persons, and others.

In the second step you identify if the message contains any elements of racism, antisemitism, homophobia, ableism, sexism or other forms of hate speech.

Subcategories to analyze in the second step:

    - Racism: Identify any elements that promote, condone, or express prejudice, discrimination, or antagonism against a person or people based on their race or ethnic origin.
    - Antisemitism: Detect any expressions or actions that show hostility, prejudice, or discrimination against Jewish people.
    - Homophobia: Highlight statements or attitudes that express dislike of or prejudice against homosexual people.
    - Ableism: Point out any discrimination or social prejudice against people with disabilities or who are perceived to be disabled.
    - Violence: Indicate if the message promotes violence in any form or threatens particular groups, persons, and others.
    - Sexism: Highlight expressions or actions that show prejudice, discrimination, or antagonism based on gender, including demeaning statements, enforcement of traditional gender roles, or denial of opportunities due to gender.
    - Other hate speech: Identify any content that doesn't fall into the above categories but can be considered as promoting hate or discrimination against a particular group based on attributes such as gender, sexual orientation, religion, or others.

    
General instructions:
    - Analyze the content of the message carefully.
    - Categorize the second step with python booleans format and don't save it as a string.
    - If a subcategory does not apply, simply state false and move on to the next one.
    - Maintain an objective and unbiased stance throughout the analysis.
    - Always think step by step and decide based on your conclusion.

Always answer in the following format:
{format_instructions}

<message>
{message}
</message>

Answer:
"""
```

#### 3.3.2 Invoke the Hatespeech Classifier

<b>Inputs: </b>
- message

<b>Outputs: JSON with following keys</b>
- classification: Classify with a string if the message is a 'Personal experience', 'Historical reference' or 'Offensive Result'
- Booleans to different hatespeech subcategories, see prompt (True if the message contains subcategory).
- explanation: An explanation of the model's decision

In [19]:
parser = JsonOutputParser(pydantic_object=HatespeechClassifierFormat)
chain_3 = llm_chains.create_public_chat_gpt_chain(HATESPEECH_CLASSIFICATION_PROMPT, parser)
print(json.dumps(chain_3.invoke(result), indent=4))

{
    "classification": "Offensive insult",
    "racism": true,
    "anitsemitism": true,
    "homophobia": false,
    "ableism": false,
    "violence": true,
    "sexism": false,
    "other_hate_speech": false,
    "explanation": "The message contains offensive insults targeting Jewish people, promoting violence and expressing antisemitism."
}


## 3.4 Right-Wing Rater <a id="cap12"></a>

In [20]:
from ki_gegen_rechts.prompts import RightWingRatingFormat, RIGHT_WING_RATING_PROMPT

#### 3.4.1 The Prompt

```python
"""You are an expert in understanding a broad spectrum of political ideologies, with a focus on recognizing extreme and harmful rhetoric, including fascism, Nazism, and racial supremacy. You possess a deep understanding of historical contexts, particularly related to Germany's history. Your task is to analyze messages in the first step if the message contain any indication of right-wing ideologies and then rate them into four categories, based on the presence and intensity of harmful right-wing ideologies. It is crucial to capture the intensity of the author's emotions, including feelings of strong dislike or aversion. If the message describes situations where the author was the victim of right-wing ideologies, this should be classified with no right-wing ideologies.

First step - Right-wing Indication:
    Classify with a boolean if the message contains any indication of right-wing ideologies. 

Second Step - Rating Scale:
    Category 0: The message contains no indication of right-wing ideologies.
    Category 1: The message contains subtle hints of right-wing ideologies but is not overtly harmful or inciting.
    Category 2: The message contains clear right-wing ideological rhetoric, promoting a divisive or exclusionary viewpoint, potentially harmful which needs be reported and further reviewed.
    Category 3: The message is extremely dangerous and constitutes a severe violation of German law. It openly advocates for violence, hatred, or supremacy based on right-wing extremist ideologies. Immediate action is required: reporting the content to the platform it was found on, and notifying law enforcement or other relevant authorities to assess the need for pressing charges.

Always think step by step and decide based on your conclusion. Also explain why and which part in the message was the most relevant for your conclusion.

Always answer in the following format:
{format_instructions}

<message>
{message}
</message>

Answer:
"""
```

#### 3.4.2 Invoke the Right-wing Rater

<b>Inputs: </b>
- message

<b>Outputs: JSON with following keys</b>
- right_wing_indicator, which should indicate with a boolean if the message contains rw ideologies (True, if it applies)
- rating from zero to 3. Category zero means, there is no rw ideologies. Category 3 is the max possible rating.
- explanation: An explanation of the model's decision

In [21]:
parser = JsonOutputParser(pydantic_object=RightWingRatingFormat)
chain_4 = llm_chains.create_public_chat_gpt_chain(RIGHT_WING_RATING_PROMPT, parser)
print(json.dumps(chain_4.invoke(result), indent=4))

{
    "right_wing_indicator": true,
    "rating": "Category 3",
    "explanation": "The message contains explicit and extreme right-wing rhetoric, advocating for the genocide of Jewish people. It openly promotes violence and hatred based on racial supremacy, which is a severe violation of German law. Immediate action is required to report the content to the platform it was found on, and notify law enforcement or other relevant authorities to assess the need for pressing charges."
}


## 3.5 Moderation Classifier from OpenAI (not a LLM Chain) <a id="cap13"></a>
[Back to Table of Content](#toc)

For details on this chain, look up openai's guide for moderation [here](https://platform.openai.com/docs/guides/moderation).

In [22]:
from ki_gegen_rechts.llm_chains import OpenAIModerationChain
mod_chain = OpenAIModerationChain()
print(json.dumps(mod_chain.invoke(message, return_only_outputs=False), indent=4))

{
    "message": "Diese gierigen Juden geh\u00f6ren vergast - leider hat Hilter seine Arbeit damals nicht beendet.",
    "mod_results": {
        "categories": {
            "harassment": true,
            "harassment_threatening": true,
            "hate": true,
            "hate_threatening": true,
            "self_harm": false,
            "self_harm_instructions": false,
            "self_harm_intent": false,
            "sexual": false,
            "sexual_minors": false,
            "violence": true,
            "violence_graphic": false,
            "self-harm": false,
            "sexual/minors": false,
            "hate/threatening": true,
            "violence/graphic": false,
            "self-harm/intent": false,
            "self-harm/instructions": false,
            "harassment/threatening": true
        },
        "category_scores": {
            "harassment": 0.9950785636901855,
            "harassment_threatening": 0.9740835428237915,
            "hate": 0.9758500456

# 4. Conlusion and next steps <a id="cap14"></a>

The first attempt is going in the right direction but it's not really scalable or useable for a proper benchmark. There are definitely some points in the prompt where I can refine the prompt. The right wing rater is not really reliable and I assume that it's better to use this chain after the others with some kind of chain routing. The reviewer or hatespeech validator often uses the same words like the detector, so it's maybe better to change the prompt or use another LLM model/different parameters.


The next steps will include:

- Refine prompts and parallel chain logic 
- Save message and results in a database (prepare for potential fine-tuning)
- Build an evaluation tool (accuracy, recall, precision or f1 for single label/multilabel) for the batched LLM output
- Build an app/dashboard to show evaluation results
- Use different LLM models (models which are trained for the german language)
- Build Message collectors for social media