#Hate Speech Detection Using DistilBERT:
 **Project Overview :**
This notebook provides a complete pipeline for identifying hate speech in YouTube comments using DistilBERT, a transformer-based model optimized for speed and efficiency. The project aims to categorize comments from the Lok Sabha election discussion on YouTube as either "Hate," "Offensive," or "Not Hate."

**Notebook Outline**
**Objective:** The focus of this project is to leverage a pretrained DistilBERT model to accurately label and analyze hate speech in election-related YouTube comments. This approach enables high-throughput and reliable hate speech classification on a large dataset.

# DistilBERT Model Overview:

**Model Choice:** DistilBERT is chosen for its performance and computational efficiency, allowing fast inference on large datasets. DistilBERT retains the accuracy of BERT while reducing its size by nearly 40%, making it suitable for real-time hate speech classification.

**Labeling Categories:** The model is fine-tuned to label comments as "Hate," "Offensive," or "Not Hate" based on probability thresholds, helping to capture a broader range of toxic speech.
Data Preprocessing:

**Text Cleaning and Deduplication:** The comments are preprocessed to remove duplicates, hyperlinks, and irrelevant special characters, preparing them for input to DistilBERT.

**Translation for Multilingual Data:** As comments are in various languages (including Hinglish, Manglish, and other regional languages), a translation step converts non-English text to English, ensuring DistilBERT can process them effectively.
Labeling with DistilBERT:

**Probability-Based Labeling:** DistilBERT provides output probabilities for each toxicity category. Comments are labeled based on thresholds that classify them into "Hate," "Offensive," or "Not Hate."
**Custom Thresholding:** Specific thresholds are set for each label to align with the context of political discussions, providing a more tailored classification.

In [None]:
import pandas as pd
import matplotlib.pyplot as pyplot
import seaborn as sns
import numpy as np
import nltk
import re

In [None]:
df=pd.read_excel('/content/translated.xlsx')

In [None]:
df.head(10)

Unnamed: 0,text
0,Dont remember the last time hindus crashed a p...
1,Being a Muslim it is our duty to te...
2,Very good
3,All Indian muslim go Pakistan
4,So modi pushing for more children 🧒
5,40 million Hindus killed in bangladesh
6,He is telling what people want every politicia...
7,🫡🫡 India
8,modi is not anti muslim\npakistanis dont want ...
9,Please 🙏 muslim leave india 😂😂😂😂😂


In [None]:
# df=df.drop('Unnamed: 1',axis=1)

In [None]:
# df.head()

In [None]:
# removal of capitalization
def lower_case(text):
    return text.lower()
df['text'] = df['text'].apply(lower_case)

In [None]:
df.head(10)

Unnamed: 0,text
0,dont remember the last time hindus crashed a p...
1,being a muslim it is our duty to te...
2,very good
3,all indian muslim go pakistan
4,so modi pushing for more children 🧒
5,40 million hindus killed in bangladesh
6,he is telling what people want every politicia...
7,🫡🫡 india
8,modi is not anti muslim\npakistanis dont want ...
9,please 🙏 muslim leave india 😂😂😂😂😂


In [None]:
# Compile the regex pattern to match @mentions
regex_pat = re.compile(r'@[\w\-]+')

# Function to remove @mentions from the text
def remove_mentions(text):
    return re.sub(regex_pat, '', text)

# Apply the function to the 'text' column
df['text'] = df['text'].apply(remove_mentions)

In [None]:
df.head(10)

Unnamed: 0,text
0,dont remember the last time hindus crashed a p...
1,being a muslim it is our duty to te...
2,very good
3,all indian muslim go pakistan
4,so modi pushing for more children 🧒
5,40 million hindus killed in bangladesh
6,he is telling what people want every politicia...
7,🫡🫡 india
8,modi is not anti muslim\npakistanis dont want ...
9,please 🙏 muslim leave india 😂😂😂😂😂


In [None]:
# Removal of extra spaces using pandas' str.replace with regex=True
df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)

In [None]:
# remove whitespace with a single space
df['text']=df['text'].str.replace(r'\s+', ' ')

In [None]:
# from google.colab import files
# df.to_excel('trans.xlsx', index=False)
# files.download('trans.xlsx')

In [None]:
#finding duplicates comments
# df[df.duplicated(subset='text')]

In [None]:
# Remove duplicates while keeping the first occurrence
# df = df.drop_duplicates(subset='text', keep='first')

In [None]:
# Optionally, reset index after removing duplicates
# df.reset_index(drop=True, inplace=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21428 entries, 0 to 21427
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21428 non-null  object
dtypes: object(1)
memory usage: 167.5+ KB


In [None]:
# import re

# # Assuming your text data is in the 'text' column of the dataframe
# special_chars = set()

# # Regular expression to match special characters (excluding alphanumeric characters and spaces)
# regex = re.compile(r'[^a-zA-Z0-9\s]')

# # Loop through each text entry in the dataframe
# for text in df['text']:
#     # Find all special characters in the text
#     found_chars = regex.findall(str(text))  # Convert to string to avoid errors
#     special_chars.update(found_chars)

# # Display the unique special characters found
# print(special_chars)

In [None]:
# Removing leading and trailing whitespace from the 'text' column
df['text'] = df['text'].str.strip()

In [None]:
hash_comments = df[df['text'] == '#value!']

In [None]:
hash_comments

Unnamed: 0,text
5954,#value!
15438,#value!
20944,#value!
21214,#value!


In [None]:
df=df[df['text'] != '#value!']

In [None]:
# Install Emoji library.
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [None]:
# Import module emoji
import emoji

In [None]:
# Function to extract emojis from a comment
def extract_emojis(comment):
    return ''.join([char for char in comment if char in emoji.EMOJI_DATA])

# Apply the function to the 'text' column
emojis = df['text'].apply(extract_emojis)

# Display the DataFrame with extracted emojis
print(emojis)

0        ☪
1         
2         
3         
4        🧒
        ..
21423     
21424     
21425     
21426     
21427     
Name: text, Length: 21424, dtype: object


In [None]:
str=''
for i in df.text:
    list=[c for c in i if c in emoji.EMOJI_DATA]
    for ele in list:
        str= str+ele

In [None]:
# How many emojis do we have in our dataset?
len(str)

18668

In [None]:
# This is how our str looks like
str

'☪🧒\U0001fae1\U0001fae1🙏😂😂😂😂😂😢😢😢😢😂😂😢😢😂❤😊😂❤❤❤😅😅😅❤🤔🤔🤔🤔😂😂😂💩💀❤❤😂🙏🩴😢😢❤❤😊😡😡😡🙏😂😂😂😂😂😂⛑🍬❤❤😂😂😂😂😂😂🗿😂😂😂😂😊😊❤❤❤❤❤❤❤❤❤❤❤❤❤❤❤😊😅😅🤬🤬🤬❤❤❤❤👏❤😂😂😅😂❤❤❤😂😂😂🚩🚩🚩🚩😂😂😂😅😅😢☕👎☕☕\U0001faf5😃😃😂😂😂😂😅😂😅😂😅❤❤❤😂🙏🙏🙏😢❤❤❤❤🙂🔥💪😊😢😂😂😂😂❌✅😂😂😂😂😂😂🥱😂🤦♂🤷♂🕉🙏❤🕉😢😢😊😊🔥🔥🔥❤❤❤🙏🙏🙏🤮🤮🤮❤❤❤❤❤❤❤❤❤🌼🌺🙏🌼🌺🙏❤❤❤❤❤❤😢😢😢😢😢❤😂✅😢👎👎👎👎👎👎❤😂😂😂😂😂😂✅🤚🤚🌎😅😅😡😭😭😢🤣🤣🤣🤣🤣😂😡😡😡😡😡😡😡😡😡😡😡😂✌💔🥲😂😂😂😂😂😂❤🚩🚩🚩🚩😂😂👈👈👎👎👎😂😂😂😂😂😂😂😂🤣🤣🤣🤣😢😢😢😢😢😢😢😢🕉❤😢😢😢😢😢❤😂😂😂😂😂😅🥶🗿😂🗿❤❤❤❤❤👏😊👎😱😈👿😡😱🤪😨😡👿👿👿👿😂😂💞🤲🕌💖🤞😂😂😂🎉😡😂😂🎉😂❤❤❤❤❤❤❤😂😂👹👹❌✅❤❤❤❤😊😂🗳🤬🤬🤬😡😡😢😢😭😢🔥🔥🔥😂😊😂😂😂😂😂😂🕉🌸☝🏻💚🤍😂😂😢😢😢👍👍👍❤🤔💔😂😂😂😂❎✅⚡😂😡😂🥺🥺☪☝🏼☝🏾☝🏾☝🏾☝🏾☝🏽😢😂😂😂😂😂😂😂😂😂😂😡😡😡🔥😂😮😢😮😮😮😮😮😮😮😅😡😂😂😂💩🤮🤮😤😂😂❌✅🙄🙏😕❤💙😂😂😂😂😂😂😂🤡❤❤❤❤❤😮❤📈📈📈📈🗿🗿🗿🗿😢😢😊😢😢😂😂😂😂😂😂😂🙏😢🤣🤣🤣🤣❌✅💀😂😅😅😅✌😂🤣💔❤💯💯💪💪💪🚩🚩🚩😈🐄🚩🚩🤣🤣🤣🤣🤣🤣😂😮👍💐💐💐😂😢😢😢🗳😢😢😢😢😢😢😢😅🤩🤩😂❌😡😠😡😠❤❤❤😊😂😂😂😂😂💩💩💩💩😢😢🙂🤣🤦🏼♂😊😢👹👺😡🤡😂😂😢👈🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶🐶😂😂😂😂😂😂😂☕☕☕😅😂😂😂❤🚩😡😡😡😡😡😡😡😡😡😡😡😡😡😡😡😡😡❌❌❌❌❌❌❌❌❌❌❌❌🙂❤❤❤❤😢😢😢😢😢😢😢❤❤❤❤😊😢😢😢😢😢😢😮💨🦛🦄🥲🥲🥲😢😢😢😢🎉❤😂😂😂😂😂😂☪😅😅😅😅😂🪑😢😨😂😂🤢😢🅱😂❤😢😢😢😂✊😆🤬🤬🤬😂😂😂😂🤡🙌🏻🙃❤☪😁😢😂💪👍💪👍❤🤣🤣👍😢😢😢😴😂😂❤❤❤❤❤❤😢🚩🚩🚩🚩🚩🚩🚩🚩🕉🔯❤😅😡😡😡😡😡😡😢😢😢😢😡🎈😂😂😂😂😂😃😆😡❤❤❤❤🚩🚩🚩🚩😂😢😢😢😢👍🙏❤❤❤😢😅🐖☕🐶😂😂😂😂🙏🙏😮😡😢😢❤🤣🤣🤣😶❤🩹✊❤🩹😢😂😂👎👎👎😆😆😆😆😆😆😆😆😆❤😂☯😡🛑🤐🤫😢😢😢😢😢☠😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😂😅👹👹👍🙏❤😡👍💯👎👎👎🙄👺👺👺😜😜🥊🥊🥊🥊😡😍😢😢🤮💩🐖🐍🤣🤣😅😅☪🕉☪😂❌❌❌❌❌❌❌❌❌❌✅✅✅✅✅✅✅✅✅✅✅

In [None]:
# Let's count the unique emojis
result={}
for i in set(str):
    result[i]= str.count(i)

In [None]:
result.items()

dict_items([('🥧', 1), ('🦜', 1), ('👟', 3), ('🎖', 6), ('👿', 11), ('👏', 103), ('😱', 11), ('♓', 3), ('🧠', 4), ('🐷', 9), ('😘', 4), ('👣', 1), ('🤪', 24), ('🚀', 12), ('🤓', 1), ('🖤', 1), ('😊', 308), ('👹', 11), ('🤮', 34), ('🍬', 1), ('🚫', 2), ('🌐', 1), ('🩴', 4), ('🌍', 8), ('🍆', 5), ('👦', 5), ('🏵', 9), ('🔬', 2), ('🤎', 17), ('💺', 1), ('😠', 14), ('🥲', 6), ('😑', 4), ('👽', 3), ('👨', 13), ('🥀', 4), ('😺', 1), ('✖', 4), ('\U0001fab7', 137), ('🐮', 15), ('🚷', 2), ('💷', 1), ('💪', 87), ('😁', 50), ('🤢', 1), ('🍀', 7), ('☺', 3), ('🍍', 1), ('😝', 12), ('🛑', 1), ('🏿', 33), ('🌽', 1), ('💀', 5), ('📈', 4), ('🍌', 9), ('🤠', 1), ('🤟', 5), ('🦄', 4), ('🤷', 11), ('🐖', 16), ('🥵', 3), ('🧒', 1), ('⛏', 4), ('✝', 6), ('🙏', 767), ('🌙', 2), ('🤫', 5), ('🐭', 2), ('🦶', 1), ('❌', 83), ('🙀', 4), ('🙂', 13), ('⚔', 2), ('🎉', 684), ('😵', 1), ('😛', 1), ('🗳', 17), ('💯', 268), ('🚩', 1903), ('🏽', 11), ('\U0001faf2', 1), ('🤍', 22), ('🐩', 1), ('🌷', 50), ('🙉', 5), ('🤥', 1), ('😖', 1), ('♋', 1), ('💓', 7), ('🌵', 5), ('🤗', 5), ('🏹', 15), ('\U0001fae2

In [None]:
# I will define a dictionary final that has each imoji(key) and its count(value)
final={}
for key, value in sorted(result.items(), key= lambda item:item[1]):
    final[key]= value

In [None]:
# Display our final result
final

{'🥧': 1,
 '🦜': 1,
 '👣': 1,
 '🤓': 1,
 '🖤': 1,
 '🍬': 1,
 '🌐': 1,
 '💺': 1,
 '😺': 1,
 '💷': 1,
 '🤢': 1,
 '🍍': 1,
 '🛑': 1,
 '🌽': 1,
 '🤠': 1,
 '🧒': 1,
 '🦶': 1,
 '😵': 1,
 '😛': 1,
 '\U0001faf2': 1,
 '🐩': 1,
 '🤥': 1,
 '😖': 1,
 '♋': 1,
 '\U0001fae2': 1,
 '✈': 1,
 '🙅': 1,
 '🪁': 1,
 '👀': 1,
 '\U0001fae4': 1,
 '🍎': 1,
 '👸': 1,
 '🐜': 1,
 '👖': 1,
 '🦠': 1,
 '🦗': 1,
 '🛺': 1,
 '🅱': 1,
 '🦉': 1,
 '🔯': 1,
 '📰': 1,
 '📲': 1,
 '🥋': 1,
 '🩳': 1,
 '🦟': 1,
 '💨': 1,
 '🦋': 1,
 '🤙': 1,
 '👕': 1,
 '\U0001f979': 1,
 '☯': 1,
 '🦨': 1,
 '💛': 1,
 '🐾': 1,
 '💜': 1,
 '🕊': 1,
 '🪐': 1,
 '☘': 1,
 '🦛': 1,
 '🦇': 1,
 '📚': 1,
 '🍷': 1,
 '⛑': 1,
 '\U0001faf6': 1,
 '\U0001fae8': 1,
 '😬': 1,
 '💌': 1,
 '🌿': 1,
 '\U0001faf4': 1,
 '🌝': 1,
 '👻': 1,
 '‼': 1,
 '🤧': 1,
 '🏍': 1,
 '📿': 1,
 '🗡': 1,
 '🦾': 1,
 '🎩': 1,
 '🐦': 1,
 '⚠': 1,
 '🐀': 1,
 '🎓': 1,
 '👐': 1,
 '🍒': 1,
 '📷': 1,
 '🎀': 1,
 '🔘': 1,
 '🏁': 1,
 '🐑': 1,
 '🪲': 1,
 '🐂': 1,
 '🧍': 1,
 '🍥': 1,
 '🌀': 1,
 '🚫': 2,
 '🔬': 2,
 '🚷': 2,
 '🌙': 2,
 '🐭': 2,
 '⚔': 2,
 '😤': 2,
 '🩸': 2,
 '🗞': 2,
 '🤯': 2,
 

In [None]:
# Now, we create a data frame for the top used 10 emojis
keys= [*final.keys()]
values=[*final.values()]
emojis= pd.DataFrame(keys[-10:], values[-10:])

In [None]:
emojis= pd.DataFrame({'chars': keys[-10:], 'num': values[-10]})

In [None]:
emojis.head()

Unnamed: 0,chars,num
0,😊,308
1,👍,308
2,😢,308
3,🤣,308
4,🎉,308


In [None]:
# Import libraries and modules
import plotly.graph_objs as go
from plotly.offline import iplot

In [None]:
graph = go.Bar(
x= emojis['chars'],
y= emojis['num'])
iplot([graph] )
# Hover over the bars to view the emojis along with the count

In [None]:
from transformers import AutoTokenizer
import emoji

In [None]:
import re
import emoji

# Function to remove duplicate emojis
def remove_duplicate_emojis(text):
    # Create a set to track used emojis
    used_emojis = set()
    # Iterate over each character in the text
    result = []
    for char in text:
        # Check if the character is an emoji
        if char in emoji.EMOJI_DATA:
            # If emoji is not already used, add it to result and mark as used
            if char not in used_emojis:
                used_emojis.add(char)
                result.append(char)
        else:
            # If it's not an emoji, just add the character to result
            result.append(char)
    return ''.join(result)

# Apply the function to the 'text' column in your dataset
df['text'] = df['text'].apply(remove_duplicate_emojis)

# Display the updated DataFrame
print(df[['text']])


                                                    text
0      dont remember the last time hindus crashed a p...
1      being a muslim it is our duty to tell you on i...
2                                              very good
3                          all indian muslim go pakistan
4                    so modi pushing for more children 🧒
...                                                  ...
21423                          bjp win 25 seat in bengal
21424  an opinion poll done on the theme of 400 plus ...
21425                              paid channel from bjp
21426                                maharashtra bjp+ 44
21427                                      manipur bjp 1

[21424 rows x 1 columns]


In [None]:
# from google.colab import files
# df.to_excel('emoji.xlsx', index=False)
# files.download('emoji.xlsx')

In [None]:
# tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [None]:
# list1=final.keys()

In [None]:
# print(type(final))


In [None]:
# def emoji2description(text):
#   return emoji.replace_emoji(text, replace=lambda chars, data_dict: ' '.join(data_dict['en'].split('_')).strip(':'))

In [None]:
def emoji2description(text):
    # Modify the replace function to add a single colon around the emoji description
    return emoji.replace_emoji(text, replace=lambda chars, data_dict: data_dict['en'] )


In [None]:
df['text'] = df['text'].apply(emoji2description)

In [None]:
df.head(10)

Unnamed: 0,text
0,dont remember the last time hindus crashed a p...
1,being a muslim it is our duty to tell you on i...
2,very good
3,all indian muslim go pakistan
4,so modi pushing for more children :child:
5,40 million hindus killed in bangladesh
6,he is telling what people want every politicia...
7,:saluting_face: india
8,modi is not anti muslim pakistanis dont want u...
9,please :folded_hands: muslim leave india :face...


In [None]:
# from google.colab import files
# df.to_excel('emoji.xlsx', index=False)
# files.download('emoji.xlsx')

In [None]:
# Assuming your text data is in the 'text' column of the dataframe
special_chars = set()

# Regular expression to match special characters (excluding alphanumeric characters and spaces)
regex = re.compile(r'[^a-zA-Z0-9\s]')

# Loop through each text entry in the dataframe
for text in df['text']:
    # Find all special characters in the text
    found_chars = regex.findall(text)  # Convert to string to avoid errors
    special_chars.update(found_chars)

# Display the unique special characters found
print(special_chars)

{'"', ';', '\U0001fbed', '✓', '౼', ',', '#', '>', '★', ']', '→', '&', '{', '.', '∞', '٫', '\u200b', '☬', '|', '~', '।', '″', '￼', ')', '±', '¹', '\u200d', '₹', '`', '•', '=', '%', '[', '”', '\u2060', '–', '-', '(', '$', '❝', '/', '‘', '}', '_', '<', ':', '²', '@', '+', "'", '۔', '°', '⁹', '✧', '?', '\\', '’', '*', '》', '!', '❞', '—', '“', '·', '…'}


In [None]:
# Function to remove special characters except #*@!?
def clean_comments(comment):
    # Keep letters, numbers, spaces, and the specified characters
    return re.sub(r'[^a-zA-Z0-9\s#*@!?:]', '', comment)

# Apply the function to the 'comments' column
df['text'] = df['text'].apply(clean_comments)

In [None]:
df.head(10)

Unnamed: 0,text
0,dont remember the last time hindus crashed a p...
1,being a muslim it is our duty to tell you on i...
2,very good
3,all indian muslim go pakistan
4,so modi pushing for more children :child:
5,40 million hindus killed in bangladesh
6,he is telling what people want every politicia...
7,:salutingface: india
8,modi is not anti muslim pakistanis dont want u...
9,please :foldedhands: muslim leave india :facew...


In [None]:
# Removal of extra spaces using pandas' str.replace with regex=True
df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)


In [None]:
!pip install transformers torch




In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load DistilBERT tokenizer and model
model_name = "distilbert-base-uncased"  # Or a fine-tuned version if available
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)  # For three classes




Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def classify_comment(comment):
    inputs = tokenizer(comment, return_tensors="pt", truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
    pred_label = torch.argmax(probs).item()

    # Map model predictions to specific labels
    if pred_label == 0:
        return "not hate"  # Adjust based on how your model is trained
    elif pred_label == 1:
        return "offensive"
    elif pred_label == 2:
        return "hate"


In [None]:
# Apply classification function
df['label'] = df['text'].apply(classify_comment)

In [None]:
df.head(50)

Unnamed: 0,text,label
0,dont remember the last time hindus crashed a p...,not hate
1,being a muslim it is our duty to tell you on i...,not hate
2,very good,not hate
3,all indian muslim go pakistan,offensive
4,so modi pushing for more children :child:,offensive
5,40 million hindus killed in bangladesh,offensive
6,he is telling what people want every politicia...,offensive
7,:salutingface: india,not hate
8,modi is not anti muslim pakistanis dont want u...,not hate
9,please :foldedhands: muslim leave india :facew...,not hate


In [None]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
not hate,15817
offensive,5501
hate,106


In [None]:
 df.to_excel('distbert_comnts.xlsx', index=False)