## **Urdu Sentence Annotation for Hate Speech Detection**

### **Objective**
The objective of this assignment was to scrape Urdu comments from YouTube videos and label them as offensive, hateful, or neutral to build a dataset for hate speech detection. The dataset is meant to aid in training machine learning models for detecting hate speech in the Urdu language. The assignment involved several pipelining phases, including data collection, preprocessing, and labeling.

The key objective of this assignment was to:

- Collect 300 Urdu comments (100 from each category: offensive, hateful, and neutral).

- Clean the collected text to remove irrelevant characters.

- Label the comments based on their content into three categories: offensive, hateful, or neutral.

- Validate the annotation with the help of a peer

### **Data Collection**

The data collection was carried out by scraping YouTube comments using the YouTube Data API. Several YouTube video URLs were chosen from Urdu news, political, and cultural discussions, where comments might include offensive or hateful content. The process used in this project was:

**API Setup:** The `googleapiclient.discovery` module was used to interact with the YouTube Data API. The API key was obtained via the Google Developer Console.

**Video Selection:** Few YouTube videos were chosen based on their relevance to topics that could include hate speech or offensive language.

**Comment Extraction:** The comments were extracted from each video by making API calls for each video ID. The API retrieves comments, and the script ensures that it collects a sufficient number of comments by handling pagination (nextPageToken).

**Keeping only Nastaliq Urdu:** A cleaning function was applied to each comment, keeping only the relevant Urdu characters and removing non-Urdu characters.

**Sampling:** 300 comments (100 per category) were sampled randomly and saved for labeling.

In [1]:
import re
import random
import pandas as pd
from googleapiclient.discovery import build

In [None]:
# Setup YouTube API
api_key = 'AIzaSyA5ilIVxIEb_XWXJL0kwj0qlpmTwGQjYqA'  # <-- Put your YouTube Data API v3 key here
youtube = build('youtube', 'v3', developerKey=api_key)

# List of video URLs
video_urls = [
    "https://www.youtube.com/watch?v=B85_wZF7XCY",
    "https://www.youtube.com/watch?v=vSQeAgGUA80",   
    "https://www.youtube.com/watch?v=sCjNi75u-y4",   
    "https://www.youtube.com/watch?v=oHEs19lx0ZU",   
    "https://www.youtube.com/watch?v=8LpGOBQxwhU",   
    "https://www.youtube.com/watch?v=-0jcXcN7F_I",
    "https://www.youtube.com/watch?v=WNfzFSSRGrU",
    "https://www.youtube.com/watch?v=4zzsVCsTxGU",   
    "https://www.youtube.com/watch?v=0f2RTcz9Dho" 
]

# Urdu script detector
def is_urdu_nastaliq(text, threshold=0.6):
    urdu_chars = re.findall(r'[\u0600-\u06FF]', text)
    return len(urdu_chars) / max(len(text), 1) >= threshold

comments = []

# Fetch comments
for video_url in video_urls:
    video_id = video_url.split('v=')[1]
    nextPageToken = None
    try:
        while True:
            response = youtube.commentThreads().list(
                part="snippet",
                videoId=video_id,
                pageToken=nextPageToken,
                maxResults=100,
                textFormat="plainText"
            ).execute()

            for item in response['items']:
                comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
                if comment and len(comment) > 15 and is_urdu_nastaliq(comment):
                    comments.append(comment)

            nextPageToken = response.get('nextPageToken')
            if not nextPageToken:
                break

    except Exception as e:
        print(f"Error fetching comments for {video_url}: {e}")

# Shuffle and pick top 300
comments = list(set(comments))
random.shuffle(comments)
selected_comments = comments[:1000]

# Save
df = pd.DataFrame(selected_comments, columns=["comments"])
df.to_csv("nastaliq_urdu_youtube_comments.csv", index=False, encoding='utf-8-sig')

print(f"Scraped {len(selected_comments)} Nastaliq Urdu comments! Saved to nastaliq_urdu_youtube_comments.csv")


Scraped 771 Nastaliq Urdu comments! Saved to nastaliq_urdu_youtube_comments.csv


### **Data Preprocessing**
The preprocessing of comments included:

**Text Cleaning:** Using regular expressions, all non-relevant characters were removed, leaving only the Urdu characters.

- Emojis were removed

- Non-Urdu text and numbers were removed

- Excess spaces were removed

- Empty rows were dropped

**Unicode Handling:** The text was cleaned to ensure that the dataset contains only valid Urdu characters.

In [None]:
# Load the CSV
df = pd.read_csv("nastaliq_urdu_youtube_comments.csv", encoding='utf-8-sig')

# Emoji pattern
emoji_pattern = re.compile("["
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags
    u"\U00002700-\U000027BF"  # Dingbats
    u"\U000024C2-\U0001F251"  # Enclosed characters
"]+", flags=re.UNICODE)

# Urdu characters range (removes Arabic letters not used in Urdu)
urdu_allowed = r'\u0600-\u06FF'
# Remove unwanted Arabic supplement, digits, Latin, etc.
def preprocess_comment(text):
    # Remove emojis
    text = emoji_pattern.sub('', text)
    # Remove English, digits, punctuation, etc.
    text = re.sub(r'[a-zA-Z0-9@#%^&*()_+=\[\]{}|\\:;"\'<>,./?!~`–“”\'\"]+', ' ', text)
    # Remove Arabic Presentation Forms (non-Nastaliq Arabic glyphs)
    text = re.sub(r'[\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]', ' ', text)
    # Keep only Nastaliq Urdu characters and spaces
    text = re.sub(f'[^{urdu_allowed} ]+', ' ', text)
    # Normalize whitespace
    return re.sub(r'\s+', ' ', text).strip()

# Apply preprocessing
df['cleaned'] = df['comments'].astype(str).apply(preprocess_comment)

# Drop empty rows after cleaning
df = df[df['cleaned'].str.strip() != '']

# Save
df[['cleaned']].to_csv("preprocessed_urdu_youtube_comments.csv", index=False, encoding='utf-8-sig')

print(f"Preprocessing complete. Saved {len(df)} clean Urdu comments to preprocessed_urdu_youtube_comments.csv")


Preprocessing complete. Saved 1032 clean Urdu comments to preprocessed_urdu_youtube_comments.csv


### **Annotation**

The manual annotation through this script was based on the following categories:

**Offensive (o):** Comments that use profane language or derogatory remarks aimed at individuals or groups.

**Hate Speech (h):** Comments that promote hatred or violence towards specific groups, often based on religion, ethnicity, or nationality.

**Neutral (n):** Comments that do not contain harmful language, hate speech, or offensive content. These are simply informational or neutral statements.


In [None]:
# Load the CSV file
input_file = 'preprocessed_urdu_youtube_comments.csv'
df = pd.read_csv(input_file)

# Add a new column for annotations
df['label'] = ''

print("Type 'h' for Hate Speech, 'o' for Offensive, 'n' for Neutral.")
print("----------------------------------------------------------")

# Loop through each comment
for idx, row in df.iterrows():
    print(f"\nComment {idx + 1}:")
    print(row['cleaned'])

    while True:
        annotation = input("Enter annotation (h/o/n): ").strip().lower()
        if annotation in ['h', 'o', 'n']:
            df.at[idx, 'Annotation'] = annotation
            break
        else:
            print("Invalid input. Please enter 'h', 'o', or 'n'.")

# Save the annotated file
output_file = 'annotated_dataset.csv'
df.to_csv(output_file, index=False)

print(f"\nAnnotation complete! Saved as '{output_file}'.")


Type 'h' for Hate Speech, 'o' for Offensive, 'n' for Neutral.
----------------------------------------------------------

Comment 1:
اپنا آئی ایم ایف والا قرضہ دے کر مرنا نہیں تو تمہارے حصے کا ہمیں بھرنا پڑے گا



Comment 2:
ڈکی حرامخور ہیرا منڈی کی پیداوار ہے لعنت تیری نسل پر جھوٹے

Comment 3:
ایک میاں جہنم میں جائے گا اپنی بیوی کو منع نہیں کرتا بھائی جان میں جائے گا باپ باب جہنم میں جائے گا ماں جہنم میں جائے گی جو منع نہیں کرتی پردے کا باقی اپ کی موج ہے بے حیائی پھیلاتے جائیں اپ کو بہت ثواب ملے گا بے حیائی پھیلاتے جائیں بالکل منع نہ کریں

Comment 4:
شامیر بھی گانڈو ہے اور شام بھی بس فروگی اچھی ہے

Comment 5:
بہن سے کبھی ادھار مت لینا لیا تھا بار دے چکا ہوں ابھی بھی باقی ہے

Comment 6:
محنت تو بھکاری بھی کرتے ہیں اور ڈاکو بھی۔آخر اِنکی محنت کیوں کِسی کو نظر نہیں آتی ؟

Comment 7:
لکھتے قلم ختم ہو جائے گی لیکن میرے نبی کی شان کبھی کم نہیں ہوگی

Comment 8:
ڈکی بھائی آپ ولوگ شروع کرے ہم آپ کے ساتھ ہے

Comment 9:
ڈٹے رہو ڈکی بھائ

Comment 10:
اِنَرَبِّیِ یَفعَلُمَا یَشَاَءُ بیشک میرا رب جو چاھے، کر سکتا ھے

Comment 11:
ہر بےعزتی والے کمنٹس پر ڈھیٹ بن کر ہنس رہے ہیں

Comment 12:
آپ بی اپنا کام اچھا کرو ۔ ہر وقت دوسرے کی عزت اڑاتے ہو بس۔

Comment 13:
اللہ نہ کرے شہزادے آپ کو کچھ ھو اللہ اپنی امان می

### **Validation**

The annotated dataset was validated with the help of a peer and the final dataset file `annotated_data_validated.csv` was created

### **Results**

The final dataset contains 300 comments, distributed as follows:

- 100 Offensive Comments

- 100 Hate Speech Comments

- 100 Neutral Comments

Each comment was carefully reviewed and assigned the appropriate label. The comments were saved in a `annotated_data_validated.csv` CSV file, with each comment paired with its corresponding label.