#üß™ Practical 2: Dataset Creation ‚Äì Clean, Annotate & Validate a Text Classification Dataset

#üéØ Learning Objectives
By the end of this practical, you will be able to:

Construct a text classification dataset from scratch

Perform data cleaning (punctuation, symbols, casing)

Simulate manual and AI-assisted annotation

Apply quality control and export the dataset to .csv

#‚úÖ Step-by-Step Guide
#üîß Step 1: Install Required Libraries

In [1]:
!pip install pandas langdetect langchain langchain-google-genai google-generativeai


Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m981.5/981.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.5-py3-none-any.whl.metadata (5.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl.metadata (9.8 kB)
INFO: pip is looking at multiple versions of google-generativeai to determine which version is compatible with other requirements. This could take a while.
Collecting google-generativeai
  Downloading google_generativeai-0.8.4-py3-none-any.whl.metadata (4.2 kB)

#üîê Step 2: Configure Gemini API (Optional AI-Labeling)

In [2]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI

# Use free-tier Gemini 1.5 Flash
os.environ["GOOGLE_API_KEY"] = "AIzaSyCDyiafjDZo4pJf36HDz4QQtCgpCe2DD3E"

llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    temperature=0.3,
    convert_system_message_to_human=True
)


#üì¶ Step 3: Simulate Raw Text Dataset (Can Be from CSV/Web/Forms)

In [3]:
import pandas as pd

data = {
    "text": [
        "I love this product! It's amazing üòç",
        "Horrible experience. Will never buy again!!",
        "Delivery was on time. Packaging was good.",
        "Customer support didn‚Äôt help me. Waste of money.",
        "Wow, absolutely loved it! <3",
        "Meh. It was okay I guess...",
        "Terrible. Broke after one day.",
        "Super fast shipping, very happy!",
        "Why does this even exist?? useless",
        "The quality is top-notch. Highly recommend."
    ]
}

df = pd.DataFrame(data)
df.head()


Unnamed: 0,text
0,I love this product! It's amazing üòç
1,Horrible experience. Will never buy again!!
2,Delivery was on time. Packaging was good.
3,Customer support didn‚Äôt help me. Waste of money.
4,"Wow, absolutely loved it! <3"


#üßπ Step 4: Clean the Text

In [4]:
import re
import string

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+", "", text)  # remove URLs
    text = text.translate(str.maketrans("", "", string.punctuation))  # remove punctuation
    text = re.sub(r"\d+", "", text)  # remove numbers
    text = re.sub(r"\s+", " ", text)  # remove extra whitespace
    return text.strip()

df["clean_text"] = df["text"].apply(clean_text)
df.head()


Unnamed: 0,text,clean_text
0,I love this product! It's amazing üòç,i love this product its amazing üòç
1,Horrible experience. Will never buy again!!,horrible experience will never buy again
2,Delivery was on time. Packaging was good.,delivery was on time packaging was good
3,Customer support didn‚Äôt help me. Waste of money.,customer support didn‚Äôt help me waste of money
4,"Wow, absolutely loved it! <3",wow absolutely loved it


#üåê Step 5: Filter Non-English Texts (Optional)

In [5]:
from langdetect import detect

def detect_language(text):
    try:
        return detect(text)
    except:
        return "error"

df["lang"] = df["clean_text"].apply(detect_language)
df = df[df["lang"] == "en"]
df.drop(columns=["lang"], inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=["lang"], inplace=True)


Unnamed: 0,text,clean_text
0,I love this product! It's amazing üòç,i love this product its amazing üòç
1,Horrible experience. Will never buy again!!,horrible experience will never buy again
2,Delivery was on time. Packaging was good.,delivery was on time packaging was good
3,Customer support didn‚Äôt help me. Waste of money.,customer support didn‚Äôt help me waste of money
4,"Wow, absolutely loved it! <3",wow absolutely loved it


#‚úçÔ∏è Step 6: Annotate Labels (Manual + AI-Suggested)

We define:

Positive = praise, love, satisfaction

Negative = complaint, anger, bad quality

Neutral = factual, mixed

In [7]:
def gemini_label_suggestion(text):
    prompt = f"""You are a classifier for sentiment.

Classify the following sentence as: Positive, Negative, or Neutral.

Sentence: {text}

Label:"""
    response = llm.invoke(prompt)
    return response.content.strip()

# Apply Gemini suggestion
df["label_suggested"] = df["clean_text"].apply(gemini_label_suggestion)
df.head()




Unnamed: 0,text,clean_text,label_suggested
0,I love this product! It's amazing üòç,i love this product its amazing üòç,Positive
1,Horrible experience. Will never buy again!!,horrible experience will never buy again,Negative
2,Delivery was on time. Packaging was good.,delivery was on time packaging was good,Positive
3,Customer support didn‚Äôt help me. Waste of money.,customer support didn‚Äôt help me waste of money,Negative
4,"Wow, absolutely loved it! <3",wow absolutely loved it,Positive


#‚úÖ Step 7: Manually Correct Labels (Optional for real scenarios)

In [8]:
# Simulate human-approved label (normally done via UI or review)
df["label_final"] = df["label_suggested"]  # or replace manually
df[["text", "clean_text", "label_suggested", "label_final"]]


Unnamed: 0,text,clean_text,label_suggested,label_final
0,I love this product! It's amazing üòç,i love this product its amazing üòç,Positive,Positive
1,Horrible experience. Will never buy again!!,horrible experience will never buy again,Negative,Negative
2,Delivery was on time. Packaging was good.,delivery was on time packaging was good,Positive,Positive
3,Customer support didn‚Äôt help me. Waste of money.,customer support didn‚Äôt help me waste of money,Negative,Negative
4,"Wow, absolutely loved it! <3",wow absolutely loved it,Positive,Positive
5,Meh. It was okay I guess...,meh it was okay i guess,Label: Neutral,Label: Neutral
6,"Super fast shipping, very happy!",super fast shipping very happy,Positive,Positive
7,Why does this even exist?? useless,why does this even exist useless,Negative,Negative
8,The quality is top-notch. Highly recommend.,the quality is topnotch highly recommend,Label: Positive,Label: Positive


#‚úÖ Step 8: Final Quality Check

In [9]:
print("Label distribution:")
print(df["label_final"].value_counts())


Label distribution:
label_final
Positive           4
Negative           3
Label: Neutral     1
Label: Positive    1
Name: count, dtype: int64


#üíæ Step 9: Save Dataset to CSV

In [10]:
df_final = df[["clean_text", "label_final"]]
df_final.to_csv("clean_labeled_dataset.csv", index=False)

from google.colab import files
files.download("clean_labeled_dataset.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>