## Youtube comments spam detection using auto-label

Import libraries and Setup the API keys:

In [1]:
import os
import pandas as pd
import evadb
from dataframe.labeling_agent import LabelingAgent
os.environ['OPENAI_API_KEY'] = 'sk-xxx'

Declare config and csv file:

In [2]:
config = {
    "task_name": "SpamClassification",
    "task_type": "classification",
    "dataset": {
      "label_column": "class",
      "label_separator": ", ",
      "delimiter": ","
    },
    "prompt": {
      "task_guidelines": "You are an expert at identifying spam and legitimate messages. Your goal is to maintain the quality of communication by accurately classifying incoming messages as either 'spam' or 'ham' (legitimate). Any message that is unsolicited, contains promotional content, or attempts to deceive or defraud users should be labeled as 'spam'. Messages that are personal, non-promotional, and relevant should be marked as 'ham'. Your job is to correctly label the provided input example into one of the following categories:\n{labels}",
      "output_guidelines": "You will return the answer as a comma separated list of labels sorted in alphabetical order. For example: \"label1, label2, label3\"",
      "labels": [
        "spam",
        "ham"
      ],
      "few_shot_examples": "spam-ham-label/data/seed.csv",
      "example_template": "Input: {example}\nOutput: {labels}\n"
    }
  }
  
csv_file = "spam-ham-label/data/test.csv"
  

In [3]:
cursor = evadb.connect().cursor()
print("Connected to EvaDB")

Connected to EvaDB


In [4]:
create_function_query = f"""CREATE FUNCTION IF NOT EXISTS AutoLabel
            IMPL  './functions/autolabel.py';
            """
cursor.query("DROP FUNCTION IF EXISTS AutoLabel;").execute()
cursor.query(create_function_query).execute()
print("Created Function")

create_table_query = f"""
CREATE TABLE IF NOT EXISTS YTCOMMENTS(
comment_id TEXT(200),
author TEXT(30),
date TEXT(10),
content TEXT(255)
);
"""

load_data_query = f""" LOAD CSV 'spam-ham-label/data/test.csv' INTO YTCOMMENTS;""" 



Created Function


Create table and load data:

In [5]:
cursor.query(create_table_query).execute()
cursor.query(load_data_query).execute()

<evadb.models.storage.batch.Batch at 0x7f460b0ec450>

Perform Data Labeling:

In [6]:
query= f""" SELECT AutoLabel(comment_id, author, date, content) FROM YTCOMMENTS;"""

result = cursor.query(query).execute()
print(result)

  self.df = df


Now I want you to label the following comments:
Input: Hello everyone :) I know most of you probably pass up these kind of comments, but for those who are still reading this, thanks! I don’t have any money for advertisements, no chance of getting heard, nothing. I live in such a small town... If this comes off as spam, sorry. I’m an instrumental songwriter from Columbus, Mississippi. Please go to my channel and check out my original music. It would be highly appreciated if you thumbs up this comment so my music can be heard! Thank you, Adam Whitney 
Input: She is perfect
Input: Shakira u are so wiredo
Input: :D subscribe to me for daily vines
Input: Very nice
Input: I really love watching football and also I’ve started off making income with out financial risk from acquiring bonus deals. It’s this weird technique where you wager on something with one bookmakers and bet against it on Betfair. You secure the bonus as income . A chap named Jim Vanstone is finding the wagers free on his ow

Validate results

In [9]:
import csv

# Function to compare two CSV files
def compare_csv_files(file1, file2):
    with open(file1, 'r', newline='') as csv_file1, open(file2, 'r', newline='') as csv_file2:
        csv_reader1 = csv.reader(csv_file1)
        csv_reader2 = csv.reader(csv_file2)

        for row1, row2 in zip(csv_reader1, csv_reader2):
            if row1 != row2:
                return False  # CSV files are different

        # Check if one file has more rows than the other
        if next(csv_reader1, None) is not None or next(csv_reader2, None) is not None:
            return False  # CSV files have different numbers of rows

    return True  # CSV files are identical

# Paths to the test and validate CSV files
test_csv = 'spam-ham-label/data/labeled_data.csv'
validate_csv = 'spam-ham-label/data/validate.csv'

# Check if the CSV files are identical
if compare_csv_files(test_csv, validate_csv):
    print("The labeled data CSV is the same as the validate CSV.")
else:
    print("The labeled data CSV is different from the validate CSV.")


The labeled data CSV is the same as the validate CSV.
