<a href="https://colab.research.google.com/github/AryAgarwal/yt-classifier/blob/main/fine_tuned_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [9]:
!pip install accelerate -U
!pip install transformers[torch]
import pandas as pd
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
!pip install datasets
from datasets import Dataset
# import os
# os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
category_to_domain = {
    '1': 'Film & Animation',
    '2': 'Autos & Vehicles',
    '10': 'Music',
    '15': 'Pets & Animals',
    '17': 'Sports',
    '19': 'Travel & Events',
    '20': 'Gaming',
    '22': 'People & Blogs',
    '23': 'Comedy',
    '24': 'Entertainment',
    '25': 'News & Politics',
    '26': 'Howto & Style',
    '27': 'Education',
    '28': 'Science & Technology',
    '29': 'Nonprofits & Activism'
}

# Define domains as per the above mapping
domains = list(category_to_domain.values())



# Load the DataFrame from the CSV file
df = pd.read_csv('dataset.csv')
print(df.head())

# Preprocess the text and labels
df['text'] = df['2'] + " " + df['3']
df['text'] = df['text'].astype(str)
# Map domain labels to numeric labels
domain_to_id = {domain: idx for idx, domain in enumerate(domains)}
df['label'] = df['5'].apply(lambda x: domain_to_id[x])

# Split the dataset
train_df, val_df = train_test_split(df, test_size=0.1, stratify=df['label'])

# Load tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=len(domains))

# Tokenize the data
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

train_dataset = train_dataset.rename_column("label", "labels")
val_dataset = val_dataset.rename_column("label", "labels")

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

print(results)


   0            1                                                  2  \
0  0  pTnk3ziVVRM  Psychedelic Horizons Beyond Psychotherapy Work...   
1  0  pTnk3ziVVRM  Psychedelic Horizons Beyond Psychotherapy Work...   
2  0  pTnk3ziVVRM  Psychedelic Horizons Beyond Psychotherapy Work...   
3  1  cuJjSeHZIrg                     Episode 35 - Dr. James Fadiman   
4  1  cuJjSeHZIrg                     Episode 35 - Dr. James Fadiman   

                                                   3   4  \
0  Watch the full workshop at http://psychedelics...  29   
1  Watch the full workshop at http://psychedelics...  29   
2  Watch the full workshop at http://psychedelics...  29   
3  Dr. James Fadiman is the father of modern psyc...  22   
4  Dr. James Fadiman is the father of modern psyc...  22   

                       5  
0  Nonprofits & Activism  
1  Nonprofits & Activism  
2  Nonprofits & Activism  
3         People & Blogs  
4         People & Blogs  


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/310 [00:00<?, ? examples/s]

Map:   0%|          | 0/35 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss
1,No log,2.136734
2,No log,1.898899
3,No log,1.838245


{'eval_loss': 1.8382446765899658, 'eval_runtime': 30.2207, 'eval_samples_per_second': 1.158, 'eval_steps_per_second': 0.099, 'epoch': 3.0}


In [11]:
import torch
def classify_transcription(transcript):
    inputs = tokenizer(transcript, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)[0]
    predicted_class_idx = torch.argmax(outputs).item()
    predicted_class = domains[predicted_class_idx]
    return [{'domain': predicted_class, 'justification': f'The transcription is classified as {predicted_class}'}]

In [12]:
transcript=" Welcome back, you're watching The Morning Show. Now, 72 Ministers who took oath on Sunday have now been allocated their respective portfolios. BJP veteran Rajnath Singh has been, has retained rather defense ministry. Amit Shah will continue to remain the country's home minister while Nirmal Aasitha Raman has also retained the finance ministry. As far as S. Jai Shankar is concerned, he will continue to represent India globally as the foreign minister. Now, he's often known as India's highway man, Anithin Guttkuri, who has also retained road transport and highways. As far as we speak about Ashwini Vaishnav has been given more than one portfolio. In fact, he's been given three that includes railways, INB as well as the IT ministry. However, there are new additions to the Prime Ministers Cabinet who are there. Let's pick that down for you or rather the Dreejigs as well. BJP President J.P. Naddha, the newly inducted MP, has replaced Mansook Mandavya. He will now be India's health minister. Jyotir Aditya Sindhya, who so far held the aviation ministry, will now head the telecom ministry. Very interestingly, TDP's Ram Mohan Nido will now give India's aviation ministry a new Udan. Now, Ram Vilas Paswan San Chirapaswan will be heading the ministry of food processing industries. As far as Manohar Lalkatter is concerned, remember who stepped down as the Chief Minister months ago has been given housing and power ministry. And then, Anna Puranadevi, Jharkhand's OBC leader has replaced Smriti Arani as the women and child development minister. First of all, not only I am satisfied, not only I am happy, I think this is absolutely my Prime Minister's discretion, you know, whom he wants to give what responsibility. Having said that for me, it's a huge responsibility. And definitely, I mean, I'll be fulfilling it. Yes, I'll be needing some time before I can understand and I can study the whole department. And then only I can comment in detail over these things. If you like this video, then like, share and subscribe to Etynaow."
clas=classify_transcription(transcript)
print(clas)

[{'domain': 'People & Blogs', 'justification': 'The transcription is classified as People & Blogs'}]


In [10]:
!pip install pandas torch google-api-python-client youtube-transcript-api yt-dlp whisper

Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.2-py3-none-any.whl (24 kB)
Collecting yt-dlp
  Downloading yt_dlp-2024.5.27-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting whisper
  Downloading whisper-1.1.10.tar.gz (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting brotli (from yt-dlp)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
Collecting mutagen (from yt-dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m 

In [4]:







import os
import pandas as pd
from googleapiclient.discovery import build
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound, CouldNotRetrieveTranscript

# YouTube API configuration
api_key = 'AIzaSyDRdE7pbPZsyep2qNYhYbcjBtXCDS_57GI'
youtube = build('youtube', 'v3', developerKey=api_key)

def fetch_video_ids(query, max_results=10, video_category_id=None):
    search_response = youtube.search().list(
        q=query,
        part='id,snippet',
        maxResults=max_results,
        type='video',
        videoCategoryId=video_category_id
    ).execute()

    video_ids = [item['id']['videoId'] for item in search_response.get('items', [])]
    return video_ids

def fetch_transcription(video_id):
    try:
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        transcript = ' '.join([entry['text'] for entry in transcript_list])
        return transcript
    except (TranscriptsDisabled, NoTranscriptFound, CouldNotRetrieveTranscript) as e:
        print(f"Could not retrieve transcript for video ID {video_id}: {str(e)}")
        return None

def main():
    video_category_ids = {
        '1': 'Film & Animation',
        '2': 'Autos & Vehicles',
        '10': 'Music',
        '15': 'Pets & Animals',
        '17': 'Sports',
        '19': 'Travel & Events',
        '20': 'Gaming',
        '22': 'People & Blogs',
        '23': 'Comedy',
        '24': 'Entertainment',
        '25': 'News & Politics',
        '26': 'Howto & Style',
        '27': 'Education',
        '28': 'Science & Technology',
        '29': 'Nonprofits & Activism'      # Category ID for Howto & Style (closest to Health)
    }
    queries=list(video_category_ids.values())

    video_data = []

    for query in queries:
        print(f"Fetching videos for category: {query}")
        video_ids = fetch_video_ids(query, max_results=5, video_category_id=video_category_ids[query])

        for video_id in video_ids:
            print(f"Fetching transcript for video ID: {video_id}")
            transcript = fetch_transcription(video_id)

            if transcript:
                video_data.append({
                    'category': query,
                    'video_id': video_id,
                    'transcript': transcript
                })

    df = pd.DataFrame(video_data)
    df.to_csv('custom_video_dataset.csv', index=False)
    print("Video data saved to custom_video_dataset.csv")

if __name__ == "__main__":
    main()


# Initialize the YouTube API client
# youtube = build('youtube', 'v3', developerKey=API_KEY)

# def fetch_videos_and_transcripts(query, max_results):
#     data = []
#     next_page_token = None

#     while True:
#         request = youtube.search().list(
#             q=query,
#             part='snippet',
#             type='video',
#             maxResults=max_results,
#             pageToken=next_page_token
#         )
#         response = request.execute()

#         for item in response['items']:
#             video_id = item['id']['videoId']
#             title = item['snippet']['title']
#             description = item['snippet']['description']
#             transcript = YouTubeTranscriptApi.get_transcript(video_id)
#             if transcript:
#                 text = f"{title} {description} {' '.join([entry['text'] for entry in transcript])}"
#                 data.append((text, query))

#         next_page_token = response.get('nextPageToken')
#         if not next_page_token:
#             break

#     return data

# # Function to save data to a CSV file
# def save_to_csv(data, filename):
#     with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
#         writer = csv.writer(csvfile)
#         writer.writerow(['text', 'label'])
#         writer.writerows(data)

# # Fetch videos and transcripts for each domain
# dataset = []
# for domain in DOMAINS:
#     print(f"Fetching data for domain: {domain}")
#     data = fetch_videos_and_transcripts(domain, max_results=50)
#     dataset.extend(data)

# # Save the dataset to a CSV file
# save_to_csv(dataset, 'youtube_video_dataset.csv')

Fetching videos for category: Film & Animation


KeyError: 'Film & Animation'

In [2]:
import pandas as pd
import numpy as np
# from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
# from sklearn.model_selection import train_test_split
# from datasets import Dataset
df = pd.read_csv('data.csv')
df=df.drop(['channelTitle','channelId','publishedAt','duration','durationSec','definition','caption','viewCount','likeCount','dislikeCount','commentCount'],axis=1)
print(df.head())
df= pd.DataFrame(np.repeat(df.values, 3, axis=0))
df.to_csv('dataset.csv',index=False)

   Unnamed: 0      videoId                                         videoTitle  \
0           0  pTnk3ziVVRM  Psychedelic Horizons Beyond Psychotherapy Work...   
1           1  cuJjSeHZIrg                     Episode 35 - Dr. James Fadiman   
2           2  IuyuZfWtGgg  #325 Microdosing from The Adam and Dr Drew Sho...   
3           3  cng_ZhQf8iY                Microdosing Away The Same Old Blues   
4           4  OpQIQEx7J5A  Erschossener Kiffer / Drogen in Mikro-Dosierun...   

                                    videoDescription  videoCategoryId  \
0  Watch the full workshop at http://psychedelics...               29   
1  Dr. James Fadiman is the father of modern psyc...               22   
2  Adam and Dr. Drew are solo today and they open...               22   
3  Source: https://www.spreaker.com/user/springwi...               24   
4  Von erschossenen "Dealern", von demonstrierend...               24   

      videoCategoryLabel  
0  Nonprofits & Activism  
1         People & B