# Classify your YouTube history

## Prepare data

In [1]:
import pandas as pd
import numpy as np
import re
import nltk

from preparation import prepare_data
from select_by_date_interval import select_by_date_interval

In [2]:
PATH = "C:/Users/San/Documents/CS projects/yt_activity_analysis/data/Takeout/YouTube and YouTube Music/history/watch-history.json"
df = prepare_data(PATH)

start_date = "2023-01-01"
# end_date = "2022-11-01"
df = select_by_date_interval(df, start=start_date)
# df = df.loc[df["app"] == "YouTube Music"] # if you wanna work only with YT Music data
df

Unnamed: 0,video_title,channel_name,time,app
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube
...,...,...,...,...
4391,Assassin's Creed Odyssey - Before You Buy,gameranx,2023-01-01 10:02:05.355000+02:00,YouTube
4392,#Ukraine's spy chief tells ABC News there will...,ABC News,2023-01-01 09:44:58.841000+02:00,YouTube
4393,Russian missile zooms over Kyiv before being s...,The Sun,2023-01-01 09:43:07.927000+02:00,YouTube
4394,YARMAK FT. TOF - –ú–û–Ø –ö–†–ê–á–ù–ê,Yarmak Music,2023-01-01 00:07:14.838000+02:00,YouTube


Let's think about possible categories:
- music - it's the easiest one because I can do it without any classification algo by treating every video in YouTube Music as a song. However, you'll need to handle songs you listened in YouTube app. Maybe just check if video_title or channel_name happened already in YouTube Music. If so, then it's a song as well
- entertainment - games, movies, books, and other not so productive stuff
- enlightment/learning - english, math, cs, books, and other subjects
- politics - war, inner policies, and so on

However, what if some videos combine different categories? For example, a history video can be both entertaining and enlightning

In [3]:
nltk.download('toolbox')

[nltk_data] Downloading package toolbox to
[nltk_data]     C:\Users\San\AppData\Roaming\nltk_data...
[nltk_data]   Package toolbox is already up-to-date!


True

In [4]:
def categorize():
    pass

Let's start with deciding whether the video is a song or not

In [5]:
df["app"].value_counts()

YouTube          2507
YouTube Music    1888
Name: app, dtype: int64

In [6]:
def initial_classification(row):
    if row["app"] == "YouTube Music":
        category = "Music"
    else:
        category = "Not Music"
    return category

In [7]:
df["category"] = df.apply(initial_classification, axis=1)
df.head(3)

Unnamed: 0,video_title,channel_name,time,app,category
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music


In [8]:
music_df = df.loc[df["app"] == "YouTube Music"]

In [9]:
"Antytila" in music_df["channel_name"].values

True

In [10]:
def classification(row):
    category = row["category"]
    if row["category"] == "Not Music":
        if row["channel_name"] in music_df["channel_name"].values:
            category = "Music"
        elif row["video_title"] in music_df["video_title"].values:
            category = "Music"
    return category

In [11]:
df["category"] = df.apply(classification, axis=1)
df

Unnamed: 0,video_title,channel_name,time,app,category
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube,Not Music
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube,Not Music
...,...,...,...,...,...
4391,Assassin's Creed Odyssey - Before You Buy,gameranx,2023-01-01 10:02:05.355000+02:00,YouTube,Not Music
4392,#Ukraine's spy chief tells ABC News there will...,ABC News,2023-01-01 09:44:58.841000+02:00,YouTube,Not Music
4393,Russian missile zooms over Kyiv before being s...,The Sun,2023-01-01 09:43:07.927000+02:00,YouTube,Not Music
4394,YARMAK FT. TOF - –ú–û–Ø –ö–†–ê–á–ù–ê,Yarmak Music,2023-01-01 00:07:14.838000+02:00,YouTube,Music


In [12]:
df["category"].value_counts()

Not Music    2378
Music        2017
Name: category, dtype: int64

Well, with several rules (that is a Rule-based system or smt), I got 129 more videos classified as songs. However, there are game/movie soundtrack collections classified as 'Not Music'

Time to assign some labels

In [13]:
# after sampling, drop useless for classification 
# cols such as weekday and time
# subdf = df.sample(n=109, random_state=42).drop(["time", "weekday"], axis=1)
# subdf.head(5)

To assign labels, sample your df, and save that sample in .csv file. Then in the .csv file itself, manualy assign labels and import for further actions

In [14]:
# comented out, so I don't accidentadly override my labeled data with unlabeled one
# subdf.to_csv("labeled.csv", index=False)

Possible labels:
- music
- rec (short form of recreation)/entertainment - games, movies, fictional books, and other not so productive stuff
- studies (enlightment/learning/study) - english, math, cs, books, and other subjects
- politics - war, inner politics, news, and so on
- sport - workouts and so on

However, what if some videos combine different categories? For example, a history video can be both entertaining and enlightning

In [15]:
labeled_df = pd.read_csv('labeled.csv', header = 0)
labeled_df

Unnamed: 0,video_title,channel_name,app,category
0,Valhalla (Extended Mix),Miss Monique,YouTube Music,Music
1,Why You Should Read The Stormlight Archive - B...,Daniel Greene,YouTube,Rec
2,"–¢–∏—Ö–æ –ø—Ä–∏–π—à–æ–≤, —Ç–∏—Ö–æ –ø—ñ—à–æ–≤ –∞–±–æ –ø—ñ—Å–Ω—è —Å–ø–µ—Ü—ñ–∞–ª—å–Ω–æ–≥...",Riffmaster,YouTube Music,Music
3,"üéôÔ∏è [SingingMarch] ‚ôØ23 MARIAH CAREY ‚Äì ""Whenever...",Mioune,YouTube,Music
4,Ukraine frontline: the battle for Bakhmut - B...,BBC News,YouTube,Politics
...,...,...,...,...
104,–ë—Ä–∞—Ç—Ç—è —É–∫—Ä–∞—ó–Ω—Ü—ñ,Shablya,YouTube Music,Music
105,Tran,Miss Monique,YouTube Music,Music
106,How South Koreans got so much taller,Vox,YouTube,Studies
107,Deep Rock Galactic - Playstation Launch Traile...,PlayStation,YouTube,Rec


In [16]:
labeled_df["category"].value_counts()

Music       52
Rec         21
Politics    20
Studies     14
Sport        2
Name: category, dtype: int64

In [17]:
labeled_df["target_label"] = labeled_df["category"].map({
    "Music": 0,
    "Rec": 1,
    "Politics": 2,
    "Studies": 3,
    "Sport": 4
})
labeled_df

Unnamed: 0,video_title,channel_name,app,category,target_label
0,Valhalla (Extended Mix),Miss Monique,YouTube Music,Music,0
1,Why You Should Read The Stormlight Archive - B...,Daniel Greene,YouTube,Rec,1
2,"–¢–∏—Ö–æ –ø—Ä–∏–π—à–æ–≤, —Ç–∏—Ö–æ –ø—ñ—à–æ–≤ –∞–±–æ –ø—ñ—Å–Ω—è —Å–ø–µ—Ü—ñ–∞–ª—å–Ω–æ–≥...",Riffmaster,YouTube Music,Music,0
3,"üéôÔ∏è [SingingMarch] ‚ôØ23 MARIAH CAREY ‚Äì ""Whenever...",Mioune,YouTube,Music,0
4,Ukraine frontline: the battle for Bakhmut - B...,BBC News,YouTube,Politics,2
...,...,...,...,...,...
104,–ë—Ä–∞—Ç—Ç—è —É–∫—Ä–∞—ó–Ω—Ü—ñ,Shablya,YouTube Music,Music,0
105,Tran,Miss Monique,YouTube Music,Music,0
106,How South Koreans got so much taller,Vox,YouTube,Studies,3
107,Deep Rock Galactic - Playstation Launch Traile...,PlayStation,YouTube,Rec,1


In [18]:
def preprocess(text):
    sentences = re.split("[\n\t]", text)
    # remove empty lines
    sentences = [sentence for sentence in sentences if sentence]
    # further cleaning
    sentences = [re.sub(r"[^0-9a-zA-Z\s]", "", sentence, re.I|re.A).lower() for sentence in sentences]
    sentences = [sentence.lower().strip() for sentence in sentences]
    wpt = nltk.WordPunctTokenizer()
    stop_words = nltk.corpus.stopwords.words("english")
    output = []
    for sentence in sentences:
        tokens = wpt.tokenize(sentence)
        filtered_tokens = [token for token in tokens if token not in stop_words]
        output.append(" ".join(filtered_tokens))
    return " ".join(output)

df["prepped"] = df["video_title"].apply(preprocess)
df.head(5)

Unnamed: 0,video_title,channel_name,time,app,category,prepped
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music,
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music,
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music,drinkers chasers another rey skywalker movie
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube,Not Music,test english vocabulary shapes patterns
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube,Not Music,finland joins nato historic shift prompted ukr...


From above, we can see that non-english text is not handled properly

Time for some language detection

In [19]:
from langdetect import detect
detect("–¢–∏—Ö–æ –ø—Ä–∏–π—à–æ–≤, —Ç–∏—Ö–æ –ø—ñ—à–æ–≤ –∞–±–æ –ø—ñ—Å–Ω—è —Å–ø–µ—Ü—ñ–∞–ª—å–Ω–æ–≥–æ...")

'uk'

In [20]:
from langdetect import detect_langs
detect_langs("–¢–∏—Ö–æ –ø—Ä–∏–π—à–æ–≤, —Ç–∏—Ö–æ –ø—ñ—à–æ–≤ –∞–±–æ –ø—ñ—Å–Ω—è —Å–ø–µ—Ü—ñ–∞–ª—å–Ω–æ–≥–æ...")

[uk:0.9999963383809275]

In [21]:
# from langdetect import detect

# def detect_language(text):
#     try:
#         return detect(text)
#     except:
#         return 'unknown'

# df['language'] = df['video_title'].apply(detect_language)
# df["language"].value_counts().head(10)

Well, this library mislabaled a lot of rows. I need to either read the library documentation or treat empty cell in 'prepped' as ukrainian video and again preprocess it but this time using ukrainian preprocessing func

In [22]:
# df.drop(["prepped"], axis=1).to_csv("languages_detected.csv", index=False)

I believe language detection is a dead end. I should try first preprocessing English text. If it returns empty string, I should preprocess the text with ukrainian preprocessing func. If it fails again, I should just drop the row

In [23]:
df.head()

Unnamed: 0,video_title,channel_name,time,app,category,prepped
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music,
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music,
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music,drinkers chasers another rey skywalker movie
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube,Not Music,test english vocabulary shapes patterns
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube,Not Music,finland joins nato historic shift prompted ukr...


In [24]:
len(df)

4395

In [25]:
count = sum(df['prepped'].str.len() < 4)
print(count)

991


We can see that around quarter of all rows are probably ukrainian videos

In [26]:
import nltk
# nltk.download('stopwords')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# nltk.download('tagsets')
# nltk.download('wordnet')
# nltk.download('omw')
# nltk.download('words')
nltk.download('ukrainian')

[nltk_data] Error loading ukrainian: Package 'ukrainian' not found in
[nltk_data]     index


False

In [27]:
from nltk.corpus import stopwords
print(stopwords.fileids())

['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']


In [28]:
def preprocess_ukr(text):
    sentences = re.split("[\n\t]", text)
    # remove empty lines
    sentences = [sentence for sentence in sentences if sentence]
    # further cleaning
    sentences = [re.sub(r"[^0-9–∞-—è–ê-–Ø\s]", "", sentence, re.I|re.A).lower() for sentence in sentences]
    sentences = [sentence.lower().strip() for sentence in sentences]
    wpt = nltk.WordPunctTokenizer()
    # stop_words = nltk.corpus.stopwords.words("ukrainian")
    output = []
    for sentence in sentences:
        tokens = wpt.tokenize(sentence)
        filtered_tokens = [token for token in tokens]
        output.append(" ".join(filtered_tokens))
    return " ".join(output) if len(output) >= 4 else ""

# filter DataFrame to only include rows where 'prepped' has length < 4
empty_rows = df[df['prepped'].str.len() < 4]

# apply ukrainian preprocessing to the 'video_title' column for these rows
empty_rows['prepped_ukr'] = empty_rows['video_title'].apply(preprocess_ukr)

# update the original DataFrame with the preprocessed values for the empty rows
df.update(empty_rows)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  empty_rows['prepped_ukr'] = empty_rows['video_title'].apply(preprocess_ukr)


In [29]:
empty_rows

Unnamed: 0,video_title,channel_name,time,app,category,prepped,prepped_ukr
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music,,
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music,,
9,–ú–∞–≤–∫–∞,Authentix,2023-04-04 12:57:54.594000+03:00,YouTube,Not Music,,
10,–•–∞—Ä—Ü–∏–∑–∏ - –ó–∞–±—É—Ç—ñ –±–æ–≥–∏,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 12:50:32.692000+03:00,YouTube,Music,,
13,–ó–ª–∏–≤–∞,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 12:44:35.400000+03:00,YouTube,Music,,
...,...,...,...,...,...,...,...
4257,"–¢–∏—Ö–æ –ø—Ä–∏–π—à–æ–≤, —Ç–∏—Ö–æ –ø—ñ—à–æ–≤ –∞–±–æ –ø—ñ—Å–Ω—è —Å–ø–µ—Ü—ñ–∞–ª—å–Ω–æ–≥...",Riffmaster,2023-01-03 17:29:21.578000+02:00,YouTube Music,Music,,
4310,–°—É—Ö–ø–∞–π –ó–±—Ä–æ–π–Ω–∏—Ö —Å–∏–ª –†–µ—Å–ø—É–±–ª—ñ–∫–∏ –ö–æ—Ä–µ—è ÌïúÍµ≠Íµ∞ Î∞∞Í∏â,–•–õ–û–ü–¶–Ü –ó –õ–Ü–°–£,2023-01-02 21:39:10.283000+02:00,YouTube,Not Music,,
4335,"D.M.G. - –ù–µ–Ω–∞–≤–∏–∂—É , –±–ª—è—Ç—å , —Ü—ã–≥–∞–Ω!!!",–°–º–µ—Ç–∞–Ω–∏–Ω –í–∞—Å–∏–ª–∏–π,2023-01-02 13:01:31.968000+02:00,YouTube,Not Music,dmg,
4346,–Ø–∫ –≤—ñ–π—Å—å–∫–æ–≤—ñ –≤—ñ–¥—Ä–µ–∞–≥—É–≤–∞–ª–∏ –Ω–∞ –∑–≤–µ—Ä–Ω–µ–Ω–Ω—è –ó–µ–ª–µ–Ω—Å—å...,Ukrainian Witness,2023-01-02 08:53:54.148000+02:00,YouTube,Not Music,,


Come to think of it, let's keep it simple for now, and select only english rows

In [30]:
prepped_df = df[df['prepped'].str.len() >= 4]
prepped_df

Unnamed: 0,video_title,channel_name,time,app,category,prepped
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music,drinkers chasers another rey skywalker movie
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube,Not Music,test english vocabulary shapes patterns
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube,Not Music,finland joins nato historic shift prompted ukr...
6,Finland joins NATO in the alliance's fastest-e...,euronews,2023-04-04 14:01:33.369000+03:00,YouTube,Not Music,finland joins nato alliances fastestever acces...
7,Finland's Election Results Explained: How Sann...,TLDR News EU,2023-04-04 14:01:00.878000+03:00,YouTube,Not Music,finlands election results explained sanna mari...
...,...,...,...,...,...,...
4390,Pathfinder: Kingmaker Review,MandaloreGaming,2023-01-01 10:15:11.845000+02:00,YouTube,Not Music,pathfinder kingmaker review
4391,Assassin's Creed Odyssey - Before You Buy,gameranx,2023-01-01 10:02:05.355000+02:00,YouTube,Not Music,assassins creed odyssey buy
4392,#Ukraine's spy chief tells ABC News there will...,ABC News,2023-01-01 09:44:58.841000+02:00,YouTube,Not Music,ukraines spy chief tells abc news likely attac...
4393,Russian missile zooms over Kyiv before being s...,The Sun,2023-01-01 09:43:07.927000+02:00,YouTube,Not Music,russian missile zooms kyiv shot time


Let's preserve the dataset above just in case

In [31]:
# prepped_df.to_csv("prepped_data.csv")

Time for labeling

Try to increase precision by working with a bigger sample

Double sample size to 600

In [32]:
sample = prepped_df.sample(n=600, random_state=42).drop(["time"], axis=1)
sample

Unnamed: 0,video_title,channel_name,app,category,prepped
397,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Not Music,testing sharks smell drop blood
971,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Not Music,yakuza 0 karaoke 24hour cinderella majima
2754,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Not Music,read mistborn brandon sanderson spoilerfree
575,Taking advantage of noobs - Scammer,Viva La Dirt League,YouTube,Not Music,taking advantage noobs scammer
637,How GitHub works,GitHub,YouTube,Not Music,github works
...,...,...,...,...,...
394,The Business Behind Kurzgesagt,Kurzgesagt ‚Äì In a Nutshell,YouTube,Not Music,business behind kurzgesagt
3352,Folsom Prison Blues,Johnny Cash,YouTube,Music,folsom prison blues
2338,Modals of Ability - Free English Grammar Lesso...,Maltalingua English Language School,YouTube,Not Music,modals ability free english grammar lesson b2 ...
250,Hallomann,Rammstein,YouTube Music,Music,hallomann


In [33]:
sample = sample.reset_index(drop=True)
sample.head(2)

Unnamed: 0,video_title,channel_name,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Not Music,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Not Music,yakuza 0 karaoke 24hour cinderella majima


In [34]:
len(sample)

600

In [35]:
sample

Unnamed: 0,video_title,channel_name,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Not Music,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Not Music,yakuza 0 karaoke 24hour cinderella majima
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Not Music,read mistborn brandon sanderson spoilerfree
3,Taking advantage of noobs - Scammer,Viva La Dirt League,YouTube,Not Music,taking advantage noobs scammer
4,How GitHub works,GitHub,YouTube,Not Music,github works
...,...,...,...,...,...
595,The Business Behind Kurzgesagt,Kurzgesagt ‚Äì In a Nutshell,YouTube,Not Music,business behind kurzgesagt
596,Folsom Prison Blues,Johnny Cash,YouTube,Music,folsom prison blues
597,Modals of Ability - Free English Grammar Lesso...,Maltalingua English Language School,YouTube,Not Music,modals ability free english grammar lesson b2 ...
598,Hallomann,Rammstein,YouTube Music,Music,hallomann


In [36]:
sample.iloc[295:301]

Unnamed: 0,video_title,channel_name,app,category,prepped
295,THE Kingdom Guide For Bannerlord,Strat Gaming,YouTube,Not Music,kingdom guide bannerlord
296,The Might Of Sovietunion,Andreas Waldetoft,YouTube Music,Music,might sovietunion
297,Deep Rock Galactic | Overdrive Booster Is So S...,ReapeeRon,YouTube,Not Music,deep rock galactic overdrive booster strong
298,–ó –≤–æ–≥–Ω—é –ø–æ–≤—Å—Ç–∞–Ω—É ‚Äì Reignite (Mass Effect cover...,Eileen,YouTube,Music,reignite mass effect cover ukrainian
299,'Would' Past Habits - English Grammar Lesson (...,Maltalingua English Language School,YouTube,Not Music,would past habits english grammar lesson upper...
300,YARMAK - –î–ò–ö–ï –ü–û–õ–ï(FT. ALISA),Yarmak Music,YouTube Music,Music,yarmak ft alisa


In [37]:
new_rows = sample.iloc[300:]
new_rows.head(2)

Unnamed: 0,video_title,channel_name,app,category,prepped
300,YARMAK - –î–ò–ö–ï –ü–û–õ–ï(FT. ALISA),Yarmak Music,YouTube Music,Music,yarmak ft alisa
301,On Writing: Magic Systems and Handling Power E...,Hello Future Me,YouTube,Not Music,writing magic systems handling power escalatio...


In [38]:
len(new_rows)

300

In [39]:
df.head(2)

Unnamed: 0,video_title,channel_name,time,app,category,prepped
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music,
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music,


In [40]:
df_combined = pd.concat([df, new_rows], axis=0)
df_combined

Unnamed: 0,video_title,channel_name,time,app,category,prepped
1,–ñ–±—É—Ä–ª—è—é,–•–∞—Ä—Ü–∏–∑–∏,2023-04-04 14:59:28.805000+03:00,YouTube Music,Music,
2,–õ—ñ—Ö—Ç–∞—Ä,Rohata Zhaba,2023-04-04 14:25:11.177000+03:00,YouTube,Not Music,
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music,drinkers chasers another rey skywalker movie
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube,Not Music,test english vocabulary shapes patterns
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube,Not Music,finland joins nato historic shift prompted ukr...
...,...,...,...,...,...,...
595,The Business Behind Kurzgesagt,Kurzgesagt ‚Äì In a Nutshell,NaT,YouTube,Not Music,business behind kurzgesagt
596,Folsom Prison Blues,Johnny Cash,NaT,YouTube,Music,folsom prison blues
597,Modals of Ability - Free English Grammar Lesso...,Maltalingua English Language School,NaT,YouTube,Not Music,modals ability free english grammar lesson b2 ...
598,Hallomann,Rammstein,NaT,YouTube Music,Music,hallomann


In [41]:
# df_combined.to_csv("sample_size_600.csv")

In [42]:
df_combined["category"].value_counts()

Not Music    2547
Music        2148
Name: category, dtype: int64

In [43]:
df["category"].value_counts()

Not Music    2378
Music        2017
Name: category, dtype: int64

In [44]:
# sample.to_csv("eng_only_labeled.csv")

In [45]:
df = pd.read_csv('sample_size_700.csv', header = 0).drop(["Unnamed: 0"], axis=1)
df.head(3)

Unnamed: 0,video_title,channel_name,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Rec,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Rec,yakuza 0 karaoke 24hour cinderella majima
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Rec,read mistborn brandon sanderson spoilerfree


In [46]:
df["category"].value_counts()

Music       354
Rec         133
Politics    107
Studies      87
Sport        18
Name: category, dtype: int64

In [47]:
# df['target_label'] = pd.factorize(df['category'])[0]
df["target_label"] = df["category"].map({
    "Music": 0,
    "Rec": 1,
    "Politics": 2,
    "Studies": 3,
    "Sport": 4
})
df.head(3)

Unnamed: 0,video_title,channel_name,app,category,prepped,target_label
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Rec,testing sharks smell drop blood,1
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Rec,yakuza 0 karaoke 24hour cinderella majima,1
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Rec,read mistborn brandon sanderson spoilerfree,1


In [48]:
df["target_label"].value_counts()

0    354
1    133
2    107
3     87
4     18
Name: target_label, dtype: int64

At last, time to try classification

In [49]:
from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names = train_test_split(
    np.array(df['prepped']), np.array(df['target_label']), np.array(df['category']), test_size=0.2, random_state=0)
train_corpus.shape, test_corpus.shape

((559,), (140,))

In [50]:
from sklearn.model_selection import cross_val_score

from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., norm='l2', use_idf=True, smooth_idf=True)
tv_train_features = tv.fit_transform(train_corpus)
tv_test_features = tv.transform(test_corpus)

In [51]:
from sklearn.svm import LinearSVC
svm = LinearSVC(penalty='l2', C=1, random_state=0)
svm.fit(tv_train_features, train_label_names)
svm_bow_tv_scores = cross_val_score(svm, tv_train_features, train_label_names, cv=5)
svm_bow_tv_mean_score = np.mean(svm_bow_tv_scores)

print('CV Accuracy (5-fold):', svm_bow_tv_scores)
print('Mean CV Accuracy:', svm_bow_tv_mean_score)
svm_bow_test_score = svm.score(tv_test_features, test_label_names)
print('Test Accuracy:', svm_bow_test_score)

CV Accuracy (5-fold): [0.76785714 0.75892857 0.83035714 0.76785714 0.79279279]
Mean CV Accuracy: 0.7835585585585585
Test Accuracy: 0.9


In [52]:
from sklearn.metrics import classification_report
# predict labels for the test set
svm_predictions = svm.predict(tv_test_features)
# get the unique classes
unique_classes = list(set(test_label_names))
# print the classification report
print(classification_report(test_label_names, svm_predictions, labels=unique_classes))

              precision    recall  f1-score   support

     Studies       0.82      0.90      0.86        20
    Politics       0.88      0.67      0.76        21
       Sport       1.00      1.00      1.00         4
         Rec       0.89      0.83      0.86        29
       Music       0.93      1.00      0.96        66

    accuracy                           0.90       140
   macro avg       0.90      0.88      0.89       140
weighted avg       0.90      0.90      0.90       140



Now let's see what labels were assigned in df format

In [53]:
# Assuming you have already trained your model and obtained the predicted labels
predicted_labels = svm.predict(tv_test_features)

# Create a new dataframe that contains the test data and the predicted labels
test_df = pd.DataFrame({'video_title': test_corpus, 'category': test_label_names, 'actual_label': test_label_nums, 'predicted_label': predicted_labels})

# Print the first 10 rows of the new dataframe
test_df.head(10)

Unnamed: 0,video_title,category,actual_label,predicted_label
0,happens arthur sawedoff shotgun instead revolver,Rec,1,Music
1,hero ages brandon sanderson stick landing part,Rec,1,Rec
2,let go frozensoundtrack version,Music,0,Music
3,kids song,Music,0,Music
4,one final effort,Music,0,Music
5,bad boys theme cops,Music,0,Music
6,sector,Music,0,Music
7,day 4 best full body yoga stretch 30 days yoga,Sport,4,Sport
8,containers vs vms whats difference,Studies,3,Studies
9,sonne,Music,0,Music


Select 100 rows from the big df of like 3-4k rows to test how the model performs in production or whatever it's called

Let's reach sample size 1000

In [64]:
prepped_df

Unnamed: 0,video_title,channel_name,time,app,category,prepped
3,Drinker's Chasers - ANOTHER Rey Skywalker Movie?!,Critical Drinker After Hours,2023-04-04 14:12:17.193000+03:00,YouTube,Not Music,drinkers chasers another rey skywalker movie
4,Test Your English Vocabulary: SHAPES & PATTERNS,Learn English with Gill ¬∑ engVid,2023-04-04 14:06:00.179000+03:00,YouTube,Not Music,test english vocabulary shapes patterns
5,Finland joins NATO in historic shift prompted ...,FRANCE 24 English,2023-04-04 14:02:41.452000+03:00,YouTube,Not Music,finland joins nato historic shift prompted ukr...
6,Finland joins NATO in the alliance's fastest-e...,euronews,2023-04-04 14:01:33.369000+03:00,YouTube,Not Music,finland joins nato alliances fastestever acces...
7,Finland's Election Results Explained: How Sann...,TLDR News EU,2023-04-04 14:01:00.878000+03:00,YouTube,Not Music,finlands election results explained sanna mari...
...,...,...,...,...,...,...
4390,Pathfinder: Kingmaker Review,MandaloreGaming,2023-01-01 10:15:11.845000+02:00,YouTube,Not Music,pathfinder kingmaker review
4391,Assassin's Creed Odyssey - Before You Buy,gameranx,2023-01-01 10:02:05.355000+02:00,YouTube,Not Music,assassins creed odyssey buy
4392,#Ukraine's spy chief tells ABC News there will...,ABC News,2023-01-01 09:44:58.841000+02:00,YouTube,Not Music,ukraines spy chief tells abc news likely attac...
4393,Russian missile zooms over Kyiv before being s...,The Sun,2023-01-01 09:43:07.927000+02:00,YouTube,Not Music,russian missile zooms kyiv shot time


In [65]:
sample = prepped_df.sample(n=1000, random_state=42)
sample = sample.reset_index(drop=True)
sample

Unnamed: 0,video_title,channel_name,time,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,2023-03-28 17:00:03.715000+03:00,YouTube,Not Music,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,2023-03-17 00:52:54.331000+02:00,YouTube,Not Music,yakuza 0 karaoke 24hour cinderella majima
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,2023-02-04 20:50:07.297000+02:00,YouTube,Not Music,read mistborn brandon sanderson spoilerfree
3,Taking advantage of noobs - Scammer,Viva La Dirt League,2023-03-24 19:54:10.913000+02:00,YouTube,Not Music,taking advantage noobs scammer
4,How GitHub works,GitHub,2023-03-22 20:53:21.421000+02:00,YouTube,Not Music,github works
...,...,...,...,...,...,...
995,Emotional Man Causes A Ruckus At Stranger's House,Active Self Protection,2023-01-21 22:15:32.978000+02:00,YouTube,Not Music,emotional man causes ruckus strangers house
996,Wingardium Leviosa (Harry Potter Parody Animat...,OneyNG,2023-02-28 12:22:45.766000+02:00,YouTube,Not Music,wingardium leviosa harry potter parody animati...
997,Heart of Steel (Eurovision Version),Tvorchi,2023-03-09 23:58:03.759000+02:00,YouTube Music,Music,heart steel eurovision version
998,"""–ü—ñ—Å–Ω—è –•–æ—Ä–æ–±—Ä–∏—Ö"" - –±–æ–π–æ–≤–∞ –ø—ñ—Å–Ω—è 3 –ø–æ–ª–∫—É –°–°–û | ...",–£–∫—Ä–∞—ó–Ω—Å—å–∫–∏–π –ü–æ–≤—Å—Ç–∞–Ω–µ—Ü—å,2023-03-01 21:47:03.586000+02:00,YouTube Music,Music,3 song 3rd regiment ukr special operation forces


In [68]:
df.tail()

Unnamed: 0,video_title,channel_name,app,category,prepped,target_label
694,Berserkir,Danheim,YouTube Music,Music,berserkir,0
695,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines,2
696,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations,0
697,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing,3
698,Ukraine - The Beginning of the End,Adam Something,YouTube,Politics,ukraine beginning end,2


In [69]:
sample.iloc[697:703]

Unnamed: 0,video_title,channel_name,time,app,category,prepped
697,Covert Operations,Adam Schneider,2023-01-07 13:43:01.832000+02:00,YouTube Music,Music,covert operations
698,What is NLP (Natural Language Processing)?,IBM Technology,2023-02-26 14:09:52.755000+02:00,YouTube,Not Music,nlp natural language processing
699,Ukraine - The Beginning of the End,Adam Something,2023-01-09 08:59:47.539000+02:00,YouTube,Not Music,ukraine beginning end
700,Endless Space 2 Review,IGN,2023-03-15 12:29:59.182000+02:00,YouTube,Not Music,endless space 2 review
701,5 MIN KILLER ABS WORKOUT (At Home No Equipment...,Sean Vigue Fitness,2023-01-27 09:01:53.183000+02:00,YouTube,Not Music,5 min killer abs workout home equipment power ...
702,I tried 10 code editors,Fireship,2023-01-02 12:05:09.259000+02:00,YouTube,Not Music,tried 10 code editors


In [99]:
df = df.drop(["time"], axis=1)
df

Unnamed: 0,video_title,channel_name,app,category,prepped,target_label
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Rec,testing sharks smell drop blood,1
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Rec,yakuza 0 karaoke 24hour cinderella majima,1
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Rec,read mistborn brandon sanderson spoilerfree,1
3,Taking advantage of noobs - Scammer,Viva La Dirt League,YouTube,Rec,taking advantage noobs scammer,1
4,How GitHub works,GitHub,YouTube,Studies,github works,3
...,...,...,...,...,...,...
694,Berserkir,Danheim,YouTube Music,Music,berserkir,0
695,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines,2
696,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations,0
697,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing,3


In [97]:
sample.iloc[395:405]

Unnamed: 0,video_title,channel_name,time,app,category,prepped
395,"–¢–ù–ú–ö - –î–∏–≤–∏—Å—å, –∫—É–¥–∏ —ñ–¥–µ—à [Official Video]",–¢–ù–ú–ö,2023-01-15 16:28:56.093000+02:00,YouTube,Not Music,official video
396,Miss Monique - Elamy [Siona Records],Miss Monique,2023-02-08 23:20:44.281000+02:00,YouTube Music,Music,miss monique elamy siona records
397,üî• –ß–ú–£–¢ –ø—Ä–æ —Ä–µ—Ü–µ–ø—Ç –ü–ï–†–ï–ú–û–ì–ò –≤—ñ–¥ –ó–ê–õ–£–ñ–ù–û–ì–û #shorts,–ü–†–Ø–ú–ê –ß–ï–†–í–û–ù–ê,2023-02-10 18:53:29.227000+02:00,YouTube,Not Music,shorts
398,Heathens,Twenty One Pilots,2023-03-19 21:49:22.618000+02:00,YouTube Music,Music,heathens
399,Getting Past Scottish Immigration,Foil Arms and Hog,2023-01-04 13:52:50.418000+02:00,YouTube,Not Music,getting past scottish immigration
400,Ghost,Star-Lord Band,2023-03-27 13:05:50.910000+03:00,YouTube Music,Music,ghost
401,Eigenvectors and Generalized Eigenspaces,Serrano.Academy,2023-03-30 01:08:04.085000+03:00,YouTube,Not Music,eigenvectors generalized eigenspaces
402,Sector,Daniel Deluxe,2023-03-26 21:05:16.185000+03:00,YouTube Music,Music,sector
403,Ghost,Star-Lord Band,2023-02-20 20:09:48.052000+02:00,YouTube Music,Music,ghost
404,Never Forget,Michael Salvatori,2023-01-09 22:39:26.781000+02:00,YouTube Music,Music,never forget


In [75]:
old_rows = sample.iloc[:700]
old_rows["time"]

0     2023-03-28 17:00:03.715000+03:00
1     2023-03-17 00:52:54.331000+02:00
2     2023-02-04 20:50:07.297000+02:00
3     2023-03-24 19:54:10.913000+02:00
4     2023-03-22 20:53:21.421000+02:00
                    ...               
695   2023-01-18 14:27:28.915000+02:00
696   2023-01-25 17:53:54.385000+02:00
697   2023-01-07 13:43:01.832000+02:00
698   2023-02-26 14:09:52.755000+02:00
699   2023-01-09 08:59:47.539000+02:00
Name: time, Length: 700, dtype: datetime64[ns, Europe/Kiev]

In [77]:
df["time"] = old_rows["time"]

In [95]:
old_rows[["video_title", "channel_name", "time"]].iloc[395:415]

Unnamed: 0,video_title,channel_name,time
395,"–¢–ù–ú–ö - –î–∏–≤–∏—Å—å, –∫—É–¥–∏ —ñ–¥–µ—à [Official Video]",–¢–ù–ú–ö,2023-01-15 16:28:56.093000+02:00
396,Miss Monique - Elamy [Siona Records],Miss Monique,2023-02-08 23:20:44.281000+02:00
397,üî• –ß–ú–£–¢ –ø—Ä–æ —Ä–µ—Ü–µ–ø—Ç –ü–ï–†–ï–ú–û–ì–ò –≤—ñ–¥ –ó–ê–õ–£–ñ–ù–û–ì–û #shorts,–ü–†–Ø–ú–ê –ß–ï–†–í–û–ù–ê,2023-02-10 18:53:29.227000+02:00
398,Heathens,Twenty One Pilots,2023-03-19 21:49:22.618000+02:00
399,Getting Past Scottish Immigration,Foil Arms and Hog,2023-01-04 13:52:50.418000+02:00
400,Ghost,Star-Lord Band,2023-03-27 13:05:50.910000+03:00
401,Eigenvectors and Generalized Eigenspaces,Serrano.Academy,2023-03-30 01:08:04.085000+03:00
402,Sector,Daniel Deluxe,2023-03-26 21:05:16.185000+03:00
403,Ghost,Star-Lord Band,2023-02-20 20:09:48.052000+02:00
404,Never Forget,Michael Salvatori,2023-01-09 22:39:26.781000+02:00


In [107]:
missed_row = old_rows.iloc[[397]].drop(["time"], axis=1)
missed_row

Unnamed: 0,video_title,channel_name,app,category,prepped
397,üî• –ß–ú–£–¢ –ø—Ä–æ —Ä–µ—Ü–µ–ø—Ç –ü–ï–†–ï–ú–û–ì–ò –≤—ñ–¥ –ó–ê–õ–£–ñ–ù–û–ì–û #shorts,–ü–†–Ø–ú–ê –ß–ï–†–í–û–ù–ê,YouTube,Not Music,shorts


In [108]:
missed_row["category"] = "Politics"
missed_row

Unnamed: 0,video_title,channel_name,app,category,prepped
397,üî• –ß–ú–£–¢ –ø—Ä–æ —Ä–µ—Ü–µ–ø—Ç –ü–ï–†–ï–ú–û–ì–ò –≤—ñ–¥ –ó–ê–õ–£–ñ–ù–û–ì–û #shorts,–ü–†–Ø–ú–ê –ß–ï–†–í–û–ù–ê,YouTube,Politics,shorts


In [113]:
df2 = pd.concat([df.iloc[:397], missed_row, df.iloc[397:]]).reset_index(drop=True)
df2.iloc[395:400]

Unnamed: 0,video_title,channel_name,app,category,prepped
395,"–¢–ù–ú–ö - –î–∏–≤–∏—Å—å, –∫—É–¥–∏ —ñ–¥–µ—à [Official Video]",–¢–ù–ú–ö,YouTube,Music,official video
396,Miss Monique - Elamy [Siona Records],Miss Monique,YouTube Music,Music,miss monique elamy siona records
397,üî• –ß–ú–£–¢ –ø—Ä–æ —Ä–µ—Ü–µ–ø—Ç –ü–ï–†–ï–ú–û–ì–ò –≤—ñ–¥ –ó–ê–õ–£–ñ–ù–û–ì–û #shorts,–ü–†–Ø–ú–ê –ß–ï–†–í–û–ù–ê,YouTube,Politics,shorts
398,Heathens,Twenty One Pilots,YouTube Music,Music,heathens
399,Getting Past Scottish Immigration,Foil Arms and Hog,YouTube,Rec,getting past scottish immigration


In [114]:
old_rows.iloc[395:400]

Unnamed: 0,video_title,channel_name,time,app,category,prepped
395,"–¢–ù–ú–ö - –î–∏–≤–∏—Å—å, –∫—É–¥–∏ —ñ–¥–µ—à [Official Video]",–¢–ù–ú–ö,2023-01-15 16:28:56.093000+02:00,YouTube,Not Music,official video
396,Miss Monique - Elamy [Siona Records],Miss Monique,2023-02-08 23:20:44.281000+02:00,YouTube Music,Music,miss monique elamy siona records
397,üî• –ß–ú–£–¢ –ø—Ä–æ —Ä–µ—Ü–µ–ø—Ç –ü–ï–†–ï–ú–û–ì–ò –≤—ñ–¥ –ó–ê–õ–£–ñ–ù–û–ì–û #shorts,–ü–†–Ø–ú–ê –ß–ï–†–í–û–ù–ê,2023-02-10 18:53:29.227000+02:00,YouTube,Not Music,shorts
398,Heathens,Twenty One Pilots,2023-03-19 21:49:22.618000+02:00,YouTube Music,Music,heathens
399,Getting Past Scottish Immigration,Foil Arms and Hog,2023-01-04 13:52:50.418000+02:00,YouTube,Not Music,getting past scottish immigration


In [115]:
df2

Unnamed: 0,video_title,channel_name,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Rec,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Rec,yakuza 0 karaoke 24hour cinderella majima
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Rec,read mistborn brandon sanderson spoilerfree
3,Taking advantage of noobs - Scammer,Viva La Dirt League,YouTube,Rec,taking advantage noobs scammer
4,How GitHub works,GitHub,YouTube,Studies,github works
...,...,...,...,...,...
695,Berserkir,Danheim,YouTube Music,Music,berserkir
696,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines
697,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations
698,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing


In [104]:
df = df.drop(["target_label"], axis=1)
df

Unnamed: 0,video_title,channel_name,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Rec,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Rec,yakuza 0 karaoke 24hour cinderella majima
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Rec,read mistborn brandon sanderson spoilerfree
3,Taking advantage of noobs - Scammer,Viva La Dirt League,YouTube,Rec,taking advantage noobs scammer
4,How GitHub works,GitHub,YouTube,Studies,github works
...,...,...,...,...,...
694,Berserkir,Danheim,YouTube Music,Music,berserkir
695,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines
696,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations
697,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing


In [94]:
df[["video_title", "channel_name", "time"]].iloc[395:415]

Unnamed: 0,video_title,channel_name,time
395,"–¢–ù–ú–ö - –î–∏–≤–∏—Å—å, –∫—É–¥–∏ —ñ–¥–µ—à [Official Video]",–¢–ù–ú–ö,2023-01-15 16:28:56.093000+02:00
396,Miss Monique - Elamy [Siona Records],Miss Monique,2023-02-08 23:20:44.281000+02:00
397,Heathens,Twenty One Pilots,2023-02-10 18:53:29.227000+02:00
398,Getting Past Scottish Immigration,Foil Arms and Hog,2023-03-19 21:49:22.618000+02:00
399,Ghost,Star-Lord Band,2023-01-04 13:52:50.418000+02:00
400,Eigenvectors and Generalized Eigenspaces,Serrano.Academy,2023-03-27 13:05:50.910000+03:00
401,Sector,Daniel Deluxe,2023-03-30 01:08:04.085000+03:00
402,Ghost,Star-Lord Band,2023-03-26 21:05:16.185000+03:00
403,Never Forget,Michael Salvatori,2023-02-20 20:09:48.052000+02:00
404,Learn Docker in 7 Easy Steps - Full Beginner's...,Fireship,2023-01-09 22:39:26.781000+02:00


In [102]:
old_rows.sort_values(by='time', ascending=True)

Unnamed: 0,video_title,channel_name,time,app,category,prepped
574,Russian missile zooms over Kyiv before being s...,The Sun,2023-01-01 09:43:07.927000+02:00,YouTube,Not Music,russian missile zooms kyiv shot time
209,Pathfinder: Kingmaker Review,MandaloreGaming,2023-01-01 10:15:11.845000+02:00,YouTube,Not Music,pathfinder kingmaker review
257,"Warhammer 40,000: Darktide - Official Soundtra...",Fatshark,2023-01-01 13:04:06.366000+02:00,YouTube Music,Music,warhammer 40000 darktide official soundtrack i...
576,Every 1's a Winner (2011 Remaster),Hot Chocolate,2023-01-01 14:02:38.353000+02:00,YouTube Music,Music,every 1s winner 2011 remaster
329,Erika,Grosse Blas-Orchester Mit Chor,2023-01-01 14:13:58.721000+02:00,YouTube Music,Music,erika
...,...,...,...,...,...,...
100,"Warhammer 40,000: Darktide - Official Soundtra...",Fatshark,2023-04-03 22:48:28.158000+03:00,YouTube Music,Music,warhammer 40000 darktide official soundtrack i...
554,[ Îã§ÌÅ¨ÌÉÄÏù¥Îìú OST ] DISPOSAL UNITIMPERIUM MIX,Kestrel,2023-04-03 22:52:49.468000+03:00,YouTube Music,Music,ost disposal unitimperium mix
83,Deep Rock Galactic - 5th Anniversary Space Rig...,Thai,2023-04-03 23:20:38.807000+03:00,YouTube,Not Music,deep rock galactic 5th anniversary space rig m...
307,How to Fall Safely - 3 Breakfall Techniques,GMB Fitness,2023-04-04 00:52:27.367000+03:00,YouTube,Not Music,fall safely 3 breakfall techniques


In [81]:
df["category"].value_counts()

Music       354
Rec         133
Politics    107
Studies      87
Sport        18
Name: category, dtype: int64

In [70]:
new_rows = sample.iloc[700:]
new_rows.head(2)

Unnamed: 0,video_title,channel_name,time,app,category,prepped
700,Endless Space 2 Review,IGN,2023-03-15 12:29:59.182000+02:00,YouTube,Not Music,endless space 2 review
701,5 MIN KILLER ABS WORKOUT (At Home No Equipment...,Sean Vigue Fitness,2023-01-27 09:01:53.183000+02:00,YouTube,Not Music,5 min killer abs workout home equipment power ...


In [72]:
new_rows.tail()

Unnamed: 0,video_title,channel_name,time,app,category,prepped
995,Emotional Man Causes A Ruckus At Stranger's House,Active Self Protection,2023-01-21 22:15:32.978000+02:00,YouTube,Not Music,emotional man causes ruckus strangers house
996,Wingardium Leviosa (Harry Potter Parody Animat...,OneyNG,2023-02-28 12:22:45.766000+02:00,YouTube,Not Music,wingardium leviosa harry potter parody animati...
997,Heart of Steel (Eurovision Version),Tvorchi,2023-03-09 23:58:03.759000+02:00,YouTube Music,Music,heart steel eurovision version
998,"""–ü—ñ—Å–Ω—è –•–æ—Ä–æ–±—Ä–∏—Ö"" - –±–æ–π–æ–≤–∞ –ø—ñ—Å–Ω—è 3 –ø–æ–ª–∫—É –°–°–û | ...",–£–∫—Ä–∞—ó–Ω—Å—å–∫–∏–π –ü–æ–≤—Å—Ç–∞–Ω–µ—Ü—å,2023-03-01 21:47:03.586000+02:00,YouTube Music,Music,3 song 3rd regiment ukr special operation forces
999,Every 1's a Winner (2011 Remaster),Hot Chocolate,2023-02-24 09:05:19.925000+02:00,YouTube Music,Music,every 1s winner 2011 remaster


In [58]:
new_rows = sample.iloc[600:]
new_rows.head(2)

Unnamed: 0,video_title,channel_name,app,category,prepped
600,Sector,Daniel Deluxe,YouTube Music,Music,sector
601,"Let's talk about the tech layoffs, objectively.",Karolina Sowinska,YouTube,Not Music,lets talk tech layoffs objectively


In [59]:
# get the features for the new videos using the trained vectorizer
new_features = tv.transform(new_rows["prepped"])
# predict the category of the new videos using the trained LinearSVC
predicted_labels = svm.predict(new_features)
# add the predicted labels to the new_videos dataframe
new_rows['category'] = predicted_labels
new_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_rows['category'] = predicted_labels


Unnamed: 0,video_title,channel_name,app,category,prepped
600,Sector,Daniel Deluxe,YouTube Music,Music,sector
601,"Let's talk about the tech layoffs, objectively.",Karolina Sowinska,YouTube,Politics,lets talk tech layoffs objectively
602,Defence strategy for small nations - force des...,Perun,YouTube,Politics,defence strategy small nations force design fr...
603,Berserkir,Danheim,YouTube Music,Music,berserkir
604,SadSvit - –°–∏–ª—É–µ—Ç–∏ (feat. –°–¢–†–£–ö–¢–£–†–ê –©–ê–°–¢–Ø) Lyri...,SadSvit,YouTube Music,Music,sadsvit feat lyric video
...,...,...,...,...,...
695,Berserkir,Danheim,YouTube Music,Music,berserkir
696,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines
697,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations
698,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing


In [60]:
# new_rows.to_csv("tried_600-700.csv")

Hmm, I had a look at the output, and it only missclassified 12 out of 100 rows. Nice

In [61]:
tried_df = pd.read_csv("tried_600-700.csv", index_col=0)
tried_df

Unnamed: 0,video_title,channel_name,app,category,prepped
600,Sector,Daniel Deluxe,YouTube Music,Music,sector
601,"Let's talk about the tech layoffs, objectively.",Karolina Sowinska,YouTube,Politics,lets talk tech layoffs objectively
602,Defence strategy for small nations - force des...,Perun,YouTube,Politics,defence strategy small nations force design fr...
603,Berserkir,Danheim,YouTube Music,Music,berserkir
604,SadSvit - –°–∏–ª—É–µ—Ç–∏ (feat. –°–¢–†–£–ö–¢–£–†–ê –©–ê–°–¢–Ø) Lyri...,SadSvit,YouTube Music,Music,sadsvit feat lyric video
...,...,...,...,...,...
695,Berserkir,Danheim,YouTube Music,Music,berserkir
696,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines
697,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations
698,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing


In [62]:
df_combined = pd.concat([df.drop(["target_label"], axis=1), new_rows], axis=0)
df_combined

Unnamed: 0,video_title,channel_name,app,category,prepped
0,Testing if Sharks Can Smell a Drop of Blood,Mark Rober,YouTube,Rec,testing sharks smell drop blood
1,Yakuza 0- Karaoke: 24-hour Cinderella (Majima),IosonoOtakuman,YouTube,Rec,yakuza 0 karaoke 24hour cinderella majima
2,Why You Should Read: Mistborn By Brandon Sande...,Mike's Book Reviews,YouTube,Rec,read mistborn brandon sanderson spoilerfree
3,Taking advantage of noobs - Scammer,Viva La Dirt League,YouTube,Rec,taking advantage noobs scammer
4,How GitHub works,GitHub,YouTube,Studies,github works
...,...,...,...,...,...
695,Berserkir,Danheim,YouTube Music,Music,berserkir
696,Let's talk about Abrams approval and timelines...,Beau of the Fifth Column,YouTube,Politics,lets talk abrams approval timelines
697,Covert Operations,Adam Schneider,YouTube Music,Music,covert operations
698,What is NLP (Natural Language Processing)?,IBM Technology,YouTube,Studies,nlp natural language processing


In [63]:
# df_combined.to_csv("sample_size_700.csv")