## Recap:
In last notebook we created a Model to classify CBtiness of a text title. Given that our dataset was based more on news article, our actual accuracy will be smaller that the achieved 99.5%, but should still be very high.

Now we will use a .csv which can be found at this [kaggle link](https://www.kaggle.com/datasnaek/youtube-new#USvideos.csv) for US treinding videos in 2017. We will do several things:

1. Load and Preview the .csv file
2. Load our TextCB Model
3. Create a CB column using the model based on video titles 
4. Load Thumbnails (which we already have most of the code from Clickbait 1.0)
5. STOP! No training yet

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

np.random.seed(179)
print(tf.__version__) # 2.x required

2.1.0


In [35]:
usvideos_filename = 'data/USvideos.csv'

df = pd.read_csv(usvideos_filename)
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [21]:
# This loads of the thumbnails into data/imgs/ folder (or makes sure you have them loaded)
import requests
from os import path

for index, row in df.iterrows():
    vid_id = row['video_id']
    tmb_url = row['thumbnail_link']
    vid_path = f'data/imgs/{vid_id}.jpg'
    if path.exists(vid_path):
        continue
    img_data = requests.get(tmb_url).content
    with open(vid_path, 'wb') as handler:
        handler.write(img_data)

In [27]:
!ls -1 data/imgs | wc -l
df.shape

    6351


(40949, 16)

Made this mistake again, forgot that there are duplicates. So actually we only have 6351 unique entries, not 40949

In [102]:
import cv2 

df = df.drop_duplicates('video_id')
images = []
is404 = []

tmbn_404_path = "data/404_thumbnail.jpg"  # This is a thumbnail of a deleted image is produced 
tmbn_404_img = cv2.imread(tmbn_404_path)

for index, row in df.iterrows():
    vid_id = row['video_id']
    img_path = f'data/imgs/{vid_id}.jpg'
    img = cv2.imread(img_path)
    images.append(img)
    is404.append(np.array_equal(img, tmbn_404_img)) # Is 404

df['image'] = images
df['is404'] = is404

df = df[df['is404']==False]

In [103]:
df.shape

(5937, 18)

In [114]:
import tensorflow_hub as hub
from tensorflow_hub import KerasLayer

text_model = tf.keras.models.load_model("pretrained_dense.h5", custom_objects={"KerasLayer":KerasLayer})

In [146]:
cb_labels = text_model.predict(df['title'].values) > 0.5
df['clickbait'] = cb_labels

In [160]:
df[df.clickbait == True].drop(['image'], axis=1).shape[0], df[df.clickbait == False].drop(['image'], axis=1).shape[0]

(3889, 2048)

We have 3889 videos classified as Clickbait and 2048 classified as Not.

Okay, I'll save our new db.

In [161]:
df.to_pickle('labeled_clickbait.pkl')