# Testing Model in Production

In this notebook, we will load the trained classifiers and vectorizer and use them to predict the appeal scores of new videos on Youtube. This will give us a better idea on how well the classifiers, or ensemble of classifers, are telling us which of the most recent videos on Youtube, related to artificial intelligence, machine learning and deep learning, are more likely to insterest me.

If I am satisfied with the predictions of the classifiers, we will select the predictor that performs the best and use it in the web application that will be built next.

In [1]:
import pandas as pd
import joblib as jb
from scipy.sparse import hstack, csr_matrix
import numpy as np
import json
import youtube_dl

## Importing trained models

In [2]:
# Trained models:
clf_rf = jb.load("random_forest_2021.pkl.z")
clf_lr = jb.load("logistic_reg_2021.pkl.z")
clf_lgbm = jb.load("lgbm_2021.pkl.z")
# Trained vectorizer:
title_vec = jb.load("title_vectorizer_2021.pkl.z")

## Extracting and cleaning new data

In [3]:
# My subjects of interest:
queries = ["artificial+intelligence", "machine+learning", "deep+learning"]
# Defining youtube_dl object:
ydl = youtube_dl.YoutubeDL({"ignoreerrors": True})
# Starting the list where we will store the collected data:
results = []
# Collecting data for each query:
for query in queries:
    r = ydl.extract_info("ytsearchdate40:{}".format(query), download=False)
    results += r['entries']

[download] Downloading playlist: artificial+intelligence
[youtube:search:date] query "artificial+intelligence": Downloading page 1
[youtube:search:date] query "artificial+intelligence": Downloading page 2
[youtube:search:date] playlist artificial+intelligence: Downloading 40 videos
[download] Downloading video 1 of 40
[youtube] 0Pb3gQ3MoQw: Downloading webpage
[youtube] 0Pb3gQ3MoQw: Downloading MPD manifest
[download] Downloading video 2 of 40
[youtube] 0fpSndBPQZc: Downloading webpage
[youtube] 0fpSndBPQZc: Downloading MPD manifest
[download] Downloading video 3 of 40
[youtube] p3pJMH2rv6w: Downloading webpage
[youtube] p3pJMH2rv6w: Downloading MPD manifest
[download] Downloading video 4 of 40
[youtube] u-8ccBP8za0: Downloading webpage
[youtube] u-8ccBP8za0: Downloading MPD manifest
[download] Downloading video 5 of 40
[youtube] LbfWkj-KMPc: Downloading webpage
[youtube] LbfWkj-KMPc: Downloading MPD manifest
[download] Downloading video 6 of 40
[youtube] SEk3FR6BcPY: Downloading webpa

[download] Downloading video 24 of 40
[youtube] 5cbuCahd3IE: Downloading webpage
[youtube] 5cbuCahd3IE: Downloading MPD manifest
[download] Downloading video 25 of 40
[youtube] 6M3kklVVsqE: Downloading webpage
[youtube] 6M3kklVVsqE: Downloading MPD manifest
[download] Downloading video 26 of 40
[youtube] HQ1U3SSk6kU: Downloading webpage
[youtube] HQ1U3SSk6kU: Downloading MPD manifest
[download] Downloading video 27 of 40
[youtube] t51TyyPvHDc: Downloading webpage
[youtube] t51TyyPvHDc: Downloading MPD manifest
[download] Downloading video 28 of 40
[youtube] rQL1zOz44ns: Downloading webpage
[youtube] rQL1zOz44ns: Downloading MPD manifest
[download] Downloading video 29 of 40
[youtube] ZoEi8XlV524: Downloading webpage
[youtube] ZoEi8XlV524: Downloading MPD manifest
[download] Downloading video 30 of 40
[youtube] Cse-3MM7mso: Downloading webpage
[download] Downloading video 31 of 40
[youtube] hJ1xovAZzII: Downloading webpage
[download] Downloading video 32 of 40
[youtube] 49pSgkip6B4: Dow

In [4]:
# FUNCTIONS WHICH TRANSFORM THE FEATURES =========================================================================
def categ(x):
    if (x==['Autos & Vehicles'] or x==['Gaming'] or x==['Comedy'] or
        x==['Sports'] or x==['Pets & Animals'] or x==['Music']): return 0
    elif (x==['Film & Animation'] or x==['Howto & Style']): return 1
    elif (x==['People & Blogs'] or x==['Entertainment']): return 2
    elif x==['Education']: return 3
    elif x==['Science & Technology']: return 4
    elif x==['News & Politics']: return 5
    elif x==['Nonprofits & Activism']: return 6
    else: return 0

def durat(x):
    t = x/60
    if t<=30: return 0
    elif t>30 and t<=70: return 1
    else: return 2

def view(data):
    if 'view_count' in data.keys():
        x = data['view_count']
        if x<=50: return 0
        elif x>50 and x<=1e3: return 1
        elif x>1e3 and x<=1e5: return 2
        else: return 3
    else: return 0
    
def ch_appeal(channel_name):
    list_3 = ['study','mit ','lex ','amini','proj','brunton']
    list_2 = ['tedx','stanford','marr','youtube','online']
    list_1 = ['school','sci','lab','google','krish','course']

    l3 = [True for word in list_3 if word in channel_name]
    l2 = [True for word in list_2 if word in channel_name]
    l1 = [True for word in list_1 if word in channel_name]
    if any(l3): return 3
    elif any(l2): return 2
    elif any(l1): return 1
    else: return 0
    
def like_ra(data):
    if (('like_count' in data.keys()) & ('dislike_count' in data.keys())):
        if data['like_count']==0: return 1
        else:
            ratio = data['dislike_count']/data['like_count']
            if ratio>0.1: return 0
            else: return 1
    else: return 0
    
# FUNCTION WHICH RETURNS THE FINAL DATASET ======================================================================
def compute_features(data):
    """Creates the same features which we used to train the models. """
    
    categories     = categ(data['categories'])
    duration       = durat(data['duration'])
    views          = view(data)
    channel_appeal = ch_appeal(data['uploader'].lower())
    like_ratio     = like_ra(data)
    title          = data['title']

    features = dict()

    features['categ']          = categories
    features['duration']       = duration
    features['views']          = views
    features['like_ratio']     = like_ratio
    features['channel_appeal'] = channel_appeal
    features['title']          = title

    vectorized_title = title_vec.transform([title])

    feats=csr_matrix(np.array([features['categ'],features['duration'],features['views'],features['like_ratio'],features['channel_appeal']]))
    feature_array = hstack([feats, vectorized_title])
    return feature_array

## Evaluating models performances on new data

In [5]:
# FUNCTIONS WHICH RETURN THE PREDICTIONS ===========================================================
def compute_prediction(data,w_lr,w_rf,w_lgbm):
    """Computes the scores of the new videos. """
    if data is None:
        return 0
    
    feature_array = compute_features(data)
    #print(feature_array)

    if feature_array is None:
        return 0


    p_rf   = clf_rf.predict_proba(feature_array)[0][1]
    p_lr   = clf_lr.predict_proba(feature_array)[0][1]
    p_lgbm = clf_lgbm.predict_proba(feature_array)[0][1]
    p = w_lr*p_lr + w_rf*p_rf + w_lgbm*p_lgbm
    return p

def predictions_list(w_lr,w_rf,w_lgbm):
    predictions = pd.DataFrame(columns=['title', 'score'])
    for i in results:
        row = {'title':i['title'], 'score':compute_prediction(i,w_lr,w_rf,w_lgbm)}
        predictions = predictions.append(row,ignore_index=True)
    predictions = predictions.drop_duplicates(subset=['title','score'], keep='first')
    return predictions

pd.options.display.max_colwidth = 120

### Random Forest predictions

In [6]:
w_lr=0; w_rf=1; w_lgbm=0;

rf_predictions = predictions_list(w_lr,w_rf,w_lgbm).drop_duplicates(subset=['title','score'],keep='first')
rf_predictions_sorted = rf_predictions.sort_values(by='score',ascending=False)
rf_predictions_sorted[:20]



Unnamed: 0,title,score
105,Image Colorization Using GANs | Deep Learning | TensorFlow | Python,0.630703
84,Deep Learning Meets Sparse Coding,0.572992
110,Cancer Detection Using Deep Learning | Deep Learning Projects | Deep Learning Training | Edureka,0.563592
112,The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained),0.554787
108,Uncertainty Estimation for Object Detection Using Deep Learning Approaches,0.547998
56,Dive into Deep Learning: Coding Session #6 GANs (Americas/EMEA),0.543161
57,Dive into Deep Learning: Coding Session #6 GANs (APAC),0.537145
117,Sharing My Time Table And Efficient Strategy To Learn Machine Learning In Quick Time,0.525612
111,plant diseases detection using image processing and deep learning,0.522563
33,The Race for Artificial Intelligence | Vice Documentary CyberWar Series,0.518318


In [7]:
rf_predictions_sorted[-10:]

Unnamed: 0,title,score
6,artificial intelligence,0.333709
64,Wafer Health Predictive Modeling - End to End Machine Learning Project 01 - Urdu/Hindi,0.327973
15,Artificial Intelligence - What is it and how far can it go in Hindi Urduارٹی فیشل انٹلی جنس ۔کیا ہے,0.32298
32,Artificial Intelligence PowerPoint Templates Designs,0.289827
0,Artificial Intelligence Jerk,0.284792
115,What is Machine Learning,0.281476
35,ai artificial intelligence. artificial intelligence course. artificial intelligence tutorial.,0.277768
49,machine learning revise3d,0.270784
4,Fundamentals of Artificial Intelligence in Hindi,0.269062
36,Artificial Intelligence Machine Learning Deep Learning Ppt Powerpoint Presentation Slide Templates,0.252998


### LGBM predictions

In [8]:
w_lr=0; w_rf=0; w_lgbm=1;

lgbm_predictions = predictions_list(w_lr,w_rf,w_lgbm).drop_duplicates(subset=['title','score'],keep='first')
lgbm_predictions_sorted = lgbm_predictions.sort_values(by='score',ascending=False)
lgbm_predictions_sorted[:20]



Unnamed: 0,title,score
110,Cancer Detection Using Deep Learning | Deep Learning Projects | Deep Learning Training | Edureka,0.878264
84,Deep Learning Meets Sparse Coding,0.83655
105,Image Colorization Using GANs | Deep Learning | TensorFlow | Python,0.808218
56,Dive into Deep Learning: Coding Session #6 GANs (Americas/EMEA),0.725182
108,Uncertainty Estimation for Object Detection Using Deep Learning Approaches,0.706868
118,"Lecture on ""Simple, Fast and Practical Uncertainty Estimation in Deep Learning"" by Jishnu Mukhoti.",0.676862
26,"Leti Innovation Days 2021 - #Semiconductors, #Innovation Engine for #Artificial #Intelligence",0.672647
57,Dive into Deep Learning: Coding Session #6 GANs (APAC),0.6154
112,The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained),0.615226
33,The Race for Artificial Intelligence | Vice Documentary CyberWar Series,0.608709


In [9]:
lgbm_predictions_sorted[-10:]

Unnamed: 0,title,score
64,Wafer Health Predictive Modeling - End to End Machine Learning Project 01 - Urdu/Hindi,0.043154
28,IAT PE DXB - Vertical Jump - Flipped Learning - Artificial Intelligence App,0.039342
32,Artificial Intelligence PowerPoint Templates Designs,0.038561
88,Matthias Nowak's Innovative Technical and Creative Exercises 👉 Deep Learning with Issam and Amir ⚽️🔥,0.019888
34,Įspūdis 19. Dirbtinis intelektas. Impression 19. Artificial Intelligence,0.019453
49,machine learning revise3d,0.012376
4,Fundamentals of Artificial Intelligence in Hindi,0.008516
19,Borderless Security - Artificial Intelligence and Machine Learning (AI/ML),0.007719
115,What is Machine Learning,0.006294
36,Artificial Intelligence Machine Learning Deep Learning Ppt Powerpoint Presentation Slide Templates,0.002912


### Logistic Regression predictions

In [10]:
w_lr=1; w_rf=0; w_lgbm=0;

lr_predictions = predictions_list(w_lr,w_rf,w_lgbm).drop_duplicates(subset=['title','score'],keep='first')
lr_predictions_sorted = lr_predictions.sort_values(by='score',ascending=False)
lr_predictions_sorted[:20]



Unnamed: 0,title,score
110,Cancer Detection Using Deep Learning | Deep Learning Projects | Deep Learning Training | Edureka,0.725457
105,Image Colorization Using GANs | Deep Learning | TensorFlow | Python,0.706402
112,The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained),0.651395
57,Dive into Deep Learning: Coding Session #6 GANs (APAC),0.634164
108,Uncertainty Estimation for Object Detection Using Deep Learning Approaches,0.630856
33,The Race for Artificial Intelligence | Vice Documentary CyberWar Series,0.623608
111,plant diseases detection using image processing and deep learning,0.613881
84,Deep Learning Meets Sparse Coding,0.571818
56,Dive into Deep Learning: Coding Session #6 GANs (Americas/EMEA),0.557809
26,"Leti Innovation Days 2021 - #Semiconductors, #Innovation Engine for #Artificial #Intelligence",0.552628


In [11]:
lr_predictions_sorted[-10:]

Unnamed: 0,title,score
52,What Is Data Science | Data Science vs Machine Learning (Data Science Explained),0.145629
73,Machine Learning: Dimension Reduction,0.144246
32,Artificial Intelligence PowerPoint Templates Designs,0.129145
19,Borderless Security - Artificial Intelligence and Machine Learning (AI/ML),0.128007
115,What is Machine Learning,0.12288
0,Artificial Intelligence Jerk,0.112756
35,ai artificial intelligence. artificial intelligence course. artificial intelligence tutorial.,0.097196
36,Artificial Intelligence Machine Learning Deep Learning Ppt Powerpoint Presentation Slide Templates,0.09496
49,machine learning revise3d,0.090161
4,Fundamentals of Artificial Intelligence in Hindi,0.085167


### Ensemble 1 : 0.3 p_lr + 0.5 p_rf + 0.2 p_lgbm

In [12]:
w_lr=0.3; w_rf=0.5; w_lgbm=0.2;

e1_predictions = predictions_list(w_lr,w_rf,w_lgbm).drop_duplicates(subset=['title','score'],keep='first')
e1_predictions_sorted = e1_predictions.sort_values(by='score',ascending=False)
e1_predictions_sorted[:20]



Unnamed: 0,title,score
105,Image Colorization Using GANs | Deep Learning | TensorFlow | Python,0.688916
110,Cancer Detection Using Deep Learning | Deep Learning Projects | Deep Learning Training | Edureka,0.675086
84,Deep Learning Meets Sparse Coding,0.625352
108,Uncertainty Estimation for Object Detection Using Deep Learning Approaches,0.604629
112,The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained),0.595857
56,Dive into Deep Learning: Coding Session #6 GANs (Americas/EMEA),0.58396
57,Dive into Deep Learning: Coding Session #6 GANs (APAC),0.581902
33,The Race for Artificial Intelligence | Vice Documentary CyberWar Series,0.567983
111,plant diseases detection using image processing and deep learning,0.527998
90,Object detection using deep learning dataset cctv road video,0.520898


In [13]:
e1_predictions_sorted[-10:]

Unnamed: 0,title,score
73,Machine Learning: Dimension Reduction,0.237592
15,Artificial Intelligence - What is it and how far can it go in Hindi Urduارٹی فیشل انٹلی جنس ۔کیا ہے,0.227669
19,Borderless Security - Artificial Intelligence and Machine Learning (AI/ML),0.218226
35,ai artificial intelligence. artificial intelligence course. artificial intelligence tutorial.,0.202702
32,Artificial Intelligence PowerPoint Templates Designs,0.191369
0,Artificial Intelligence Jerk,0.188984
115,What is Machine Learning,0.178861
49,machine learning revise3d,0.164915
4,Fundamentals of Artificial Intelligence in Hindi,0.161784
36,Artificial Intelligence Machine Learning Deep Learning Ppt Powerpoint Presentation Slide Templates,0.155569


### Ensemble 2 : 0.4 p_lr + 0.6 p_lgbm

In [14]:
w_lr=0.4; w_rf=0; w_lgbm=0.6;

e2_predictions = predictions_list(w_lr,w_rf,w_lgbm).drop_duplicates(subset=['title','score'],keep='first')
e2_predictions_sorted = e2_predictions.sort_values(by='score',ascending=False)
e2_predictions_sorted[:20]



Unnamed: 0,title,score
110,Cancer Detection Using Deep Learning | Deep Learning Projects | Deep Learning Training | Edureka,0.817141
105,Image Colorization Using GANs | Deep Learning | TensorFlow | Python,0.767492
84,Deep Learning Meets Sparse Coding,0.730657
108,Uncertainty Estimation for Object Detection Using Deep Learning Approaches,0.676463
56,Dive into Deep Learning: Coding Session #6 GANs (Americas/EMEA),0.658233
112,The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained),0.629694
26,"Leti Innovation Days 2021 - #Semiconductors, #Innovation Engine for #Artificial #Intelligence",0.62464
57,Dive into Deep Learning: Coding Session #6 GANs (APAC),0.622906
33,The Race for Artificial Intelligence | Vice Documentary CyberWar Series,0.614668
118,"Lecture on ""Simple, Fast and Practical Uncertainty Estimation in Deep Learning"" by Jishnu Mukhoti.",0.577029


In [15]:
e2_predictions_sorted[-10:]

Unnamed: 0,title,score
28,IAT PE DXB - Vertical Jump - Flipped Learning - Artificial Intelligence App,0.113252
15,Artificial Intelligence - What is it and how far can it go in Hindi Urduارٹی فیشل انٹلی جنس ۔کیا ہے,0.112013
73,Machine Learning: Dimension Reduction,0.108005
0,Artificial Intelligence Jerk,0.083387
32,Artificial Intelligence PowerPoint Templates Designs,0.074795
19,Borderless Security - Artificial Intelligence and Machine Learning (AI/ML),0.055834
115,What is Machine Learning,0.052928
49,machine learning revise3d,0.04349
36,Artificial Intelligence Machine Learning Deep Learning Ppt Powerpoint Presentation Slide Templates,0.039731
4,Fundamentals of Artificial Intelligence in Hindi,0.039176


### Ensemble 3 : 0.1p_rf + 0.9p_lgbm

In [16]:
w_lr=0; w_rf=0.1; w_lgbm=0.9;

e3_predictions = predictions_list(w_lr,w_rf,w_lgbm).drop_duplicates(subset=['title','score'],keep='first')
e3_predictions_sorted = e3_predictions.sort_values(by='score',ascending=False)
e3_predictions_sorted[:20]



Unnamed: 0,title,score
110,Cancer Detection Using Deep Learning | Deep Learning Projects | Deep Learning Training | Edureka,0.846796
84,Deep Learning Meets Sparse Coding,0.810194
105,Image Colorization Using GANs | Deep Learning | TensorFlow | Python,0.790467
56,Dive into Deep Learning: Coding Session #6 GANs (Americas/EMEA),0.70698
108,Uncertainty Estimation for Object Detection Using Deep Learning Approaches,0.690981
118,"Lecture on ""Simple, Fast and Practical Uncertainty Estimation in Deep Learning"" by Jishnu Mukhoti.",0.659202
26,"Leti Innovation Days 2021 - #Semiconductors, #Innovation Engine for #Artificial #Intelligence",0.648893
112,The Dimpled Manifold Model of Adversarial Examples in Machine Learning (Research Paper Explained),0.609182
57,Dive into Deep Learning: Coding Session #6 GANs (APAC),0.607575
33,The Race for Artificial Intelligence | Vice Documentary CyberWar Series,0.59967


In [17]:
e3_predictions_sorted[-10:]

Unnamed: 0,title,score
64,Wafer Health Predictive Modeling - End to End Machine Learning Project 01 - Urdu/Hindi,0.071636
28,IAT PE DXB - Vertical Jump - Flipped Learning - Artificial Intelligence App,0.0709
32,Artificial Intelligence PowerPoint Templates Designs,0.063688
88,Matthias Nowak's Innovative Technical and Creative Exercises 👉 Deep Learning with Issam and Amir ⚽️🔥,0.054201
34,Įspūdis 19. Dirbtinis intelektas. Impression 19. Artificial Intelligence,0.052998
19,Borderless Security - Artificial Intelligence and Machine Learning (AI/ML),0.042603
49,machine learning revise3d,0.038217
4,Fundamentals of Artificial Intelligence in Hindi,0.03457
115,What is Machine Learning,0.033812
36,Artificial Intelligence Machine Learning Deep Learning Ppt Powerpoint Presentation Slide Templates,0.02792


## Conclusion

When we compare the three classifiers, we see that LightGBM is by far the most confident, i.e., it can assign very high or very low scores to the new videos. The second most confident is Logistic Regression, while Random Forest assigns values that range from only 0.3 to 0.6, approximately.

The predictor that performs the best is Ensemble 3, because it combines the confidence of LightGBM with a little bit of doubt from Random Forest. This combination, which assigns LightGBM's prediction a weight of 0.9 and Random Forest's a weight of 0.1, gives me the most satisfactory  result for the most appealing Youtube videos about artificial intelligence, machine learning and deep learning, posted in recent times.

Therefore, we will use

$$p = 0.1p_{rf}+0.9p_{lgbm}$$

as our predictor in the web application for this recommender, which can be accessed at this [link](https://stormy-lake-83008.herokuapp.com/).