<h1 id="basics" style="font-family:verdana;"> 
    <center> Measurement Problems Using for the YouTube Trends
    </center>
</h1>
<div style="width:100%;text-align: center;"> <img align=middle src="https://www.teknohall.com/wp-content/uploads/2021/06/turkiye-youtube-trend.png" alt="Heat beating" style="height:300px;margin-top:3rem;"> </div>



<div style="font-size:15px; font-family:verdana;">YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for. This dataset is a daily record of the top trending YouTube videos.<br><br>

## Main topics of the study can be seen below:

* [Aim of the study](#section-one)
* [Understand the data](#section-two)
* [Preparation of data](#section-three)
* [Scoring Like and Dislikes](#section-four)
* [Scoring Average Rating](#section-five)
* [Sorting with Wilson Lower Bound](#section-six)
* [Comparison of the Rating Methods](#section-seven)
* [Conclusion](#section-eight)



<a id="section-one"></a>
## 1. Aim of the Study

The main purpose of the study sorting the trends videos according to statistical methods and find out the new sorting of the video. Before the start content of the data can be seen below:

> This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, and FR regions (USA, Great Britain, Germany, Canada, and France, respectively), with up to 200 listed trending videos per day. EDIT: Now includes data from RU, MX, KR, JP and IN regions (Russia, Mexico, South Korea, Japan and India respectively) over the same time period. Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count. The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the five regions in the dataset.

In this study we will work with the "CAvideos.csv" file
<div style="width:100%;text-align: center;"> <img align=middle src="https://www.noxinfluencer.com/blog/wp-content/uploads/2019/11/2.png" alt="Heat beating" style="height:300px;margin-top:3rem;"> </div>

<a id="section-two"></a>
## 2. Understand the Data

First of all we should import the libraries that will use during the analysis and rating parts.

In [1]:
# Lets import the dataset

import pandas as pd
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler
pd.set_option("display.width", 500)
pd.set_option("display.max_columns", None)

In [2]:
# Lets import the dataset

df = pd.read_csv(r"/kaggle/input/youtube-new/CAvideos.csv")
# I just selected these variables for this analysis.

df = df[["title", "channel_title", "views", "likes", "dislikes", "comment_count"]]

In [3]:
# To understand the "check_df" functione can be used to decide the what should we do about the data.

def check_df(dataframe, head=5):
    print("########## Info #############")
    print(dataframe.info())
    print("########## Shape #############")
    print(dataframe.shape)
    print("########## Data Types #############")
    print(dataframe.dtypes)
    print("########## Head of Data #############")
    print(dataframe.head(head))
    print("########## Tail of Data #############")
    print(dataframe.tail(head))
    print("########## Null Values of Data #############")
    print(dataframe.isnull().sum())
    print("########## Describe of the Numerical Datas #############")
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

########## Info #############
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40881 entries, 0 to 40880
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          40881 non-null  object
 1   channel_title  40881 non-null  object
 2   views          40881 non-null  int64 
 3   likes          40881 non-null  int64 
 4   dislikes       40881 non-null  int64 
 5   comment_count  40881 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 1.9+ MB
None
########## Shape #############
(40881, 6)
########## Data Types #############
title            object
channel_title    object
views             int64
likes             int64
dislikes          int64
comment_count     int64
dtype: object
########## Head of Data #############
                                               title channel_title     views    likes  dislikes  comment_count
0         Eminem - Walk On Water (Audio) ft. Beyoncé    EminemVEVO  17158579  

Before the start the analysis, according to dataset summary, dataset has 8 variables. Lets check them;

1. title: Name of the video
2. channel_title: Owner of the video as a channel
3. views: Total views of the video.
4. likes: number of likes of the video
5. dislikes: number of dislikes of the video
6. comment_count: total comment of the video


<a id="section-three"></a>
## 3. Preparation of the Data

In this stage, If any null values are in the dataset, they will drop it from the data.

In [4]:
# dropna() command will help to drop the null values from the data.
df.dropna(inplace = True)

# Lets check the data
df.describe().T

# As we can see total amount of data decreased the after the dropping to 40881. But we have still negative Quantity and Total Price values.


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
views,40881.0,1147036.0,3390913.0,733.0,143902.0,371204.0,963302.0,137843120.0
likes,40881.0,39582.69,132689.5,0.0,2191.0,8780.0,28717.0,5053338.0
dislikes,40881.0,2009.195,19008.37,0.0,99.0,303.0,950.0,1602383.0
comment_count,40881.0,5042.975,21579.02,0.0,417.0,1301.0,3713.0,1114800.0


<a id="section-four"></a>
## 4. Scoring Likes and Dislikes


The first part of the analysis stage, relationship of the likes and dislikes considers. To see the effect of the likes for the sorting process, we use the "score_lik_dis_diff" function.

In [5]:
# Score Positive - Negative Difference Function

def score_pos_neg_diff(up, down):
    return up - down


<a id="section-five"></a>
## 5. Scoring Average Rating


The second part of the analysis stage, relationship of the likes and dislikes considers with the average rating method. To see the effect of the likes for the sorting process, we use the "score_average_rating" function.

In [6]:
# Score Average Rating Function

def score_average_rating(up, down):
    if up + down == 0:
        return 0

    return up/(up+down)

<a id="section-six"></a>
## 6. Sorting with Wilson Lower Bound

> The idea here is to treat the existing set of user ratings as a statistical sampling of a hypothetical set of user ratings from all users and then use this score. In other words, what user community would think about upvoting a product with 95% confidence given that we have an existing rating for this product with a sample (subset from the whole community) user ratings.
Therefore if we know what a sample population thinks i.e. user reviews for a product, you can use this to estimate the preferences of the whole community.
If there are X positive votes and Y negative votes for a product and we want to understand how popular the product will be across the whole community. We can estimate that with 95% confidence between wilson_lower_bound_score and wilson_upper_bound_score% of users will upvote this product using the Wilson Score of Confidence interval.

<div style="width:100%;text-align: center;"> <img align=middle src="https://miro.medium.com/v2/resize:fit:786/format:webp/1*a55XGo_ZIv6lFGn23nCQeQ.png" alt="WLB" style="height:100px;margin-top:3rem;"> </div>

In [7]:
def wilson_lower_bound(up, down, confidence = 0.95):
    """
    Function to provide lower bound of wilson score


    Parameters
    ----------
    up: int
        up count
    down: int
        down count

    confidence: float
        confidence interval, by default is 95 %

    Returns
    -------
    wilson_score = float
        Wilson Lower bound score

    """

    n = up + down
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * up / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1- phat) + z * z / (4*n)) / n)) / (1 + z*z/n)

<a id="section-seven"></a>
## 7. Comparison of the Rating Methods

In [8]:
# Each functions are used to adding as variable to the dataset. Three of function just focused the likes and dislikes counts for each video.

df["score_pos_neg_diff"] = df.apply( lambda x: score_pos_neg_diff(x["likes"],
                                                                              x["dislikes"]), axis = 1)

df["score_average_rating"] = df.apply( lambda x: score_average_rating(x["likes"],
                                                                              x["dislikes"]), axis = 1)

df["wilson_lower_bound"] = df.apply( lambda x: wilson_lower_bound(x["likes"],
                                                                              x["dislikes"]), axis = 1)

In [9]:
# Let's see the individual results of the each rating.

df.sort_values("score_pos_neg_diff", ascending = False).head(10)


Unnamed: 0,title,channel_title,views,likes,dislikes,comment_count,score_pos_neg_diff,score_average_rating,wilson_lower_bound
36453,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,80738011,5053338,165854,1114800,4887484,0.968222,0.968071
36153,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,73463137,4924056,156026,1084435,4768030,0.969287,0.969136
35900,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,65396157,4750254,141966,1040912,4608288,0.970981,0.970832
35685,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,62796390,4470888,119046,905912,4351842,0.974064,0.973918
35515,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,39349927,3880074,72707,692311,3807367,0.981606,0.981473
34361,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,98938809,3037318,161813,319502,2875505,0.94942,0.949179
34131,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,85092067,2735961,140711,289682,2595250,0.951085,0.950836
4699,Marvel Studios' Avengers: Infinity War Officia...,Marvel Entertainment,89930713,2606665,53011,347982,2553654,0.980069,0.9799
4451,Marvel Studios' Avengers: Infinity War Officia...,Marvel Entertainment,87450245,2584675,52176,341571,2532499,0.980213,0.980044
4202,Marvel Studios' Avengers: Infinity War Officia...,Marvel Entertainment,84281319,2555414,51008,339708,2504406,0.98043,0.980261


In [10]:
df.sort_values("score_average_rating", ascending = False).head(10)

Unnamed: 0,title,channel_title,views,likes,dislikes,comment_count,score_pos_neg_diff,score_average_rating,wilson_lower_bound
8893,Brendan Shanahan speaks on the Passing of Leaf...,Toronto Maple Leafs,4138,144,0,15,144,1.0,0.974016
13111,莫斯科行动 23 | Operation Moscow 23（夏雨、吴优、姚芊羽 领衔主演）,感谢订阅中剧独播,23926,20,0,31,20,1.0,0.838875
25882,Ghost Adventures S16E01 - Ripley's Believe It ...,Ghost Adventures TV,5173,232,0,73,232,1.0,0.983712
20855,Voyage backpack // Marina Bastarache,Marina Bastarache,5213,604,0,32,604,1.0,0.99368
11499,Spirit of Canada - Home For A Rest,alanthomasdoyle,13197,143,0,13,143,1.0,0.973839
3962,Daniel Sedin | Letter from Markus Naslund,Canucks,1626,98,0,8,98,1.0,0.96228
24444,Maple Leafs Post-Game: Curtis McElhinney - Mar...,Toronto Maple Leafs,1754,60,0,16,60,1.0,0.939828
15251,SNL - Reality Stars With Will Ferrell | Saturd...,Global TV,12004,42,0,1,42,1.0,0.916201
7187,RECETTE CORRECTE de NOËL : Le pain-sandwich au...,2FillesOrdinaires,3955,352,0,108,352,1.0,0.989205
20708,Maple Leafs Morning Skate: Mike Babcock - Febr...,Toronto Maple Leafs,5145,64,0,4,64,1.0,0.943376


In [11]:
df.sort_values("wilson_lower_bound", ascending = False).head(10)

Unnamed: 0,title,channel_title,views,likes,dislikes,comment_count,score_pos_neg_diff,score_average_rating,wilson_lower_bound
4508,The Reaction of The Streets (I Wait-Day6 Edition),JaeSix,88889,25599,9,3619,25590,0.999649,0.999332
32215,G.C.F in Osaka,BANGTANTV,2942269,688754,687,61516,688067,0.999004,0.998926
33206,THE POPULAR DANCE TUTORIALS OF 90s-CURRENT W/B...,JaeSix,165176,34756,28,4429,34728,0.999195,0.998837
29621,Day6 Tomfoolery in NY and Japan,JaeSix,98947,22743,17,4102,22726,0.999253,0.998804
6771,BTS Tell Us What They Love About Each Other & ...,AskAnythingChat,324230,31439,40,2575,31399,0.998729,0.99827
40781,"180613 JIN & V - Even If I Die, It's You (Hwar...",Jung Hyun Ran,793776,95387,141,4841,95246,0.998524,0.99826
19059,This Video Is My Wife's Anniversary Gift,Marcus Johns,69205,10311,10,663,10301,0.999031,0.998217
33078,When you been playing Fortnite for too long,Lenarr Young,841698,91115,144,6845,90971,0.998422,0.998143
29254,The Rose (더 로즈) - BABY MV,CJENMMUSIC Official,573371,115729,189,8825,115540,0.99837,0.99812
35790,BTS Dish About Debuting New Music At The 2018 ...,Access,189531,16371,20,737,16351,0.99878,0.998116


As we can see, each function results are different than the each other and results does not seem sense because of the using of the restricted variables all of them. To make useful and meaningful sorting, "views" and "comments" counts also has to be considered in the sorting process.

In [12]:
# Let's transform the Views and Comment counts between 1 - 5 with the MinMaxScaler.

df["views_rating"] = MinMaxScaler(feature_range = (1,5)).\
    fit(df[["views"]])\
    .transform(df[["views"]])

df["comment_rating"] = MinMaxScaler(feature_range = (1,5)).\
    fit(df[["comment_count"]])\
    .transform(df[["comment_count"]])

# Now, we can use the both results Weighted Ratings for Comment and Views and also Wilson Lower Bound results. This way; likes, dislikes, comments and views can be considered in the same method.

# Lets try to use "Wilson Lower Bound" effect.

def weighted_rating(dataframe, w1 = 32, w2 = 28, w3 = 40):
    return (dataframe["comment_rating"]* w1/100 +
         dataframe["views_rating"]* w2/100 +
         dataframe["wilson_lower_bound"] * w3/100)

df["weighted_sorting_score_1"] = weighted_rating(df)

# Lets try to use "Wilson Lower Bound" effect.

def weighted_rating(dataframe, w1 = 32, w2 = 28, w3 = 40):
    return (dataframe["comment_rating"]* w1/100 +
         dataframe["views_rating"]* w2/100 +
         dataframe["score_average_rating"] * w3/100)

df["weighted_sorting_score_2"] = weighted_rating(df)



In [13]:
df.sort_values("weighted_sorting_score_1", ascending= False).head(10)

Unnamed: 0,title,channel_title,views,likes,dislikes,comment_count,score_pos_neg_diff,score_average_rating,wilson_lower_bound,views_rating,comment_rating,weighted_sorting_score_1,weighted_sorting_score_2
36453,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,80738011,5053338,165854,1114800,4887484,0.968222,0.968071,3.342887,5.0,2.923237,2.923297
5900,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,137843120,3014479,1602383,817582,1412096,0.652928,0.652494,5.0,3.933556,2.919735,2.919909
36153,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,73463137,4924056,156026,1084435,4768030,0.969287,0.969136,3.13178,4.891048,2.829688,2.829748
5623,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,125431369,2912715,1545018,807558,1367697,0.653407,0.652965,4.639828,3.897589,2.807566,2.807743
35900,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,65396157,4750254,141966,1040912,4608288,0.970981,0.970832,2.897687,4.734883,2.714848,2.714908
5398,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,113876217,2811217,1470387,787174,1340830,0.65658,0.65613,4.304513,3.824449,2.69154,2.691719
35685,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,62796390,4470888,119046,905912,4351842,0.974064,0.973918,2.822245,4.250492,2.539953,2.540011
5197,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,100911567,2656678,1353655,682890,1303023,0.662458,0.661995,3.928296,3.450269,2.468807,2.468992
4996,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,75969469,2251826,1127811,827755,1124015,0.666292,0.665789,3.20451,3.970057,2.433997,2.434198
34361,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,98938809,3037318,161813,319502,2875505,0.94942,0.949179,3.871049,2.146401,2.150414,2.15051


In [14]:
df.sort_values("weighted_sorting_score_2", ascending= False).head(10)

Unnamed: 0,title,channel_title,views,likes,dislikes,comment_count,score_pos_neg_diff,score_average_rating,wilson_lower_bound,views_rating,comment_rating,weighted_sorting_score_1,weighted_sorting_score_2
36453,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,80738011,5053338,165854,1114800,4887484,0.968222,0.968071,3.342887,5.0,2.923237,2.923297
5900,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,137843120,3014479,1602383,817582,1412096,0.652928,0.652494,5.0,3.933556,2.919735,2.919909
36153,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,73463137,4924056,156026,1084435,4768030,0.969287,0.969136,3.13178,4.891048,2.829688,2.829748
5623,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,125431369,2912715,1545018,807558,1367697,0.653407,0.652965,4.639828,3.897589,2.807566,2.807743
35900,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,65396157,4750254,141966,1040912,4608288,0.970981,0.970832,2.897687,4.734883,2.714848,2.714908
5398,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,113876217,2811217,1470387,787174,1340830,0.65658,0.65613,4.304513,3.824449,2.69154,2.691719
35685,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,62796390,4470888,119046,905912,4351842,0.974064,0.973918,2.822245,4.250492,2.539953,2.540011
5197,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,100911567,2656678,1353655,682890,1303023,0.662458,0.661995,3.928296,3.450269,2.468807,2.468992
4996,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,75969469,2251826,1127811,827755,1124015,0.666292,0.665789,3.20451,3.970057,2.433997,2.434198
34361,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,98938809,3037318,161813,319502,2875505,0.94942,0.949179,3.871049,2.146401,2.150414,2.15051


In [15]:
df.sort_values("views", ascending= False).head(10)

Unnamed: 0,title,channel_title,views,likes,dislikes,comment_count,score_pos_neg_diff,score_average_rating,wilson_lower_bound,views_rating,comment_rating,weighted_sorting_score_1,weighted_sorting_score_2
5900,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,137843120,3014479,1602383,817582,1412096,0.652928,0.652494,5.0,3.933556,2.919735,2.919909
5623,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,125431369,2912715,1545018,807558,1367697,0.653407,0.652965,4.639828,3.897589,2.807566,2.807743
5398,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,113876217,2811217,1470387,787174,1340830,0.65658,0.65613,4.304513,3.824449,2.69154,2.691719
5197,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,100911567,2656678,1353655,682890,1303023,0.662458,0.661995,3.928296,3.450269,2.468807,2.468992
34361,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,98938809,3037318,161813,319502,2875505,0.94942,0.949179,3.871049,2.146401,2.150414,2.15051
4699,Marvel Studios' Avengers: Infinity War Officia...,Marvel Entertainment,89930713,2606665,53011,347982,2553654,0.980069,0.9799,3.609647,2.24859,2.12221,2.122277
4451,Marvel Studios' Avengers: Infinity War Officia...,Marvel Entertainment,87450245,2584675,52176,341571,2532499,0.980213,0.980044,3.537667,2.225587,2.094752,2.09482
34131,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,85092067,2735961,140711,289682,2595250,0.951085,0.950836,3.469236,2.039404,2.00433,2.00443
4202,Marvel Studios' Avengers: Infinity War Officia...,Marvel Entertainment,84281319,2555414,51008,339708,2504406,0.98043,0.980261,3.445709,2.218902,2.066952,2.067019
36453,BTS (방탄소년단) 'FAKE LOVE' Official MV,ibighit,80738011,5053338,165854,1114800,4887484,0.968222,0.968071,3.342887,5.0,2.923237,2.923297


<a id="section-seven"></a>

## 8. Conclusion

End of the sorting for YouTube videos, we considered most of possibility and effect onto the sorting stage.

The main variables for the sorting process;

1. Likes
2. Dislikes
3. Views
4. Comment Count

In first part of the sorting we tried to three methods to find out the best sorting of the videos. However, all of these functions prepared just with the likes and dislikes. Therefore, "Scoring Positive - Negative Comments Difference" give the meaningful results but it is also missed views and comments effects. The other two methods "Scoring Average Ratings" and "Wilson Lower Bound" functions does not enough to sort the dataset.

Therefore, views and comments are also considered for the find out the sorting of the videos. 

Views, comments and likes/dislikes effects used in the weighted_rating function to see the all of these variables effects on sorting process. Although, views and comments are so important but likes/dislikes ratio's weight considered more than the others. Both likes/dislikes methods are added the function and checked the results.

According to results, althoung some videos have the most views but also their likes/dislikes ratios are lower than the other popular videos. Both "weighted_sorting_score_1 & 2" give the same sorted list. 

As a result, this kind of sorting problems, not only one visual variable import but also social proof also considers to make sure.

## Keep in Touch!

You can follow my the other social media adresses to see this kind of works!

1. [GitHub](https://github.com/KeskinHakan)
2. [LinkedIn](https://www.linkedin.com/in/hakan-keskin-/)
3. [Medium](https://medium.com/@hakan-keskin)
