<img src="./magical-place.png" />

In [None]:
## Import Libraries
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot, plot
from plotly.subplots import make_subplots

from IPython.core.display import HTML
from IPython.display import Image, Audio, Video

%matplotlib inline

############################
## Start Helper functions ##
############################

# Wrapper around pandas cut() method.
def my_cut (x, bins, lower_infinite=False, upper_infinite=False, **kwargs):
    """
    Wrapper around pandas cut() to create infinite lower/upper bounds with proper labeling.

    Takes all the same arguments as pandas cut(), plus two more.

    Args :
        lower_infinite (bool, optional) : set whether the lower bound is infinite
            Default is True. If true, and your first bin element is something like 20, the
            first bin label will be '<= 20' (depending on other cut() parameters)
        upper_infinite (bool, optional) : set whether the upper bound is infinite
            Default is True. If true, and your last bin element is something like 20, the
            first bin label will be '> 20' (depending on other cut() parameters)
        **kwargs : any standard pandas cut() labeled parameters

    Returns :
        out : same as pandas cut() return value
        bins : same as pandas cut() return value
    
    Code slidely modified from sparc_spread, StackOverflow: https://stackoverflow.com/a/30199132/1843511
    """

    # Quick passthru if no infinite bounds
    if not lower_infinite and not upper_infinite:
        return pd.cut(x, bins, **kwargs)

    # Setup
    num_labels      = len(bins) - 1
    include_lowest  = kwargs.get("include_lowest", False)
    right           = kwargs.get("right", True)

    # Prepend/Append infinities where indiciated
    bins_final = bins.copy()
    if upper_infinite:
        bins_final.insert(len(bins),float("inf"))
        num_labels += 1
    if lower_infinite:
        bins_final.insert(0,float("-inf"))
        num_labels += 1

    # Decide all boundary symbols based on traditional cut() parameters
    symbol_lower  = "<=" if include_lowest and right else "<"
    left_bracket  = "(" if right else "["
    right_bracket = "]" if right else ")"
    symbol_upper  = ">" if right else ">="

    # Inner function reused in multiple clauses for labeling
    def make_label(i, lb=left_bracket, rb=right_bracket):
        return "{0} - {1}".format(bins_final[i], bins_final[i+1])

    # Create custom labels
    labels=[]
    for i in range(0,num_labels):
        new_label = None

        if i == 0:
            if lower_infinite:
                new_label = "{0} {1}".format(symbol_lower, bins_final[i+1])
            elif include_lowest:
                new_label = make_label(i, lb="[")
            else:
                new_label = make_label(i)
        elif upper_infinite and i == (num_labels - 1):
            new_label = "{0} {1}".format(symbol_upper, bins_final[i])
        else:
            new_label = make_label(i)

        labels.append(new_label)

    # Pass thru to pandas cut()
    return pd.cut(x, bins_final, labels=labels, **kwargs)

# Import images
import shutil 
dest = shutil.copy('../input/images/magical-place.png', './magical-place.png')
dest = shutil.copy('../input/images/dabbing-unicorn.png', './dabbing-unicorn.png')
dest = shutil.copy('../input/images/erd-tiktok-data-800.PNG', './erd-tiktok-data-800.png')
dest = shutil.copy('../input/images/erd-audd-data-800.PNG', './erd-audd-data-800.png')

##########################
## END Helper functions ##
##########################

## Set CSS Styles
HTML("""<style>
    @import url('https://fonts.googleapis.com/css2?family=Cookie&display=swap');

    .credits {
        display: block;
        border-radius: 4px;
        font-weight: bold;
        background: #452756;
        padding: 10rem;
        position: relative;
    }
    
    .creditsunicorn {
        display: inline-block;
        background: url('./dabbing-unicorn.png') no-repeat;
        width: 190px;
        height: 100px;
        position: absolute;
        right:0;
        bottom:0;
    }
    
    .credits-title {
        font-family: 'Cookie', cursive;
        position: relative;
        top: -10px;
        font-size: 50px;
        color: #A2D2FF;
        margin-bottom: 20px;
    }
    
    .credits-title > span {
        color: #FFAFCC;
    }
    
    .credits-text {
        color: white;
        line-height: 1.8 !important;
    }
    
    .credits-title-star {
        position: absolute;
        right: 10px;
        top: 10px;
        font-size: 80px;
    }
    
    strong {
        font-weight: bold!important
    }
    </style>""")

> About the banner: whoever watches or has watched Agents of S.H.I.E.L.D., probably knows this phrase. Agent Coulson keeps rephrasing to Tahiti as a *magical place*, after his holiday over there. He thinks he was sent over there by Nick Fury after he almost died, to recover. But is that true? Has he even been there? Who knows... Anyway, because this notebook is all about performing some Python magic on my TikTok dataset, I thought it would be a nice banner to use.

# What's this all about 🤷‍♂️
It isn't that hard to just create a Dataset and post it on Kaggle for anyone to use. So, I would like to do a bit extra. I will show the structure of the dataset (because there are different files, each containing a part of the whole set) and how you can start using it. I'll try doing some first analysis as well. So in this notebook, I will:
- show the dataset structure
- show alternatives of combining the data
- enrich the data
- do some first analysis

# Our Data Structure 📁


So let's get started with the current data structure. The previous version of our dataset consisted of mulitple files, but because this ended up in data errors, I decided replace the files with the raw data and work from there:
* `trending.json`, this is the raw data file. It contains all scraped information of the TikTok videos.

This file contains a list `collectors`, which contains a JSON object of each video with the following fields:
* `id`: the unique identifier of the video
* `text`: the text below of the video
* `createTime`: timestamp of the datetime when the video was created
* `authorMeta`: an object with detailed information about the author
* `musicMeta`: an object with detailed information about the music used with the video
* `covers`: an object with all covers of the video
* `webVideoUrl`: link to the TikTok video
* `videoUrl`: exact link to the TikTok video (not reachable directly)
* `videoUrlNoWaterMark`: the URL of the video without a watermark
* `videoMeta`: an object with dimensions and duration of the video
* `diggCount`: amount of likes
* `shareCount`: how many times the video has been shared
* `playCount`: how many times the video has been watched
* `commentCount`: amount of comments
* `downloaded`: if the video is downloaded using the scraper
* `mentions`: list with users mentioned in the video
* `hashtags`: list with hashtags used in the video

<div class="alert alert-info" role="alert">
    <strong>Note:</strong> The new version of the current dataset contains a folder "audd", which contains the enrichments from the <a href="#Enrich-Data-%E2%9C%A8">enrichment chapter below</a>
</div><br/>

In [None]:
# Open file with the raw data
file = open('../input/tiktok-trending-december-2020/trending.json', encoding="utf8")

# Load data as JSON
raw_data = json.load(file)

# Close the original file
file.close()

# Select only the list with the video data
trending_videos_list = raw_data['collector']

# Example of a video object
print(json.dumps(trending_videos_list[15], indent=4, sort_keys=True))

# Convert JSON to DataFrame 🐱‍💻

# Let's explode() the cell 💣
Don't worry, we are not destroying all our stuff! We like to keep things clean. 

Now we have merged the one-to-one relationships into one dataframe, we would like to add the hashtags as well (a many-to-one relationship), which is a bit trickier.
The hashtags are comma-separated id's inside the **hashtag_ids** column of `trending_dec_2020.csv`.

The `explode()` function is used to transform each element of a list-like to a row, replicating the index values. 
So we would like to perform the following steps:

1. Set each column we would like to keep as index, temporarily.
2. Converting the comma separated values to a list
3. Create a new row for each value by using the `explode()` function, which copies all index fields as well
4. Resetting the index

So let's get started.

In [None]:
# Create a DataFrame of the data
df_tiktok_dataset = pd.DataFrame(trending_videos_list)

# Let's expand the hashtag cell containing lists to multiple rows
df_tiktok_dataset = df_tiktok_dataset.explode('hashtags').explode('mentions')

In [None]:
def object_to_columns(dfRow, **kwargs):
    '''Function to expand cells containing dictionaries, to columns'''
    for column, prefix in kwargs.items():
        if isinstance(dfRow[column], dict):
            for key, value in dfRow[column].items():
                columnName = '{}.{}'.format(prefix, key)
                dfRow[columnName] = value
    return dfRow

# Expand certain cells containing dictionaries to columns
df_tiktok_dataset = df_tiktok_dataset.apply(object_to_columns, 
                            authorMeta='authorMeta',  
                            musicMeta='musicMeta',
                            covers='cover',
                            videoMeta='videoMeta',
                            hashtags='hashtag', axis = 1)

# Remove the original columns containing the dictionaries
df_tiktok_dataset = df_tiktok_dataset.drop(['authorMeta','musicMeta','covers','videoMeta','hashtags'], axis = 1)
df_tiktok_dataset

In [None]:
df_tiktok_dataset.info()

In [None]:
# Get unique rows from dataset
df_unique_videos = df_tiktok_dataset.drop_duplicates(subset='id', keep="first")
df_unique_music = df_tiktok_dataset.drop_duplicates(subset='musicMeta.musicId', keep="first")
df_unique_authors = df_tiktok_dataset.drop_duplicates(subset='authorMeta.id', keep="first")

# Show amount of rows per dataset
{
    'df_tiktok_dataset': df_tiktok_dataset.shape,
    'df_unique_videos': df_unique_videos.shape,
    'df_unique_music': df_unique_music.shape,
    'df_unique_authors': df_unique_authors.shape
}

# Some first Analysis 📈

In [None]:
# Set bucket ranges
buckets = list(range(0,105000,5000))

# Count videos with likes and comments per bucket range
likes = df_unique_videos.groupby( my_cut( df_unique_videos['diggCount'], buckets, upper_infinite=True ) ).diggCount.count()
comments = df_unique_videos.groupby( my_cut( df_unique_videos['commentCount'], buckets, upper_infinite=True ) ).diggCount.count()

# Transform from series to dataframe with some small modifications
likes = likes.rename('likes').to_frame().reset_index() 
comments = comments.rename('comments').to_frame().reset_index() 

# create subplots, two rows and 1 column each row
fig = make_subplots(2,1,subplot_titles=("Distribution of Likes", "Distribution of Comments"))

# First plot
fig.add_trace(
    go.Bar(y = likes['diggCount'], 
           x = likes['likes'], 
           name="Likes",
           text = likes['likes'], 
           orientation='h',
           texttemplate='%{text:.2s}', 
           textposition='outside', 
           marker_color='rgb(162, 210, 255)'
    ),
    row=1,col=1
)

# Second plot
fig.add_trace(
    go.Bar(y = comments['commentCount'], 
           x = comments['comments'], 
           name="Comments",
           text = comments['comments'], 
           orientation='h',
           texttemplate='%{text:.2s}', 
           textposition='outside', 
           marker_color='rgb(205, 180, 219)'
    ),
    row=2,col=1
)

fig.update_layout(uniformtext_minsize=8, 
                  uniformtext_mode='hide', 
                  title_text="Multiple Subplots with Titles",
                  height=1200,
                  template='plotly_white',
                  margin=go.layout.Margin(
                      l=130,r=5,b=5,t=100,pad=10
                  ))

fig.update_xaxes(title_text='Videos')
fig.update_yaxes(title_text='Likes', col=1, row=1, automargin=False)
fig.update_yaxes(title_text='Comments', col=1, row=2, automargin=False)

fig.show(config={'displayModeBar': False})

So we started with the most obvious: how many videos have received how many likes, i.e. the like distribution. And how many videos have received how many comments, i.e. the comment distribution. 

So far most of the videos seem to contain below 50.000 comments and likes. It still doesn't tell us how many of the enormous amount in the **0-5000** bucket are close to zero, somewhere in the middle or close to 5000. 

Let's see if we can create a scatter plot to have a better idea of the relation and distribution between the amount of comments and likes of all buckets below and including 50.000

In [None]:
# Focus on dataset from 0 till 50.000 likes
df_videos_users_focus = df_unique_videos[df_unique_videos['diggCount'] <= 50000]

# Create a scatter plot with a trendline
fig = px.scatter(df_videos_users_focus, trendline="ols",
                 x="diggCount", 
                 y="commentCount",
                 labels={
                     "diggCount": "Likes",
                     "commentCount": "Comments"
                 },
                 log_y=True,
                 trendline_color_override="#ff7096", 
                 template='plotly_white')

fig.update_traces(marker=dict(
                     color='#4cc9f0',
                     opacity=0.6,
                 ))
fig.show()

I use a log-scale for the y-axis, to allow a large range to be displayed without the small values being compressed down into bottom of the graph.

We see that the low $R^2$ value indicates that the independent variable (likes) is not explaining much in the variation of the dependent variable (comments). 

While it seemed obvious that a video with more likes, would also result in a higher amount of comments as well... The contents of the video might still be the biggest factor. Even though a higher amount of likes would result in a higher amount of viewers. And hitting a like button is easier than leaving a comment. We do see, however, there isn't a single video with over 20k likes in this dataset which contains less than 50 comments.

And even with our own eye we can see: 
- there are just more videos under 20.000 likes
- even videos with a low amount of likes can still be very high in comments
- there are two interesting outliers around 35k likes who have a large amount of comments.


## Popular Hashtags 🏷

In [None]:
# Create a DataFrame of the data
df_hashtags = pd.DataFrame(trending_videos_list)

# Let's expand the hashtag cell containing lists to multiple rows
df_hashtags = df_hashtags.explode('hashtags')

# Expand certain cells containing dictionaries to columns
df_hashtags = df_hashtags.apply(object_to_columns, 
                                hashtags='hashtag', axis = 1)

hashtags = df_hashtags[['hashtag.name']].copy().dropna()
hashtags.info()

In [None]:
# Add column with default value
hashtags['count'] = 1

# Count all hashtags, group and replace the count column value with the sum
hashtags = hashtags.groupby(["hashtag.name"])["count"].count().reset_index()

# Sort by most popular hashtags and keep the top 15
hashtags = hashtags.sort_values(by='count', ascending=False)[:15]

# Set colours

# Create a Pie Chart with all values
fig = go.Figure(data=[go.Pie(
                        labels=hashtags["hashtag.name"], 
                        values=hashtags["count"], 
                        textinfo='label+percent',
                        insidetextorientation='radial'
                )], 
                layout={"colorway": ["#f72585","#b5179e",
                                     "#7209b7","#560bad",
                                     "#480ca8","#3a0ca3",
                                     "#3f37c9","#4361ee",
                                     "#4895ef","#4cc9f0"]})
fig.show()

# Enrich Data ✨
The music from the data contains not only a link to the actual sound, but a title as well. The "origineel geluid", which is Dutch for "original sound", might contain speech or perhaps some music (in the background) as well. But we don't know that just by looking at this title.

On top of that, we might be interested in more information, besides the title of the sound/music used in the TikTok video, to get a better idea if and how much value the music adds to those videos. I could use some more information like:

* genre of the music
* artist
* original name
* popularity
* etc.

We are probably all familiar with SoundHound and Shazam, which are able to recognize the sound of a playing song and return the artist and name. If they only had an API which we could use to enrich our data... Unfortunately, SoundHound and Shazam don't have an API (as far as I know off), but there are alternatives!

<div class="alert alert-info" role="alert"><strong>Note:</strong> I used <a href="https://audd.io">audd.io</a> to enrich my music.csv with their information about the song. I added the information to the most recent version of the dataset, inside the `audd` folder.</div>
<div></div>

## Shazam/Soundhound-like Data 🎼
Let's say the **trending_dec_2020.csv** is the dataframe we created from the `trending.json` file.

<img src='./erd-audd-data-800.png' />

In [None]:
# Import Audd Data
df_audd_music = pd.read_csv('../input/tiktok-trending-december-2020/audd/audd_music.csv', index_col='id')
df_audd_music_apple = pd.read_csv('../input/tiktok-trending-december-2020/audd/audd_music_apple_music.csv')
df_audd_music_spotify = pd.read_csv('../input/tiktok-trending-december-2020/audd/audd_music_spotify_music.csv')
df_audd_music_spotify_artists = pd.read_csv('../input/tiktok-trending-december-2020/audd/audd_music_spotify_music_artists.csv')

In [None]:
# The current version of the dataset contains duplicated rows, let's remove them
df_audd_music = df_audd_music.drop_duplicates()

# Add prefix to this dataset, before merging
df_audd_music = df_audd_music.add_prefix('_audd_music.')
df_audd_music.shape

In [None]:
# Create a DataFrame of the data
df_tiktok_music = pd.DataFrame(trending_videos_list)

# Expand certain cells containing dictionaries to columns
df_tiktok_music = df_tiktok_music.apply(object_to_columns, 
                                        musicMeta='musicMeta', axis = 1)

# Convert the column dtype to int64 so we can merge
df_tiktok_music['musicMeta.musicId'] = df_tiktok_music['musicMeta.musicId'].astype('int64')
df_tiktok_music.shape

In [None]:
df_tiktok_audd_music = df_tiktok_music.merge(df_audd_music, how='left', right_on='id', left_on='musicMeta.musicId')
df_tiktok_audd_music.shape

### Original sound examples 🔊
As you might hear in the background: the audio file contains other sounds as well.
As you can see when opening the link, it seems the audio is added to the video while maintaining the original audio. So the data says "origineel geluid" (= original sound), but audd.io recognized the music being used.

In [None]:
df_tiktok_audd_music = df_tiktok_audd_music[(df_tiktok_audd_music['musicMeta.musicName'] == 'origineel geluid') & df_tiktok_audd_music['_audd_music.artist'].notna()]
df_tiktok_audd_music

In [None]:
videoUrl = df_tiktok_audd_music.iloc[2]['musicMeta.playUrl']
url = df_tiktok_audd_music.iloc[2]['webVideoUrl']
print('Url to full video: ', url)
print('Sound recognised by Audd: ', df_tiktok_audd_music.iloc[2]['_audd_music.artist'], '-', df_tiktok_audd_music.iloc[2]['_audd_music.title'])
print('original sound:↴')
Audio(df_tiktok_audd_music.iloc[2]['musicMeta.playUrl'])

In [None]:
videoUrl = df_tiktok_audd_music.iloc[9]['musicMeta.playUrl']
url = df_tiktok_audd_music.iloc[9]['webVideoUrl']
print('Url to full video: ', url)
print('Sound recognised by Audd: ', df_tiktok_audd_music.iloc[9]['_audd_music.artist'], '-', df_tiktok_audd_music.iloc[9]['_audd_music.title'])
print('original sound:↴')
Audio(df_tiktok_audd_music.iloc[9]['musicMeta.playUrl'])

In [None]:
videoUrl = df_tiktok_audd_music.iloc[15]['musicMeta.playUrl']
url = df_tiktok_audd_music.iloc[15]['webVideoUrl']
print('Url to full video: ', url)
print('Sound recognised by Audd: ', df_tiktok_audd_music.iloc[15]['_audd_music.artist'], '-', df_tiktok_audd_music.iloc[15]['_audd_music.title'])
print('original sound:↴')
Audio(df_tiktok_audd_music.iloc[15]['musicMeta.playUrl'])

# To be continued ... ⏳

I still want to do some more analysis on the Audd data. Quite some, to be honest. But this notebook already took quite some time, and it's always good to have some feedback, even in this early stage!

So if anyone has any feedback already, go ahead! Leave a comment, it will sure help me a lot! And analysis ideas you would like to see, if there are any, are very welcome as well!


# Thank you and Credits 💰
I am a member of Kaggle for four years now and haven't published a single "real" notebook ever since. Why? Because I thought I couldn't do it, I wasn't good enough or I just didn't know how or where to start. Last Saturday's meetup by Andrada and Parul was most inspiring. It encouraged me to start this notebook. 

But there are more people to thank, so for all of you, some most deserved credit:

<div class="credits">
    <div class="credits-title"><span>Awesome</span> Unicorns</div><div class="credits-title-star" style="color: rgba(0,0,0,1);">✨</div>
    <div class="creditsunicorn"></div>
    <p class="credits-text" style="color: rgba(255,255,255,1);">A huge thank you to <a style="color:#BDE0FE" href="https://www.kaggle.com/andradaolteanu">Andrada Olteanu</a> and <a style="color:#BDE0FE" href="https://www.kaggle.com/parulpandey">Parul Pandey</a> for sharing their valuable experience and knowledge with the rest of the community, during the meetup of last Saturday, and of course Team Kaggle Days Meetup Delhi NCR for organising the event. <br/><br/>Thanks <a style="color:#BDE0FE" href="https://www.kaggle.com/andradaolteanu">Andrada Olteanu</a>, again, for her awesome tips on improving the Notebooks Flow and her helpful and most inspiring notebooks. I most probably used a lot of her ideas for this notebook, so make sure to check out her profile.<br/><br/>
    Thank you <a style="color:#BDE0FE" href="https://www.kaggle.com/bariscal">Baris Cal</a> for his great notebook on visualisations with the plotly library, which gave me great advice on how to create and write cleaner code for the visualisations with this library.<br/><br/>
    Thank you every member of the KaggleNoobs Slack Channel! It's awesome to see how many Masters and Grandmasters are still member of this slack channel and are always happy to help others out, no matter their experience, background or skillset.<br/><br/> You guys make me love Data Science and the people in it, even more ❤</p>
</div>

<div></div>