# TikTok Analytics

Graded assignment for BDL03_1 H21 by Patrik Widmer (wkwidmer)

## Summary: Project Goal


The goal of this project is to analyze the social media platform TikTok. Social media managers and marketers need to understand what is happening on these platforms and identify the current trends. With this project, I will answer questions like the following:

* Which TikTokers (users) have the most followers, which TikToks (posts) are trending right now, and which TikToks (posts) get the most views and which the most comments. 
* I will also analyze which hashtags are trending right now and from which country the most "Top 1000 TikTokers" come from.

## Project Components

This jupyter notebook will scrape data from two sources: 
* First, we will directly scrape data from tiktok.com using the "The Unofficial TikTok API Wrapper", which will be called using a docker image. The data from this source will be saved in the MongoDB collection "trending".
* The second source is the hypeauditor.com website. This website lists the "Top 1000 TikTokers" globally. Since this website is powered by their own API, we will directly call their proprietary API to get the data. The data from this source will be saved in the MongoDB collection "hypeauditor_toptiktoker".

You can see these components on the image below.



<img src="https://i2.paste.pics/EFI05.png" alt="Drawing" style="width: 800px;"/>


# Installation & Setup
In order to get this jupiter notebook up and running, please follow the steps below.

## Step 1: Clone Git Repository

Please clone the following git repository: 
* git clone https://github.com/PatWint/BDL03_1.H21_wkw.git

## Step 2: Install requirements

Run the following command in jupyter notebook or on the command line to install the required packages and restart ther jupyter notebook kernel.

In [1]:
# !pip3 install -r requirements.txt

## Step 3: MongoDB User

If you opened this jupyter notebook from the git repository, no username and password is set for the mongo db. Please replace the jupyter notebook from the git repository with the jupyter notebook from moodle or set a username and password for mongo db in the chapter: "Mongo DB Settings": 
* mongodb_username = ""
* mongodb_pw = ""


## Step 4: TikTok API Key

To retrieve/scrape data directly from https://www.tiktok.com, I use "The Unofficial TikTok API Wrapper" by David Teather. You can use the TikTok API without a "Key", but you will only get limited data. In order to get a larger number of TikToks, you need to set the custom_verifyFp value. This value expires after some hours. Therefore, you need to manually reset it.

To get a custom_verifyFp key, please follow these steps and see the printscreens bellow: 

1. Open Chrome and go to: www.tiktok.com
2. Click on the lock icon and then on Cookies.
3. Click on www.tiktok.com (not on tiktok.com).
4. Click on cookies
5. Scroll down until you see s_v_web_id
6. Copy the verify_ key to your clipboard.
7. Scroll down in this Jupiter notebook to the chapter: "TikTok API Settings" and set the custom_verifyFp key you just copied to your clipboard. 




<img src="https://i2.paste.pics/EGSKX.png" alt="Drawing" style="width: 700px;"/>


<img src="https://i2.paste.pics/0b6b8b29ee74020023673655490e7dff.png" alt="Drawing" style="width: 500px;"/>


<img src="https://i2.paste.pics/bebb0f3a7e686f4c203b272a0aaac0c6.png" alt="Drawing" style="width: 500px;"/>


# Imports

In [23]:
import json
import pymongo
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings
import requests
import numpy as np
from pprint import pprint
from datetime import date, timedelta
from time import sleep
from IPython.display import Image, HTML
from datetime import datetime


# Run Configuration & Settings


## Execution Mode
By changing these settings, you can enable or disable parts of the code and specify what you want to do.



### Scrape Trending Tiktoks
This mode scrapes currently trending tiktoks and stores it in the mongodb.

In [3]:
modea = []
modea.append('get_trending_today') 

### Scrape Hypeauditor
This mode resets the hypeauditor collection, scraptes the current Top 1000 TikTokers from the Hypeauditor website and stores it in mongodb.


In [4]:
modea.append('hypeauditor_scrape') 

### Analytics
This Mode activates the TikTok Analytics part.

In [5]:
modea.append('tiktok_analytics') 

### Debug
This mode activates advanced output for debuging.

In [6]:
modea.append('debug') 

### ResetDB
I do <font color='red'> __not recommend__ </font> resetting the entire database as you will   <font color='red'> __lose historical data that cannot be recovered__ </font> or scraped again. Charts such as trending tiktoks will not show any data because for this you need to have data for several days. 

However, if you run this notebook in the  __get_trending_today__ mode,  <font color='green'> today's data will be added to the database. </font>


In [7]:
# reset_db

### Review
Review the activated modes: 

In [8]:
### Review
mode = set(modea)
pprint(mode)

{'debug', 'get_trending_today', 'tiktok_analytics', 'hypeauditor_scrape'}


## Mongo DB Settings

In [9]:
mongodb_username = ""
mongodb_pw = ""

## TikTok API Settings

In [26]:
# Nr of TikToks to scrape
# You will be blocked by TikTok if you scrape too many
scrape_nr_results = 57 #bookmark

# Tiktok Token (please read the TikTok API Installation chapter)
custom_verifyFp_settings = "verify_kupjggmp_......"


## Hypeauditor Scraper Settings

In [11]:
nr_pages = 3 # page max is 20
min_sleep_seconds = 1  # pause between api calls, minimum pause in seconds
max_sleep_seconds = 15 # pause between api calls, minimum pause in seconds
hypeauditor_api_url = "https://hypeauditor.com/tiktok/getRankingData/?p="

# General
## Connection to Mongo DB

In [12]:
client = pymongo.MongoClient("mongodb+srv://" + mongodb_username + ":" + mongodb_pw + "@cluster0.t9sw1.mongodb.net/test" ,)
db = client['tiktok']

## Functions
### Mongo DB Functions

In [13]:
def drop_collection(collections):
    for collection in collections:
        print("~" * 100)
        print("Run: Drop Collection:", collection )
        print("Docs in collection before drop:", db[collection].count_documents({}))

        db.drop_collection(collection)
        print("Docs in collection after drop:", db[collection].count_documents({}))

def mongodb_insert_many(collection, data):
    print("~" * 100)
    print("Write Data to Collection:", collection )
    print("Number of:" , collection , "documents before write: " ,db[collection].count_documents({}))
    db.trending.insert_many(trending_transformed)
    print("Number of:" , collection , "documents after write: " ,db[collection].count_documents({}))


### Helper Functions

In [14]:
# This function returns a date als integers
def get_date_as_int(date_datetime):    
    today_d = int(date_datetime.strftime("%d"))
    today_m = int(date_datetime.strftime("%m"))
    today_y = int(date_datetime.strftime("%Y"))
    
    return today_d, today_m, today_y
        
today_d, today_m, today_y = get_date_as_int(date.today())

# Reset DB 
## Reset DB Mode

I do <font color='red'> __not recommend__ </font> resetting the entire database as you will   <font color='red'> __lose historical data that cannot be recovered__ </font> or scraped again. Charts such as trending tiktoks will not show any data because for this you need to data over several days. 

However, if you run this notebook in the  __get_trending_today__ mode,  <font color='green'> today's data will be added to the database. </font>

The reset_db mode will drop all/the following collections:
- trending
- hypeauditor_toptiktoker


In [15]:
if 'reset_db' in mode:
    drop_collection_list = []
    drop_collection_list.append("trending")
    drop_collection_list.append("hypeauditor_toptiktoker")

    drop_collection(drop_collection_list)


# ELT
## TikTok API

To retrieve data directly from https://www.tiktok.com, I use "The Unofficial TikTok API Wrapper" by David Teather.
You can find the library on GitHub: https://github.com/davidteather/TikTok-Api. Since I made minor changes, you should follow the installation instructions above to use the TikTok API Wrapper in this project. 



##  Trending Tiktoks
### Mode: get_trending_today

In order to get as many Trending Tiktoks as possible per day, the following __strategy__ will be used:
* We run the jupyter notebook in the __get_trending_today mode__ multiple times per day. This will always call the TikTok API and we get back a list of the current trending TikToks (posts). 
    1. TikTok always returns a different list of trending TikToks posts. When we call the API again, we get back a mixed list of TikTok posts that we haven't seen yet, but also TikTok posts that we have already seen in a previous call. 
    2. We always write the full list of TikToks including duplicates to the DB. This will cause duplicate entries. But we will remove them later using a MongoDB pipeline to identify the duplicates. 
    3. We delete the duplicates


### Fetch data: API Call Trending Tiktoks
 

In [16]:
if 'get_trending_today' in mode:
    custom_verifyFp = custom_verifyFp_settings
    api_return_filename = "api_trending_return.json"


Since the TikTok API uses the Playwright Async API and not the asyncio loop like jupyter notebook, this causes an error inside a jupyter notebook. For this reason, I am running the actual API Call from a seperate python script outside of jupyter notebook.


In [17]:
if 'get_trending_today' in mode:
    !ipython ./TikTokApi/api_call_trending.py {scrape_nr_results} {custom_verifyFp} {api_return_filename}

[22;0t]0;IPython: pythonProject1/BDL03_1.H21_wkwscrape_nr_results: 57
custom_verifyFp: verify_kupjggmp_......
filename: api_trending_return.json
Call TikTokApi
Write result to file: api_trending_return.json
Done


* Check Data from the TikTok API Scraper 
* Add a scrapeDate so that we later know when the Tiktok was trending.

In [18]:
if 'get_trending_today' in mode:
    with open(api_return_filename) as json_file:
        trending = json.load(json_file)
        
    trending_transformed = []
    
    print_limit= 3
    print_counter = 0
    
    print("Print first:" , print_limit , "trending TikToks:")
    print("-" * 80)   

    for tiktok in trending:
         
        print_counter += 1
        
        #date = date.today() #add current date as scrapeDate
        date = datetime.today()

        tiktok["scrapeDate"] = date
        
        if( print_counter < print_limit):
            print("print_counter" , print_counter)
            print("des: " , tiktok['desc']) # tiktok description
            print("id: " , tiktok['id']) # tiktok id 
            print("scrapeDate:", tiktok["scrapeDate"]) # tiktok scrapeDate 

            # print tiktok hashtags 
            if "textExtra" in tiktok:
                for e in tiktok['textExtra']:
                    print("-" , e['hashtagName'])
            else:
                print("Tiktok has no textExtra")
            print("-" * 80) 
            
        trending_transformed.append(tiktok)
    nr_scraped_tiktoks = len(trending)
    print("Number of trending TikToks scraped: " ,nr_scraped_tiktoks)
    print("-" * 80) 


Print first: 3 trending TikToks:
--------------------------------------------------------------------------------
print_counter 1
des:  OMG, wie findest DU diese Schuhe?🤔👟 #foryou #fürdich #schuhe #sneaker #airjordan1 #nike #SneakerHacks #AusgehOutfit
id:  7020823103102078213
scrapeDate: 2021-10-19 23:30:08.663473
- foryou
- fürdich
- schuhe
- sneaker
- airjordan1
- nike
- sneakerhacks
- ausgehoutfit
--------------------------------------------------------------------------------
print_counter 2
des:  Hahahahah kann nicht mehr #viral #foryou #fyp #funny
id:  7007090297264000262
scrapeDate: 2021-10-19 23:30:08.664310
- viral
- foryou
- fyp
- funny
--------------------------------------------------------------------------------
Number of trending TikToks scraped:  57
--------------------------------------------------------------------------------


### Insert into MongoDB: Trending TikToks

In [19]:
if 'get_trending_today' in mode:
    if(len(trending) > 0):
        mongodb_insert_many("trending" , trending_transformed)
    

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Write Data to Collection: trending
Number of: trending documents before write:  12683
Number of: trending documents after write:  12740


### Find duplicates 

When a TikTok (post) is trending for multiple days, it does not count as a duplicate. A duplicate is a scraped TikTok (post) on the same day. We use the following __strategy__ to find duplicates:
* Filter documents for scrapeDate today
* Find documents which have the same id *(not _id)*

Note: 
* The key: id is issued by tiktok.com and is the official id of this TikTok (post).
* The key: _id is the document key of MongoDB

In [None]:
def find_duplicate_trending(filter_date):
    
    today_d, today_m, today_y = get_date_as_int(filter_date)
    
    # find duplicats of trending tiktoks in db using a pipeline

    pipeline = [
        {
            '$match': {
                'scrapeDate': {
                    '$gte': datetime(today_y, today_m, today_d, 0, 0, 0), 
                    '$lt': datetime(today_y, today_m, (today_d + 1), 0, 0, 0)
                }
            }
        }, {
            '$project': {
                '_id': 1, 
                'id': 1
            }
        }, {
            '$group': {
                '_id': '$id', 
                'count': {
                    '$sum': 1
                }
            }
        }, {
            '$match': {
                'count': {
                    '$gte': 2
                }
            }
        }
    ]


    duplicates = []

    for d in db.trending.aggregate(pipeline):
        duplicates.append(d)

    duplicates_clean = []
    for duplicate in duplicates:
        duplicates_clean.append(duplicate["_id"])

    return duplicates_clean


def get_delete_list_trendingdocs(duplicates): 
    # Prepare list with _id's to delete
    
    
    all_pipeline_docs = []
    delete_list = []
    keep_docs_list = []
    uniqe_socialid_list = []


    for d in db.trending.find( {"id": {"$in": duplicates }}, { "_id" : 1 , "id" :1} ):
        all_pipeline_docs.append(d)
        if d["id"] in uniqe_socialid_list:
            delete_list.append(d["_id"])
        else:
            uniqe_socialid_list.append(d["id"])
            keep_docs_list.append(d["_id"])

    return delete_list

if 'get_trending_today' in mode:
    delete_list = get_delete_list_trendingdocs(find_duplicate_trending(date.today()))

### Delete duplicates 


In [None]:
def delete_documents_by_list(delete_list, collectionstr):
    delete_filter = {"_id":{"$in":delete_list}}

    nr_total_trending_before_delete = db[collectionstr].count_documents({})
    nr_docs_to_delete = db[collectionstr].count_documents(delete_filter)

    if 'debug' in mode:
        print("Delete duplicates:" , "-" * 80) 
        print("Docs in ", collectionstr , " before delete: ", nr_total_trending_before_delete)
        print("Delete docs in" , collectionstr , nr_docs_to_delete)
        print("Start delete")

    db[collectionstr].delete_many(delete_filter) # delete duplicate documents

    nr_total_trending_after_delete = db[collectionstr].count_documents({})

    if 'debug' in mode:
        print("Target number of docs after delete: ", nr_total_trending_before_delete - nr_docs_to_delete)
        print("Docs in" , collectionstr , " after delete: ", nr_total_trending_after_delete)
        if( nr_total_trending_before_delete - nr_docs_to_delete == nr_total_trending_after_delete):
            print("Correct number of docs deleted.")
        else:
            print("Wrong number of docs deleted.")
    return nr_docs_to_delete

if 'get_trending_today' in mode:
    print("Scraped TikToks:", nr_scraped_tiktoks)
    nr_docs_to_deleted = delete_documents_by_list(delete_list, "trending")    
    print("Percentage of duplicates in Scraped TikToks:", nr_docs_to_deleted /nr_scraped_tiktoks)

## Hypeauditor Top 1000 TikTokers

Since we only receive a tiny fraction of all TikToks (posts) worldwide, we only know which TikToks are trending for our user in Switzerland. To get a global overview, we need to scrape tiktok.com with multiple accounts from multiple IP locations. Since this is beyond the scope of this project, we scrape an API from Hypeauditor which provides us with the top 1000 TikTokers (users). We can later use this information to get more data about the top TikTokers directly from toktok.com using "The Unofficial TikTok API Wrapper".


### Remove all existing documents -> Reset collection

In [None]:
# clear hypeauditor_toptiktoker colleciton
if 'hypeauditor_scrape' in mode:  
    drop_collection_list = []
    drop_collection_list.append("hypeauditor_toptiktoker")
    drop_collection(drop_collection_list)

    

### Fetch Data: Hypeauditor

In [None]:
def hypeauditor_api_call(page_counter, url):
    # Scrape the API of the hypeauditor website: https://hypeauditor.com/top-tiktok/
    full_request_url = f'{url}{page_counter}'
    print("Calling: ", full_request_url)

    payload = {}
    headers = {
        'authority': 'hypeauditor.com',
        'pragma': 'no-cache',
        'cache-control': 'no-cache',
        'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
        'accept': 'application/json, text/plain, */*',
        'x-requested-with': 'XMLHttpRequest',
        'sec-ch-ua-mobile': '?0',
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',
        'sec-ch-ua-platform': '"macOS"',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'cors',
        'sec-fetch-dest': 'empty',
        'referer': 'https://hypeauditor.com/top-tiktok?p=1',
        'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,de;q=0.7',
        'Cookie': 'kyb-account=1388226; kyb-last-login=refresh; kyb-hash=%242y%2404%24Xfg3Dtl8FsqoSiri0Vkm9OBi9AUDwt5lRmsUxQiCVMbuuPFCPDpSe; PHPSESSID=09d8d49eb5034c31d656262df7304a08; dd=16'
    }

    response = requests.request("GET", full_request_url, headers=headers, data=payload)
    response_json = response.json()

    return response_json["result"]

In [None]:
# Call hypeauditor API multiple times for all the subpages 
# Visite this url to get an example: https://hypeauditor.com/top-tiktok/
if 'hypeauditor_scrape' in mode:
    top_tiktoker_pages_list = []
    for page_counter in range(1, nr_pages + 1):
        toptiktoker_page = hypeauditor_api_call(page_counter, hypeauditor_api_url)
        # get all information from website
        top_tiktoker_pages_list.append(toptiktoker_page)

        #sleep if page_counter is not last page
        if(nr_pages  != page_counter):
            sleep_randint = np.random.randint(min_sleep_seconds, max_sleep_seconds)
            print("Sleep seconds: " , sleep_randint)
            sleep(sleep_randint)

    # extract tiktoker information from website data
    top_tiktoker_list = []

    for page in top_tiktoker_pages_list:
        for tiktoker in page:
            top_tiktoker_list.append(tiktoker)
            #print(tiktoker)

    


### Insert into MongoDB: Hypeauditor Data

In [None]:
if 'hypeauditor_scrape' in mode:
    # write top tiktoker to mongo db
    print("Number of top TikToker in db before write:" ,db.hypeauditor_toptiktoker.count_documents({}))
    db.hypeauditor_toptiktoker.insert_many(top_tiktoker_list)
    print("Number of top TikToker in db after write:" ,db.hypeauditor_toptiktoker.count_documents({}))

# Data Analysis


## Trending TikToks (Switzerland)

Analysis of scraped trending TikToks with a Swiss TikTok Account and a Swiss IP location.



In [None]:
# Query and group the hashtags from mongodb using a pipeline
if 'tiktok_analytics' in mode: 
    today_d, today_m, today_y = get_date_as_int(date.today())

    pipeline =  [
        {
            '$match': {
                'scrapeDate': {
                    '$gte': datetime(today_y, today_m, today_d, 0, 0, 0), 
                    '$lt': datetime(today_y, today_m, (today_d + 1), 0, 0, 0),
                }
            }
        }, {
            '$project': {
                '_id': 0, 
                'textExtra.hashtagName': 1, 
                'textExtra.hashtagId': 1
            }
        }, {
            '$unwind': {
                'path': '$textExtra'
            }
        }, {
            '$replaceWith': {
                'hashtagName': '$textExtra.hashtagName', 
                'hashtagId': '$textExtra.hashtagId'
            }
        }, {
            '$group': {
                '_id': {
                    'hashtagId': '$hashtagId', 
                    'hashtagName': '$hashtagName'
                }, 
                'count': {
                    '$sum': 1
                }
            }
        }, {
            '$sort': {
                'count': -1
            }
        }, {
            '$replaceWith': {
                'hashtagName': '$_id.hashtagName', 
                'hashtagId': '$_id.hashtagId', 
                'count': '$count'
            }
        
        }, {
            '$match': {
                '$and': [
                    {
                        'hashtagName': {
                            '$ne': ''
                        }
                    }
                    
                ]
            }
        }, {
            '$limit': 50
        }
    ]
    if False:
        if 'debug' in mode:
            for d in db.trending.aggregate(pipeline):
                print(d)

    df_hashtags = pd.DataFrame.from_dict(db.trending.aggregate(pipeline))

    
    

### Hashtags in Trending Tiktoks

In [None]:
if 'debug' in mode:
    display(df_hashtags.head(15))

    # Visualize the Hashtags
    # Note: Some Hashtags contain glyph icons. These will be aggregated as seperate hashtags,
    # but not shown in the chart bellow
    warnings.filterwarnings('ignore') # Jupyter can't show glyph icons, ignore warning

#### Barchart: Hashtags used in today's trending Tiktoks
The barchart bellow shows the most used hashtags in trending TikToks (posts).

In [None]:
if 'tiktok_analytics' in mode and len(df_hashtags) > 0: 
    plt.figure();
    plt.rcParams['figure.figsize'] = [15, 5]

    df_hashtags.iloc[0:30].plot(title="Trending Hashtags as of " + str(today_d) + "." + str(today_m) + "."+ str(today_y) , kind="bar" , x='hashtagName');
    plt.axhline(0, color="k");

### #switzerland related hashtags



In [None]:
if 'tiktok_analytics' in mode:

    pipeline = [
        {
            '$project': {
                '_id': 0, 
                'textExtra.hashtagName': 1, 
                'textExtra.hashtagId': 1
            }
        }, {
            '$match': {
                '$or': [
                    {
                        'textExtra.hashtagName': {
                            '$eq': 'switzerland'
                        }
                    }, {
                        'textExtra.hashtagName': {
                            '$eq': 'schweiz'
                        }
                    }, {
                        'textExtra.hashtagName': {
                            '$eq': 'schwiz'
                        }
                    }, {
                        'textExtra.hashtagName': {
                            '$eq': 'suisse'
                        }
                    }, {
                        'textExtra.hashtagName': {
                            '$eq': 'swiss'
                        }
                    }, {
                        'textExtra.hashtagName': {
                            '$eq': 'svizzera'
                        }
                    }, {
                        'textExtra.hashtagName': {
                            '$eq': '🇨🇭'
                        }
                    }
                ]
            }
        }, {
            '$unwind': {
                'path': '$textExtra'
            }
        }, {
            '$replaceWith': {
                'hashtagName': '$textExtra.hashtagName', 
                'hashtagId': '$textExtra.hashtagId'
            }
        }, {
            '$group': {
                '_id': {
                    'hashtagId': '$hashtagId', 
                    'hashtagName': '$hashtagName'
                }, 
                'count': {
                    '$sum': 1
                }
            }
        }, {
            '$sort': {
                'count': -1
            }
        }, {
            '$replaceWith': {
                'hashtagName': '$_id.hashtagName', 
                'hashtagId': '$_id.hashtagId', 
                'count': '$count'
            }
        }, {
            '$match': {
                '$and': [
                    {
                        'hashtagName': {
                            '$ne': 'switzerland'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'schweiz'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'schwiz'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'switzerland'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'suisse'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'swiss'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'svizzera'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': '🇨🇭'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'fyp'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'foryou'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': '4u'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'fy'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'fypage'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'viral'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'goviral'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'trend'
                        }
                     }, {
                        'hashtagName': {
                            '$ne': 'pourtoi'
                        }
                     }, {
                        'hashtagName': {
                            '$ne': 'fürdich'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'fypシ'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'fanpage'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'foryoupage'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': 'fypp'
                        }
                    }, {
                        'hashtagName': {
                            '$ne': ''
                        }
                    }
                ]
            }
        }, {
            '$limit': 100
        }
    ]
    
    df_hashtags_sw  = pd.DataFrame.from_dict(db.trending.aggregate(pipeline))
    

    display(df_hashtags_sw.iloc[0:10])


#### Barchart:  #switzerland related hashtags


This chart shows which hashtags are used together with the #switzerland hashtag and how often (count). 
Excluded hashtags:
* #Switzerland related hashtags
* #FYPage related hashtags
* #Viral related hashtags

In [None]:
if 'tiktok_analytics' in mode and len(df_hashtags_sw) > 0: 
    plt.figure();
    plt.rcParams['figure.figsize'] = [20,10]
    df_hashtags_sw.iloc[0:40].plot(title="#switzerland related hashtags " , kind="bar" , x='hashtagName', ylabel="count" );
    plt.axhline(0, color="k");

### Trending Tiktoks 

#### Helper Function

In [None]:
# This function uses a pipeline to get the trending users from mongodb
# The function will sort the results according to (function argument: sort_field) directly on the database
# as well as limit the number of records (functionargument: limit_nr)also directly on the database
if 'tiktok_analytics' in mode: 
    def pipeline_trending_tiktoks_by(sort_field, asof_date, limit_nr= 10):
        today_d, today_m, today_y = get_date_as_int(asof_date)

        pipeline = [
            {
                '$match': {
                    'scrapeDate': {
                        '$gte': datetime(today_y, today_m, today_d, 0, 0, 0), 
                        '$lt': datetime(today_y, today_m, (today_d + 1), 0, 0, 0),
                    }
                }
            }, {
                '$project': {
                    'id': 1, 
                    'desc': 1, 
                    'video.dynamicCover': 1, 
                    'video.playAddr': 1, 
                    'stats': 1, 
                    'author.id': 1, 
                    'author.nickname': 1,
                    'scrapeDate': 1
                }
            }, {
                '$replaceWith': {
                    'desc': '$desc', 
                    'id': '$id', 
                    'authorNickname': '$author.nickname', 
                    'playCount': '$stats.playCount', 
                    'shareCount': '$stats.shareCount', 
                    'commentCount': '$stats.commentCount',
                    'scrapeDate': '$scrapeDate'
                }
            }, {
                '$sort': {
                    'sortkey': -1
                }
            },
            {
                '$limit': limit_nr
            }
        ]

        pipeline[-2]["$sort"] = {sort_field : -1} # replace sortkey with sort_field from function input
        
        return pipeline

#### Table:  Today's Trending TikToks by Play Count

The table below shows the trending Tiktoks, ordered by the number of views (play count). You can click on the Play TikTok link to watch the TikTok on tiktok.com. 

In [None]:
if 'tiktok_analytics' in mode: 
    def print_tiktok_by(sort_by, asof_date, limit):
        df_tiktoks_by_view = pd.DataFrame.from_dict(db.trending.aggregate(pipeline_trending_tiktoks_by(sort_by , asof_date, limit)))
        if(len(df_tiktoks_by_view) > 0): 
            pd.set_option('display.max_colwidth', 30)
            df_tiktoks_by_view["play_url"] = '<div style="width: 170px"><a href="https://www.tiktok.com/@totouchanemu/video/' + df_tiktoks_by_view["id"] + '?is_from_webapp=v1" target="_blank">Play TikTok</a><div>'
            #df_tiktoks_by_view.drop('id', axis=1, inplace=True)
            display(HTML(df_tiktoks_by_view.to_html(escape=False)))
        else:
            print("No data for: ", asof_date)
    
    print_tiktok_by("playCount" , date.today(), 20)

### Trending Tiktoks by Share Count
#### Table: Today's Trending Tiktoks by Share Count

Trending TikToks today by Share Count.

In [None]:
if 'tiktok_analytics' in mode: 
    print_tiktok_by("shareCount" , date.today(),  10)

#### Table: Yesterday's Trending Tiktoks by Share Count

Trending TikToks today by Share Count.

In [None]:
if 'tiktok_analytics' in mode: 
    yesterday = date.today() - timedelta(1)
    print_tiktok_by("shareCount" , yesterday ,  10)

### Development of trending TikTok over Time

#### Development of trending TikTok over Time: Functions


In [None]:
def pipeline_trending_tiktoks_overtime(asof_date, sort_metric="playCount"):
    today_d, today_m, today_y = get_date_as_int(asof_date)
    pipeline = [
        {
            '$match': {
                '$or': [
                
                    {
                        'scrapeDate': {
                            '$gte': datetime(today_y, today_m, today_d - 7, 0, 0, 0), 
                            '$lt': datetime(today_y, today_m, (today_d + 1), 0, 0, 0),
                        }
                    }
                ]
            }
        }, {
            '$replaceWith': {
                'desc': '$desc', 
                'id': '$id', 
                'commentCount': '$stats.commentCount', 
                'videoDynamicCover': '$video.dynamicCover', 
                'videoPlayAddr': '$video.dynamicCover', 
                'authorId': '$author.id', 
                'authorNickname': '$author.nickname', 
                'playCount': '$stats.playCount', 
                'shareCount': '$stats.shareCount', 
                'scrapeDate': '$scrapeDate'
            }
        }, {
            '$sort': {
                'playCount': -1, 
                'id': -1, 
                'scrapeDate': -1
            }
        } , 
        {
          '$limit': 3000
        }
    ]
    
    #pipeline[-2]["$sort"] = {sort_field : -1} # replace sortkey with sort_field from function input

    sort_dict = json.dumps(pipeline[-2]["$sort"])
    sort_dict = sort_dict.replace("playCount", sort_metric ) 
    sort_dict =json.loads(sort_dict)
    pipeline[-2]["$sort"] = sort_dict
    return pipeline



In [None]:
def get_df_trending_tiktoks_over_time(pipeline):
    # get date from db
    df_trending_development = pd.DataFrame.from_dict( db.trending.aggregate(pipeline))

    # trending tiktoks which appeared on more then one day
    df_trending_ids = df_trending_development['id'].value_counts() 
    df_trending_ids = df_trending_ids[df_trending_ids > 1] # get id's which we have on multiple days
    df_trending_development = df_trending_development[df_trending_development['id'].isin(list(df_trending_ids.index))] # filter dataset
    df_trending_development["playCountMio"] = df_trending_development["playCount"] / 1000000
    df_trending_development["shareCountK"] = df_trending_development["shareCount"] / 1000
    df_trending_development["commentCountK"] = df_trending_development["commentCount"] / 1000
    
    return df_trending_development

def get_single_tiktok_dataframe(df, tiktok_id):
    df_trending_development = df #get_df_trending_tiktoks_over_time(pipeline)
    # pick one tiktok
    df_trending_development_single = df_trending_development[df_trending_development["id"] == tiktok_id]

    #normalize date for chart
    df_trending_development_single['scrapeDate'] = df_trending_development_single['scrapeDate'].dt.normalize()
    df_trending_development_single.reset_index(drop=True, inplace=True)

    return df_trending_development_single



In [None]:
def plot_tiktoks_over_time(pipeline, top_x_nr_start,top_x_nr_end , metric="playCountMio", id_list = False):
    
   #metric: playCountMio, shareCountK, commentCountK
    
    df_trending_tiktoks = get_df_trending_tiktoks_over_time(pipeline)
    if(id_list):
        #plot specific list
        list_top_x_ids = id_list
        
    else: 
        
        # plot top x id's 
        list_top_x_ids = list(df_trending_tiktoks.id.unique()[top_x_nr_start: top_x_nr_end + 1 ])
        if((len(list_top_x_ids)) == 0):
            print("Choose an other top_x_nr_start number.")
      
    
    fig,ax = plt.subplots()
    ax.xaxis.set_major_locator(mdates.DayLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y %m %d'))


    for id in list_top_x_ids:
        df_trending_tiktok1 = get_single_tiktok_dataframe(df_trending_tiktoks, id)
        x = df_trending_tiktok1['scrapeDate']
        y = df_trending_tiktok1[metric]
        if(len(df_trending_tiktok1['desc'])> 0): #check if all tiktoks have data
            if(len(df_trending_tiktok1['desc'][0]) > 0):
                tiktok_desc = df_trending_tiktok1['desc'][0][0:20] # check if tiktok has desc
            else:
                tiktok_desc = df_trending_tiktok1['id'][0]
            plt.plot(x, y, label = "TikTok: " + tiktok_desc ) #  


    plt.title("TikTok " + metric + " over Time") 
    # show a legend on the plot
    plt.legend()
    # Display a figure.
    
    plt.ylabel(metric)


    plt.show()



#### Chart: Development of trending TikTok over Time: Top 5

This chart shows how the views of the top 5 TikToks have evolved. 

Note:
* Only TikToks with more than 1 data point are included here
* If the line is interrupted, we have not been able to collect data for that day

Interpretation
* The trending TikToks with the most views at this point are only slowly gaining views. In the graph bellow, this is sometimes shown as an almost flat line. It is also interesting to note that we only have data on a top trending Tiktok when views flatten out. So Tiktok shows us the top trending tiktoks very late in the hype/life cycle. 
* In order to catch a trending TikTok early, we must scrape many more tiktoks around the world using multiple scrapers with multiple users and IP locations and also start scraping the same TikTok every day again and not only the trending TikToks.

In [None]:
plot_tiktoks_over_time(pipeline_trending_tiktoks_overtime( date.today()), 0, 10, "playCountMio")


## Development of selected TikTok over Time

### Chart: Development of selected TikTok over Time by play count

Add a TikTok ID to the list: selected_tiktok_ids to see how the TikTok has evolved. This only works if we have data over several days for the respective TikTok ID.

In [None]:
selected_tiktok_ids = ["7008948234081160453", "6991178928714943749", "7009323168288591110", "7016989005157895425" ]
plot_tiktoks_over_time(pipeline_trending_tiktoks_overtime( date.today()), 1, 2, "playCountMio", selected_tiktok_ids)

### Chart: Development of selected TikTok over Time by comment count

In [None]:
plot_tiktoks_over_time(pipeline_trending_tiktoks_overtime( date.today()), 1,1, "commentCountK", selected_tiktok_ids)


### Chart: Development of selected TikTok over Time by share count


In [None]:
plot_tiktoks_over_time(pipeline_trending_tiktoks_overtime( date.today()), 0, 4, "shareCountK", selected_tiktok_ids)

### Number of trending Tiktoks by Author

In [None]:
if 'tiktok_analytics' in mode: 
    filter_date = date.today()
    
    today_d, today_m, today_y = get_date_as_int(filter_date)
    pipeline = [{
                    '$match': {
                        'scrapeDate': {
                            '$gte': datetime(today_y, today_m, today_d, 0, 0, 0), 
                            '$lt': datetime(today_y, today_m, (today_d + 1), 0, 0, 0),
                        }
                    }
                }, {
                    '$project': {
                        'id': 1, 
                        'desc': 1, 
                        'video.dynamicCover': 1, 
                        'video.playAddr': 1, 
                        'stats': 1, 
                        'author.id': 1, 
                        'author.nickname': 1
                    }
                }, {
                    '$replaceWith': {
                        'desc': '$desc', 
                        'id': '$id', 
                        'videoDynamicCover': '$video.dynamicCover', 
                        'videoPlayAddr': '$video.dynamicCover', 
                        'authorId': '$author.id', 
                        'authorNickname': '$author.nickname', 
                        'playCount': '$stats.playCount', 
                        'shareCount': '$stats.shareCount', 
                        'commentCount': '$stats.commentCount'
                    }
                }, {
                    '$group': {
                        '_id': {
                            'authorId': '$authorId', 
                            'authorNickname': '$authorNickname'
                        }, 
                        'count': {
                            '$sum': 1
                        }
                    }
                }, {
                    '$sort': {
                        'count': -1
                    }
                }, {
                    '$replaceWith': {
                        'authorId': '$_id.authorId', 
                        'authorNickname': '$_id.authorNickname', 
                        'count': '$count'
                    }
                }, {
                    '$limit': 30
                }
            ]

#### Barchart: Number of trending Tiktoks by Author (today)

TikTokers (user) with most TikToks (posts) in trending TikToks.

In [None]:
if 'tiktok_analytics' in mode: 

    df_tiktoks_by_tiktoker = pd.DataFrame.from_dict( db.trending.aggregate(pipeline))
    if len(df_tiktoks_by_tiktoker) > 0:
        plt.figure();
        plt.rcParams['figure.figsize'] = [15, 5]

        df_tiktoks_by_tiktoker.iloc[0:30].plot(title="Number of trending TikToks by Author as of " + str(today_d) + "." + str(today_m) + "."+ str(today_y) , kind="bar" , x='authorNickname');
        plt.axhline(0, color="k");

#### Table: Number of trending Tiktoks by Author (today)
TikTokers (user) with most TikToks (posts) in trending TikToks.

In [None]:
if 'tiktok_analytics' in mode: 
    if 'authorId' in df_tiktoks_by_tiktoker.columns:
        df_tiktoks_by_tiktoker["profileUrl"] = '<div style="width: 200px"><a href="https://www.tiktok.com/@' + df_tiktoks_by_tiktoker["authorId"] + '?is_from_webapp=v1" target="_blank">TikTok Profile: ' +  df_tiktoks_by_tiktoker["authorNickname"] + ' </a><div>'
        df_tiktoks_by_tiktoker.drop('authorId', axis=1, inplace=True)
        pass

    display(HTML(df_tiktoks_by_tiktoker.head(10).to_html(escape=False)))
    

### Trending Music used in TikToks


In [None]:
pipeline = [
    {
        '$match': {
            'music.original': False
        }
    }, {
        '$replaceWith': {
            'id': '$id', 
            'music_title': '$music.title', 
            'music_id': '$music.id', 
            'music_cover': '$music.coverMedium', 
            'music_author_name': '$music.authorName', 
            'shareCount': '$stats.shareCount', 
            'commentCount': '$stats.commentCount', 
            'playCount': '$stats.playCount'
        }
    }, {
        '$group': {
            '_id': {
                'music_id': '$music_id', 
                'music_title': '$music_title', 
                'music_cover': '$music_cover',
                'music_author_name': '$music_author_name', 
            }, 
            'count': {
                '$sum': 1
            }
        }
    }, {
        '$sort': {
            'count': -1
        }
    }, {
        '$replaceWith': {
            'music_title': '$_id.music_id', 
            'music_title': '$_id.music_title', 
            'music_cover': '$_id.music_cover', 
            'music_author_name': '$_id.music_author_name',
            'count': '$count'
        }
    },{
            '$limit': 10
    }
]

df_trending_music = pd.DataFrame.from_dict( db.trending.aggregate(pipeline))



#### Table: Trending Music used in TikToks

In this part we analyze how often a music title is used in a trending TikTok.
About 50% of all TikToks use voice recordings or mix music with voice recordings. These TikToks are classified as "original sound" for which we don't get data about the music title. Hence, in this part, we only analyze the TikToks without original sound.

In [None]:
df_trending_music["music_cover_html"] = '<img src="' + df_trending_music["music_cover"] +'" style="width:80;height:80px;" >'
df_trending_music.drop('music_cover', axis=1, inplace=True)
display(HTML(df_trending_music.head(10).to_html(escape=False)))

## Top TikTokers (Worldwide)
### Top TikTokers by followers

In [None]:
if 'tiktok_analytics' in mode: 
    pipeline = [
        {
            '$replaceWith': {
                'username': '$basic.username', 
                'userId': '$basic.social_id', 
                'avatar_url': '$basic.avatar_url', 
                'followers': '$metrics.subscribers_count.value', 
                'country': '$features.blogger_geo.data.country'
            }
        }, {
            '$sort': {
                'followers': -1
            }
        }, {
             '$limit': 25
        }
    ]

    df_worldwide_toptiktokers = pd.DataFrame.from_dict( db.hypeauditor_toptiktoker.aggregate(pipeline))
    df_worldwide_toptiktokers['country'] = df_worldwide_toptiktokers['country'].fillna("unknown")
   
    df_worldwide_toptiktokers["avatar_url_img"] = '<img src="' + df_worldwide_toptiktokers["avatar_url"] +'" style="width:80;height:80px;" >'
    df_worldwide_toptiktokers["profileLink"] = '<div style="width: 200px"><a href="https://www.tiktok.com/@' + df_worldwide_toptiktokers["userId"] + '?is_from_webapp=v1" target="_blank"> TikTok Profile: ' +  df_worldwide_toptiktokers["username"] + ' </a><div>'
    if 'userId' in df_worldwide_toptiktokers.columns:
        df_worldwide_toptiktokers.drop('userId', axis=1, inplace=True)
        df_worldwide_toptiktokers.drop('avatar_url', axis=1, inplace=True)
        



#### Table: Top Tiktokers by followers

In [None]:
if 'tiktok_analytics' in mode: 
    display(HTML(df_worldwide_toptiktokers.head(10).to_html(escape=False)))

### Number of Top 1000 TikTokers by country
From which countries are the top 1000 TikTokers from? 

In [None]:
if 'tiktok_analytics' in mode: 
    pipeline = [
        {
            '$replaceWith': {
                'username': '$basic.username', 
                'userId': '$basic.social_id', 
                'avatar_url': '$basic.avatar_url', 
                'followers': '$metrics.subscribers_count.value', 
                'country': '$features.blogger_geo.data.country'
            }
        }, {
            '$sort': {
                'followers': -1
            }
        }, {
            '$group': {
                '_id': '$country', 
                'count': {
                    '$sum': 1
                }
            }
        }, {
            '$match': {
                '_id': {
                    '$exists': True, 
                    '$ne': None
                }
            }
        }, {
            '$sort': {
                'count': -1
            }
        }
    ]
    
    df_worldwide_toptiktokers_by_country_agg = pd.DataFrame.from_dict( db.hypeauditor_toptiktoker.aggregate(pipeline))
    df_worldwide_toptiktokers_by_country_agg

    


### Barchart: Number of Top 1000 TikTokers by country

Most of the top 1000 TikTokers (by followers count) are from the US. 

In [None]:
if 'tiktok_analytics' in mode: 
    plt.figure();
    plt.rcParams['figure.figsize'] = [15, 5]

    df_worldwide_toptiktokers_by_country_agg.iloc[0:30].plot(title="Number of Top 1000 TikToks by Country" , kind="bar" , x='_id');
    plt.axhline(0, color="k");

# Conclusion and Outlook



This analysis gives us a first insight into the world of Tiktoks and summarizes what is trending right now. However, to get deeper insights and identify trends early, we need to scrape many more Tiktoks around the world, using multiple scrapers with multiple users and IP locations, and also scrape the same TikToks every day. With this information, we could analyze what makes a Tiktok go viral/trending. 


About 50% of all TikToks use voice recordings or mix music with voice recordings. These TikToks are classified as "original sound" for which we don't get data about the music title. AI could be used to identify this music and collect more data about the trending music titles.


Depending on the target audience of this tool, different information could be interesting:
* An influencer or a company with its own presence on TikTok, could learn how to create a trending Tiktok and better build an audience.
* Marketers trying to sell a product could try to understand Gen Z's sense of humor and values to craft ads better suited to them.
* Marketers and product managers could try to use this tool to better understand Generation Z and incorporate the insights into their products and services.



# Remarks
## Learnings
