Cluster Analysis 

After conducting simple descriptive statistical analysis and visual display of users, authors, and works, we attempted to further explore the data through some data mining methods.
For the Tiktok platform itself, how to classify or grade users and then provide differentiated services is a very important direction.
For business partners and advertisers, how to classify authors and choose authors for collaboration is also a valuable issue.
In section 3.1, based on the data characteristics of users and authors, the k-means clustering algorithm will be used to quantitatively classify the two groups

In [1]:
import numpy as np
import pandas as pd

from pyecharts.charts import *
from pyecharts import options as opts

from sklearn.cluster import KMeans
import joblib
from sklearn import metrics
from scipy.spatial.distance import cdist

1. Data reading and processing


1.1 Data Fetch

In [2]:
user_feature = pd.read_csv(
    'user_characteristics.csv',
    index_col=0,
    usecols=range(0, 10)
)

author_feature = pd.read_csv(
    'author_characteristics.csv',
    index_col=0,
    usecols=range(0, 10)
)

1.2 Data Processing

User clustering can serve the platform's user grading and exploratory analysis of user characteristics. However, in the second aspect of data visualization analysis, it can be seen that there are some users with low platform usage, few views, and no likes. Clustering analysis of such users is ineffective and unnecessary. Increasing filtering to consider users who have watched at least one complete short video and have a certain amount of views is of analytical significance.

In [3]:
user_data = user_feature[(user_feature['Complete_views']>=1)&(user_feature['Page_view']>=5)]
print(len(user_data)/len(user_feature))

0.7097514856834144


In terms of considering the author, the clustering results serve business cooperation and advertising placement, with the core being page views
And most of the authors have very small total page views, so these authors do not need to be considered, so screening is conducted.

In [4]:
author_data = author_feature[(author_feature['Total_complete_views']>=1)&(author_feature['Total_views']>=3)]
print(len(author_data)/len(author_feature))

0.2990244347629775


2. Clustering methods and definitions

2.1 Clustering method

Key parameters:

init='k-means++'
For the k-means algorithm, the selection of the initial center is crucial.
The random selection method of k-means may result in the initial center being too close, leading to slow convergence and poor performance of the iteration results
K-means++optimizes the initial center selection by selecting centers one by one and prioritizing centers that are farther away

N_clusters: Number of clusters
The determination of the number of clusters is made by integrating different indicators and applying the elbow rule for judgment

Evaluation

SSE: Sum of squared errors
The total Euclidean distance from the center position obtained in the current iteration to the respective center point clusters

Contour coefficient: The sc contour coefficient combines the intra class cohesion and inter class separation of clustering. The sc contour coefficient is ‚àà [-1,1], and the closer it is to 1, the better.

2.2 Definition of Related Functions

Kmeans

In [5]:
def km(data, name):
    K = range(2, 10) #K value selection range
    X = data #data
    # scores = { 'SSE': [], 'sc': [], 'sse': []}
    scores = {'sc': [], 'sse': []}
    for _k in K:
        #Initialize the model and perform clustering
        kmeans = KMeans(n_clusters=_k, init='k-means++', random_state=0)
        kmeans.fit(X)
        _y = kmeans.predict(X) #Prediction results
        #Calculate model evaluation indicators
        sse = sum(np.min(cdist(X,kmeans.cluster_centers_,'euclidean'),axis=1))/X.shape[0]
        sc = metrics.silhouette_score(X, _y) #Calculate contour coefficient
        joblib.dump(kmeans, f'{name}{_k}cluster.model')
        #Storage evaluation value
        # scores['SSE'].append(SSE)
        scores['sse'].append(sse)
        scores['sc'].append(sc)
        print(f'Calculation of cluster {_k} class completed', end='\t')
    joblib.dump(scores, f'{name}cluster index.score')
    print('Indicator storage completed')
    return scores

Draw sse and sc curves

In [6]:
def draw(k, sse, sc):
    chart = (
        Line(init_opts=opts.InitOpts(
            theme='dark',
            width='400px',
            height='400px'
        ))
        .add_xaxis(k)
        .add_yaxis('sse', sse, yaxis_index=0, label_opts=opts.LabelOpts(is_show=False))
        .add_yaxis('sc', sc, yaxis_index=1, label_opts=opts.LabelOpts(is_show=False))
        .extend_axis(yaxis=opts.AxisOpts())
        .set_global_opts(
            title_opts=opts.TitleOpts(title='Clustering effect'),
            xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=True),
            yaxis_opts=opts.AxisOpts(
                type_="value",
                axistick_opts=opts.AxisTickOpts(is_show=True),
                splitline_opts=opts.SplitLineOpts(is_show=True),
            ),
        )
    )
    return chart

3. User feature clustering

3.1 Model Training and Preservation

In [7]:
user_score = km(user_data, 'user ')

Calculation of cluster 2 class completed	Calculation of cluster 3 class completed	Calculation of cluster 4 class completed	Calculation of cluster 5 class completed	Calculation of cluster 6 class completed	Calculation of cluster 7 class completed	Calculation of cluster 8 class completed	Calculation of cluster 9 class completed	Indicator storage completed


3.2 Clustering k-value selection

In [8]:
user_score =  joblib.load(f'user cluster index.score')
draw([str(x) for x in range(2,10)], user_score['sse'], user_score['sc']).render_notebook()

By integrating the elbow rule and sc value, select ùëò=4 as the user clustering model.

3.3 Clustering Results

In [9]:
user_km = joblib.load(f'user 4cluster.model')
user_centers = pd.DataFrame(user_km.cluster_centers_, columns=user_feature.columns)
user_centers['Number of people']=pd.Series(user_km.predict(user_data)).value_counts()
user_centers

Unnamed: 0,Page_view,Like_count,Number_of_authors_viewed,Number_of_viewed_works,Average_duration_of_viewing_works,Number_of_background_music_views,Complete_views,Number_of_cities_visited,Number_of_citzes_viewing_works,Number of people
0,16.891674,0.191183,16.609704,16.89154,11.314563,16.02305,8.166387,1.170956,14.647672,29740
1,163.284336,1.268519,155.473765,163.282407,11.079584,132.366898,54.16821,1.295525,88.876929,2595
2,66.012507,0.655094,64.052183,66.011644,11.174247,58.415849,27.607763,1.2731,47.351375,9294
3,382.953771,3.038929,354.287105,382.948905,10.968649,277.233577,92.829684,1.321168,141.182482,411


This model can be applied to determine user value, such as the first type of users who have relatively low views, likes, and completion rates. These users tend to pay more attention to the first half of the video content, and their interests can be judged by the duration of stay. However, their usage time is relatively long, reflecting product dependence, and to some extent, they are considered core users. You can use dwell time to determine preferences, optimize recommendation algorithms, and focus on recommending content with high appeal in the first half.


4. Author feature clustering

4.1 Model Training and Preservation

In [10]:
author_score = km(author_data, 'author ')

Calculation of cluster 2 class completed	Calculation of cluster 3 class completed	Calculation of cluster 4 class completed	Calculation of cluster 5 class completed	Calculation of cluster 6 class completed	Calculation of cluster 7 class completed	Calculation of cluster 8 class completed	Calculation of cluster 9 class completed	Indicator storage completed


4.2 Clustering k-value selection

In [11]:
author_score =  joblib.load(f'author cluster index.score')
draw([str(x) for x in range(2,10)], author_score['sse'], author_score['sc']).render_notebook()

By integrating the elbow rule and sc value, select ùëò=4 as the author clustering model.

4.3 Clustering Results

In [12]:
author_km = joblib.load(f'author 4cluster.model')
author_centers = pd.DataFrame(author_km.cluster_centers_, columns=author_feature.columns)
author_centers['Number of people'] = pd.Series(author_km.predict(author_data)).value_counts()
author_centers

Unnamed: 0,Total_views,Total_likes,Total_complete_views,Total_number_of_works,Average_duration_of_works,Number_of_background_music_used,Number_of_days_since_the_publication_of_the_work,Creative_activity_(daily),Number_of_cities_visited,Number of people
0,12.474092,0.121136,5.241201,3.696408,10.785327,3.239715,3.69627,11.385818,1.115037,57873
1,426.856624,3.865699,185.480944,21.119782,11.137147,14.341198,21.119782,24.655172,1.337568,551
2,131.20582,1.271757,57.975707,12.973038,10.995015,9.791778,12.971436,21.422851,1.233582,3750
3,1158.822785,8.962025,490.721519,32.810127,11.129757,21.835443,32.810127,28.443038,1.329114,79


Summary:

The interpretability of clustering results is relatively obvious, and its core is related to browsing volume, providing a quantitative classification function under certain data characteristics.

This model can also be applied to improve the author's creative efficiency, as shown in the above figure. It can be seen that authors with high views, high likes, and high views usually use more music, publish more works, and visit more cities than other authors, indicating that authors need to accumulate a lot of creative experience and rich experience to create more popular videos.