Construction of characteristic indicators

In [1]:
import pandas as pd
import numpy as np

Dataset Description，
Uid - User ID，
user_city - City where the user is located，
item_id - Work ID，
author_id - Author ID，
item_city - City where the author is located，
channel - Work Channel，
Finish - whether you have finished watching，
like - whether you like，
music_id - music id，
Duration_time - duration of the work，
real_time - specific release time H, date hour, day (release)，

We can categorize browsing behavior data into the following simple classifications:

User Information:
uid, user_city

Item Information:
item_id, item_city, channel, musicid, duration_time, real_time, H, date

Author Information:
authorid

Behavior Descriptions:
finish, like

Additionally, we can abstract entities such as users, items, authors, music, and cities from browsing behavior. In this project, we will focus on simple analyses from the perspectives of users, authors, and items, incorporating basic data analysis methods. 

1. Simple data processing

In [2]:
#Simple data processing
df = pd.read_csv('douyin_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,uid,user_city,item_id,author_id,item_city,channel,finish,like,music_id,duration_time,real_time,H,date
0,3,15692,109.0,691661,18212,213.0,0,0,0,11513.0,10,2019-10-28 21:55:10,21,2019-10-28
1,5,44071,80.0,1243212,34500,68.0,0,0,0,1274.0,9,2019-10-21 22:27:03,22,2019-10-21
2,16,10902,202.0,3845855,634066,113.0,0,0,0,762.0,10,2019-10-26 00:38:51,0,2019-10-26
3,19,25300,21.0,3929579,214923,330.0,0,0,0,2332.0,15,2019-10-25 20:36:25,20,2019-10-25
4,24,3656,138.0,2572269,182680,80.0,0,0,0,238.0,9,2019-10-21 20:46:29,20,2019-10-21


In [3]:
del df['Unnamed: 0']

In [4]:
df.info(show_counts=True)#Basic Information of Data Basic Information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1737312 entries, 0 to 1737311
Data columns (total 13 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   uid            1737312 non-null  int64  
 1   user_city      1737312 non-null  float64
 2   item_id        1737312 non-null  int64  
 3   author_id      1737312 non-null  int64  
 4   item_city      1737312 non-null  float64
 5   channel        1737312 non-null  int64  
 6   finish         1737312 non-null  int64  
 7   like           1737312 non-null  int64  
 8   music_id       1737312 non-null  float64
 9   duration_time  1737312 non-null  int64  
 10  real_time      1737312 non-null  object 
 11  H              1737312 non-null  int64  
 12  date           1737312 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 172.3+ MB


2. Construction of characteristic indicators

User statistics indicators include: views, likes, complete views, viewed works, viewed authors, average viewing time, viewed music, visited cities, viewed works cities, and user's city.

The author's statistical indicators include: creative activity (in days), number of cities visited, number of days works published, number of music used, total views, total likes, total completion, total number of works, average duration of works, and author's permanent residence.

The statistical indicators of the work include: likes, views, published cities, and background music.

3. Statistical analysis of characteristic indicators

3.1 Statistical analysis of user characteristics

In [5]:
user_df = pd.DataFrame()
user_df['uid'] = df.groupby('uid')['like'].count().index.tolist() #Extract the UID of all users as UID columns
user_df.set_index('uid', inplace=True) #Set UID column as index for easy automatic alignment of subsequent data
user_df['Page_view'] = df.groupby('uid')['like'].count() #Count the browsing volume under the corresponding UID
user_df['Like_count']  = df.groupby('uid')['like'].sum() #Count the number of likes under the corresponding UID
user_df['Number_of_authors_viewed'] = df.groupby(['uid']).agg({'author_id':pd.Series.nunique}) #Number of authors viewed
user_df['Number_of_viewed_works'] = df.groupby(['uid']).agg({'item_id':pd.Series.nunique}) #Number of viewed works
user_df['Average_duration_of_viewing_works'] = df.groupby(['uid'])['duration_time'].mean() #Average duration of browsing works
user_df['Number_of_background_music_views'] = df.groupby(['uid']).agg({'music_id':pd.Series.nunique}) #Watch the number of background music pieces in the work
user_df['Complete_views']  = df.groupby('uid')['finish'].sum() #Count the complete number of views under the corresponding UID
user_df['Number_of_cities_visited'] = df.groupby(['uid']).agg({'user_city':pd.Series.nunique}) #Count the number of cities visited by UID users
user_df['Number_of_citzes_viewing_works'] = df.groupby(['uid']).agg({'item_city':pd.Series.nunique}) #Count the number of cities where the corresponding UID users view works
#Count the most frequently appearing cities for each UID (the cities where users are most frequently located)
user_city_mode = df.groupby('uid')['user_city'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None)
user_df['City_where_the_user_is_located'] = user_city_mode#Add to user_def
user_df.describe()
user_df.to_csv('user_characteristics.csv', encoding='utf_8_sig')#Save Results

3.2 Author Characteristics Statistical Analysis

In [6]:
#Ensure that duration_time is of numeric type
df['duration_time'] = pd.to_numeric(df['duration_time'], errors='coerce')
df['duration_time'].fillna(df['duration_time'].mean(), inplace=True)

#Constructing an Author Feature Table
author_df = pd.DataFrame()
author_df['Total_views'] = df.groupby('author_id')['like'].count()
author_df['Total_likes'] = df.groupby('author_id')['like'].sum()
author_df['Total_complete_views'] = df.groupby('author_id')['finish'].sum()
author_df['Total_number_of_works'] = df.groupby('author_id')['item_id'].nunique()

#Calculate the average duration of the work
item_time = df.groupby(['author_id', 'item_id'])['duration_time'].mean().reset_index()
author_df['Average_duration_of_works'] = item_time.groupby('author_id')['duration_time'].mean()

author_df['Number_of_background_music_used'] = df.groupby('author_id')['music_id'].nunique()
author_df['Number_of_days_since_the_publication_of_the_work'] = df.groupby('author_id')['real_time'].nunique()

#Calculate creative activity
author_days = df.groupby('author_id')['date']
date_diff = pd.to_datetime(author_days.max()) - pd.to_datetime(author_days.min())
author_df['Creative_activity_(daily)'] = date_diff.dt.days + 1

#Number of cities visited
author_df['Number_of_cities_visited'] = df.groupby('author_id')['item_city'].nunique()

#Obtain the city where each author most frequently appears in the work (author's location)
author_city_mode = df.groupby('author_id')['item_city'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None)
author_df['City_where_the_author_is_located'] = author_city_mode

#Save Results
author_df.to_csv('author_characteristics.csv', encoding='utf_8_sig')

3.3 Statistical analysis of work features

In [7]:
item_df = pd.DataFrame()
item_df['item_id'] = df.groupby('item_id')['like'].count().index.tolist()
item_df.set_index('item_id', inplace=True)
item_df['Page_view'] = df.groupby('item_id')['like'].count()
item_df['Like_count']  = df.groupby('item_id')['like'].sum()
item_df['Publish_city'] = df.groupby('item_id')['item_city'].mean()
item_df['Background_music'] = df.groupby('item_id')['music_id'].mean()

item_df.to_csv('features_of_the_work.csv', encoding='utf_8_sig')