# Project Overview and Purpose

This is a project from Google Advanced Data Analytics.

We aim to conduct a data analytics project for TikTok. As a part of this project, we are willing to determine whether there is a statistically significant difference in the number of views for TikTok videos posted by verified accounts versus unverified accounts.

In [90]:
# import libraries

import numpy as np
import pandas as pd
import scipy
import seaborn as sns

In [98]:
TikTok = pd.read_csv("C:\\Users\\Amirhossein Hosseini\\OneDrive - Queen's University\\Coursera_Google_Advanced_Data_Analytics_Professional\\tiktok_dataset.csv")


In [99]:
TikTok.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0




`claim_status`: Whether the published video has been identified as an “opinion” or a “claim.” In this dataset, an “opinion” refers to an individual’s or group’s personal beliefs or thoughts. A “claim” refers to information that is either unsourced or from an unverified source.

`video_id`: Random identifying number assigned to a video upon publication on TikTok.

`video_duration_sec`: How long the published video is measured in seconds.

`video_transcription_text`: Transcribed text of the words spoken in the published video.

`verified_status`: Indicates the status of the TikTok user who published the video in terms of their verification, either “verified” or “not verified.” 

`author_ban_status`: Indicates the status of the TikTok user who published the video in terms of their permissions: “active,” “under scrutiny,” or “banned.” 

`video_view_count`: The total number of times the published video has been viewed. 

`video_like_count`: The total number of times the published video has been liked by other users. 

`video_share_count`: The total number of times the published video has been shared by other users. 

`video_download_count`: The total number of times the published video has been downloaded by other users. 

`video_comment_count`: The total number of comments on the published video. 

In [100]:
TikTok.describe(include='all')

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19084,19382.0,19382.0,19084,19382,19382,19084.0,19084.0,19084.0,19084.0,19084.0
unique,,2,,,19012,2,3,,,,,
top,,claim,,,a friend read in the media a claim that badmi...,not verified,active,,,,,
freq,,9608,,,2,18142,15663,,,,,
mean,9691.5,,5627454000.0,32.421732,,,,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,,2536440000.0,16.229967,,,,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,,1234959000.0,5.0,,,,20.0,0.0,0.0,0.0,0.0
25%,4846.25,,3430417000.0,18.0,,,,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,,5618664000.0,32.0,,,,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,,7843960000.0,47.0,,,,504327.0,125020.0,18222.0,1156.25,292.0


In [101]:
TikTok.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [111]:
# check the duplicaed rows
TikTok.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [113]:
TikTok = TikTok.dropna(axis=0)

In [114]:
TikTok['video_view_count'].groupby(TikTok['verified_status']).mean()

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

Based on the averages shown, it appears that not verified accounts had more video views. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we conduct a hypothesis test.

# A/B test
We aim to conduct an A/B test using a two-sample t-test with the significance level of %5.

`Null hypothesis`: There is no difference in the average views of verified and non-verified videos.

`Alternative hypothesis`: There is a difference in the average views of verified and non-verified videos.

Note that we can assume the mean sample t-distribution is close to normal distribution since the number of samples are large.

In [115]:
scipy.stats.ttest_ind(TikTok[TikTok['verified_status'] == 'not verified']['video_view_count'].dropna(),
                      TikTok[TikTok['verified_status'] == 'verified']['video_view_count'].dropna(),
                      equal_var=False)

Ttest_indResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120)

# Results and recommendations

p_value ~ 0 < 0.05. So, we can reject the null hypothesis and conclude that there **is** a statistically significant difference in the mean video view count between verified and unverified accounts on TikTok.


It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.

