# XGBoost Predictive Model for TikTok's Claim Classification: EDA, Hypothesis Testing, Logistic Regression, Tree-Based Models

### 3rd Jupyter Notebook -- Conducting Statistical Test 

### *Data exploration and hypothesis testing*

**Questions:**

1) Do videos from verified accounts and videos unverified accounts have different average view counts?
2) Is there a relationship between the account being verified and the associated videos' view counts?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import statsmodels.api as sm

In [2]:
data = pd.read_csv("tiktok-dataset.csv")

In [3]:
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [4]:
# Descriptive statistics table
data.describe(include = np.number)

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [5]:
# Checking for missing values
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [6]:
# Dropping rows with missing values
data_V2 = data.dropna(axis = 0)

In [7]:
data_V2.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [8]:
# Mean 'video_view_count' for each verified_status group
data_V2.groupby(['verified_status'])['video_view_count'].mean()

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

### **Stating Hypothesis:**

**H0:** There is NO difference in views between videos posted by verified accounts and videos posted by unverified accounts. In sample data, any observed difference is due to chance or sampling variability.

**HA:** There is difference in views between videos posted by verified accounts and videos posted by unverified accounts. Any observed difference is due to actual difference in the corresponding population means.

**Note:**
p = 0.5 is determined before starting the test. *5% Significance Level!!!* 

In [9]:
# 2 sample t-test to compare means
data_verif = data_V2[data_V2['verified_status'] == 'verified']['video_view_count']
data_unverif = data_V2[data_V2['verified_status'] == 'not verified']['video_view_count']

print(data_verif.var(), data_unverif.var())
print(data_unverif.var() / data_verif.var() )

48902353353.169945 106068688207.418
2.168989443951626


In [10]:
stats.ttest_ind(a = data_unverif, b = data_verif)

TtestResult(statistic=18.250939509545823, pvalue=8.632160883925904e-74, df=19082.0)

In [11]:
stats.ttest_ind(a = data_unverif, b = data_verif, equal_var = False)

TtestResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

**Remarks:** 
Since both p-values are much much smaller than 0.5:
- **H0 is rejected.**

**Conclusion:**

1) There is **statistically significant difference in views between videos posted by verified accounts and videos posted by unverified accounts**. 
It indicates key behavioral differences between these account groups.

2) Investigating cause of this behavioral difference is suggested. 
For instance, do unverified accounts more likely to post clickbait videos? Or are unverified accounts associated with spam bots which inflate its view counts?

3) Since end goal is generating machine model, next step is regression model build up.
A regression model for verified_status is helpful to analyze user behavior in this group. 

*Technical note:* the data is skewed, and there is significant difference between account types, it will be essential to build logistic regression model.