# Tik-Tok Capstone Project

The Tik-Tok leadership have requested an additional item to be added to the initial project scope. We are interested in whether there is a statistical difference in the data between verified and unverified accounts. Do you have any indication which variable would be most insightful to test in terms of verified and unverified accounts?

### Task to do

* Compute descriptive statistics on the claims classification data
* Conduct a two-sample hypothesis test of verified versus unverified accounts in terms of video view counts

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [3]:
tik_tok_raw = pd.read_csv("./Raw_data/tiktok_dataset.csv")
tik_tok_raw.head(12)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [4]:
tik_tok_raw.describe(include="all")

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19084,19382.0,19382.0,19084,19382,19382,19084.0,19084.0,19084.0,19084.0,19084.0
unique,,2,,,19012,2,3,,,,,
top,,claim,,,a colleague read in the media that butterflie...,not verified,active,,,,,
freq,,9608,,,2,18142,15663,,,,,
mean,9691.5,,5627454000.0,32.421732,,,,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,,2536440000.0,16.229967,,,,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,,1234959000.0,5.0,,,,20.0,0.0,0.0,0.0,0.0
25%,4846.25,,3430417000.0,18.0,,,,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,,5618664000.0,32.0,,,,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,,7843960000.0,47.0,,,,504327.0,125020.0,18222.0,1156.25,292.0


In [5]:
tik_tok_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [6]:
claim_nas_mask = tik_tok_raw["claim_status"].isna()
claim_nas_mask.sum()

np.int64(298)

In [7]:
claim_nas = tik_tok_raw[claim_nas_mask]
claim_nas

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
19084,19085,,4380513697,39,,not verified,active,,,,,
19085,19086,,8352130892,60,,not verified,active,,,,,
19086,19087,,4443076562,25,,not verified,active,,,,,
19087,19088,,8328300333,7,,not verified,active,,,,,
19088,19089,,3968729520,8,,not verified,active,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
19377,19378,,7578226840,21,,not verified,active,,,,,
19378,19379,,6079236179,53,,not verified,active,,,,,
19379,19380,,2565539685,10,,verified,under review,,,,,
19380,19381,,2969178540,24,,not verified,active,,,,,


In [8]:
claims = tik_tok_raw.dropna()
claims.isna().sum()

#                           0
claim_status                0
video_id                    0
video_duration_sec          0
video_transcription_text    0
verified_status             0
author_ban_status           0
video_view_count            0
video_like_count            0
video_share_count           0
video_download_count        0
video_comment_count         0
dtype: int64

In [9]:
claims.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [10]:
status_desc = claims.groupby("verified_status").agg({"video_view_count":"mean","video_like_count":"mean","video_comment_count":"mean"})
status_desc

Unnamed: 0_level_0,video_view_count,video_like_count,video_comment_count
verified_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
not verified,265663.785339,87925.772422,363.700514
verified,91439.164167,30337.633333,134.8775


In [11]:
status_desc.loc[:,"video_view_count"]

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

## Hypothesis test

* Ho = There is *no* difference between the verified and the not verified account
* Ha = There is difference between the verified and the not verified account

In [12]:
# set the significance level

significance_level = 0.05

verified = claims[claims["verified_status"] == "verified"]
not_verified = claims[claims["verified_status"] == "not verified"]

# Check statistical difference by the "video_view_count"
t_score, p_value = stats.ttest_ind(a = verified["video_view_count"], b = not_verified["video_view_count"], equal_var= False)
print(f"The t_score for this two tail test is: {t_score:.4f}")
print(f"The p_value for this test is: {p_value:.4f}")

The t_score for this two tail test is: -25.4994
The p_value for this test is: 0.0000


In [13]:
significance_level = 0.05

verified = claims[claims["verified_status"] == "verified"]
not_verified = claims[claims["verified_status"] == "not verified"]

# Check statistical difference by the "video_comment_count"
t_score, p_value = stats.ttest_ind(a = verified["video_comment_count"], b = not_verified["video_comment_count"], equal_var= False)
print(f"The t_score for this two tail test is: {t_score:.4f}")
print(f"The p_value for this test is: {p_value:.4f}")

The t_score for this two tail test is: -13.5301
The p_value for this test is: 0.0000


* The p_value is almost 0 being lower than the significance level of 5% and in this escenario I reject the null hypothesis

* This means that there is significance probability evidence in the verified accounts vs the unverified

In [14]:
map_dict = {"verified":1, "not verified" :2}

claims["verified_status_key"] = claims["verified_status"].map(map_dict)

claims_main = claims[["claim_status","verified_status","verified_status_key","video_view_count","video_comment_count"]] 
claims_main.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  claims["verified_status_key"] = claims["verified_status"].map(map_dict)


Unnamed: 0,claim_status,verified_status,verified_status_key,video_view_count,video_comment_count
0,claim,not verified,2,343296.0,0.0
1,claim,not verified,2,140877.0,684.0
2,claim,not verified,2,902185.0,329.0
3,claim,not verified,2,437506.0,584.0
4,claim,not verified,2,56167.0,152.0
5,claim,not verified,2,336647.0,1857.0
6,claim,not verified,2,750345.0,5446.0
7,claim,not verified,2,547532.0,11.0
8,claim,not verified,2,24819.0,27.0
9,claim,verified,1,931587.0,2540.0
