## Overview 

In this part,I will try to prepare, perform, and analyze hypothesis tests using Python. My goal is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing to generate actionable insights.

This project includes three main parts:


------------------------------------------------------------------------------------------------------

## 1- PACE: Plan

necessary data packages for hypothesis testing

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

Load the dataset into a pandas DataFrame


In [2]:
data = pd.read_csv("/Users/sufyanalwjali/Desktop/Data analysis/Tik_Tok_Project-/tiktok_dataset.csv")

## 2- PACE: Analyze

In [3]:
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [5]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Check for and handle missing values.

In [6]:
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

Drop rows with missing values

In [7]:
data = data.dropna(axis=0)

So we are intrested in the relationship between verified_status and video_view_count. A good starting point is to compare the average view counts across different verification statuses in the sample data.

In [8]:
data.groupby(['verified_status']).mean()[['video_view_count']]

Unnamed: 0_level_0,video_view_count
verified_status,Unnamed: 1_level_1
not verified,265663.785339
verified,91439.164167


#### Hypothesis testing
In hypothesis testing, the null hypothesis (H₀) represents the assumption of no effect or no difference, while the alternative hypothesis (H₁) represents what we are trying to find evidence for.

For this project, the hypotheses are:

Null Hypothesis (H₀):
There is no difference in the average number of views between TikTok videos from verified and unverified accounts. Any observed difference is due to random variation.

Alternative Hypothesis (H₁):
There is a difference in the average number of views between videos from verified and unverified accounts, suggesting a real effect in the population.

Let's Conduct a two-sample t-test to compare means


In [9]:
not_verified = data[data['verified_status'] == 'not verified']['video_view_count']
verified = data[data['verified_status'] == 'verified']['video_view_count']

stats.ttest_ind(a=not_verified, b=verified, equal_var = False)

Ttest_indResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120)

Since the p-value is significantly smaller than the 5% significance level, we reject the null hypothesis. This indicates that there is a statistically significant difference in the average video view count between verified and unverified TikTok accounts.

#### - Conclusion


The analysis reveals a statistically significant difference in average video view counts between verified and unverified TikTok accounts. This suggests potential behavioral or structural differences between the two groups.

It raises important questions for further investigation, such as:

Do unverified accounts post more clickbait-style content?
Are some unverified accounts spam bots inflating views?

#### Next Steps: Regression Modeling
The next phase will involve building a regression model to better understand how verification status relates to user behavior—especially as we aim to predict claim status.

A logistic regression model is appropriate here because:

The data is skewed.
We are dealing with categorical outcomes (e.g., verified vs. unverified, claim vs. opinion).
It will allow us to explore the predictive power of verification status and other features.