# **TikTok Claim Classification Project**

# **Data exploration and hypothesis testing**

In this lap, we will explore the data provided and conduct a hypothesis testing.

**Hypothesis** testing is a method we use in statistics to figure out if what we're seeing in our data is just a fluke or if it's something real we can actually trust.

First off, we start with what we call hypotheses. Think of these like two different ideas we're testing. One is called the null hypothesis, which basically says there's no real effect or difference going on. It's like saying things are just normal, nothing special. Then, we have the alternative hypothesis, which is what we're hoping to find evidence for. This one says there is some kind of effect or difference happening.

Now, we need to decide how sure we want to be before we make any conclusions. We set something called the significance level, often written as `α`. It's like setting a bar – if we find evidence that's higher than this bar, we'll believe there's something going on.

Next, we pick a statistic to help us make our decision. This depends on what kind of data we have and what question we're trying to answer. There are different tests for different situations, like t-tests, chi-square tests, and others.

Once we have our data, we crunch some numbers to calculate our chosen statistic.

Now comes the fun part – we compare what we found to what we expected. We look at our calculated statistic and see if it's big enough to cross into what we call the critical region. This is like our danger zone – if our statistic lands here, it means we should start to doubt our null hypothesis.

Finally, we make a call based on where our statistic ended up. If it's in the critical region, we reject the null hypothesis and say, "Hey, there's something interesting going on here!" If it's not in the critical region, we don't have enough evidence to reject the null hypothesis – so we stick with the idea that things are just normal.

### Formulating research question

There are a few possible ways to frame the research question. For example:

1) Do videos from verified accounts and videos unverified accounts have different average view counts?

2) Is there a relationship between the account being verified and the associated videos' view counts?

Based on these question, we will construct out Null and Alternative hypothesis

### **Imports and Data Loading**

We will import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [2]:
# packages for data manipulation
import pandas as pd
import numpy as np

# packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# packages for statistical analysis/hypothesis testing
from scipy import stats

In [3]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Data exploration**

In [4]:
# Display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [5]:
# Generate a table of descriptive statistics about the data
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Check for and handle missing values.

In [5]:
# Check for missing values
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [6]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [7]:
# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


we are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean values of `video_view_count` for each group of `verified_status` in the sample data.

In [8]:
# Compute the mean `video_view_count` for each group in `verified_status`
### YOUR CODE HERE ###
data.groupby("verified_status")["video_view_count"].mean()

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

### Hypothesis testing

Before conducting Hypothesis testing, we need to formulate the Null and Alternative hypothesises

**$H_0$**: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).

**$H_A$**: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).


We can perform a t-test to compare the means of the two groups since it appears we're dealing with numerical data (view count) and comparing two independent groups (verified vs. not verified).


#### Independent Two-Sample T-Test

**Objective:** To determine if there is a significant difference between the means of two independent groups.

**Assumptions:**
1. The data in each group are independent.
2. The data in each group are approximately normally distributed.
3. The variances of the two groups are equal (though there's a version of the test that does not assume equal variances, known as Welch's t-test).

**Hypotheses:**
- Null hypothesis $\ (H_0)\ $: There is no difference in means $ (\mu_1 = \mu_2)\ $.
- Alternative hypothesis $ (H_a) $: There is a difference $ (\mu_1 \neq \mu_2)\ $.

**Formulas:**

1. **Test Statistic:**
$[ t = \frac{\bar{x}_1 - \bar{x}_2}{\text{SE}_{\text{difference}}} ]$
Where $(\bar{x}_1)$ and $(\bar{x}_2)$ are the sample means of groups 1 and 2, respectively, and $(\text{SE}_{\text{difference}})$ is the standard error of the difference between the two means.

2. **Standard Error of the Difference:**
$[ \text{SE}_{\text{difference}} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} ]$
Where $(s_1^2)$ and $(s_2^2)$ are the sample variances of groups 1 and 2, respectively, and $(n_1)$ and $(n_2)$ are the sample sizes.

3. **Degrees of Freedom (for equal variances):**
$[ \text{df} = n_1 + n_2 - 2 ]$
For unequal variances (Welch's approximation), the formula for degrees of freedom is more complex and is calculated based on the sample sizes and variances of the two groups.

4. **Confidence Interval for the Difference:**
The confidence interval provides a range of values within which the true difference in means is expected to lie, with a certain level of confidence (e.g., 95%):
$[ (\bar{x}_1 - \bar{x}_2) \pm t_{\text{critical}} \times \text{SE}_{\text{difference}} ]$
Where $(t_{\text{critical}})$ is the critical value of t for the desired level of confidence and degrees of freedom.

5. **Margin of Error (MOE):**
$[ \text{MOE} = t_{\text{critical}} \times \text{SE}_{\text{difference}} ]$
It represents the maximum expected difference between the true population parameter and a sample estimate.

After calculating the t-statistic, you compare it to a critical value from the t-distribution table (based on your chosen significance level and degrees of freedom) to decide whether to reject the null hypothesis. If the absolute value of the t-statistic is greater than the critical value, you reject the null hypothesis.

This procedure allows researchers to infer whether the observed differences in sample means are statistically significant and likely reflect true differences in the population means, rather than being due to random chance.


we choose 5% as the significance level and proceed with a two-sample t-test.

In [9]:
# Conduct a two-sample t-test to compare means
# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

Ttest_indResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120)

The p-value is extremely low $\ ( 2.61 \times 10^{-120}) \$, far below any conventional significance level (e.g., 0.05 or 0.01), indicating that we have strong evidence to reject the null hypothesis. This suggests that there is a statistically significant difference in the average view count between verified and not verified videos.

Given the mean values, not verified videos have a higher average view count compared to verified videos. This result is intriguing and may suggest that factors other than verification status influence the popularity of videos on the platform, or perhaps the sample of verified videos behaves differently from what one might intuitively expect.



### **Task 4. Communicate insights with stakeholders**

*Ask yourself the following question:*

*  What business insight(s) can you draw from the result of your hypothesis test?


The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral differences between these two groups of accounts.

It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.