# **TikTok Project**
**The Power of Statistics**

# 📘 Final Project: Data Analysis and Hypothesis Testing

This project involves analysing a dataset and performing hypothesis testing using Python.

<br/>

## 🎯 Objective  
Demonstrate your ability to structure, carry out, and interpret hypothesis tests.

## 🧠 Skills Applied  
This activity puts into practice:
- Descriptive statistics  
- Inferential statistics  
- Probability distributions  
- Hypothesis testing in Python

<br/>

## 📌 Project Structure

The analysis is organised into three main parts:

### 🔹 Part 1: Data Import and Preparation
- Identify the necessary Python libraries for hypothesis testing.

### 🔹 Part 2: Hypothesis Testing Process
- Use descriptive statistics to better understand the data.  
- Formulate and state the null and alternative hypotheses.

### 🔹 Part 3: Communicating Insights
- Identify key business insights based on the results of the hypothesis test.  
- Provide actionable recommendations supported by the analysis.

<br/>

Once all the steps are complete, summarize your findings using the PACE Strategy framework to reflect on the business implications of your results.


# **Data exploration and hypothesis testing**


# **PACE stages**

This project follows the PACE problem-solving framework, which stands for Plan, Analyze, Construct, and Execute.  
Each section of the notebook is aligned with one of these stages to guide the analytical process step by step.


## **PACE: Plan**



Is there a statistically significant difference in the mean number of video views between verified and unverified TikTok accounts?

*Complete the following steps to perform statistical analysis of your data:*

### **Part 1. Imports and Data Loading**

In [None]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats



In [None]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

## **PACE: Analyze and Construct**

Using descriptive statistics helps me quickly grasp the key features of the dataset.  
It allows me to examine values like the mean and median (central tendency), as well as measures of variability such as standard deviation and variance.  
This also gives insight into the overall distribution of the data.

Through this analysis, I can identify possible outliers, compare the range of views between verified and unverified accounts, and evaluate whether the data is appropriate for hypothesis testing.  
These insights lay the groundwork for accurate interpretation in the later stages of inferential analysis.

### **Part 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



In [None]:
# Display first few rows
print(data.head())

   # claim_status    video_id  video_duration_sec  \
0  1        claim  7017666017                  59   
1  2        claim  4014381136                  32   
2  3        claim  9859838091                  31   
3  4        claim  1866847991                  25   
4  5        claim  7105231098                  19   

                            video_transcription_text verified_status  \
0  someone shared with me that drone deliveries a...    not verified   
1  someone shared with me that there are more mic...    not verified   
2  someone shared with me that american industria...    not verified   
3  someone shared with me that the metro of st. p...    not verified   
4  someone shared with me that the number of busi...    not verified   

  author_ban_status  video_view_count  video_like_count  video_share_count  \
0      under review          343296.0           19425.0              241.0   
1            active          140877.0           77355.0            19034.0   
2            a

In [None]:
# Generate a table of descriptive statistics about the data
print(data.describe())

                  #      video_id  video_duration_sec  video_view_count  \
count  19382.000000  1.938200e+04        19382.000000      19084.000000   
mean    9691.500000  5.627454e+09           32.421732     254708.558688   
std     5595.245794  2.536440e+09           16.229967     322893.280814   
min        1.000000  1.234959e+09            5.000000         20.000000   
25%     4846.250000  3.430417e+09           18.000000       4942.500000   
50%     9691.500000  5.618664e+09           32.000000       9954.500000   
75%    14536.750000  7.843960e+09           47.000000     504327.000000   
max    19382.000000  9.999873e+09           60.000000     999817.000000   

       video_like_count  video_share_count  video_download_count  \
count      19084.000000       19084.000000          19084.000000   
mean       84304.636030       16735.248323           1049.429627   
std       133420.546814       32036.174350           2004.299894   
min            0.000000           0.000000          

In [None]:
# Check for missing values
print(data.isnull().sum())

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64


In [None]:
# Drop rows with missing values
data = data.dropna()

In [None]:
# Display first few rows after handling missing values
print(data.head())

   # claim_status    video_id  video_duration_sec  \
0  1        claim  7017666017                  59   
1  2        claim  4014381136                  32   
2  3        claim  9859838091                  31   
3  4        claim  1866847991                  25   
4  5        claim  7105231098                  19   

                            video_transcription_text verified_status  \
0  someone shared with me that drone deliveries a...    not verified   
1  someone shared with me that there are more mic...    not verified   
2  someone shared with me that american industria...    not verified   
3  someone shared with me that the metro of st. p...    not verified   
4  someone shared with me that the number of busi...    not verified   

  author_ban_status  video_view_count  video_like_count  video_share_count  \
0      under review          343296.0           19425.0              241.0   
1            active          140877.0           77355.0            19034.0   
2            a

In [None]:
# Compute the mean `video_view_count` for each group in `verified_status`
mean_views_by_verified_status = data.groupby('verified_status')['video_view_count'].mean()

# Display the result
print(mean_views_by_verified_status)


verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64


### **Part 3. Hypothesis testing**


##### Null Hypothesis (H₀):

There is no difference in the mean number of video views between verified and unverified TikTok accounts.

##### Alternative Hypothesis (H₁):

There is a difference in the mean number of video views between verified and unverified TikTok accounts.




Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



##### Null Hypothesis (H₀):

There is no difference in the mean number of video views between verified and unverified TikTok accounts.  
(The mean views for verified accounts are equal to the mean views for unverified accounts.)

##### Alternative Hypothesis (H₁):

There is a difference in the mean number of video views between verified and unverified TikTok accounts.  
(The mean views for verified accounts are not equal to the mean views for unverified accounts.)





You choose 5% as the significance level and proceed with a two-sample t-test.

In [None]:
# Create samples for verified and not verified accounts
verified = data[data['verified_status'] == 'verified']['video_view_count']
not_verified = data[data['verified_status'] == 'not verified']['video_view_count']

# Conduct a two-sample t-test to compare means (Welch's t-test)
t_statistic, p_value = stats.ttest_ind(
    verified,
    not_verified,
    equal_var=False  # Do not assume equal variances
)

# Display results
print("t-statistic:", t_statistic)
print("p-value:", p_value)


t-statistic: -25.499441780633777
p-value: 2.6088823687177823e-120


Since the p-value (2.61e-120) is much smaller than the significance level of 0.05, we reject the null hypothesis.
There is statistical significance to conclude that the mean number of video views is different between verified and unverified TikTok accounts.



## **PACE: Execute**


## **Communicate insights with stakeholders**

The hypothesis test revealed a statistically significant difference in the mean number of video views between verified and unverified TikTok accounts.  
This suggests that verification status is associated with differences in video visibility or engagement.

Interestingly, based on the sample data, unverified accounts had a higher average number of views compared to verified accounts.  
This insight could prompt TikTok's strategy team to further investigate factors influencing video performance beyond verification status, such as content type, posting frequency, or algorithmic promotion.

It would also be important to explore potential root causes for this behavioral difference.  
For instance, are unverified accounts more likely to post clickbait-style videos, or could there be the influence of spam bots inflating view counts?

Given the skewed nature of the data and the categorical nature of `verified_status`, the next logical step would be to build a logistic regression model.  
This model could help predict user behavior related to verification status and better understand the dynamics behind video engagement patterns.
