---

## **Part 2: Statistical Analysis to Modeling**

---

### **Transition from Part 1**  
In **Part 1: Project Proposal to EDA**, I established the project’s scope, structured the dataset, and conducted exploratory data analysis to uncover initial insights. This set the foundation for understanding claim-related patterns and engagement metrics.  

Now, in **Part 2**, I will transition from EDA to **statistical analysis, including hypothesis testing, and machine learning modeling, covering multiple logistic regression, model evaluation, and champion model selection (Random Forest and XGBoost).** This phase will focus on validating statistical assumptions, developing a baseline logistic regression model, and selecting the best-performing model for claims classification.

### **Objectives of Part 2** 

![Phases4_6.jpg](attachment:Phases4_6.jpg)

- **Phase 4 - Statistical Tests:** Statistical Hypothesis Testing for TikTok Engagement: Verified vs. Unverified Accounts  
- **Phase 5 - Understand the Data:** Exploratory Data Analysis & Engagement Insights  
- **Phase 6 - EDA:** Advanced Data Visualization & Outlier Analysis  

This phase will be crucial in refining the classification model and ensuring robust performance for claim moderation.  

### **Looking Back: Part 1**  
To view the project foundation, including dataset structuring and exploratory data analysis, refer to:

➡ **[Part 1: TikTok_Claims_Classification_Part1_Project_Proposal_to_EDA.ipynb](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok_Claims_Classification_Part1_Project_Proposal_to_EDA.ipynb)**  

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok_Claims_Classification_Part1_Project_Proposal_to_EDA.ipynb

---

| ![Analyze.png](attachment:Analyze.png) | ![Construct.png](attachment:Construct.png) |
|----------------------------------------|--------------------------------------------|
| **Analyze**                            | **Construct**                              |


# Phase 4: Statistical Hypothesis Testing for TikTok Engagement: Verified vs. Unverified Accounts
| ![Phase_4-2.png](attachment:Phase_4-2.png) | 
|----------------------------------------|

### **Introduction**

In **Phase 4**, I will conduct **hypothesis testing and statistical analysis** to further investigate engagement trends in the TikTok claims classification dataset. Building on **Phase 1** (project proposal), **Phase 2** (Python-based EDA and data structuring), and **Phase 3** (advanced EDA, data visualization, and pattern identification), this phase focuses on **statistical significance testing** to validate insights and guide decision-making.  

The TikTok data analytics team has reached the midpoint of the claims classification project. So far, the team has completed **a project proposal, exploratory data analysis, and visualizations** using Python and Tableau. Now, I receive an email from **Mary Joanna Rodgers**, a Project Management Officer at TikTok, introducing a new request: to determine whether there is a **statistically significant difference in the number of views between verified and unverified accounts**. Follow-up emails from **Rosie Mae Bradshaw**, the Data Science Manager, and **Willow Jaffey**, the Data Science Lead, provide further details on the analysis. Finally, **Orion Rainier**, a Data Scientist at TikTok, assigns me the task of conducting a hypothesis test on **verified vs. unverified accounts in terms of video view count**.  

With exploratory data analysis completed, the team is now ready to apply statistical methods to **validate observed patterns** and ensure data-driven decision-making. My assignment is to **determine the appropriate hypothesis testing method** and execute the analysis to assess engagement differences.  

#### **Task**  

For this phase, I will:  

- **Compute descriptive statistics** to summarize key engagement metrics.  
- **Conduct hypothesis tests** to analyze differences between verified and unverified accounts.  
- **Interpret statistical results** to determine their relevance to the claims classification project.  

This phase is essential for ensuring that insights drawn from the dataset are statistically valid. The results of my hypothesis tests will inform future decisions on engagement trends and contribute to the larger goal of **machine learning-driven claims classification**.

### **Overview**  

In **Phase 4**, I will apply **statistical methods** to analyze and interpret the TikTok claims classification dataset. This phase builds upon previous work: Phase 1 (project proposal and milestones), Phase 2 (data inspection and structuring), and Phase 3 (exploratory data analysis). The focus now shifts to **descriptive statistics, hypothesis testing, and effective communication of insights**. I will conduct hypothesis tests to examine patterns in the dataset and summarize key findings through an executive summary to update stakeholders.

### **Project Background**  

TikTok’s data team is progressing with the **claims classification project**, and at this stage, the following tasks must be completed:
- **Explore the project data** to gain statistical insights.
- **Conduct a hypothesis test** to validate findings.
- **Communicate insights** to stakeholders within TikTok.

### **Phase 4 Tasks**  

During this phase, I will complete the following:
- **Import relevant packages and load the TikTok dataset** for analysis.
- **Compute descriptive statistics** to understand data distributions.
- **Conduct a hypothesis test** to evaluate claims versus opinions.
- **Summarize key insights** and communicate findings effectively.

### **Phase 4 Deliverables**  

At the end of this phase, I will produce the following:
- **Hypothesis test results** prepared using Python.
- **Executive summary** outlining key insights for stakeholders.
- **PACE Strategy Document** detailing project considerations and action items.

### **Key Stakeholders for This Phase**  

- **Mary Joanna Rodgers** – Project Management Officer  
- **Rosie Mae Bradshaw** – Data Science Manager  
- **Orion Rainier** – Data Scientist  
- **Willow Jaffey** – Data Science Lead  

### **Review the Email**  

📎 [Email](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Emails/Phase-4-Project-Emails.pdf):

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Emails/Phase-4-Project-Emails.pdf

## Milestone 4:  Compute descriptive statistics

### **PACE: Plan**

### **Research Question: Verified vs. Unverified Accounts and Views**

The research question for this data project is that whether there is a statistically significant difference in the number of views for TikTok videos posted by verified accounts versus unverified accounts.

### **Task 1: Imports and Data Loading**  

To perform statistical analysis, I imported the necessary Python libraries:  

- **pandas** and **numpy** for data manipulation and numerical operations  
- **seaborn** and **matplotlib.pyplot** for data visualization  
- **scipy.stats** for statistical analysis and hypothesis testing

In [4]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

The dataset was then loaded into a Pandas DataFrame:

In [5]:
data = pd.read_csv(r"C:\Users\saswa\Documents\GitHub\TikTok-Claims-Classification\Data\tiktok_dataset.csv")

### **PACE: Analyze and Construct**

#### **The Role of Descriptive Statistics in Exploratory Data Analysis**

Descriptive statistics play a crucial role in Exploratory Data Analysis (EDA) by helping data professionals quickly explore and understand large datasets. They provide a high-level summary of the data’s structure, trends, and key features, allowing for a better understanding of its central tendency, spread, and distribution. For example, calculating the mean values of video_view_count for each group of verified_status helps uncover patterns and differences between groups. Tools like .describe() (which summarizes numerical fields) or .describe(include='all') (which includes both numerical and categorical fields) can be used, along with individual functions from libraries like NumPy or Pandas for specific statistics. This approach not only offers a quick way to assess the dataset but also helps identify potential insights, such as how video views vary based on verification status, aiding in informed decision-making during analysis.

#### **Task 2: Data exploration**

To inspect the dataset, I performed the following steps:

In [6]:
data.head(5)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [7]:
print(data.describe())
print(data.describe(include='all'))
print(data.info())

                  #      video_id  video_duration_sec  video_view_count  \
count  19382.000000  1.938200e+04        19382.000000      19084.000000   
mean    9691.500000  5.627454e+09           32.421732     254708.558688   
std     5595.245794  2.536440e+09           16.229967     322893.280814   
min        1.000000  1.234959e+09            5.000000         20.000000   
25%     4846.250000  3.430417e+09           18.000000       4942.500000   
50%     9691.500000  5.618664e+09           32.000000       9954.500000   
75%    14536.750000  7.843960e+09           47.000000     504327.000000   
max    19382.000000  9.999873e+09           60.000000     999817.000000   

       video_like_count  video_share_count  video_download_count  \
count      19084.000000       19084.000000          19084.000000   
mean       84304.636030       16735.248323           1049.429627   
std       133420.546814       32036.174350           2004.299894   
min            0.000000           0.000000          

In [8]:
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [9]:
data=data.dropna(axis=0)

In [10]:
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


#### **Comparing Mean Video Views for Verified vs. Unverified Accounts**  

I am interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in my sample data.

In [11]:
round(data.groupby('verified_status')['video_view_count'].mean(), 3)

verified_status
not verified    265663.785
verified         91439.164
Name: video_view_count, dtype: float64

## Milestone 4a:  Conduct hypothesis testing


#### **Task 3: Hypothesis testing**

#### **Defining Hypotheses for TikTok Video Engagement Analysis**

The null hypothesis (H₀) and alternative hypothesis (H₁) are central to hypothesis testing. The null hypothesis suggests that there is no effect or difference in the population, while the alternative hypothesis proposes that there is an effect or difference. In this project, the hypotheses are as follows:

- Null hypothesis (H₀): There is no difference between the mean video_view_count of verified and unverified TikTok accounts.
- Alternative hypothesis (H<sub>A</sub>): There is a difference between the mean video_view_count of verified and unverified TikTok accounts.

The key distinction between these hypotheses is that the null hypothesis represents a statement of no effect or difference, often associated with equality, while the alternative hypothesis challenges the null and reflects the potential difference or effect being tested.

My goal in this step is to conduct a **two-sample t-test** to analyze the relationship between `verified_status` and `video_view_count`. The process follows these steps:  

1. **State the hypotheses**: Define the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\)).  
2. **Choose a significance level**: Determine an appropriate alpha level (e.g., 0.05).  
3. **Calculate the p-value**: Perform the two-sample t-test to obtain the p-value.  
4. **Make a decision**: Reject or fail to reject the null hypothesis based on the p-value.  

By following these steps, I will assess whether there is a significant difference in `video_view_count` between verified and non-verified users.

The null and alternative hypotheses for my analysis are as follows:  

- **Null hypothesis (\(H_0\))**: There is no difference in the number of views between TikTok videos posted by verified and unverified accounts. Any observed difference in the sample data is due to chance or sampling variability.  
- **Alternative hypothesis (\(H_A\))**: There is a difference in the number of views between TikTok videos posted by verified and unverified accounts. Any observed difference in the sample data is due to an actual difference in the corresponding population means.  

In simpler terms, the null hypothesis assumes that verified and unverified accounts receive the same average number of views, while the alternative hypothesis suggests that a significant difference exists between them.

I choose 5% as the significance level and proceed with a two-sample t-test.

In [12]:
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

TtestResult(statistic=np.float64(25.499441780633777), pvalue=np.float64(2.6088823687177823e-120), df=np.float64(1571.163074387424))

Based on the p-value, which is extremely small and much smaller than the significance level of 5%, I reject the null hypothesis. This indicates that there is a statistically significant difference in the mean video view count between verified and unverified TikTok accounts.

### **PACE: Execute**

#### **Task 4: Communicate insights with stakeholders**

#### **Business Insights from Hypothesis Testing on TikTok Video Engagement**

The hypothesis test results show a statistically significant difference in the average video view counts between verified and unverified TikTok accounts, indicating a potential behavioral difference between these two groups. This insight suggests that verified accounts might post content that generates more views, while unverified accounts could potentially be posting more engaging or clickbait-like content to attract views. This relationship is strong enough to consider using verified status as a determinant of video engagement. Further research is necessary to explore whether unverified accounts tend to post more controversial or clickbait content. Understanding this behavior could lead to a deeper analysis of content strategies for both verified and non-verified accounts.

### **Conclusion**  

In **Phase 4**, I conducted **data exploration and hypothesis testing** on the TikTok dataset, focusing on analyzing statistical patterns and drawing meaningful conclusions. This phase built upon previous work, allowing me to apply **descriptive and inferential statistics** to understand key relationships in the data. Using **Python (pandas, scipy.stats, numpy)**, I examined distributions, computed summary statistics, and performed hypothesis tests to validate insights.  

A key aspect of this phase was evaluating whether **verified TikTok accounts receive more views than unverified accounts**. By formulating and testing hypotheses, I assessed statistical significance and interpreted results to determine whether the difference was meaningful. Additionally, I leveraged probability distributions and confidence intervals to support my findings, ensuring a rigorous approach to hypothesis testing.  

This phase strengthened my skills in **statistical analysis**, **hypothesis testing**, and **data interpretation**—all essential for making data-driven decisions. I also compiled an **executive summary**, ensuring that findings were effectively communicated to stakeholders.  

#### **Deliverables**  

- 📎 [Executive Summary](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-4-Project-Documents/Phase-4-Executive%20Summary.pdf):

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-4-Project-Documents/Phase-4-Executive%20Summary.pdf

- 📎 [PACE Strategy Document](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-4-Project-Documents/Phase-4-PACE%20Strategy%20Document.pdf):

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-4-Project-Documents/Phase-4-PACE%20Strategy%20Document.pdf