![tt_logo-2.png](attachment:tt_logo-2.png)

# TikTok Claims Classification: End-to-End Analysis and Modeling

## **Introduction**

TikTok, one of the world's leading platforms for short-form mobile videos, has brought me on board as the newest member of its data analytics team. My team is working on developing a machine learning model to classify claims made in videos submitted to the platform.

At TikTok, the mission is to inspire creativity and bring joy. TikTok employees lead with curiosity and move at the speed of culture. Combined with the company's flat structure, I have dynamic opportunities to make a real impact on a rapidly expanding company and grow my career.

### **Company Background**  

TikTok, the short-form video hosting company is the global-hub and leading destination for short-form mobile video. The platform is built to help imaginations thrive. TikTok's mission is to create a place for inclusive, joyful, and authentic content–where people can safely discover, create, and connect. 

### **The TikTok Scenario**  

As part of TikTok’s data team, my role involves tackling an important challenge: handling user-reported claims in videos. TikTok users can flag videos and comments they believe contain unverified claims, generating a high volume of reports. These reports must be reviewed by moderators, but the sheer volume makes it difficult to process them efficiently.

To address this, TikTok is developing a **predictive model** that can automatically determine whether a video contains a **claim** or an **opinion**. I will work mainly on this task, ensuring the model is effective in **reducing the backlog of user reports** and **prioritizing moderation efforts more efficiently**.

### **Project Scope**  

This project focuses on developing a machine learning model to classify claims in TikTok videos. It involves data exploration, statistical analysis, and visualization using Python and Tableau, followed by the development of a multiple logistic regression model. The project then includes the main model champion selection based on the development and evaluation of Random Forest and XGBoost, along with an n-gram tokenizer commonly used to process textual information for prediction and classification tasks. A detailed breakdown of these steps will be provided in the overview section.

With this approach, the project aims to create a reliable machine learning pipeline to enhance content moderation efficiency on TikTok.  

## **Overview**
- This project strictly follows the **PACE (Plan, Analyze, Construct, Execute) framework** at its core. More details about this framework are explained in **Phase 1 - Milestone 1**.
  
- **Emails**: At the start of each phase, an **email from the phase's relevant stakeholder** is linked (stakeholders specific to each phase are mentioned in the phase overview). These emails, sent by stakeholders, team members, or cross-functional team members, provide essential details about the task and the project phase. They should be reviewed carefully, and the provided instructions should be followed to complete the **phase project**, including the **phase-specific code notebook, executive summary, PACE strategy document, and any other phase-specific document**.

- **Executive Summary**: At the end of applicable phases, an **executive summary** is linked. This summary is designed to effectively communicate the insights gained during the phase. It keeps **TikTok teammates informed** about project progress in a **concise one-page format**, respecting the time constraints of stakeholders who may not be able to read the full report.
 
- **Pace Strategy Document**: Similarly, at the end of each phase, the **PACE strategy document** is linked. This document outlines my approach to the tasks and answers key questions about the project phase. The **Data Project Questions & Considerations** section of the document includes questions that help deepen my understanding of data analytics. Completing this document is essential before preparing the **executive summary**, ensuring that the phase-related insights are communicated concisely.

- Each phase follows this structured workflow, where I:  
  - **Review the email** from the stakeholder to understand the project phase details and gather information about the business problem or question to be addressed.  
  - **Execute the tasks** outlined for the phase in the respective project section.  
  - **Complete the PACE strategy document** to define the approach and respond to key guiding questions.  
  - **Prepare the executive summary** to effectively communicate findings and share insights with stakeholders and team members.

This approach ensures **effective communication and strong project management** throughout the claims classification project.

### **Stakeholders & Team Members**

**TikTok Data Team:**
- **Willow Jaffey** – Data Science Lead
- **Rosie Mae Bradshaw** – Data Science Manager
- **Orion Rainier** – Data Scientist  
  *These technical team members require concise and specific communication.*

**Cross-Functional Team Members:**
- **Mary Joanna Rodgers** – Project Management Officer
- **Margery Adebowale** – Finance Lead, Americas
- **Maika Abadi** – Operations Lead  
  *These managers oversee operations and require communication that is adapted to their less technical roles.*

### **Effective Communication in Each Phase**  

Each phase not only focuses on technical execution but also helps me develop critical **communication skills** essential for success in a data-driven role. Throughout the project, I will:  

- **Ask questions** to clarify project goals, requirements, and stakeholder expectations.  
- **Share project needs** to ensure alignment and resource availability.  
- **Communicate with stakeholders** by presenting findings clearly and concisely.  
- **Give and receive feedback** to refine analyses and improve project outcomes.  
- **Stay in contact with team members** to foster collaboration and maintain project momentum.  

By applying these communication strategies in each phase, I ensure that both technical and non-technical stakeholders remain informed and engaged throughout the **claims classification project phases**.

### **Project Phases Overview**  

- **Phase 1:** Creating a project proposal by defining the required data analytical tasks and assigning them into realistic milestones to guide future steps in the claims classification project.  

- **Phase 2:** The team has been granted access to TikTok’s user data. To gain clear insights, the data must be inspected, organized, and prepared for analysis. I will assist by building a dataframe and structuring the claims data for exploratory data analysis.  

- **Phase 3:** Initiating the exploratory data analysis (EDA) process. I will conduct EDA for the claims classification project and use Tableau to create visualizations for an executive summary, enabling non-technical stakeholders to engage with and interpret the data.  

- **Phase 4:** With exploratory data analysis completed, the team will begin hypothesis testing. I will analyze TikTok's user claim dataset to determine the most appropriate hypothesis testing method for the project.  

- **Phase 5:** Using the project data, I will develop a logistic regression model as a baseline for classification.  

- **Phase 6:** The final milestone involves building the machine learning model and selecting the champion model based on evaluation. Random Forest and XGBoost will be considered for model selection, and an n-gram tokenizer will be used to process textual data for classification.

Each phase follows the **PACE framework** and includes corresponding **emails, executive summaries, and PACE strategy documents** to ensure a structured, well-documented approach.

## **Dataset Structure**

This dataset contains information about TikTok videos where a claim or opinion has been made. It consists of **19,383 rows**, with each row representing a unique published video. Various fields capture details about the video's characteristics, including **claim status**, which distinguishes between "opinions" (personal beliefs or thoughts) and "claims" (unsourced or unverified information). The dataset includes **video metadata** such as duration, transcription text, and engagement metrics like view count, like count, share count, download count, and comment count. Additionally, it provides information about the video's author, including **verification status** and **ban status**, indicating whether the user is verified, active, under scrutiny, or banned. Each video is uniquely identified by a **video_id**, while the **#** column represents a TikTok-assigned unique number specifically for videos containing a claim or opinion.


| Column Name                | Type  | Description  |
|----------------------------|-------|-------------|
| #                          | int   | TikTok assigned number for video with claim/opinion. |
| claim_status              | obj   | Whether the published video has been identified as an “opinion” or a “claim.” In this dataset, an “opinion” refers to an individual’s or group’s personal belief or thought. A “claim” refers to information that is either unsourced or from an unverified source. |
| video_id                  | int   | Random identifying number assigned to video upon publication on TikTok. |
| video_duration_sec        | int   | How long the published video is measured in seconds. |
| video_transcription_text  | obj   | Transcribed text of the words spoken in the published video. |
| verified_status           | obj   | Indicates the status of the TikTok user who published the video in terms of their verification, either “verified” or “not verified.” |
| author_ban_status         | obj   | Indicates the status of the TikTok user who published the video in terms of their permissions: “active,” “under scrutiny,” or “banned.” |
| video_view_count         | float | The total number of times the published video has been viewed. |
| video_like_count         | float | The total number of times the published video has been liked by other users. |
| video_share_count        | float | The total number of times the published video has been shared by other users. |
| video_download_count     | float | The total number of times the published video has been downloaded by other users. |
| video_comment_count      | float | The total number of comments on the published video. |


This dataset offers a structured view of how claims and opinions spread on the platform and provides the foundation for further analysis in the claims classification project.

![Plan-2.png](attachment:Plan-2.png) 
### **Plan**

# Phase 1: PACE Planning & Stakeholder Alignment for TikTok Claims Classification Project
![Phase_1.png](attachment:Phase_1.png)

### **Introduction**

In this phase, as a data analyst on TikTok’s data team, I will develop a **project proposal** based on new considerations from the leadership team, defining **realistic milestones** for the required data analytical tasks to ensure a structured approach to the **claims classification project**. At the start of this phase, my **supervisor has sent an email** outlining the task, providing instructions for creating the **project proposal**. I will carefully review this email and follow the provided guidelines to complete the **PACE strategy document** and the **claims classification project proposal**.

#### **Task**  

For this first task, I will create a **project proposal** that establishes clear **milestones** for the claims classification project. While planning the deliverable, I will consider:  

- The **target audience** for the proposal.  
- **Team roles and responsibilities** within the project.  
- The **project goal** and expected outcomes.  
- The **PACE framework** to align each task with its corresponding stage.  

This phase sets the foundation for the project by outlining a clear, structured roadmap for execution.

### **Overview**

 I will demonstrate my understanding of an effective data analytics workflow by developing a **project proposal**. This proposal will outline key tasks and milestones to guide the **claims classification project**. To ensure a structured approach, I will use the **PACE strategy document**, a tool designed to support project planning and development.  

#### **Project Background**  
TikTok’s data team is in the earliest stages of the **claims classification project**. Before beginning data analysis, the team requires:  
- A structured **project proposal** that:  
  - Organizes project tasks into **milestones**.  
  - Classifies tasks according to the **PACE framework**.  
  - Identifies **relevant stakeholders**.  

#### **Phase 1 Tasks**  
- Gather relevant information from stakeholder notes within TikTok, as provided in the email.  
- Assign **PACE stages** to each requested task in the classification project.  
- Organize tasks into realistic **milestones**.  
- Develop a **project proposal** for the TikTok data team.  

#### **Phase 1 Deliverables**  
This phase will help me apply key **project planning** skills by completing:  
- **Project Proposal** – A formal document outlining the project’s key tasks, milestones, and stakeholder considerations.  
- **PACE Strategy Document** – A structured framework to guide planning and execution within the classification project.  

#### **Key Stakeholders for This Phase**  
- **Rosie Mae Bradshaw** – Data Science Manager  
- **Mary Joanna Rodgers** – Project Management Officer  

#### **Review the Email**

📎 [Email](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Emails/Phase-1-Project-Emails.pdf): 

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Emails/Phase-1-Project-Emails.pdf


## Milestone 1: Establish structure for project workflow (PACE) - Plan

### Understanding the PACE Framework  
![PACE.png](attachment:PACE.png)

The **PACE** framework—**Plan, Analyze, Construct, Execute**—provides a structured yet flexible approach to managing data analysis projects. It ensures clear organization, facilitates communication, and supports an iterative workflow by allowing revisiting of previous stages without disruption. Communication is essential throughout PACE, much like electricity flowing through a circuit, enabling continuous collaboration—whether through asking questions, gathering data, updating stakeholders, or presenting findings and receiving feedback.

### Importance of Workflow Structures  
Large-scale data projects require structured workflows to manage tasks efficiently. Identifying potential blockers early in the process allows for better resource planning and minimizes disruptions. A well-defined workflow promotes **efficiency, collaboration, and streamlined decision-making**.

### Applying PACE to This Project  
Each **phase** in this project is categorized into one or more **PACE stages**. Within each phase, there are **milestones**, and each milestone is further classified under a specific PACE stage.  

For example, **Milestone 1 of Phase 1** falls under the **Plan** stage. To visually indicate the primary **PACE stage** of each phase, an **image representing the stage** is placed at the beginning of the phase—for instance, the **Plan** image is attached **at the start of Phase 1** since it belongs to the **Plan** stage. If a phase spans **multiple PACE stages**, multiple images will be used accordingly at the beginning of the phase.  

In **Phase 1**, both milestones were categorized as **Plan** in the project proposal, making this phase fully part of the **Plan** stage. However, other phases may span multiple PACE stages depending on how their milestones are classified in the project proposal. For instance, in **Phase 1 - Milestone 1a**, the project proposal clearly identifies the corresponding PACE stage for each milestone, ensuring a structured approach throughout the project.

### Overview of PACE Stages  

Let’s take a closer look at each stage of the PACE model, along with the images that represent each stage:

#### **Plan**  
![Plan.png](attachment:Plan.png)

This stage establishes a solid foundation by defining the project scope, gathering requirements, and setting objectives. Key activities include:  
- Researching business data  
- Defining project scope  
- Developing a workflow  
- Assessing stakeholder needs  

#### **Analyze**  
![Analyze-2.png](attachment:Analyze-2.png)

Here, data is acquired, cleaned, and explored through **exploratory data analysis (EDA)**. Key activities include:  
- Data collection and formatting  
- Handling missing values and inconsistencies  
- Performing initial statistical analysis  

#### **Construct**
![Construct-2.png](attachment:Construct-2.png)

This stage focuses on building and refining models, often incorporating machine learning techniques. Key activities include:  
- Selecting modeling approaches  
- Building and training models  
- Evaluating model performance  

#### **Execute**  
![Execute-2.png](attachment:Execute-2.png)

Findings are communicated to stakeholders, incorporating feedback and refining outputs. Key activities include:  
- Presenting results  
- Addressing stakeholder feedback  
- Finalizing reports  

### Communication in PACE  
Effective **communication** is essential across all PACE stages. Whether clarifying project goals, presenting findings, or incorporating feedback, ongoing dialogue ensures alignment and enhances decision-making.

### Adaptability of PACE  
While PACE is presented sequentially, real-world projects require flexibility. It’s common to revisit earlier stages as new insights emerge. This adaptability prepares professionals for dynamic, evolving data projects.

By following the **PACE framework**, I will structure this project efficiently, ensuring each phase progresses smoothly while maintaining the flexibility to adapt as needed.

## Milestone 1a:  Write a project proposal

A **project proposal** serves as a structured document that outlines the **scope, objectives, key tasks, and milestones** of the project. It ensures alignment among stakeholders, provides a clear roadmap for execution, and helps anticipate potential challenges.  

For this milestone, I will develop a **detailed project proposal** for the **TikTok Claims Classification Project**, incorporating the **PACE framework** to categorize tasks and define milestones. This proposal will act as a foundation for the project's workflow and execution.  

### **Attached Document:**  
📎 [TikTok Claims Classification Project Proposal](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-1-Project-Documents/TikTok%20Claims%20Classification%20Project%20Proposal.pdf): 

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-1-Project-Documents/TikTok%20Claims%20Classification%20Project%20Proposal.pdf

### **Conclusion**

In **Phase 1**, I explored the role of data professionals and how data analysis aligns with organizational goals. This phase emphasized both **technical and professional workplace skills**, particularly the importance of **effective communication** in data analytics.  

My success in **workflow management, data analysis, visualizations, statistics, regression analysis, and machine learning** relies on my ability to communicate insights clearly with **cross-functional teams**. Strong communication ensures alignment, collaboration, and impactful decision-making.  

#### **Deliverables**  

- 📎Project Proposal: *Attached in Phase 1 - Milestone 1*  

- 📎[PACE Strategy Document](https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-1-Project-Documents/Phase-1-PACE%20Strategy%20Document.pdf): 

https://github.com/Cyberoctane29/TikTok-Claims-Classification-End-to-End-Analysis-and-Modeling/blob/main/TikTok%20Claims%20Classification%20Project%20Documents/Phase-1-Project-Documents/Phase-1-PACE%20Strategy%20Document.pdf