![tt_logo.png](attachment:tt_logo.png)

# TikTok Claims Classification: End-to-End Analysis and Modeling

## **Introduction**

TikTok, one of the world's leading platforms for short-form mobile videos, has brought me on board as the newest member of its data analytics team. My team is working on developing a machine learning model to classify claims made in videos submitted to the platform.

At TikTok, the mission is to inspire creativity and bring joy. TikTok employees lead with curiosity and move at the speed of culture. Combined with the company's flat structure, I have dynamic opportunities to make a real impact on a rapidly expanding company and grow my career.

### **Background**  

TikTok, the short-form video hosting company is the global-hub and leading destination for short-form mobile video. The platform is built to help imaginations thrive. TikTok's mission is to create a place for inclusive, joyful, and authentic content–where people can safely discover, create, and connect. 

### **The TikTok Scenario**  

As part of TikTok’s data team, my role involves tackling an important challenge: handling user-reported claims in videos. TikTok users can flag videos and comments they believe contain unverified claims, generating a high volume of reports. These reports must be reviewed by moderators, but the sheer volume makes it difficult to process them efficiently.

To address this, TikTok is developing a **predictive model** that can automatically determine whether a video contains a **claim** or an **opinion**. I will work mainly on this task, ensuring the model is effective in **reducing the backlog of user reports** and **prioritizing moderation efforts more efficiently**.

### **Project Scope**  

This project focuses on developing a machine learning model to classify claims in TikTok videos. It involves data exploration, statistical analysis, and visualization using Python and Tableau, followed by the development of a multiple logistic regression model. The project then includes the main model champion selection based on the development and evaluation of Random Forest and XGBoost, along with an n-gram tokenizer commonly used to process textual information for prediction and classification tasks. A detailed breakdown of these steps will be provided in the overview section.

With this approach, the project aims to create a reliable machine learning pipeline to enhance content moderation efficiency on TikTok.  

## **Overview**
- This project strictly follows the **PACE (Plan, Analyze, Construct, Execute) framework** at its core. More details about this framework are explained in **Phase 1-1**.
  
- At the start of each phase, an **email from the phase's relevant stakeholder** is linked (stakeholders specific to each phase are mentioned in the phase overview). These emails, sent by stakeholders, team members, or cross-functional team members, provide essential details about the task and the project phase. They should be reviewed carefully, and the provided instructions should be followed to complete the **phase project**, including the **section-specific code notebook, executive summary, PACE strategy document, and any other phase-specific document**.

- At the end of each phase, the **executive summary** is linked. This summary is designed to effectively communicate the insights gained during the phase. It keeps **TikTok teammates informed** about project progress in a **concise one-page format**, respecting the time constraints of stakeholders who may not be able to read the full report.
 
- Similarly, at the end of each phase, the **PACE strategy document** is linked. This document outlines my approach to the tasks and answers key questions about the project phase. The **Data Project Questions & Considerations** section of the document includes questions that help deepen my understanding of data analytics. Completing this document is essential before preparing the **executive summary**, ensuring that the phase-related insights are communicated concisely.

- Each phase follows this structured workflow, where I:  
  - **Review the email** from the stakeholder to understand the project phase details and gather information about the business problem or question to be addressed.  
  - **Execute the tasks** outlined for the phase in the respective project section.  
  - **Complete the PACE strategy document** to define the approach and respond to key guiding questions.  
  - **Prepare the executive summary** to effectively communicate findings and share insights with stakeholders and team members.

This approach ensures **effective communication and strong project management** throughout the claims classification project.

### **Stakeholders & Team Members**

**TikTok Data Team:**
- **Willow Jaffey** – Data Science Lead
- **Rosie Mae Bradshaw** – Data Science Manager
- **Orion Rainier** – Data Scientist  
  *These technical team members require concise and specific communication.*

**Cross-Functional Team Members:**
- **Mary Joanna Rodgers** – Project Management Officer
- **Margery Adebowale** – Finance Lead, Americas
- **Maika Abadi** – Operations Lead  
  *These managers oversee operations and require communication that is adapted to their less technical roles.*

### **Project Phases Overview**  

- **Phase 1:** Creating a project proposal by defining the required data analytical tasks and assigning them into realistic milestones to guide future steps in the claims classification project.  

- **Phase 2:** The team has been granted access to TikTok’s user data. To gain clear insights, the data must be inspected, organized, and prepared for analysis. I will assist by building a dataframe and structuring the claims data for exploratory data analysis.  

- **Phase 3:** Initiating the exploratory data analysis (EDA) process. I will conduct EDA for the claims classification project and use Tableau to create visualizations for an executive summary, enabling non-technical stakeholders to engage with and interpret the data.  

- **Phase 4:** With exploratory data analysis completed, the team will begin hypothesis testing. I will analyze TikTok's user claim dataset to determine the most appropriate hypothesis testing method for the project.  

- **Phase 5:** Using the project data, I will develop a logistic regression model as a baseline for classification.  

- **Phase 6:** The final milestone involves building the machine learning model and selecting the champion model based on evaluation. Random Forest and XGBoost will be considered for model selection, and an n-gram tokenizer will be used to process textual data for classification.

Each phase follows the **PACE framework** and includes corresponding **emails, executive summaries, and PACE strategy documents** to ensure a structured, well-documented approach.

## **Dataset Structure**

This dataset contains information about TikTok videos where a claim or opinion has been made. It consists of **19,383 rows**, with each row representing a unique published video. Various fields capture details about the video's characteristics, including **claim status**, which distinguishes between "opinions" (personal beliefs or thoughts) and "claims" (unsourced or unverified information). The dataset includes **video metadata** such as duration, transcription text, and engagement metrics like view count, like count, share count, download count, and comment count. Additionally, it provides information about the video's author, including **verification status** and **ban status**, indicating whether the user is verified, active, under scrutiny, or banned. Each video is uniquely identified by a **video_id**, while the **#** column represents a TikTok-assigned unique number specifically for videos containing a claim or opinion.


| Column Name                | Type  | Description  |
|----------------------------|-------|-------------|
| #                          | int   | TikTok assigned number for video with claim/opinion. |
| claim_status              | obj   | Whether the published video has been identified as an “opinion” or a “claim.” In this dataset, an “opinion” refers to an individual’s or group’s personal belief or thought. A “claim” refers to information that is either unsourced or from an unverified source. |
| video_id                  | int   | Random identifying number assigned to video upon publication on TikTok. |
| video_duration_sec        | int   | How long the published video is measured in seconds. |
| video_transcription_text  | obj   | Transcribed text of the words spoken in the published video. |
| verified_status           | obj   | Indicates the status of the TikTok user who published the video in terms of their verification, either “verified” or “not verified.” |
| author_ban_status         | obj   | Indicates the status of the TikTok user who published the video in terms of their permissions: “active,” “under scrutiny,” or “banned.” |
| video_view_count         | float | The total number of times the published video has been viewed. |
| video_like_count         | float | The total number of times the published video has been liked by other users. |
| video_share_count        | float | The total number of times the published video has been shared by other users. |
| video_download_count     | float | The total number of times the published video has been downloaded by other users. |
| video_comment_count      | float | The total number of comments on the published video. |


This dataset offers a structured view of how claims and opinions spread on the platform and provides the foundation for further analysis in the claims classification project.

![Plan-2.png](attachment:Plan-2.png)