# **TikTok Project**
**The Power of Statistics**

This portfolio project follows the PACE framework (Plan, Analyze, Construct, Execute) and is inspired by a Google course on data analytics. 
The project demonstrates the application of hypothesis testing using Python, focusing on descriptive and inferential statistics, probability distributions, and statistical analysis.

Through three structured parts—data preparation, hypothesis testing, and insights communication—the project explores how statistical methods support data-driven decision-making. By the end, key business insights and recommendations will be presented based on the test results.

# **Data exploration and hypothesis testing**


# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

## **PACE: Plan**

For now, the main question I want to answer is: **Is the difference in video view count between Unverified and Verified accounts statistically significant?**

Starting with...

### **Task 1. Imports and Data Loading**

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [1]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
from matplotlib import pyplot as plt
import seaborn as sns


# Import packages for statistical analysis/hypothesis testing
from scipy import stats 
#OTHERS
from dotenv import load_dotenv
import os

In [2]:
load_dotenv()

True

In [3]:
filepath = os.getenv('TIKTOKFILE_PATH')
print(filepath)

C:\Users\trevi\OneDrive\PythonScripts\TikTok_Project\tiktok_dataset.csv


Load the dataset.

In [4]:
# Load dataset into dataframe
data = pd.read_csv(filepath)

In [5]:
data.info()
#Check for missing values and data types
#And Already we can see that there are missing values in the dataset that need to be addressed.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB



## **PACE: Analyze and Construct**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



Inspect the first five rows of the dataframe.

In [6]:
# Display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [7]:
#video_view_count ranges from 20 to 999.817 views, with a mean of 254708.55. 
#For the start of the analysis, we will focus on the video_view_count column grouped by verified_status column.
#My question is: How the video_view_count is distributed among verified and unverified accounts?
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [8]:
data['verified_status'].value_counts()
#Check the distribution of the verified_status column, which is the target variable, for this project.
#As I can see, the dataset is imbalanced, with the majority of the data being unverified accounts.

verified_status
not verified    18142
verified         1240
Name: count, dtype: int64

Check for and handle missing values.

In [9]:
# Check for missing values
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [10]:
# Drop rows with missing values
data.dropna(axis=0, inplace=True)

In [11]:
# Display first few rows after handling missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19084 entries, 0 to 19083
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19084 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19084 non-null  int64  
 3   video_duration_sec        19084 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19084 non-null  object 
 6   author_ban_status         19084 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.9+ MB


You are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [12]:
# Compute the mean `video_view_count` for each group in `verified_status`
data.groupby('verified_status').agg(Mean = pd.NamedAgg(column='video_view_count', aggfunc='mean')).reset_index()

Unnamed: 0,verified_status,Mean
0,not verified,265663.785339
1,verified,91439.164167


### **Task 3. Hypothesis testing**

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. What are your hypotheses for this data project?

<b>The null hypotheses is:</b> The difference in the mean ocurred by chance.
<b>The Alternative hypotheses:</b> The difference in the mean has not ocurred by chance.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



<b>The null hypotheses is:</b> The difference in the mean between not verified and verified status ocurred by chance.
<b>The Alternative hypotheses:</b> The difference in the mean between not verified and verified status has not ocurred by chance.

You choose 5% as the significance level and proceed with a two-sample t-test.

In [13]:
# Conduct a two-sample t-test to compare means
### YOUR CODE HERE ###

# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

TtestResult(statistic=np.float64(25.499441780633777), pvalue=np.float64(2.6088823687177823e-120), df=np.float64(1571.163074387424))


## **PACE: Execute**

Consider the questions in your PACE Strategy Documentto reflect on the Execute stage.

## **Step 4: Communicate insights**

<h3>Conclusion</h3>
<div style="background-color: #f8f9fa; border-left: 5px solid #007bff; padding: 10px 15px; margin-top: 1em;">
  <p>
  Based on the <b>very low p-value</b>, we can <b>reject the null hypothesis</b>. This implies that the observed difference is statistically significant and not merely due to random chance.
  </p>
  <p>
  As a next step, a more in-depth <b>regression analysis</b> could be performed to identify variables that influence this behavior.
  </p>
</div>