Data exploration and hypothesis testing

The purpose of this project is to demostrate knowledge of how to prepare, create, and analyze
hypothesis tests.
The goal is to apply descriptive and inferential statistics, probability distributions, and hypothesis
testing in Python.

This activity has three parts:

Part 1: Imports and data loading

* What data packages will be necessary for hypothesis testing?
*Ans: Pandas, numpy, scipy, matplotlib/seaborn, statsmodel

Part 2: Conduct hypothesis testing 

* How will descriptive statistics help you analyze your data?
Ans: Measures like mean, median and mode
     range, variance, standard deviation
     visuals tools like histogram, box plots
     graphs and charts make it easier to identify trends and patterns
     comparison bettween different datasets
  
• How will you formulate your null hypothesis and alternative hypothesis?

Part 3: Communicate insights with stakeholders

• What key business insight(s) emerge from your hypothesis test?

• What business recommendations do you propose based on your results?

PACE: Plan
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test.

For the TikTok classification project, the research question could be: "Does the type of content (e.g., educational, entertainment, or promotional) significantly influence user engagement metrics (such as likes, shares, and comments) on TikTok?"

In [1]:
#for data manipulation
import pandas as pd
import numpy as np

#for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#for statistical analysis/hypothesis testing
import scipy.stats as stats
import statsmodels.api as sm

In [2]:
data = pd.read_csv("C://Users//hp//Desktop//PYTHON//Stat//Tiktok project//tiktok_dataset.csv")


In [3]:
#display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [4]:
# Generate a table of descriptive statistics about the data
data.describe()   #data.describe(include = 'all') it specifies that you want to generate descriptive statistics for all columns in the DataFrame, regardless of their data type.

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [5]:
#check for missing values
data.isnull().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [6]:
#drop rows with missing values
data.dropna()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
...,...,...,...,...,...,...,...,...,...,...,...,...
19079,19080,opinion,1492320297,49,in our opinion the earth holds about 11 quinti...,not verified,active,6067.0,423.0,81.0,8.0,2.0
19080,19081,opinion,9841347807,23,in our opinion the queens in ant colonies live...,not verified,active,2973.0,820.0,70.0,3.0,0.0
19081,19082,opinion,8024379946,50,in our opinion the moon is moving away from th...,not verified,active,734.0,102.0,7.0,2.0,1.0
19082,19083,opinion,7425795014,8,in our opinion lightning strikes somewhere on ...,not verified,active,3394.0,655.0,123.0,11.0,4.0


In [7]:
#display first few rows after handling missing values
data.dropna().head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [34]:
# Convert necessary columns to numeric if needed
data['video_view_count'] = pd.to_numeric(data['video_view_count'], errors='coerce')

In [35]:
# Compute the mean video_view_count for each group in verified_status

mean_views = data.groupby('verified_status')['video_view_count'].mean().reset_index()
mean_views

Unnamed: 0,verified_status,video_view_count
0,not verified,265663.785339
1,verified,91439.164167


In [36]:
# Rename the columns for clarity
mean_views.columns = ['Verified Status', 'Mean Video View Count']
mean_views

Unnamed: 0,Verified Status,Mean Video View Count
0,not verified,265663.785339
1,verified,91439.164167


In [37]:
#Hypothesis testing
verified_data = data[data['verified_status'] == 'verified']['video_view_count'].dropna()
unverified_data = data[data['verified_status'] =='not verified']['video_view_count'].dropna()

In [38]:
t_stat, p_value = stats.ttest_ind(verified_data, unverified_data, equal_var = False)
print(t_stat, p_value)

-25.499441780633777 2.6088823687177823e-120


Since the p-value is significantly smaller than 0.05, you reject the null hypothesis. This suggests there is strong evidence that the mean video view counts for verified and unverified accounts are significantly different.

What business insight(s) can you draw from the result of your hypothesis test?

The hypothesis test shows a significant difference in video view counts between verified and unverified accounts. This suggests that verified accounts have higher visibility and engagement, likely due to user trust or platform algorithms favoring them. Businesses can use this insight to optimize marketing strategies, encourage account verification, and adjust platform policies for better content reach.