# <center> **TikTok Project**
# <center> **Hypothesis Testing**

In this activity, you will explore the data provided and conduct a hypothesis testing.  
**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.  
**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.  
* **Conduct hypothesis testing**
    * How will descriptive statistics help you analyze your data?
    * How will you formulate your null hypothesis and alternative hypothesis?

* **Communicate insights with stakeholders**
    * What key business insight(s) emerge from your hypothesis test?
    * What business recommendations do you propose based on your results?

*  **There are few possible ways to frame the research question. For instance:**
    * Do videos from verified accounts and videos unverified accounts have different average view counts?
    * Is there a relationship between the account being verified and the associated videos' view counts?
    * Is there any relationship between categorical variables such as verified_status and author_status? 

## **1. Imports**

In [13]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [16]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")
# le dataset est nettoyé, pas de valeurs manquantes ou de doublons. 

### **2. Data Exploration & Cleaning**

In [14]:
data.head()

Unnamed: 0,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [17]:
data.dropna(inplace=True, axis=0)
data.drop_duplicates(inplace=True, keep='first')
data.drop(columns=['#'], inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19084 entries, 0 to 19083
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   claim_status              19084 non-null  object 
 1   video_id                  19084 non-null  int64  
 2   video_duration_sec        19084 non-null  int64  
 3   video_transcription_text  19084 non-null  object 
 4   verified_status           19084 non-null  object 
 5   author_ban_status         19084 non-null  object 
 6   video_view_count          19084 non-null  float64
 7   video_like_count          19084 non-null  float64
 8   video_share_count         19084 non-null  float64
 9   video_download_count      19084 non-null  float64
 10  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(2), object(4)
memory usage: 1.7+ MB


In [12]:
# Some descriptive statistics : 
data[["video_duration_sec", "video_view_count",	
      "video_like_count", "video_share_count",	
      "video_download_count", "video_comment_count"]].describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
video_duration_sec,19084.0,32.42,16.23,5.0,18.0,32.0,47.0,60.0
video_view_count,19084.0,254708.56,322893.28,20.0,4942.5,9954.5,504327.0,999817.0
video_like_count,19084.0,84304.64,133420.55,0.0,810.75,3403.5,125020.0,657830.0
video_share_count,19084.0,16735.25,32036.17,0.0,115.0,717.0,18222.0,256130.0
video_download_count,19084.0,1049.43,2004.3,0.0,7.0,46.0,1156.25,14994.0
video_comment_count,19084.0,349.31,799.64,0.0,1.0,9.0,292.0,9599.0


### **3. Hypothesis Testing** 

* You are interested in the relationship between **`verified_status`** and **`video_view_count`**. 
* One approach is to examine the mean values of **`video_view_count`** for each group of **`verified_status`** in the sample data.

In [19]:
# Compute the mean `video_view_count` for each group in `verified_status`
data.groupby("verified_status")["video_view_count"].mean().round(2)

verified_status
not verified    265663.79
verified         91439.16
Name: video_view_count, dtype: float64

#### **3.1. Formulate Hypothesis**

* **Null & Alternative Hypothesis :** 
   * **Null Hypothesis :**  
   There is no difference in number of views between TikTok videos posted by **`verified accounts`** and TikTok videos posted by **`unverified accounts`**  
   (any observed difference in the sample data is due to chance or sampling variability).  
   * **Alternative Hypothesis :**  
   There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts  
   (any observed difference in the sample data is due to an actual difference in the corresponding population means).
* **Your goal in this step is to conduct a `two-sample t-test`**  
* **The steps for conducting a hypothesis test:**
    *  State the null hypothesis and the alternative hypothesis
    *  Choose a signficance level
    *  Find the p-value
    *  Reject or fail to reject the null hypothesis

In [21]:
# We choose 5% as the significance level and proceed with a two-sample t-test.
# Conduct a two-sample t-test to compare means
# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

TtestResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

* Since the p-value is extremely small (much smaller than the significance level of 5%), you reject the null hypothesis.  
* You conclude that there **is a statistically significant** difference in the mean video view count between verified and unverified accounts on TikTok.


#### **3.2. Interpret & communicate**

- The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral differences between these two groups of accounts.
- It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post **more clickbait-y videos**?  
Or are unverified accounts associated with **spam bots** that help inflate view counts?
- The next step will be to build a regression model on verified_status.  
- A regression model is the natural next step because the end goal is to make predictions on claim status.  
- A regression model for verified_status can help analyze user behavior in this group of verified users.  
- Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.