# Hypothesis Testing

**The purpose** of this notebook is to prepare, create, and analyze hypothesis tests.

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.

---

*This Notebook has three parts:*

**Part 1:** Imports and Load Data

**Part 2:** Conduct Hypothesis Testing

**Part 3:** Results/Insights

## Part 1: Imports and Load Data

In [24]:
# Import packages for data manipulation
import numpy as np
import pandas as pd

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

In [25]:
# Load dataset into dataframe
data = pd.read_csv('data/tiktok_dataset.csv')

## Part 2: Data Exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).

In [26]:
# Display the first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [27]:
# Generate a table of descriptive statistics about the data
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [28]:
# Check for missing values
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [29]:
# Drop rows with missing values
data = data.dropna()

In [30]:
# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


The leadership team is interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [31]:
# Compute the mean `video_view_count` for each group in `verified_status`
data.groupby('verified_status')['video_view_count'].mean().reset_index(name='average_video_view_count')

Unnamed: 0,verified_status,average_video_view_count
0,not verified,265663.785339
1,verified,91439.164167


## Part 3: Hypothesis Testing

Steps to conduct a hypothesis test (two-sample t-test):

1. State the null hypothesis and the alternative hypothesis
2. Choose a signficance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis

### Step 1

Null Hypothesis: There is **no** significant difference in video views between verified versus unverified accounts.

Alternative Hypothesis: There is **a** significant difference in video views between verified versus unverified accounts.

### Step 2

In [36]:
# Choose 5% as the significance level
significance_level = 0.05

### Step 3

In [40]:
# Conduct a two-sample t-test to compare means
data_verified = data[data['verified_status']=='verified']
data_not_verified = data[data['verified_status']=='not verified']

tstat, pvalue = stats.ttest_ind(a=data_verified['video_view_count'], b=data_not_verified['video_view_count'], equal_var=False)

In [41]:
print(f't statistics: {tstat:f}\np-value: {pvalue:f}')

t statistics: -25.499442
p-value: 0.000000


### Step 4

In [46]:
# Reject or fail to reject the null hypothesis
if pvalue < significance_level:
    print(f'Since the p-value {pvalue:f} is less than the significance level {significance_level}, we reject the null hypothesis.')
else:
    print(f'Since the p-value {pvalue:f} is greater than the significance level {significance_level}, we fail to reject the null hypothesis.')

Since the p-value 0.000000 is less than the significance level 0.05, we reject the null hypothesis.


Rejecting the mean null hypothesis means that there is a statistically significant difference in the video view count between verified and unverified accounts. 

## Part 4: Results/Insights

The hypothesis test shows that there is a statistically significant difference in the video view count between verified and unverified accounts. This means that there is something causing the unverified accounts to get more views and it might be interesting to investigate why.

The next step would be to build a regression model on `verified_status`. Investigating the relationship between `verified_status` and other variables will tell us more about the user behaviors that can help us to predict claim status.