# **TikTok Project**
**Course 4 - The Power of Statistics**

You are a data professional at TikTok. The current project is reaching its midpoint; a project proposal, Python coding work, and exploratory data analysis have all been completed.

The team has reviewed the results of the exploratory data analysis and the previous executive summary the team prepared. You received an email from Orion Rainier, Data Scientist at TikTok, with your next assignment: determine and conduct the necessary hypothesis tests and statistical analysis for the TikTok classification project.

A notebook was structured and prepared to help you in this project. Please complete the following questions.


# **Course 4 End-of-course project: Data exploration and hypothesis testing**

In this activity, you will explore the data provided and conduct hypothesis testing.
<br/>

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How will descriptive statistics help you analyze your data?

* How will you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerge from your hypothesis test?

* What business recommendations do you propose based on your results?

<br/>

Follow the instructions and answer the questions below to complete the activity. Then, complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.



# **Data exploration and hypothesis testing**

<img src="../Automatidata/images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="../Automatidata/images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response.

1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.

There are many research question I could come with for this data project? Here a couple:
* Is there a relationship between a video content going viral and the verifications status of the user?
* Is there a relationship between a verified user and their associated video view count?\
* Is there a relationship between a verified use and ther associated view like count?
* Do videos from verified accounts and videos from unverified accounts have different engagement counts?

Having considered all the above, I will focus on the **The relationship between verified status users and associated video view count**

*Complete the following steps to perform statistical analysis of your data:*

### **Task 1. Imports and Data Loading**

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Be sure to import `pandas`, `numpy`, `matplotlib.pyplot`, `seaborn`, and `scipy`.

</details>

In [None]:
# Import packages for data manipulation
### YOUR CODE HERE ###
import pandas as pd
import numpy as np

# Import packages for data visualization
### YOUR CODE HERE ###
import matplotlib.pyplot as plt
import seaborn as sns
# Import packages for statistical analysis/hypothesis testing
### YOUR CODE HERE ###
from scipy import stats


Load the dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Load dataset into dataframe
from google.colab import drive
drive.mount('/content/drive')
path_file = '/content/drive/MyDrive/TikTok/tiktok_dataset.csv'
data = pd.read_csv(path_file)

Mounted at /content/drive


<img src="../Automatidata/images/Analyze.png" width="100" height="100" align=left>

<img src="../Automatidata/images/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


* A summary descriptive statistic provides a snap short of the measure of central tendency and dispersion of the data. This include informations such as the mean, media, count and standard deviation of the distribution. At this stage of my analysis, such information would help to understand the mean values of` video_view_count` for each group of `verified_status` in the sample data.

### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Refer back to *Self Review Descriptive Statistics* for this step-by-step proccess.

</details>

Inspect the first five rows of the dataframe.

In [None]:
# Display first few rows
### YOUR CODE HERE
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [None]:
# Generate a table of descriptive statistics about the data
### YOUR CODE HERE ###
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Check for and handle missing values.

In [None]:
# Check for missing values
### YOUR CODE HERE ###
data.isna().mean().sort_values(ascending=False)

Unnamed: 0,0
claim_status,0.015375
video_transcription_text,0.015375
video_view_count,0.015375
video_like_count,0.015375
video_share_count,0.015375
video_download_count,0.015375
video_comment_count,0.015375
#,0.0
video_id,0.0
video_duration_sec,0.0


There exist the same percentage (1.5%) of missing values in the `claim_status`. `video_transcription_text` and all the **count** variables. Because this is not a significant amount of the the dataset, I have opted to handle it by dropping it.

In [None]:
print(f"The size of the original dataset is {len(data)}")
# Drop rows with missing values
### YOUR CODE HERE ###
data = data.dropna()

print(f"The size of dataset after removing the missing observation is {len(data)}")


The size of the original dataset is 19382
The size of dataset after removing the missing observation is 19084


In [None]:
# Display first few rows after handling missing values

### YOUR CODE HERE ###
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


You are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [None]:
# Compute the mean `video_view_count` for each group in `verified_status`
### YOUR CODE HERE ###
data.groupby(["verified_status"])["video_view_count"].agg(["mean"]).rename(columns={"mean":"mean_video_view_count"})

Unnamed: 0_level_0,mean_video_view_count
verified_status,Unnamed: 1_level_1
not verified,265663.785339
verified,91439.164167



The result of the analysis indicates that there is a noticeable difference in the mean video view count between verified and not verified accounts on the platform. This suggests that videos posted by not verified accounts tend to attract more views on average than those posted by verified accounts. Further investigation may be warranted to understand the factors contributing to this disparity and to inform strategies for increasing engagement among verified accounts.

### **Task 3. Hypothesis testing**

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. What are your hypotheses for this data project?

*   **Null hypothesis**: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).

*    **Alternative hypothesis**: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).




Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



**$H_0$**: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).

**$H_A$**: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).


You choose 5% as the significance level and proceed with a two-sample t-test.

In [None]:
# Conduct a two-sample t-test to compare means
### YOUR CODE HERE ###
verified = data.loc[data['verified_status'] == "verified", "video_view_count"]
unverified = data.loc[data['verified_status'] == "not verified", "video_view_count"]


# Implement a t-test using the two samples
stats.ttest_ind(a=verified, b=unverified ,equal_var=False, random_state=3433, )

TtestResult(statistic=-25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?


The p-value which is much smaller than 0.05 (p ≈ 2.61e-120). This indicates strong evidence against the null hypothesis of equal means for the two samples. Therefore, we reject the null hypothesis and conclude that there **is** a significant difference in the mean video view count between verified and unverified accounts on TikTok


<img src="../Automatidata/images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Documentto reflect on the Execute stage.

## **Step 4: Communicate insights with stakeholders**

*Ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?

In the business context, the result of the rest can be leveraged to inform various decisions and strategies. Here are some potential business insights that can be drawn from the result:

- **Market Segmentation**: The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral or preference differences between these two groups of accounts.

- **Resource Allocation**: Understanding the differences between the two groups can help TikTok allocate resources more effectively. For example, they can prioritize investments in areas that are more impactful or relevant to specific customer segments, thereby optimizing resource allocation and maximizing returns.

- It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

- The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.


**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.