# **Case Study** - *TikTok Accounts*

The Data Team at TikTok is working on the development of a predictive model that can determine whether a video contains a claim or offers an opinion.

As part of the milestones for the project, a Hypothesis Test could help the team to understand more about the population behavior and make a decision on the most relevant variables for the model.  

## **1. Setting Enviroment**

As a first step on this case study, we will focus on preparing the enviroment for our anlaysis and uploading our data.

In [None]:
# First, let´s import the libraries that are escential for the analysis.

import pandas as od
from scipy import stats

In [None]:
# Load dataset into dataframe
views_data = pd.read_csv("tiktok_dataset.csv")

#NOTE: The path to the CSV is not specified on the code, but can be replicated as long as the file is locally stored.

## **2. Descriptive Analysis & Data Preparation**

Now that we have everything set up, it is important that we give our dataset a quick descriptive analysis. This will allow us to better understand our data set, check for any missing values, examine the data structures, understand data distribution and for this specific case study, compute the mean of ***video_view_count*** and determine if there is a relevant difference for the analysis.

Also, we will prepare the dataset, this includes transforming data types, dropping duplicates or missing values, or creating new columns if necessary.  

In [None]:
# See the first rows of our data set and familiarize with the data.
views_data.head()

In [None]:
# Table of descriptive measures of the data set
views_data.describe()

In [None]:
# Check for any missing data
views_data[views_data["claim_status"].isna()]

In [None]:
# Drop rows with missing values
views_data = views_data[~views_data["claim_status"].isna()]
views_data.head()

In [None]:
# Creation of new data frames that will help us in the hypothes testing process

verified_acc = data[data["verified_status"] == "verified"]
not_verified_acc = data[data["verified_status"] == "not verified"]

# Mean validation
data.groupby(["verified_status"]).mean()["video_view_count"]

# Note: We notice that the mean views from not verified accounts (265,663) is greater than the mean views from verified accounts (91,439)

The code shows the following results:

```
not verified - 265663.785339
verified - 91439.164167
```

We notice that the mean views from not verified accounts ( ***265,663*** ) is greater than the mean views from verified accounts ( ***91,439*** )



## **3. Hypothesis Testing**

Now that we are ready with our dataset, we can proceed with our Hypothesis Testing.
Someting that I find really interesting while executing this step is to recall relevant concepts such as:

***Null Hypothesis***: A statement that is assumed to be true unless there is convincing evidence to the contrary.

***Alternative Hypothesis:*** A statement that contradicts the null hypothesis and is accepted as true only if there is convincing evidence for it

---

For this specific case study, the hypothesis would be:

***Null Hypothesis***
> There is no statistically significant difference in the mean views between the verified and non-verified accounts. The difference is due to random sampling.

***Alternative Hypothesis***

> There is statistically significant difference in the mean views between the verified and non-verified accounts. This can be accepted only if there is convincing evidence.  


We set the **significance level at 5%** and proceed with a two-sample t-test.

In [None]:
# Creation of the two-sample t-test

stats.ttest_ind(a= not_verified_acc['video_view_count'] , b= verified_acc['video_view_count'], alternative='two-sided', equal_var=False)

# NOTE: I chose  a two-sided t-test due to the null hypothesis.

## **4. Results and Conclusion**

The code shows the following results:


```
Ttest_indResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120)
```

Recalling what a p-value is:
> The probability of observing results as or more extreme than those observed when the null hypothesis is true

Based on the result that we got from the two-tailed t-test, we observe that the p-value is extremely low, so we can conclude that there is statistically significant difference between the means of the verified accounts and the non-verified accounts.


***We can confidently reject the null hypothesis and say that: there is a difference in mean views between non-verified accounts and verified accounts ***


## **5. Next Steps**

As part of the analysis, here are some insights I would recommend to analyze further and contribute to the project´s objectives:

*   Do the **not verified accounts** upload videos related to an specific topic?
*   Are **not verified accounts** using some kind of bot to increase the views whenever they upload a video?


