# **TikTok Project**
**Course 2 - Get Started with Python**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).



# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>

To complete the activity, follow the instructions and answer the questions below. Then, you will use your responses to these questions and the questions included in the Course 2 PACE Strategy Document to create an executive summary.

Be sure to complete this activity before moving on to Course 3. You can assess your work by comparing the results to a completed exemplar after completing the end-of-course project.

# **Identify data types and compile summary information**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

# **PACE stages**



## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:



### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

I can prepare to understand this data or information by checking out the data fields, understanding the
Structural sphere of the data. I would also get descriptive statistics of the columns I have deemed useful and also preprocess and prepare the data for the EDA stage. To organise the information I would import it into the used software for the coding(jupyter notebook in this case) and after importing the pandas library I would create a dataframe and form the dataframe consisting of this data and I would like to check how the data is organised and the further changes I need to make to make sure the data is well organised.


## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
*   `import pandas as pd`

*   `import numpy as np`


In [1]:
import pandas as pd

import numpy as np

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

In [None]:
# Load dataset into dataframe
data = pd.read_csv("C:\Users\saswa\Documents\GitHub\TikTok-Claims-Classification\Data\tiktok_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

In [3]:
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [5]:
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Answer:**

Upon reviewing the first few rows of the dataset, I observed that the majority of the data consists of videos classified as "claims" from non-verified active accounts, with an average video duration of approximately 35 seconds. The dataset captures various aspects of video content quite well, with several informative fields that provide a comprehensive view of each video.

The dataset contains 12 columns, each representing a different feature of the videos and their associated metadata:

#### **Column Breakdown:**

- **#**
  -represents the unique id given to each video which was reported to be classified as a   claim or an opinion. 
  -Type- int 64. 
  -Non-null values: Yes

- **claim_status**  
  - Indicates whether the video was classified as a claim or an opinion.  
  - Possible values: `claim`, `opinion`  
  - Type: `object`  
  - Null values: 298

- **video_id**
  -the unique id given to each TikTok video. 
  -Type-int64
  -Non-null values: Yes

- **video_duration_sec**  
  - Represents the length of the video in seconds.  
  - Type: `int64`  
  - Non-null values: Yes

- **video_transcription_text**  
  - The transcribed text representing the content of the video, which is classified as either a claim or an opinion.  
  - Type: `object`  
  - Null values: 298

- **verified_status**  
  - Indicates whether the user who posted the video is verified.  
  - Possible values: `verified`, `unverified`  
  - Type: `object`  
  - Non-null values: Yes

- **author_ban_status**  
  - Represents the current status of the account that posted the video (whether the account is banned).  
  - Type: `object`  
  - Non-null values: Yes

- **video_view_count**  
  - The total number of times the video has been viewed.  
  - Type: `float64`  
  - Null values: 298

- **video_like_count**  
  - The total number of likes the video has received.  
  - Type: `float64`  
  - Null values: 298

- **video_share_count**  
  - The total number of times the video has been shared.  
  - Type: `float64`  
  - Null values: 298

- **video_download_count**  
  - The total number of times the video has been downloaded.  
  - Type: `float64`  
  - Null values: 298

- **video_comment_count**  
  - The total number of comments on the video.  
  - Type: `float64`  
  - Null values: 298


**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Answer:**

When reviewing the output of `data.info()`, I noticed the following:

- **Textual Information**: The columns containing textual data, such as `video_transcription_text`, are classified as `object` type in Pandas. This is because strings in Pandas are represented as `object` type.
  
- **Numeric Data**: Some fields are of `float64` type, while others are `int64`. Upon closer inspection, I found that the `float64` fields could be converted to `int64` without any loss of information, as these values are discrete (e.g., view counts, like counts, etc.).

- **Null Values**: There are **298 null values** in 7 of the 12 columns, which are:
  - `claim_status`
  - `video_transcription_text`
  - `video_view_count`
  - `video_like_count`
  - `video_share_count`
  - `video_download_count`
  - `video_comment_count`
  
- **Numeric Variables**: Out of the 12 fields, **8 are numeric**, with 5 of them being `float64` and 3 being `int64`.

- **Null Values Pattern**: What stands out is that for the fields with null values (listed above), if one field contains null, the others typically have null values as well. This suggests that missing data might be related or interconnected across these fields.

---

This analysis provides insights into the data types and the null value distribution across the dataset, along with an observation regarding the potential interrelationship of missing values in specific fields.


**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

**Answer:**

When reviewing the `data.describe()` output, I observed the following insights about the distributions of the relevant variables:

- **Relevant Columns**: The columns included in the `describe()` output for analysis are:
  - `video_duration_sec`
  - `video_view_count`
  - `video_like_count`
  - `video_share_count`
  - `video_download_count`
  - `video_comment_count`

### Observations:
- **Video Duration (`video_duration_sec`)**:
  - The **mean** is 32.42 seconds, and the **median** is 32 seconds, suggesting that the data is relatively centered around this value.
  - The **standard deviation (std)** is 16.23, which is relatively low compared to other columns, indicating **low variability** in `video_duration_sec`. Most videos have durations close to 32 seconds, suggesting consistency in video lengths.

- **Video View Count (`video_view_count`)**:
  - The **mean** is 254,708, and the **std** is 322,893, which is quite high, indicating **high variability** in the number of views across videos.
  - The gap between the **75th percentile (504,327)** and the **maximum value (999,817)** suggests the presence of **outliers**, where a few videos receive disproportionately higher views compared to the majority.

- **Video Like Count (`video_like_count`)**:
  - The **std** of 133,420 is large relative to the mean, reflecting **high variability** in the like counts.
  - The distribution is **right-skewed**, with some videos having significantly higher likes than the majority. The gap between the **75th percentile (125,020)** and the **maximum (657,830)** confirms the presence of outliers.

- **Video Share Count (`video_share_count`)**:
  - The **std** is 32,036, which is large relative to the mean of 16,735, indicating **high variability** in shares.
  - The distribution appears to be **right-skewed**, with a few videos having significantly higher share counts compared to the majority.

- **Video Download Count (`video_download_count`)**:
  - The **std** is 2,004, which is significantly larger than the mean (1,049), highlighting **variability** in download counts.
  - There are a few videos with exceptionally high download counts, as evidenced by the gap between the **75th percentile (1,156.25)** and the **maximum value (14,994)**.

- **Video Comment Count (`video_comment_count`)**:
  - The **std** is 799.63, which is high compared to the mean (349.31), indicating **variability** in comment counts.
  - The distribution is also **right-skewed**, with a few videos receiving disproportionately higher comments.

### Outliers and Standard Deviation:
- For most columns, the **standard deviation is large**, particularly for `video_view_count`, `video_like_count`, and `video_share_count`. This indicates that the values are widely spread and influenced by **outliers**.
- The **high difference between the 75th percentile and the maximum values** in several columns supports the observation of **outliers**.

### Conclusion:
- There is noticeable **high variability** and **right-skewness** in key variables such as `video_view_count`, `video_like_count`, and `video_share_count`.
- **Outliers** are a significant factor in these distributions, as reflected by both the **high standard deviation** and the difference between quartiles and the maximum values. These should be carefully examined, as they can impact further analysis and model performance.

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [6]:
# What are the different values for claim status and how many of each are in the data?

data.value_counts('claim_status')

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

**Question:** What do you notice about the values shown?


**Answer:**

In the `claim_status` field, there are two distinct types of values: `claim` and `opinion`. The counts are as follows:
- `claim = 9608`
- `opinion = 9476`

**Observations:**
1. The number of claims and opinions are quite close to each other, with a difference of only 132.
2. Since `value_counts()` only considers non-null values, the counts mentioned above reflect only the non-null entries in the `claim_status` column.
3. The total number of rows in the dataset is 19,382. However, if we add the counts of `claim` and `opinion`, we get 19,084 rows, indicating that 298 rows have a null value for `claim_status`.

**Conclusion:**
- There are 298 null values in the `claim_status` column, which makes up a small portion of the total dataset.

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [7]:
# What is the average view count of videos with "claim" status?

# print(data.loc[data['claim_status'] == 'claim', 'video_view_count'].mean())

# print(round(data.loc[data['claim_status'] == 'claim', 'video_view_count'].mean()))

# This would round to the nearest integer, which might better reflect the true average in some scenarios.

claim_mask=data['claim_status']=='claim'

claim_mean=data.loc[claim_mask,'video_view_count'].mean()

claim_median=data.loc[claim_mask,'video_view_count'].median()

print(f'Claim\nmean: {claim_mean}\nmedian: {claim_median}')

Claim
mean: 501029.4527477102
median: 501555.0


In [8]:
# What is the average view count of videos with "opinion" status?
opinion_mask=data['claim_status']=='opinion'

opinion_mean=data.loc[opinion_mask,'video_view_count'].mean()

opinion_median=data.loc[opinion_mask,'video_view_count'].median()

print(f'Opinion\nmean: {opinion_mean}\nmedian: {opinion_median}')


Opinion
mean: 4956.43224989447
median: 4953.0


**Question:** What do you notice about the mean and median within each claim category?

**Answer:**

- There is a relatively low difference between the mean and median values in both the `claim` and `opinion` categories.
- This suggests that the data within each category is fairly symmetrical and does not contain extreme outliers that could skew the results significantly.
- However, there is a **vast discrepancy in view counts** between videos labeled as `claims` and videos labeled as `opinions`.

**Conclusion:**
- The close proximity between the mean and median values indicates that the distributions of both `claim` and `opinion` are likely balanced without substantial outliers.
- The significant difference in view counts between `claim` and `opinion` videos highlights a potential pattern that warrants further investigation.

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [9]:
# Get counts for each group combination of claim status and author ban status
data.groupby(['claim_status', 'author_ban_status']).size()


claim_status  author_ban_status
claim         active               6566
              banned               1439
              under review         1603
opinion       active               8817
              banned                196
              under review          463
dtype: int64

**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?

**Answer:**  

- I noticed that the number of banned users is higher in the `claim` category than in the `opinion` category for the corresponding `author_ban_status`.  
- This relationship may occur because claims are often made by unsourced or untrusted individuals. As a result, a significant portion of these claims may come from accounts that have been banned, possibly due to spreading misinformation or violating platform policies.  
- Additionally, it is possible that claim videos are more strictly policed than opinion videos, or that authors posting claims are required to comply with stricter rules compared to those posting opinions.  
- It is important to note that there is no way of determining if claim videos are inherently more likely than opinion videos to result in author bans or if the authors of claim videos are more likely to post content violating platform terms.  

**Conclusion:**  
- The higher number of banned users in the `claim` category may indicate a trend where accounts making claims are more likely to be flagged and banned for policy violations.  
- While this data highlights the relationship between banned/active authors and claims/opinions, it does not provide evidence to conclude whether specific banned videos directly caused the bans. Banned authors may have also posted content that fully complied with the platform's terms of service.  
- The presence of claim videos does not necessarily imply that these videos violated platform policies. Authors could have been banned for reasons unrelated to the claim videos, making it important to avoid assuming a direct causal relationship between claim videos and author bans based on the data.

Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [10]:
# What's the median video share count of each author ban status?
data.groupby(['author_ban_status'])['video_share_count'].median()

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [11]:
data.groupby(['author_ban_status']).agg({'video_view_count':['count','mean','median'],'video_like_count':['count','mean','median'],'video_share_count':['count','mean','median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


In [12]:
data.groupby(['claim_status','author_ban_status']).agg({'video_view_count':['count','mean','median'],'video_like_count':['count','mean','median'],'video_share_count':['count','mean','median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,499221.733171,499500.5,6566,164960.302924,121943.0,6566,32769.101889,17774.5
claim,banned,1439,505907.917304,512572.0,1439,173719.102849,132044.0,1439,34056.580959,19018.0
claim,under review,1603,504054.640674,500774.0,1603,165566.95446,125882.0,1603,33155.623206,18084.0
opinion,active,8817,4958.120563,4958.0,8817,1091.714982,820.0,8817,217.16695,121.0
opinion,banned,196,4876.530612,5083.5,196,1027.515306,799.5,196,208.423469,108.5
opinion,under review,463,4958.105832,4884.0,463,1139.663067,876.0,463,220.431965,124.0


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?

**Answer:**

I noticed the following differences between banned and active authors in terms of views, likes, and shares:

- **Views:**
  - Banned users have a higher mean (445,845.44) and median (448,201.0) of views compared to active users (mean: 215,927.04, median: 8,616.0). However, banned users have a lower count of views overall.
  
- **Likes:**
  - Banned users have a higher mean (153,017.24) and median (105,573.0) of likes compared to active users (mean: 71,036.53, median: 2,222.0).

- **Shares:**
  - Again, banned users have a higher mean (29,998.94) and median (14,468.0) of shares compared to active users (mean: 14,111.47, median: 437.0).

**Conclusion:**
- Despite the lower total count of views, likes, and shares, banned users tend to generate higher engagement per post (mean and median) than active users. This could suggest that controversial content (such as claims) posted by banned users tends to garner more attention, possibly due to its nature or sensationalism.




Now, create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [13]:
# Create a likes_per_view column
data['likes_per_view']=data['video_like_count']/data['video_view_count']

# Create a comments_per_view column
data['comments_per_view']=data['video_comment_count']/data['video_view_count']

# Create a shares_per_view column
data['shares_per_view']=data['video_share_count']/data['video_view_count']

data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,0.056584,0.0,0.000702
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0,0.549096,0.004855,0.135111
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0,0.108282,0.000365,0.003168
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0,0.548459,0.001335,0.079569
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0,0.62291,0.002706,0.073175
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0,0.521454,0.005516,0.185069
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0,0.647958,0.007258,0.258429
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0,0.001958,2e-05,9.1e-05
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0,0.409364,0.001088,0.042306
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0,0.183612,0.002727,0.072714


Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [14]:
data.groupby(['claim_status','author_ban_status']).agg({'likes_per_view':['count','mean','median'],
  'comments_per_view':['count','mean','median'],
  'shares_per_view':['count','mean','median']})


Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


In [15]:
data.groupby(['claim_status','author_ban_status']).agg({'video_view_count':['count','mean','median'],'video_like_count':['count','mean','median'],'video_share_count':['count','mean','median'],
'video_comment_count':['count','mean','median'],})

Unnamed: 0_level_0,Unnamed: 1_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count,video_comment_count,video_comment_count,video_comment_count
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
claim,active,6566,499221.733171,499500.5,6566,164960.302924,121943.0,6566,32769.101889,17774.5,6566,687.827901,286.0
claim,banned,1439,505907.917304,512572.0,1439,173719.102849,132044.0,1439,34056.580959,19018.0,1439,698.402363,296.0
claim,under review,1603,504054.640674,500774.0,1603,165566.95446,125882.0,1603,33155.623206,18084.0,1603,698.336245,279.0
opinion,active,8817,4958.120563,4958.0,8817,1091.714982,820.0,8817,217.16695,121.0,8817,2.696609,1.0
opinion,banned,196,4876.530612,5083.5,196,1027.515306,799.5,196,208.423469,108.5,196,2.311224,1.0
opinion,under review,463,4958.105832,4884.0,463,1139.663067,876.0,463,220.431965,124.0,463,2.87689,1.0


**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.

**Answer:**

The data comparison between claim and opinion videos reveals several patterns in terms of views, comments, likes, and shares across different groups (active, banned, and under review). Here are the key observations:

#### **Count (Common in Views, Comments, Likes, and Shares):**
- The occurrence of claims is greater than that of opinions in total. However, the claim-active group has fewer occurrences compared to the opinion-active group. 
- For the opinion-banned and opinion-under review groups, there are fewer occurrences than in the claim-banned and claim-under review groups.

#### **Views:**
- **Mean:**
  - The view mean for the claim-active, claim-banned, and claim-under review groups is much greater than that of the corresponding opinion groups.
- **Median:**
  - Similar to the mean, the median view count for the claim groups is higher than that for the opinion groups.

#### **Comments:**
- **Mean:**
  - The mean number of comments for each claim group (claim-active, claim-banned, claim-under review) is much greater than for the corresponding opinion groups.
- **Median:**
  - The comment median follows the same pattern as the mean, with the claim groups having higher values than the opinion groups.

#### **Comments per View:**
- **Mean:**
  - The mean of comments per view for all three claim groups is higher than for all three opinion groups.
- **Median:**
  - The median for comments per view follows the same trend, with the claim groups outperforming the opinion groups.

#### **Likes:**
- **Mean:**
  - The like mean for each claim group (claim-active, claim-banned, claim-under review) is much greater than for the corresponding opinion groups.
- **Median:**
  - The like median mirrors the mean, with claim groups having higher values than opinion groups.

#### **Likes per View:**
- **Mean:**
  - The mean of likes per view for all three claim groups is higher than for the opinion groups.
- **Median:**
  - The median of likes per view follows the same trend, with the claim groups having higher values than the opinion groups.

#### **Shares:**
- **Mean:**
  - The mean share count for each claim group (claim-active, claim-banned, claim-under review) is much greater than for the corresponding opinion groups.
- **Median:**
  - The share median shows the same pattern as the mean, with claim groups having higher values than opinion groups.

#### **Shares per View:**
- **Mean:**
  - The mean of shares per view for all three claim groups is higher than for all three opinion groups.
- **Median:**
  - The median of shares per view follows the same trend, with the claim groups having higher values than the opinion groups.

### **Conclusion:**
- The claim category generally shows higher values for views, comments, likes, and shares, both in terms of the mean and median, compared to the opinion category. This trend is observed across different groups (active, banned, and under review). 
- The higher engagement in the claim category could indicate that claim videos tend to generate more interaction, possibly due to their controversial or attention-grabbing nature.


## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.

## **PACE: Execute**

Consider the questions in your PACE Strategy Document and those below to craft your response.

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


In [16]:
claim_count = 9608
opinion_count = 9476
total_rows = 19382 #include unclassified claim_status

claim_pct_all = (claim_count / total_rows) * 100
opinion_pct_all = (opinion_count / total_rows) * 100

print(f"Relative to all rows: Claim: {claim_pct_all:.2f}%, Opinion: {opinion_pct_all:.2f}%")


Relative to all rows: Claim: 49.57%, Opinion: 48.89%


What percentage of the data is comprised of claims and what percentage is comprised of opinions?

**Answer:**

The whole dataset comprises 49.57% claims, 48.88% opinions, and 1.54% unclassified data points.

What factors correlate with a video's claim status?

**Answer:**  
- Views, likes, shares, downloads, and comments correlate with a video's `claim_status`. These factors are indicative of the level of engagement a video receives, which may influence its categorization as a claim.  
- For instance, videos with higher views, likes, comments, or shares might be more likely to contain claims, as they often attract attention and provoke discussion. Additionally, claim videos tend to have higher engagement rates (e.g., likes per view or shares per view), further demonstrating their correlation with these factors.  
- The strong correlation between engagement level and `claim_status` highlights an area for further inquiry, as it raises questions about how audience interaction influences or is influenced by the nature of the video's content.


What factors correlate with a video's engagement level?

**Answer:**  
- A video's `claim_status` correlates significantly with its engagement level. Claim videos tend to exhibit higher levels of engagement—such as views, likes, shares, and comments—compared to opinion videos.  
- Additionally, videos with banned authors show significantly higher engagement compared to videos with active authors, while videos with authors under review fall between these two categories in terms of engagement levels.  
- Metrics like likes per view or shares per view can further highlight the relationship between `claim_status` and engagement, providing deeper insights into audience interaction patterns.

#### **Further Analysis**

In [17]:
percentile_75 = data['video_view_count'].quantile(0.75)

# Filter rows where video_view_count is above the 75th percentile
filtered_df = data[data['video_view_count'] > percentile_75]

# Count occurrences of claims and opinions
counts = filtered_df['claim_status'].value_counts

# Print results
print("Occurrences above 75th percentile:")
print(counts)

Occurrences above 75th percentile:
<bound method IndexOpsMixin.value_counts of 2       claim
6       claim
7       claim
9       claim
10      claim
        ...  
9597    claim
9598    claim
9603    claim
9604    claim
9605    claim
Name: claim_status, Length: 4771, dtype: object>


In [18]:
null_claim_status_df = data[data['claim_status'].isna()]

null_claim_status_df

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
19084,19085,,4380513697,39,,not verified,active,,,,,,,,
19085,19086,,8352130892,60,,not verified,active,,,,,,,,
19086,19087,,4443076562,25,,not verified,active,,,,,,,,
19087,19088,,8328300333,7,,not verified,active,,,,,,,,
19088,19089,,3968729520,8,,not verified,active,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19377,19378,,7578226840,21,,not verified,active,,,,,,,,
19378,19379,,6079236179,53,,not verified,active,,,,,,,,
19379,19380,,2565539685,10,,verified,under review,,,,,,,,
19380,19381,,2969178540,24,,not verified,active,,,,,,,,
