Data dictionary
This project uses a dataset called tiktok_dataset.csv. It contains synthetic data created for this project in partnership with TikTok.

The dataset contains:

19,383 rows – Each row represents a different published TikTok video in which a claim/opinion has been made.

12 columns

| Column name             | Type  | Description                                                                                         |
|-------------------------|-------|-----------------------------------------------------------------------------------------------------|
| #                       | int   | TikTok assigned number for video with claim/opinion.                                                 |
| claim_status            | obj   | Whether the published video has been identified as an “opinion” or a “claim.”                       |
| video_id                | int   | Random identifying number assigned to video upon publication on TikTok.                              |
| video_duration_sec      | int   | How long the published video is measured in seconds.                                                 |
| video_transcription_text| obj   | Transcribed text of the words spoken in the published video.                                         |
| verified_status         | obj   | Indicates the status of the TikTok user who published the video in terms of their verification.      |
| author_ban_status       | obj   | Indicates the status of the TikTok user who published the video in terms of their permissions.       |
| video_view_count        | float | The total number of times the published video has been viewed.                                        |
| video_like_count        | float | The total number of times the published video has been liked by other users.                          |
| video_share_count       | float | The total number of times the published video has been shared by other users.                         |
| video_download_count    | float | The total number of times the published video has been downloaded by other users.                     |
| video_comment_count     | float | The total number of comments on the published video.                                                  |



# **PACE stages**

<img src="Pace.png" width="150" height="150" align=left>

   * [Plan](#plan-anchor)
   * [Analyze](#analyze-anchor)
   * [Construct](#construct-anchor)
   * [Execute](#execute-anchor)


## **PACE: Plan** <a id='plan-anchor'></a>

<img src="Plan.png" width="100" height="100" align=left>

### **Understand the situation**

<img src="Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze** <a id='analyze-anchor'></a>

In [57]:
import numpy as np
import pandas as pd

In [58]:
tiktok_df = pd.read_csv('tiktok_dataset.csv')

In [59]:
tiktok_df.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [60]:
tiktok_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [61]:
tiktok_df.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


### **Understand the data - Investigate the variables**

In [62]:
tiktok_df_grouped = tiktok_df.groupby('claim_status', as_index=False)['#'].count()

In [63]:
tiktok_df_grouped

Unnamed: 0,claim_status,#
0,claim,9608
1,opinion,9476


In [64]:
claim_video_view_mean = tiktok_df[tiktok_df['claim_status'] == 'claim']['video_view_count'].agg(['mean', 'median'])
opinion_video_view_mean = tiktok_df[tiktok_df['claim_status']=='opinion']['video_view_count'].agg(['mean', 'median'])

In [65]:
claim_video_view_mean.round()

mean      501029.0
median    501555.0
Name: video_view_count, dtype: float64

In [66]:
opinion_video_view_mean.round()

mean      4956.0
median    4953.0
Name: video_view_count, dtype: float64

In [67]:
 tiktok_df['author_ban_status'].unique()

array(['under review', 'active', 'banned'], dtype=object)

In [68]:
videos_by_status = tiktok_df.groupby(['claim_status', 'author_ban_status']).size().reset_index(name='video_count')


In [69]:
videos_by_status

Unnamed: 0,claim_status,author_ban_status,video_count
0,claim,active,6566
1,claim,banned,1439
2,claim,under review,1603
3,opinion,active,8817
4,opinion,banned,196
5,opinion,under review,463


In [70]:
video_median_status = tiktok_df.groupby('author_ban_status')['video_share_count'].median()

In [71]:
video_median_status

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

In [72]:
video_status = tiktok_df.groupby('author_ban_status')[['video_view_count','video_like_count','video_share_count']].agg(['median','mean','count'])

In [73]:
video_status.round()

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,median,mean,count,median,mean,count,median,mean,count
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,8616.0,215927.0,15383,2222.0,71037.0,15383,437.0,14111.0,15383
banned,448201.0,445845.0,1635,105573.0,153017.0,1635,14468.0,29999.0,1635
under review,365246.0,392205.0,2066,71204.0,128718.0,2066,9444.0,25775.0,2066


In [74]:
tiktok_df['likes_per_view'] = tiktok_df['video_like_count']/tiktok_df['video_view_count']
tiktok_df['comments_per_view'] = tiktok_df['video_comment_count']/tiktok_df['video_view_count']
tiktok_df['shares_per_view'] = tiktok_df['video_share_count']/tiktok_df['video_view_count']

In [75]:
tiktok_df.head(1)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,0.056584,0.0,0.000702


In [76]:
agg_tiktok_newly_col = tiktok_df.groupby(['author_ban_status','claim_status'])[['likes_per_view','comments_per_view','shares_per_view']].agg(['mean','median','count'])

In [77]:
agg_tiktok_newly_col 

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,count,mean,median,count,mean,median,count
author_ban_status,claim_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
active,claim,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279,6566
active,opinion,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405,8817
banned,claim,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606,1439
banned,opinion,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728,196
under review,claim,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967,1603
under review,opinion,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027,463


<img src="Construct.png" width="100" height="100" align=left>

## **PACE: Construct** <a id='construct-anchor'></a>

<img src="Execute.png" width="100" height="100" align=left>

## **PACE: Execute** <a id='excecute-anchor'></a>

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?