***
# TikTok Project

### Scenario

TikTok users have the ability to submit reports that identify videos and comments that contain user claims. These reports identify content that needs to be reviewed by moderators. The process generates a large number of user reports that are challenging to consider in a timely manner. 

TikTok is working on the development of a predictive model that can determine whether a video contains a claim or offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

### Assignment
You will build a dataframe for the claims classification data. After the dataframe is complete, you will organize the claims data for the process of exploratory data analysis, and update the team on your progress and insights.

***

## Team members

### Data team roles
The members of the data team at TikTok are well versed in data analysis and data science. Messages to these more technical coworkers should be concise and specific.
* Willow Jaffey- Data Science Lead
* Rosie Mae Bradshaw- Data Science Manager
* Orion Rainier- Data Scientist

### Cross-functional team members
* Mary Joanna Rodgers- Project Management Officer
* Margery Adebowale- Finance Lead, Americas
* Maika Abadi- Operations Lead

***

### Specific project deliverables

* Course 2 PACE Strategy Document to plan your project while considering your audience members, teammates, key milestones, and overall project goal. 
* Summarize the column Dtypes
* Communicate important findings in the form of an executive summary

In [1]:
import pandas as pd 
import numpy as np

In [2]:
#Load the data

tiktok_raw = pd.read_csv("../raw_data/tiktok_dataset.csv")
tiktok_raw.head(15)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [3]:
#rename the first column, change it from # to claim_id

print(tiktok_raw.columns)
tiktok_raw = tiktok_raw.rename(columns={"#":"claim_id"})
tiktok_raw.head()

Index(['#', 'claim_status', 'video_id', 'video_duration_sec',
       'video_transcription_text', 'verified_status', 'author_ban_status',
       'video_view_count', 'video_like_count', 'video_share_count',
       'video_download_count', 'video_comment_count'],
      dtype='object')


Unnamed: 0,claim_id,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


***

claim_status
    obj

        Whether the published video has been identified as an “opinion” or a “claim.” In this dataset, an “opinion” refers to an individual’s or group’s personal belief or thought. A “claim” refers to information that is either unsourced or from an unverified source.

In [4]:
# review columns data type, row count
tiktok_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   claim_id                  19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [5]:
#Check basic statistics of the dataset
tiktok_raw.describe().round()

Unnamed: 0,claim_id,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9692.0,5627454000.0,32.0,254709.0,84305.0,16735.0,1049.0,349.0
std,5595.0,2536440000.0,16.0,322893.0,133421.0,32036.0,2004.0,800.0
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.0,3430417000.0,18.0,4942.0,811.0,115.0,7.0,1.0
50%,9692.0,5618664000.0,32.0,9954.0,3404.0,717.0,46.0,9.0
75%,14537.0,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.0,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [6]:
# What are the different values for claim status and how many of each are in the data?
tiktok_raw.value_counts("claim_status")

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

In [7]:
# What is the average view count of videos with "claim" status?

claims_vids = tiktok_raw[ tiktok_raw["claim_status"] == "claim"]
avg_claim_count = round(claims_vids["video_view_count"].mean(),2)
avg_claim_count

np.float64(501029.45)

In [8]:
# What is the average view count of videos with "opinion" status?
opinions_vids = tiktok_raw[tiktok_raw["claim_status"] == "opinion"]
avg_commt_count = round(opinions_vids["video_view_count"].mean(),2)
avg_commt_count

np.float64(4956.43)

In [10]:
# Get counts for each group combination of claim status and author ban status

tiktok_raw.groupby(["claim_status","author_ban_status"])["claim_id"].count()

claim_status  author_ban_status
claim         active               6566
              banned               1439
              under review         1603
opinion       active               8817
              banned                196
              under review          463
Name: claim_id, dtype: int64

In [11]:
# Calculate the median video share count of each author ban status.

tiktok_raw.groupby("author_ban_status")["video_share_count"].median()

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

In [14]:
# Use groupby() to group the data by author_ban_status, then use agg() to get the count, mean, and median of each of the following columns:

# video_view_count
# video_like_count
# video_share_count

tiktok_raw.groupby("author_ban_status").agg({"video_view_count":["count","mean","median"],"video_like_count":["mean","median"],"video_share_count":["mean","median"]}).round()

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
active,15383,215927.0,8616.0,71037.0,2222.0,14111.0,437.0
banned,1635,445845.0,448201.0,153017.0,105573.0,29999.0,14468.0
under review,2066,392205.0,365246.0,128718.0,71204.0,25775.0,9444.0


Now, create three new columns to help better understand engagement rates:

* likes_per_view: represents the number of likes divided by the number of views for each video
* comments_per_view: represents the number of comments divided by the number of views for each video
* shares_per_view: represents the number of shares divided by the number of views for each video

In [16]:
tiktok_raw["likes_per_view"] = (tiktok_raw["video_like_count"] / tiktok_raw["video_view_count"]) *100
tiktok_raw["comments_per_view"] = (tiktok_raw["video_comment_count"] / tiktok_raw["video_view_count"]) *100
tiktok_raw["shares_per_view"] = (tiktok_raw["video_share_count"] / tiktok_raw["video_view_count"]) *100

tiktok_raw.head(15)

Unnamed: 0,claim_id,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,5.658382,0.0,0.070202
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0,54.909602,0.48553,13.511077
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0,10.828156,0.036467,0.316786
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0,54.845876,0.133484,7.956919
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0,62.291025,0.270622,7.317464
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0,52.145422,0.551616,18.506923
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0,64.795794,0.725799,25.842912
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0,0.195788,0.002009,0.009132
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0,40.936379,0.108788,4.23063
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0,18.361248,0.272653,7.271355


In [18]:
# Use groupby() to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, 
# #then use agg() to calculate the count, the mean, and the median of each group.
### YOUR CODE HERE ###
tiktok_raw.groupby(["claim_status","author_ban_status"]).agg({"likes_per_view":["count","mean","median"],"comments_per_view":["mean","median"],"shares_per_view":["mean","median"]})


Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,mean,median,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
claim,active,6566,32.954203,32.653813,0.139263,0.077551,6.545579,4.927911
claim,banned,1439,34.507055,35.890867,0.137719,0.074562,6.789342,5.160571
claim,under review,1603,32.799666,32.086742,0.136709,0.078887,6.573347,4.99666
opinion,active,8817,21.974352,21.833031,0.051708,0.025176,4.372857,3.240497
opinion,banned,196,20.68684,19.848304,0.043436,0.019277,4.053072,3.072761
opinion,under review,463,22.639437,22.80508,0.053556,0.029261,4.447242,3.502682
