# **TikTok Project**

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

# **Inspect and analyze data**

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
*   `import pandas as pd`

*   `import numpy as np`


In [1]:
# Import packages
import pandas as pd
import numpy as np

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

In [2]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

















In [7]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [8]:
data.shape

(19382, 12)

In [10]:
data[10000:10008]

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
10000,10001,opinion,1421538844,5,my family thinks that windshield wipers were i...,not verified,active,6000.0,1699.0,515.0,39.0,14.0
10001,10002,opinion,7761568882,18,my family thinks that the average car has more...,not verified,active,5279.0,637.0,250.0,7.0,1.0
10002,10003,opinion,6451317096,44,my family thinks that violins are constructed ...,not verified,under review,2953.0,859.0,67.0,2.0,1.0
10003,10004,opinion,8265851170,46,my family thinks that the most recorded song i...,not verified,active,4429.0,366.0,134.0,1.0,0.0
10004,10005,opinion,5559772026,50,my family thinks that the best selling single ...,not verified,active,5438.0,1139.0,55.0,14.0,3.0
10005,10006,opinion,8007367020,36,my family thinks that the japanese word ‘karao...,not verified,active,9197.0,2342.0,589.0,56.0,7.0
10006,10007,opinion,4444073469,26,my family thinks that the musical artist princ...,not verified,active,1924.0,325.0,25.0,0.0,0.0
10007,10008,opinion,8802429822,6,my family thinks that 90% of the world’s data ...,not verified,active,4950.0,1293.0,467.0,24.0,0.0


In [5]:
# Get summary info
data.info()

<class 'pandas.core.frame.DataFrame'>
,RangeIndex: 19382 entries, 0 to 19381
,Data columns (total 12 columns):
, #   Column                    Non-Null Count  Dtype  
,---  ------                    --------------  -----  
, 0   #                         19382 non-null  int64  
, 1   claim_status              19084 non-null  object 
, 2   video_id                  19382 non-null  int64  
, 3   video_duration_sec        19382 non-null  int64  
, 4   video_transcription_text  19084 non-null  object 
, 5   verified_status           19382 non-null  object 
, 6   author_ban_status         19382 non-null  object 
, 7   video_view_count          19084 non-null  float64
, 8   video_like_count          19084 non-null  float64
, 9   video_share_count         19084 non-null  float64
, 10  video_download_count      19084 non-null  float64
, 11  video_comment_count       19084 non-null  float64
,dtypes: float64(5), int64(3), object(4)
,memory usage: 1.8+ MB


In [6]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Question 1: The dataframe contains a collection of categorical, text, and numerical data. Each row represents a distinct TikTok video that presents either a claim or an opinion and the accompanying metadata about that video.

Question 2: The dataframe contains five float64s, three int64s, and four objects. There are 19,382 observations, but some of the variables are missing values, including claim status, the video transcripton, and all of the count variables.

Question 3: Many of the count variables seem to have outliers at the high end of the distribution. They have very large standard deviations and maximum values that are very high compared to their quartile values.

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [23]:
data.claim_status.unique()

array(['claim', 'opinion', nan], dtype=object)

In [95]:
data['claim_status'].value_counts()

claim      9608
opinion    9476
Name: claim_status, dtype: int64

In [44]:
# What are the different values for claim status and how many of each are in the data?

claim = data['claim_status'] == 'claim'
opinion = data['claim_status'] == 'opinion'
nan = data['claim_status'].isna()

claim_d = data[claim]
opinion_d = data[opinion]

print("Number of claims:", claim_d.shape, "\nNumber of opinions:", opinion_d.shape, "\nNumber of null values:", nan.sum())

Number of claims: (9608, 12) 
,Number of opinions: (9476, 12) 
,Number of null values: 298


**Question:** What do you notice about the values shown? 
The counts of each claim status are quite balanced.

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [96]:
# What is the average view count of videos with "claim" status?

claim_d = data[data['claim_status'] == 'claim']
print('Mean view count claims:', claim_d['video_view_count'].mean())
print('Median view count claims:', claim_d['video_view_count'].median())

Mean view count claims: 501029.4527477102
,Median view count claims: 501555.0


In [98]:
# What is the average view count of videos with "opinion" status?

opinion_d = data[data['claim_status'] == 'opinion']
print('Mean view count opinions:', opinion_d['video_view_count'].mean())
print('Median view count opinions:', opinion_d['video_view_count'].median())

Mean view count opinions: 4956.43224989447
,Median view count opinions: 4953.0


In [99]:
data.groupby(['claim_status']).agg(['mean', 'median'])

Unnamed: 0_level_0,#,#,video_id,video_id,video_duration_sec,video_duration_sec,video_view_count,video_view_count,video_like_count,video_like_count,...,video_download_count,video_download_count,video_comment_count,video_comment_count,likes_per_view,likes_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,mean,median,mean,median,mean,median,mean,median,mean,median,...,mean,median,mean,median,mean,median,mean,median,mean,median
claim_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
claim,4804.5,4804.5,5627264000.0,5607672000.0,32.486886,32.0,501029.452748,501555.0,166373.331182,123649.0,...,2070.952227,1139.5,691.164863,286.0,0.33161,0.329734,0.001386,0.000774,0.065867,0.049673
opinion,14346.5,14346.5,5622382000.0,5611857000.0,32.359856,32.0,4956.43225,4953.0,1092.729844,823.0,...,13.67729,7.0,2.697446,1.0,0.219802,0.218127,0.000516,0.000253,0.043699,0.032487


**Question:** What do you notice about the mean and media within each claim category? 
The mean and the median within each claim category are close to one another, but there is a vast discrepancy between view counts for videos labeled as claims and videos labeled as opinions.

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [52]:
data.author_ban_status.unique()

array(['under review', 'active', 'banned'], dtype=object)

In [100]:
# Get counts for each group combination of claim status and author ban status

data.groupby(['claim_status', 'author_ban_status']).count()[['#']]

Unnamed: 0_level_0,Unnamed: 1_level_0,#
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?

There are many more claim videos with banned authors than there are opinion videos with banned authors. This could mean a number of things, including the possibilities that:

-Claim videos are more strictly policed than opinion videos

-Authors must comply with a stricter set of rules if they post a claim than if they post an opinion


Also, it should be noted that there's no way of knowing if claim videos are inherently more likely than opinion videos to result in author bans, or if authors who post claim videos are more likely to post videos that violate terms of service.

Finally, while you can use this data to draw conclusions about banned/active authors, you cannot draw conclusions about banned videos. There's no way of determining whether a particular video caused the ban, and banned authors could have posted videos that complied with the terms of service.


Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [102]:
data.groupby(['author_ban_status']).agg('median').video_share_count

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

In [105]:
#OR
data.groupby(['author_ban_status']).median(numeric_only=True)[
    ['video_share_count']]

Unnamed: 0_level_0,video_share_count
author_ban_status,Unnamed: 1_level_1
active,437.0
banned,14468.0
under review,9444.0


**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

Banned authors have a median share count that's 33 times the median share count of active authors! 


Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [106]:
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['mean', 'median'],
     'video_like_count': ['mean', 'median'],
     'video_share_count': ['mean', 'median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215927.039524,8616.0,71036.533836,2222.0,14111.466164,437.0
banned,445845.439144,448201.0,153017.236697,105573.0,29998.942508,14468.0
under review,392204.836399,365245.5,128718.050339,71204.5,25774.696999,9444.0


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?


Banned authors and those under review get far more views, likes, and shares than active authors.
In most groups, the mean is much greater than the median, which indicates that there are some videos with very high engagement counts.


Now, create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [76]:
# Create a likes_per_view column
data['likes_per_view'] = data['video_like_count']/data['video_view_count']

# Create a comments_per_view column
data['comments_per_view'] = data['video_comment_count']/data['video_view_count']

# Create a shares_per_view column
data['shares_per_view'] = data['video_share_count']/data['video_view_count']

Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [107]:
data.groupby(['claim_status', 'author_ban_status']).agg(
    {'likes_per_view': ['count', 'mean', 'median'],
     'comments_per_view': ['count', 'mean', 'median'],
     'shares_per_view': ['count', 'mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.


We know that videos by banned authors and those under review tend to get far more views, likes, and shares than videos by non-banned authors. However, when a video does get viewed, its engagement rate is less related to author ban status and more related to its claim status.

Also, we know that claim videos have a higher view rate than opinion videos, but this tells us that claim videos also have a higher rate of likes on average, so they are more favorably received as well. Furthermore, they receive more engagement via comments and shares than opinion videos.

Note that for claim videos, banned authors have slightly higher likes/view and shares/view rates than active authors or those under review. However, for opinion videos, active authors and those under review both get higher engagement rates than banned authors in all categories.


### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


Of the 19,382 samples in this dataset, just under 50% are claims—9,608 of them.


Engagement level is strongly correlated with claim status. This should be a focus of further inquiry.


Videos with banned authors have significantly higher engagement than videos with active authors. Videos with authors under review fall between these two categories in terms of engagement levels.


In [94]:
print("Percentage of claims:", len(claim_d)/len(data), 
      "\nPercentage of opinions:", len(opinion_d)/len(data),
      "\nPercentage of null values:", nan.sum()/len(data))

Percentage of claims: 0.4957176761944072 
,Percentage of opinions: 0.4889072335156331 
,Percentage of null values: 0.015375090289959757


About 50% of data was comprised of claims, and about 49% was comprised of opinions. 