# Preliminary Data Investigation

**The purpose** of this notebook is to investigate and understand the data provided. This notebook aims to:

1. Acquaint with the data.

2. Compile summary information about the data.

3. Begin the process of EDA and reveal insights contained in the data.

4. Prepare for more in-depth EDA, hypothesis testing, and statistical analysis.

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and share findings with the TikTok data team members.

---

*This notebook has four parts:*

**Part 1:** Understand the Situation

**Part 2:** Understand the Data

**Part 3:** Understand the Variables

**Part 4:** Summary

## Part 1: Understand the situation

Requirements: Import data, get summary statistics, data nonnull counts, and find relevant/irrelevant columns

Data Dictionary:

![Data](imgs/tiktok_data_dictionary.png)

## Part 2: Understand the Data

In [1]:
# Import packages
import pandas as pd
import numpy as np

In [2]:
# Load dataset into dataframe
data = pd.read_csv("data/tiktok_dataset.csv")

In [3]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


There are string, integer, and float data types. Each row represents a video. It seems like the `#` column is useless since we can identify the video using video_id assuming no duplicates or nulls but need further investigation.

In [4]:
# Get summary info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


There are 5 floats, 3 integers, and 4 object data types. Some columns have null values but not a lot. Also, the last 5 columns (video_view_count, ..., video_comment_count) can be converted to integer instead of float data type because we won't have half a like or half a comment.

In [5]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


The `#` column is just going from 1 to 19382 but we already have the index that's going from 0 to 19381 so we can consider removing the `#` column. The distribution of the variables are right skewed because the mean is greater than the median. Outliers are present because if we look at the maximum value for the last 5 columns (`video_view_count`, ..., `video_comment_count`), they are extreme compare to the 75 percentile. We need further investigation to see how many outliers there are and what to do with them.

## Part 3: Understand the Variables

The ultimate objective is to use machine learning to classify videos as either claims or opinions. Therefore, a good first step towards understanding the variables might be examining the `claim_status` variable.

In [6]:
# Claim status value counts
data['claim_status'].value_counts()

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

The number of claims and opinions are pretty evenly spreaded.

### Examine the engagement trends associated with each different claim status

In [7]:
# The average and median view count per claim_status
data.groupby('claim_status')['video_view_count'].agg(['mean', 'median']).reset_index()

Unnamed: 0,claim_status,mean,median
0,claim,501029.452748,501555.0
1,opinion,4956.43225,4953.0


Seems like claims (information that is either unsourced or from an unverified source) generated way more views than opinions.

### Examine trends associated with the ban status of the author

In [8]:
# Get counts for each group combination of claim status and author ban status
data.groupby(['claim_status','author_ban_status']).size().to_frame('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,count
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


There are a lot more bans/under review for claims videos. I suspend this is due to the fact that they are spreading information that are coming from a unsourced/unverified sources. However, this doesn't tell us if posting claim videos are more likely to get banned.

In [9]:
# Median video share count per author ban status
data.groupby('author_ban_status')['video_share_count'].median().reset_index()

Unnamed: 0,author_ban_status,video_share_count
0,active,437.0
1,banned,14468.0
2,under review,9444.0


The medium share count for banned author and those under review is significantly higher than active authors. This makes sense because widely spreaded claims are more likely to attract TikTok's attention.

In [10]:
# Video view, like, and share statistics per author_ban_status
data.groupby('author_ban_status')[['video_view_count','video_like_count','video_share_count']].agg(['count','mean','median'])

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


There are not many videos for banned authors but their videos are more popular.

### Create three new columns to help better understand engagement rates

- `likes_per_view`: represents the number of likes divided by the number of views for each video
- `comments_per_view`: represents the number of comments divided by the number of views for each video
- `shares_per_view`: represents the number of shares divided by the number of views for each video

In [11]:
# Create a likes_per_view column
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']

# Create a comments_per_view column
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']

# Create a shares_per_view column
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,0.056584,0.0,0.000702
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0,0.549096,0.004855,0.135111
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0,0.108282,0.000365,0.003168
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0,0.548459,0.001335,0.079569
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0,0.62291,0.002706,0.073175


In [14]:
# Count, mean, and median for the three new columns per claim_status and author_ban_status
data.groupby(['claim_status','author_ban_status'])[['likes_per_view','comments_per_view','shares_per_view']]\
    .agg(['count','mean','median']).reset_index()

Unnamed: 0_level_0,claim_status,author_ban_status,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,median,count,mean,median,count,mean,median
0,claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
1,claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
2,claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
3,opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
4,opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
5,opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


Claim videos in general generates more views, comments, and likes per view than opinion videos. The claim videos by banned authors have higher likes per view and shares per view than active authors and those under review ban status. On the other hand, the opinion videos by banned authors have lower likes per view, comments per view, and shares per view than active authors and those under review ban status.

At this point in the investigation, I found that TikTok users seems to be more interested in claim videos than opinion and engagement level seems to be strongly correlated with claim status.

## Part 4: Summary

The dataset contains 12 columns:

| Column | # of non-null | data type |
| --- | --- | --- |
| # |                         19382 non-null | int64 |  
| claim_status |              19084 non-null | object | 
| video_id |                  19382 non-null | int64 |  
| video_duration_sec |        19382 non-null | int64 |  
| video_transcription_text |  19084 non-null | object | 
| verified_status |           19382 non-null | object | 
| author_ban_status |         19382 non-null | object | 
| video_view_count |          19084 non-null | float64 |
| video_like_count |          19084 non-null | float64 |
| video_share_count |         19084 non-null | float64 |
| video_download_count |      19084 non-null | float64 |
| video_comment_count |       19084 non-null | float64 |

Relevant columns: `claim_status, video_id, author_ban_status, video_view_count, video_like_count, video_share_count, video_download_count, video_comment_count`

Irrelevant column(s): `#, video_duration_sec, video_transcription_text, verified_status`

---

The goal of the project is to build a machine learning model that can streamline the claims process by identifying whether statements made in videos are claims or opinions.

The engagement level seems to be strongly correlated with `claim_status` and somewhat with `author_ban_status`. Therefore, `author_ban_status` and the engagement rate stats (video_view_count, ..., video_comment_count) could be a good way to classify wether the video has a claim or opinion.

Claim and opinion videos' view statistics:

| | Average Views | Median Views |
| ---| --- | --- |
| Claim | 501029 | 501555
| Opinion | 4956 | 4953 |
