In [2]:
# Import packages
import pandas as pd
import numpy as np

In [None]:
# Load dataset into dataframe
data = pd.read_csv("../Statistical Testing/tiktok_dataset.csv")

In [None]:
# Display and examine the first ten rows of the dataframe
data.head(10)


In [None]:
# Get summary info
data.info()

In [None]:
# Get summary statistics7
data.describe()

In [None]:
# The dataframe contains a collection of categorical, text, and numerical data. Each row represents a distinct TikTok video that presents either a claim or an opinion and the accompanying metadata about that video.
# The dataframe contains five float64s, three int64s, and four objects. There are 19,382 observations, but some of the variables are missing values, including claim status, the video transcripton, and all of the count variable
# Many of the count variables seem to have outliers at the high end of the distribution. They have very large standard deviations and maximum values that are very high compared to their quartile values.

In [None]:
# What are the different values for claim status and how many of each are in the data?
data.groupby('claim_status').count()
# The counts of each claim status are quite balanced
#  1. claim with 9608
#  2. opinion with 9476

In [None]:
# What is the average view count of videos with "claim" status?
mask_claim = data['claim_status'] == 'claim'
data[mask_claim]['video_view_count'].mean()
data[mask_claim]['video_view_count'].median()

# The mean view count  for claims is :501029.4527477102
# The median is 501555.0

In [None]:
# What is the average view count of videos with "opinion" status?
mask_opinion = data['claim_status'] == 'opinion'
data[mask_opinion]['video_view_count'].mean()
data[mask_opinion]['video_view_count'].median()
# The mean view count for opinions is 4956.43224989447
# the median is: 4953.0

# The mean and the median within each claim category are close to one another, but there is a vast discrepancy between view counts for videos labeled as claims and videos labeled as opinions.

In [None]:
# Get counts for each group combination of claim status and author ban status
data.groupby(['claim_status','author_ban_status']).count()[['#']]




There are many **more claim videos with banned authors than there are opinion videos with banned authors**. This could mean a number of things, including the possibilities that:

Claim videos are more strictly policed than opinion videos
Authors must comply with a stricter set of rules if they post a claim than if they post an opinion
Also, it should be noted that there's no way of knowing if claim videos are inherently more likely than opinion videos to result in author bans, or if authors who post claim videos are more likely to post videos that violate terms of service.

Finally, while you can use this data to draw conclusions about banned/active authors, you cannot draw conclusions about banned videos. **There's no way of determining whether a particular video caused the ban**, and banned authors could have posted videos that complied with the terms of service.

In [None]:
# What's the median video share count of each author ban status?
data.groupby('author_ban_status').median('video_share_count')
#Banned authors have a median share count that's 33 times the median share count of active authors! 

In [None]:
#get the count, mean, and median of each of the following columns:
# `video_view_count`
# `video_like_count`
# `video_share_count`
data.groupby('author_ban_status').agg({'video_view_count':['count', 'mean', 'median'],
                                       'video_like_count': ['count', 'mean', 'median'],
                                       'video_share_count': ['count', 'mean', 'median'] })

A few observations stand out:

* Banned authors and those under review get far more views, likes, and shares than active authors.
* In most groups, the mean is much greater than the median, which indicates that there are some videos with very high engagement counts.

In [None]:
# Create a likes_per_view column
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']

# Create a comments_per_view column
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']

# Create a shares_per_view column
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

In [None]:
data.groupby(['claim_status', 'author_ban_status']).agg(
    {'likes_per_view': ['count', 'mean', 'median'],
     'comments_per_view': ['count', 'mean', 'median'],
     'shares_per_view': ['count', 'mean', 'median']})

We know that videos by banned authors and those under review tend to get far more views, likes, and shares than videos by non-banned authors. However, when a video does get viewed, its engagement rate is less related to author ban status and more related to its claim status.

Also, we know that claim videos have a higher view rate than opinion videos, but this tells us that claim videos also have a higher rate of likes on average, so they are more favorably received as well. Furthermore, they receive more engagement via comments and shares than opinion videos.

Note that for claim videos, banned authors have slightly higher likes/view and shares/view rates than active authors or those under review. However, for opinion videos, active authors and those under review both get higher engagement rates than banned authors in all categories.

***Summary***
* Of the 19,382 samples in this dataset, just under 50% are claims - 9,608 of them.  
* Engagement level is strongly correlated with claim status. This should be a focus of further inquiry.
* Videos with banned authors have significantly higher engagement than videos with active authors. Videos with authors under review fall between these two categories in terms of engagement levels.