### This Part of the project aims to explore and understand the provided TikTok dataset, laying the groundwork for deeper analysis. The process is divided into three key parts:

Part 1: Understand the Context
- Clarify the purpose of the data
- Identify key objectives and data structure

Part 2: Explore the Data
- Load the data into a pandas DataFrame
- Review structure, data types, and summary statistics
- Check for missing or unusual values

Part 3: Analyze Variables
- Examine relationships between variables
- Identify key features for further analysis

------------------------------------------------------------------------------------------------------

## 1- PACE: Plan
As part of planning stage, I will start by identifying any outliers in the data. 

I will use a couple of simple tools:

Numpy : I’ll look at the mean() and median() of numeric columns to get a feel for the data’s central tendency and overall range.

Boxplots : Visuals always help, so I will try to generate boxplots to see if there are any extreme values standing out.

## What should I Do With Outliers ?
Not all outliers are bad  but some of them might be important! but depending on the project goals, I will decide whether to:

- Delete them if they look like obvious errors or typos 

- Reassign them if the dataset is small or if I need to smooth the data for a sensitive model.

- Leave them as-is if I’m just doing EDA or if the model I plan to use can handle them well.

## 2- PACE: Analyze
Before we dive into the analysis, let’s bring in the essential Python libraries we'll use for loading and exploring the TikTok dataset.

In [1]:
import numpy as np
import pandas as pd


Now, let’s load the dataset into a pandas DataFrame


In [2]:
data = pd.read_csv("/Users/sufyanalwjali/Desktop/Data analysis/Tik_Tok_Project-/tiktok_dataset.csv")

View and inspect summary information about the dataframe

In [3]:
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


The dataset includes a mix of categorical, text, and numerical features. Each row represents a unique TikTok video that presents either a claim or an opinion,

In [4]:
data.info()
 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


Here we can Notice the Missing values Found in claim_status, video_transcription, and count fields like video_view_count and video_like_count

In [5]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Count variables have extreme outliers very high max values and large standard deviations compared to quartiles. These will need attention during my analysis.

In [6]:
data.isnull().sum()
 

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

Cleaning data and dropping the null values in file Data exploration and hypothesis testing. 

Lets Understand Investigate the variables in the Dataset

In [6]:
data['claim_status'].value_counts()

claim      9608
opinion    9476
Name: claim_status, dtype: int64

The counts of each claim status are quite balanced.

Let's see now the engagement trends associated with each different claim status.

In [7]:
claims = data[data['claim_status']=='claim']
print('Mean view count claims:', claims['video_view_count'].mean())
print('Median view count claims:', claims['video_view_count'].median())


Mean view count claims: 501029.4527477102
Median view count claims: 501555.0


In [8]:
opinion = data[data['claim_status']=='opinion']
print('Mean view count claims:', opinion['video_view_count'].mean())
print('Median view count claims:', opinion['video_view_count'].median())

Mean view count claims: 4956.43224989447
Median view count claims: 4953.0


Mean and median views are similar within each group, but claim videos have significantly higher view counts than opinion videos. This suggests claims may drive more engagement.

Let's calculate how many videos there are for each combination of categories of claim status and author ban status.

In [9]:
data.groupby(['claim_status','author_ban_status']).count()[['#']]


Unnamed: 0_level_0,Unnamed: 1_level_0,#
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


We observe that more banned authors are linked to claim videos than to opinion videos.

This may suggest that:

Claim content is more strictly moderated or the Authors posting claims face stricter content guidelines.

However, it's important to note that while we can analyze author status, we can't determine if a specific video caused a ban.
Bans may result from other content, and even banned authors might have posted videos that did not violate any rules.

let's see and calculate the median video share count of each author ban status.

In [10]:
data.groupby(['author_ban_status']).agg(
        {'video_view_count':['mean','median'],
         'video_like_count':['mean','median'],
         'video_share_count':['mean','median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215927.039524,8616.0,71036.533836,2222.0,14111.466164,437.0
banned,445845.439144,448201.0,153017.236697,105573.0,29998.942508,14468.0
under review,392204.836399,365245.5,128718.050339,71204.5,25774.696999,9444.0


In [11]:
data.groupby(['author_ban_status']).median(numeric_only=True)[['video_share_count']]


Unnamed: 0_level_0,video_share_count
author_ban_status,Unnamed: 1_level_1
active,437.0
banned,14468.0
under review,9444.0


This suggests their content may have been more widely shared or potentially more viral, which could have drawn increased attention from moderators.

In [12]:
data.groupby(['author_ban_status']).agg(
            {'video_view_count':['count','mean','median'],
            'video_like_count':['count','mean','median'],
            'video_share_count':['count','mean','median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


In [13]:
np.array([[1, 2], [3, 4]])


array([[1, 2],
       [3, 4]])

In [14]:
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

In [15]:
data.groupby(['claim_status','author_ban_status']).agg({'likes_per_view':['count','mean','median'],
                                                        'comments_per_view':['count','mean','median'],
                                                        'shares_per_view':['count','mean','median']})


Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


### Claims vs. Opinions Engagement Insights
Claim videos get more views, likes, comments, and shares than opinion videos overall.
This suggests claim content is more engaging and favorably received.

### Author Status & Engagement
Banned authors tend to get more total views, likes, and shares, but once a video is viewed, engagement rates depend more on claim status than author status.

### Engagement Rate by Group

For claim videos:
Banned authors have slightly higher likes/view and shares/view rates than active or under-review authors.
For opinion videos:
Active and under-review authors show higher engagement rates than banned authors across all metrics.

------------------------------------------------------------------------------------------------------------

# Summary 

### 1- Claim vs Opinion Breakdown:
 Out of 19,382 videos, just over 50% (9,608) are labeled as claims.

### 2- Engagement & Claim Status:
Engagement metrics (views, likes, comments, shares) are strongly correlated with claim status , claim videos consistently perform better than opinion videos. This is a key area for further analysis.

### 3- Author Status & Engagement:

1- Videos by banned authors show significantly higher engagement than those by active authors.
2- Authors under review fall between the two in terms of performance.