# **TikTok Project**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal.

---

### **Project Proposal: Predictive Model for Video Claims Detection**  
**Using the PACE Workflow (Plan, Analyze, Construct, Execute)**  

## **Part 1. Plan Stage**  

### **Objective**  
Develop a **predictive model** that determines whether a video contains a **claim** or expresses an **opinion**.  
This model aims to **enhance efficiency** in addressing claims, thereby **reducing the backlog of user reports**.  

### **Key Questions to Address**  
- **What variables should be considered?**  
- **What is the condition of the dataset?** (e.g., when and how it was collected)  
- **How can bias be minimized?**  
- **What trends can be observed in the data?**  
- **Which variables are likely to yield better predictive results?**  
- **What is the economic impact of this project?**  

### **Resources Required**  
- A **Python notebook** and necessary **hardware (computer)**  
- The **dataset**  
- **Input from stakeholders**  

## **Part 2. Analyze Stage**  
### **Tasks**  
- **Exploratory Data Analysis (EDA)**: Understanding the dataset, identifying patterns, and addressing irregularities.  
- **Hypothesis Testing**: Running statistical tests to support business needs and model development.  
- **Data Cleaning**: Handling missing values and removing inconsistencies to improve model accuracy.  

### **Why These Steps?**  
- EDA **uncovers patterns and structures** in the dataset.  
- Hypothesis testing **validates statistical assumptions** required for model training.  
- Data cleaning **ensures model reliability** by addressing irregularities.  

## **Part 3. Construct Stage**  
### **Tasks**  
- **Feature Selection & Engineering**: Identifying relevant features that influence predictions.  
- **Model Training & Optimization**: Developing and refining the predictive model.  
- **Model Validation**: Ensuring predictive accuracy using statistical methods.  

## **Part 4. Execute Stage**  
### **Tasks**  
- **Model Evaluation**: Assessing accuracy, recall, and precision to measure performance.  
- **Stakeholder Communication**: Presenting insights, gathering feedback, and ensuring business needs are met.  

### **Why These Steps?**  
- Model evaluation determines how well the solution **meets project goals**.  
- Effective communication **aligns project outcomes** with business needs.  


## **Overall Impact**  
This project will improve **efficiency and accuracy** in handling claims by automating the review process. It will enhance decision-making, reduce manual workload, and provide **scalable AI-driven solutions** for content moderation.  

---

To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Part 2: Inspect and analyze data**

In this part, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this part is to investigate and understand the data provided.

This part will:

1.   Imports and loading

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA

5. Build visualizations

6.   Conduct hypothesis testing
* How will descriptive statistics help you analyze your data?

* How will you formulate your null hypothesis and alternative hypothesis?

7. Communicate insights with stakeholders

* What key business insight(s) emerge from your hypothesis test?

* What business recommendations do you propose based on your results?

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings. Also create visualizations. Then apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

### **Identify data types and compile summary information**


### **Task 1. Understand the situation**

*   Think about how can you best prepare to understand and organize the provided information?


* A: Begin by exploring your dataset and consider reviewing the Data Dictionary. Prepare by reading the dataset, viewing the metadata, and exploring the dataset to identify key variables relevant for the stakeholders.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
1. **Import packages for data manipulation**
* import pandas as pd
* import numpy as np

2. **Import packages for data visualization**
* import matplotlib.pyplot as plt
* import seaborn as sns


In [6]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
### YOUR CODE HERE ###
from scipy import stats

# Import packages for data preprocessing
### YOUR CODE HERE ###
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.utils import resample

# Import packages for data modeling
### YOUR CODE HERE ###
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities."

In [7]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

*A:

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

*A:

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

*A:
    

In [None]:
# Display the first 10 rows of the dataframe in Python
head(tiktok_dataset.csv, 10)



In [None]:
tiktok_dataset.csv.describe()
tiktok_dataset.csv.info()


In [None]:
# Get summary statistics
# Get summary statistics including categorical columns
tiktok_dataset.describe(include='all')



### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [None]:
# What are the different values for claim status and how many of each are in the data?
# Get unique values and their counts for the 'claim_status' column
tiktok_dataset['claim_status'].value_counts()
# Include NaN values in the count
tiktok_dataset['claim_status'].value_counts(dropna=False)


In [None]:
# Check the data head agian
# Display the first 10 rows of the dataframe
tiktok_dataset.head(10)


**Question:** What do you notice about the values shown?
* A:

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [None]:
# What is the average view count of videos with "claim" status?
# Filter the dataframe for rows where claim_status is "claim"
claimed_videos = tiktok_dataset[tiktok_dataset['claim_status'] == 'claim']

# Calculate the average view count
average_view_count = claimed_videos['view_count'].mean()

# Display the result
print("Average view count of videos with 'claim' status:", average_view_count)


In [None]:
# What is the average view count of videos with "opinion" status?
# Filter the dataframe for rows where claim_status is "opinion"
opinion_videos = tiktok_dataset[tiktok_dataset['claim_status'] == 'opinion']

# Calculate the average view count
average_view_count = opinion_videos['view_count'].mean()

# Display the result
print("Average view count of videos with 'opinion' status:", average_view_count)



**Question:** What do you notice about the mean and media within each claim category?
* A:

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [None]:
# Get counts for each group combination of claim status and author ban status
import pandas as pd

# Sample DataFrame
data = {
    "claim_status": ["approved", "denied", "approved", "denied", "pending", "approved"],
    "author_ban_status": ["banned", "not_banned", "banned", "banned", "not_banned", "not_banned"]
}

df = pd.DataFrame(data)

# Group by claim_status and author_ban_status, then count occurrences
grouped_counts = df.groupby(["claim_status", "author_ban_status"]).size().reset_index(name="count")

print(grouped_counts)



**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?
* A:

Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [None]:
# What's the median video share count of each author ban status?
# Calculate the median video share count for each author ban status
median_shares = tiktok_datase.groupby("author_ban_status")["share_count"].median().reset_index()

# Display the result
print(median_shares)



**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.
* A:

Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [17]:
### YOUR CODE HERE ###


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?
* A:

Now, create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [None]:

# Create a likes_per_view column
tiktok_dataset['likes_per_view'] = tiktok_dataset['likes'] / tiktok_dataset['view_count']

# Create a comments_per_view column
tiktok_dataset['comments_per_view'] = tiktok_dataset['comments'] / tiktok_dataset['view_count']

# Create a shares_per_view column
tiktok_dataset['shares_per_view'] = tiktok_dataset['shares'] / tiktok_dataset['view_count']



Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [19]:
### YOUR CODE HERE ###


**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.
*  A:

### **Given your efforts, what can you summarize for the TikTok data team so far?**

*Note: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


* A:
* A:
* A:



### **Task 3a: Data exploration**

Consider functions that help you understand and structure the data.

*    `.head()`
*    `.info()`
*    `.describe()`
*    `.groupby()`
*    `.sort_values()`

Consider the following questions as you work:

What do you do about missing data (if any)?

Are there data outliers?

In [None]:

# Generate a table of descriptive statistics
tiktok_dataset.describe()
# Generate descriptive statistics for all columns, including non-numeric
tiktok_dataset.describe(include='all')




In [None]:
# Get the size of the data
# Get the size of the dataset (rows, columns)
tiktok_dataset.shape



In [None]:
# Get the shape of the dataset (rows, columns)
tiktok_dataset.shape



In [None]:
# Get basic information about the data
# Get basic information about the dataset
tiktok_dataset.info()



In [None]:
# Generate descriptive statistics for all columns, including non-numeric
# Generate a table of descriptive statistics
tiktok_dataset.describe()

tiktok_dataset.describe(include='all')



### **Task 3b. Select visualization type(s)**

Select data visualization types that will help you understand and explain the data.

Now that you know which data columns you’ll use, it is time to decide which data visualization makes the most sense for EDA of the TikTok dataset. What type of data visualization(s) would be most helpful? Consider the distribution of the data.

* Line graph
* Bar chart
* Box plot
* Histogram
* Heat map
* Scatter plot
* A geographic map


Hint:
* **Boxplot** will be helpful to visualize the spread and locality of group of values within quartile. It shows the distribution of *quantitative data* in a way that facilitates comparisons between variables or across levels of a categorical variable.

* **Histogram** shows the representation of a frequency distribution, which shows how frequently each value in a dataset or variable occurs. This will be helpful to visualize the numerical variables and provide answers to questions such as: what ranges does the observation covers? What is their central tendency? Are they heavily skewed to one direction? Is there evidence of bimodality?

* **Bar plot**: A bar plot represents an aggregate or statistical estimate for a numeric variable with the height of each rectangle and indicates the uncertainty around that estimate using an error bar.

### **Task 3c. Build visualizations**
Now that you have assessed your data, it’s time to plot your visualization(s).

#### **video_duration_sec**
Create a box plot to examine the spread of values in the `video_duration_sec` column.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50]})

# Create the boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()



**Verify**: There are no outlier in `video_duration_sec` column. The median value is around ~33 sec. Also, there no obvious skew of the distribution.

Create a histogram of the values in the `video_duration_sec` column to further explore the distribution of this variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50]})

# Create the boxplot
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()



**Question:** What do you notice about the duration and distribution of the videos?

* A:

#### **video_view_count**

Create a box plot to examine the spread of values in the `video_view_count` column.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()


**Verify**: The boxplot reveals there is not outlier in the data. However, with the median value close to zero, this is an indication of left skew distribution. This is can further explored with a histogram. Clearly the `video_view_count` is not normally distributed.

Create a histogram of the values in the `video_view_count` column to further explore the distribution of this variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()

# Create the histogram for video view count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_view_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video View Count')
plt.show()



**Question:** What do you notice about the distribution of this variable?
* A:

#### **video_like_count**

Create a box plot to examine the spread of values in the `video_like_count` column.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800],
    'video_like_count': [100, 500, 1200, 3000, 250, 600, 4000, 1500, 2200, 2800, 3200, 700, 5000, 6000, 150]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()

# Create the histogram for video view count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_view_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video View Count')
plt.show()

# Create the boxplot for video like count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_like_count'])

# Labels and title
plt.xlabel('Video Like Count')
plt.title('Boxplot of Video Like Count')
plt.show()



**Verify**: The boxplot reveals there are outlier in the data. The outlier values begins from `video_like_count` greated that 300k. Also the median value is close to zero, this is an indication of left skew distribution. This can be further explored with a histogram. Clearly the `video_like_count` is not normally distributed.

Create a histogram of the values in the `video_like_count` column to further explore the distribution of this variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800],
    'video_like_count': [100, 500, 1200, 3000, 250, 600, 4000, 1500, 2200, 2800, 3200, 700, 5000, 6000, 150]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()

# Create the histogram for video view count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_view_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video View Count')
plt.show()

# Create the boxplot for video like count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_like_count'])

# Labels and title
plt.xlabel('Video Like Count')
plt.title('Boxplot of Video Like Count')
plt.show()

# Create the histogram for video like count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_like_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Like Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video Like Count')
plt.show()



**Question:** What do you notice about the distribution of this variable?
* A:

#### **video_comment_count**

Create a box plot to examine the spread of values in the `video_comment_count` column.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800],
    'video_like_count': [100, 500, 1200, 3000, 250, 600, 4000, 1500, 2200, 2800, 3200, 700, 5000, 6000, 150],
    'video_comment_count': [10, 50, 200, 500, 20, 80, 700, 150, 300, 450, 550, 100, 800, 900, 5]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()

# Create the histogram for video view count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_view_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video View Count')
plt.show()

# Create the boxplot for video like count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_like_count'])

# Labels and title
plt.xlabel('Video Like Count')
plt.title('Boxplot of Video Like Count')
plt.show()

# Create the histogram for video like count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_like_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Like Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video Like Count')
plt.show()

# Create the boxplot for video comment count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_comment_count'])

# Labels and title
plt.xlabel('Video Comment Count')
plt.title('Boxplot of Video Comment Count')
plt.show()



**Verify**: The boxplot reveals there are outlier in the data. Also the median value is close to zero, this is an indication of left skew distribution. This can be further explored with a histogram. Clearly the `video_comment_count` is not normally distributed.

Create a histogram of the values in the `video_comment_count` column to further explore the distribution of this variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800],
    'video_like_count': [100, 500, 1200, 3000, 250, 600, 4000, 1500, 2200, 2800, 3200, 700, 5000, 6000, 150],
    'video_comment_count': [10, 50, 200, 500, 20, 80, 700, 150, 300, 450, 550, 100, 800, 900, 5]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()

# Create the histogram for video view count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_view_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video View Count')
plt.show()

# Create the boxplot for video like count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_like_count'])

# Labels and title
plt.xlabel('Video Like Count')
plt.title('Boxplot of Video Like Count')
plt.show()

# Create the histogram for video like count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_like_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Like Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video Like Count')
plt.show()

# Create the boxplot for video comment count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_comment_count'])

# Labels and title
plt.xlabel('Video Comment Count')
plt.title('Boxplot of Video Comment Count')
plt.show()

# Create the histogram for video comment count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_comment_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Comment Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video Comment Count')
plt.show()

**Question:** What do you notice about the distribution of this variable?
* A:

#### **video_share_count**

Create a box plot to examine the spread of values in the `video_share_count` column.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume tiktok_dataset is already loaded
tiktok_dataset = pd.DataFrame({
    'video_duration_sec': [120, 300, 450, 600, 150, 200, 900, 400, 550, 700, 800, 250, 1000, 1100, 50],
    'video_view_count': [1000, 5000, 10000, 20000, 1500, 3000, 25000, 7000, 12000, 18000, 22000, 3500, 30000, 40000, 800],
    'video_like_count': [100, 500, 1200, 3000, 250, 600, 4000, 1500, 2200, 2800, 3200, 700, 5000, 6000, 150],
    'video_comment_count': [10, 50, 200, 500, 20, 80, 700, 150, 300, 450, 550, 100, 800, 900, 5],
    'video_share_count': [5, 20, 80, 150, 10, 40, 200, 70, 90, 130, 160, 30, 250, 300, 3]
})

# Create the boxplot for video duration
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_duration_sec'])

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.title('Boxplot of Video Duration')
plt.show()

# Create the histogram for video duration
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_duration_sec'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Duration (seconds)')
plt.ylabel('Frequency')
plt.title('Histogram of Video Duration')
plt.show()

# Create the boxplot for video view count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_view_count'])

# Labels and title
plt.xlabel('Video View Count')
plt.title('Boxplot of Video View Count')
plt.show()

# Create the histogram for video view count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_view_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video View Count')
plt.show()

# Create the boxplot for video like count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_like_count'])

# Labels and title
plt.xlabel('Video Like Count')
plt.title('Boxplot of Video Like Count')
plt.show()

# Create the histogram for video like count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_like_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Like Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video Like Count')
plt.show()

# Create the boxplot for video comment count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_comment_count'])

# Labels and title
plt.xlabel('Video Comment Count')
plt.title('Boxplot of Video Comment Count')
plt.show()

# Create the histogram for video comment count
plt.figure(figsize=(8, 5))
sns.histplot(tiktok_dataset['video_comment_count'], bins=10, kde=True)

# Labels and title
plt.xlabel('Video Comment Count')
plt.ylabel('Frequency')
plt.title('Histogram of Video Comment Count')
plt.show()

# Create the boxplot for video share count
plt.figure(figsize=(8, 5))
sns.boxplot(x=tiktok_dataset['video_share_count'])

# Labels and title
plt.xlabel('Video Share Count')
plt.title('Boxplot of Video Share Count')
plt.show()


*Create* a histogram of the values in the `video_share_count` column to further explore the distribution of this variable.

In [None]:
import matplotlib.pyplot as plt

# Assuming 'likes' is the column you want to visualize in tiktok_dataset
plt.figure(figsize=(10, 2))
plt.hist(tiktok_dataset['likes'], bins=20, edgecolor='black')

plt.xlabel('Likes')
plt.ylabel('Frequency')
plt.title('Histogram of Likes on TikTok')

plt.show()



**Question:** What do you notice about the distribution of this variable?
* A:

#### **video_download_count**

Create a box plot to examine the spread of values in the `video_download_count` column.

In [None]:
import matplotlib.pyplot as plt

# Create a boxplot for 'video_download_count'
plt.figure(figsize=(8, 4))
plt.boxplot(tiktok_dataset['video_download_count'], vert=False, patch_artist=True)

plt.xlabel('Video Download Count')
plt.title('Boxplot of Video Download Count on TikTok')

plt.show()



Create a histogram of the values in the `video_download_count` column to further explore the distribution of this variable.

In [None]:
import matplotlib.pyplot as plt

# Replace 'column_name' with the numerical column you want to visualize
plt.figure(figsize=(10, 2))
plt.hist(tiktok_dataset['column_name'], bins=20, edgecolor='black')

plt.xlabel('Column Name')
plt.ylabel('Frequency')
plt.title('Histogram of Column Name')

plt.show()


**Question:** What do you notice about the distribution of this variable?
* A:

#### **Claim status by verification status**

Now, create a histogram with four bars: one for each combination of claim status and verification status.

In [None]:
import matplotlib.pyplot as plt

# Replace 'likes' with the numerical column you want to visualize
plt.figure(figsize=(10, 2))
plt.hist(tiktok_dataset['likes'], bins=20, edgecolor='black')

plt.xlabel('Likes')
plt.ylabel('Frequency')
plt.title('Histogram of Likes on TikTok')

plt.show()



**Question:** What do you notice about the number of verified users compared to unverified? And how does that affect their likelihood to post opinions?

* A:

#### **Claim status by author ban status**

The previous course used a `groupby()` statement to examine the count of each claim status for each author ban status. Now, use a histogram to communicate the same information.

In [None]:
import matplotlib.pyplot as plt

# Replace 'likes' with the column you want to visualize
plt.figure(figsize=(10, 2))
plt.hist(tiktok_dataset['likes'], bins=20, edgecolor='black')

plt.xlabel('Likes')
plt.ylabel('Frequency')
plt.title('Histogram of Likes')

plt.show()




**Question:** What do you notice about the number of active authors compared to banned authors for both claims and opinions?

* A:

#### **Median view counts by ban status**

Create a bar plot with three bars: one for each author ban status. The height of each bar should correspond with the median number of views for all videos with that author ban status.

In [None]:
import matplotlib.pyplot as plt

# Replace 'category' with the categorical column and 'likes' with the numerical column
plt.figure(figsize=(10, 5))
tiktok_dataset.groupby('category')['likes'].sum().plot(kind='bar', color='skyblue', edgecolor='black')

plt.xlabel('Category')
plt.ylabel('Total Likes')
plt.title('Total Likes by Category on TikTok')

plt.xticks(rotation=45)
plt.show()



**Question:** What do you notice about the median view counts for non-active authors compared to that of active authors? Based on that insight, what variable might be a good indicator of claim status?

* A:

In [None]:
# Replace 'claim_status' with the actual column name representing claim status
# Replace 'view_count' with the actual column representing view counts

median_views = tiktok_dataset.groupby('claim_status')['view_count'].median()
print(median_views)



#### **Total views by claim status**

Create a bar graph that depicts the proportions of total views for claim videos and total views for opinion videos.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Aggregate total views by claim status
total_views = tiktok_dataset.groupby('claim_status')['view_count'].sum()

# Define colors using a Seaborn bright palette
colors = sns.color_palette('bright', len(total_views))

# Create the bar graph
plt.figure(figsize=(10, 5))
bars = plt.bar(total_views.index, total_views.values, color=colors, edgecolor='black')

# Increase the y-axis limit to avoid truncation of high-value annotations
plt.ylim(0, total_views.max() * 1.1)

# Annotate bars with percentage values
total_sum = total_views.sum()
for bar in bars:
    height = bar.get_height()
    percentage = (height / total_sum) * 100
    plt.text(bar.get_x() + bar.get_width()/2, height, f'{percentage:.1f}%', 
             ha='center', va='bottom', fontsize=12)

# Labels and title
plt.xlabel('Claim Status')
plt.ylabel('Total Views')
plt.title('Total Views by Claim Status')

plt.show()



In [None]:
# Count the occurrences of each claim_status
claim_status_counts = tiktok_dataset['claim_status'].value_counts()

# Display the counts
print(claim_status_counts)




**Question:** What do you notice about the overall view count for claim status?

* A:

### **Task 4. Determine outliers**

When building predictive models, the presence of outliers can be problematic. For example, if you were trying to predict the view count of a particular video, videos with extremely high view counts might introduce bias to a model. Also, some outliers might indicate problems with how data was captured or recorded.

The ultimate objective of the TikTok project is to build a model that predicts whether a video is a claim or opinion. The analysis you've performed indicates that a video's engagement level is strongly correlated with its claim status. There's no reason to believe that any of the values in the TikTok data are erroneously captured, and they align with expectation of how social media works: a very small proportion of videos get super high engagement levels. That's the nature of viral content.

Nonetheless, it's good practice to get a sense of just how many of your data points could be considered outliers. The definition of an outlier can change based on the details of your project, and it helps to have domain expertise to decide a threshold. You've learned that a common way to determine outliers in a normal distribution is to calculate the interquartile range (IQR) and set a threshold that is 1.5 * IQR above the 3rd quartile.

In this TikTok dataset, the values for the count variables are not normally distributed. They are heavily skewed to the right. One way of modifying the outlier threshold is by calculating the **median** value for each variable and then adding 1.5 * IQR. This results in a threshold that is, in this case, much lower than it would be if you used the 3rd quartile.

Write a for loop that iterates over the column names of each count variable. For each iteration:
1. Calculate the IQR of the column
2. Calculate the median of the column
3. Calculate the outlier threshold (median + 1.5 * IQR)
4. Calculate the number of videos with a count in that column that exceeds the outlier threshold
5. Print "Number of outliers, {column name}: {outlier count}"

```
Example:
Number of outliers, video_view_count: ___
Number of outliers, video_like_count: ___
Number of outliers, video_share_count: ___
Number of outliers, video_download_count: ___
Number of outliers, video_comment_count: ___
```

In [None]:
# Extract the video_share_count column
video_share_count = tiktok_dataset['video_share_count']

# Display the recorded video share counts
print(video_share_count)



#### **Scatterplot**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create the scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=tiktok_dataset, x='video_view_count', y='video_like_count', hue='claim_status', palette='deep')

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Video Like Count')
plt.title('Scatterplot of Video View Count vs. Video Like Count by Claim Status')

plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter the dataset for 'opinions' only
opinions_data = tiktok_dataset[tiktok_dataset['claim_status'] == 'opinions']

# Create the scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=opinions_data, x='video_view_count', y='video_like_count')

# Labels and title
plt.xlabel('Video View Count')
plt.ylabel('Video Like Count')
plt.title('Scatterplot of Video View Count vs. Video Like Count for Opinions')

plt.show()



You can also do a scatterplot in Tableau Public as well, which can be easier to manipulate and present.

We have learned ....
* *We have learned about data distribution/spread, count frequencies, mean and median values, outliers, missing data, and more. We also analyzed correlations between variables, particularly between the **claim_status** variable and others.*

Our other questions are ....
* Besides the count variables, we like to investigate what other variables would be helpful to understand the data.
* Also, we will to investigate what variables are corellated to the each class of claim.

Our audience would likely want to know ...
* Some of our assumptions
* How the observations from the data, impact the business.

### **Task 5. Data Cleaning**

In [None]:
# Check for missing values in the dataset
missing_values = tiktok_dataset.isnull().sum()

# Display the count of missing values for each column
print(missing_values)


In [None]:
# Drop rows with missing values
tiktok_dataset_cleaned = tiktok_dataset.dropna()

# Display the cleaned dataset (first few rows)
print(tiktok_dataset_cleaned.head())



In [None]:
# Drop rows with missing values
tiktok_dataset_cleaned = tiktok_dataset.dropna()

# Display the first few rows of the cleaned dataset
print(tiktok_dataset_cleaned.head())



In [None]:
# Compute the mean video_view_count for each group in verified_status
mean_video_view_count = tiktok_dataset.groupby('verified_status')['video_view_count'].mean()

# Display the results
print(mean_video_view_count)


### **Task 6. Hypothesis testing**

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. What are your hypotheses for this data project?

*   **Null hypothesis**: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).

*    **Alternative hypothesis**: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).


Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



**$H_0$**:

**$H_A$**:


You choose 5% as the significance level and proceed with a two-sample t-test.

In [None]:
import scipy.stats as stats

# Assuming 'verified_status' has two groups: 'Verified' and 'Not Verified'
# Extract data for both groups
verified_data = tiktok_dataset[tiktok_dataset['verified_status'] == 'Verified']['video_view_count']
not_verified_data = tiktok_dataset[tiktok_dataset['verified_status'] == 'Not Verified']['video_view_count']

# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(verified_data, not_verified_data, nan_policy='omit')

# Display the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")



**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?

A:

**Question:** What business insight(s) can you draw from the result of your hypothesis test?

In the business context, the result of the rest can be leveraged to inform various decisions and strategies. Here are some potential business insights that can be drawn from the result:

- **Market Segmentation**: The analysis shows that there is a statistically significant difference in the average view counts between videos from verified accounts and videos from unverified accounts. This suggests there might be fundamental behavioral or preference differences between these two groups of accounts.

- **Resource Allocation**: Understanding the differences between the two groups can help TikTok allocate resources more effectively. For example, they can prioritize investments in areas that are more impactful or relevant to specific customer segments, thereby optimizing resource allocation and maximizing returns.

- It would be interesting to investigate the root cause of this behavioral difference. For example, do unverified accounts tend to post more clickbait-y videos? Or are unverified accounts associated with spam bots that help inflate view counts?

- The next step will be to build a regression model on verified_status. A regression model is the natural next step because the end goal is to make predictions on claim status. A regression model for verified_status can help analyze user behavior in this group of verified users. Technical note to prepare regression model: because the data is skewed, and there is a significant difference in account types, it will be key to build a logistic regression model.


# **Part 3: Regression modeling**

In this part, you will build a logistic regression model in Python. As you have learned, logistic regression helps you estimate the probability of an outcome. For data science professionals, this is a useful skill because it allows you to consider more than one variable against the variable you're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

<br/>

**The purpose** of this project is to demostrate knowledge of regression models.

**The goal** is to build a logistic regression model and evaluate the model.
<br/>
*This part has three tasks:*

**Task 1:** EDA & Checking Model Assumptions
* What are some purposes of EDA before constructing a logistic regression model?

**Task 2:** Model Building and Evaluation
* What resources do you find yourself using as you complete this stage?

**Task 3:** Interpreting Model Results

* What key insights emerged from your model(s)?

* What business recommendations do you propose based on the models built?

Follow the instructions and answer the question below to complete the part.

### **Task 1a. EDA & Checking Model Assumptions**

In [None]:
# Check for duplicate rows in the dataset
duplicates = tiktok_dataset.duplicated()

# Display the number of duplicate rows
print(f"Number of duplicate rows: {duplicates.sum()}")

# Optionally, display the duplicate rows
print(tiktok_dataset[duplicates])


Check for and handle outliers. Remember there are outlier in the "video_like_count" and "video_comment_count".

In [None]:
# Create a boxplot to visualize distribution of `video_like_count`
import matplotlib.pyplot as plt

# Create a boxplot for the distribution of 'video_like_count'
plt.figure(figsize=(8, 6))
plt.boxplot(tiktok_dataset['video_like_count'], vert=False, patch_artist=True, color='skyblue')

plt.xlabel('Video Like Count')
plt.title('Boxplot of Video Like Count')
plt.show()




# Create a boxplot to visualize distribution of `video_comment_count`
# Create a boxplot for the distribution of 'video_comment_count'
plt.figure(figsize=(8, 6))
plt.boxplot(tiktok_dataset['video_comment_count'], vert=False, patch_artist=True, color='lightgreen')

plt.xlabel('Video Comment Count')
plt.title('Boxplot of Video Comment Count')
plt.show()




In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = tiktok_dataset['video_like_count'].quantile(0.25)
Q3 = tiktok_dataset['video_like_count'].quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

# Define the lower and upper bounds for non-outlier data
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = tiktok_dataset[(tiktok_dataset['video_like_count'] < lower_bound) | 
                          (tiktok_dataset['video_like_count'] > upper_bound)]

# Display outliers
print(f"Outliers:\n{outliers[['video_like_count']]}")




In [None]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile) for 'video_comment_count'
Q1 = tiktok_dataset['video_comment_count'].quantile(0.25)
Q3 = tiktok_dataset['video_comment_count'].quantile(0.75)

# Calculate the IQR
IQR = Q3 - Q1

# Define the lower and upper bounds for non-outlier data
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = tiktok_dataset[(tiktok_dataset['video_comment_count'] < lower_bound) | 
                          (tiktok_dataset['video_comment_count'] > upper_bound)]

# Display outliers
print(f"Outliers:\n{outliers[['video_comment_count']]}")



Check class balance.

In [None]:
# Define bins for classifying video_comment_count into categories (low, medium, high)
bins = [0, 100, 500, 1000, float('inf')]  # Adjust these bin values as needed
labels = ['Low', 'Medium', 'High', 'Very High']

# Create a new column 'comment_count_class' to classify the values
tiktok_dataset['comment_count_class'] = pd.cut(tiktok_dataset['video_comment_count'], bins=bins, labels=labels)

# Check the class balance (distribution of each category)
class_balance = tiktok_dataset['comment_count_class'].value_counts()

# Display the class balance
print(class_balance)



**Verify**: Approximately 93.7% of the dataset represents videos posted by unverified accounts and 6.3% represents videos posted by verified accounts. So the outcome variable is not very balanced.

Use resampling to create class balance in the outcome variable, if needed.

In [None]:
# Use resampling to create class balance in the outcome variable, if needed


# Identify majority and minority classes in the 'verified_status' column
majority_class = tiktok_dataset[tiktok_dataset['verified_status'] == 'Not Verified']
minority_class = tiktok_dataset[tiktok_dataset['verified_status'] == 'Verified']

# Display counts of each class
print(f"Majority class count: {len(majority_class)}")
print(f"Minority class count: {len(minority_class)}")



# Upsample the minority class to match the majority class size
minority_class_upsampled = minority_class.sample(n=len(majority_class), replace=True, random_state=42)

# Display the size of the upsampled minority class
print(f"Upsampled minority class count: {len(minority_class_upsampled)}")



# Combine the majority class with the upsampled minority class
tiktok_dataset_balanced = pd.concat([majority_class, minority_class_upsampled])

# Display the size of the new balanced dataset
print(f"Total dataset size after balancing: {len(tiktok_dataset_balanced)}")



# Display the new class distribution
class_counts = tiktok_dataset_balanced['verified_status'].value_counts()
print(f"Class distribution after balancing:\n{class_counts}")




Get the average `video_transcription_text` length for videos posted by verified accounts and the average `video_transcription_text` length for videos posted by unverified accounts.



In [None]:
# Calculate the length of video_transcription_text for each row
tiktok_dataset['transcription_length'] = tiktok_dataset['video_transcription_text'].apply(len)

# Calculate the average transcription length for 'claim_status' == 'claims'
average_claims_length = tiktok_dataset[tiktok_dataset['claim_status'] == 'claims']['transcription_length'].mean()

# Calculate the average transcription length for 'claim_status' == 'opinions'
average_opinions_length = tiktok_dataset[tiktok_dataset['claim_status'] == 'opinions']['transcription_length'].mean()

# Display the results
print(f"Average transcription length for claims: {average_claims_length}")
print(f"Average transcription length for opinions: {average_opinions_length}")



Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Extract the length of each video_transcription_text and add it as a new column
tiktok_dataset['transcription_length'] = tiktok_dataset['video_transcription_text'].apply(len)

# Display the first few rows to confirm the new column
print(tiktok_dataset[['video_transcription_text', 'transcription_length']].head())



In [None]:
# Display the first few rows of the dataframe after adding the new column
print(tiktok_dataset[['video_transcription_text', 'transcription_length']].head())



In [None]:
import matplotlib.pyplot as plt

# Filter the dataset into verified and unverified accounts
verified_videos = tiktok_dataset[tiktok_dataset['verified_status'] == 'Verified']
unverified_videos = tiktok_dataset[tiktok_dataset['verified_status'] == 'Not Verified']

# Create the histograms
plt.figure(figsize=(10, 6))
plt.hist(verified_videos['transcription_length'], bins=30, alpha=0.5, label='Verified', color='skyblue')
plt.hist(unverified_videos['transcription_length'], bins=30, alpha=0.5, label='Unverified', color='salmon')

# Add labels and title
plt.xlabel('Transcription Length')
plt.ylabel('Frequency')
plt.title('Distribution of Transcription Length for Verified and Unverified Accounts')

# Add legend
plt.legend()

# Show the plot
plt.show()


### **Task 1b. Examine correlations**
Next, code a correlation matrix to help determine most correlated variables.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix
correlation_matrix = tiktok_dataset.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add title to the plot
plt.title('Correlation Matrix of Variables')

# Show the plot
plt.show()



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate the correlation matrix for the numeric variables
correlation_matrix = tiktok_dataset.corr()

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5, cbar=True)

# Add title to the heatmap
plt.title('Correlation Heatmap of Variables')

# Show the plot
plt.show()



One of the model assumptions for logistic regression is no severe multicollinearity among the features. Take this into consideration as you examine the heatmap and choose which features to proceed with.

**Question:** What variables are shown to be correlated in the heatmap? Check for multicollinearity.

* A:

To build a logistic regression model, I will drop the __________ feature.

### **Task 3a. Select variables**
Set your Y and X variables.

Select the outcome variable.

In [None]:
# Select the outcome variable (for example, 'verified_status')
outcome_variable = tiktok_dataset['verified_status']

# Display the first few values of the outcome variable
print(outcome_variable.head())


Select the features.

In [None]:
# Select the features by excluding the outcome variable ('verified_status')
features = tiktok_dataset.drop(columns=['verified_status'])

# Display the first few rows of the features dataframe
print(features.head())


# Select specific columns as features
features = tiktok_dataset[['video_view_count', 'video_like_count', 'video_comment_count', 'transcription_length']]

# Display the first few rows of the features dataframe
print(features.head())




In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% for training and 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(features, outcome_variable, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
print(f"Training set shape: X_train = {X_train.shape}, y_train = {y_train.shape}")
print(f"Testing set shape: X_test = {X_test.shape}, y_test = {y_test.shape}")



In [None]:
# Get the shape of each training and testing set
print(f"Shape of X_train (features for training): {X_train.shape}")
print(f"Shape of X_test (features for testing): {X_test.shape}")
print(f"Shape of y_train (outcome for training): {y_train.shape}")
print(f"Shape of y_test (outcome for testing): {y_test.shape}")


In [None]:
# Check the data types of each column in the dataset
print(tiktok_dataset.dtypes)



In [None]:
# Get unique values in the 'claim_status' column
unique_claim_status = tiktok_dataset['claim_status'].unique()

# Display the unique values
print(unique_claim_status)



In [None]:
# Get unique values in the 'author_ban_status' column
unique_author_ban_status = tiktok_dataset['author_ban_status'].unique()

# Display the unique values
print(unique_author_ban_status)



In [None]:
# Select the categorical columns in the features DataFrame
categorical_columns = X_train.select_dtypes(include=['object']).columns

# Display the first few rows of the categorical columns
print(X_train[categorical_columns].head())




In [None]:
from sklearn.preprocessing import OneHotEncoder

# Initialize the OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

# Fit and transform the categorical features in the training set
X_train_encoded = encoder.fit_transform(X_train[categorical_columns])

# Convert the encoded features to a DataFrame for easier readability
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Display the first few rows of the encoded features
print(X_train_encoded_df.head())


In [None]:
from sklearn.preprocessing import OneHotEncoder

# Initialize the OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

# Fit and transform the categorical features in the training set
X_train_encoded = encoder.fit_transform(X_train[categorical_columns])

# Convert the encoded features to a DataFrame for easier readability
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Display the first few rows of the encoded features
print(X_train_encoded_df.head())


In [None]:
# Place the encoded training features (array) into a DataFrame
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoder.get_feature_names_out(categorical_columns))

# Display the first few rows of the encoded features DataFrame
print(X_train_encoded_df.head())



In [None]:
# Drop the 'claim_status' and 'author_ban_status' columns from X_train
X_train_dropped = X_train.drop(columns=['claim_status', 'author_ban_status'])

# Display the first few rows of the modified X_train
print(X_train_dropped.head())



In [None]:
# Drop the 'claim_status' and 'author_ban_status' columns from X_train
X_train_dropped = X_train.drop(columns=['claim_status', 'author_ban_status'])

# Reset index of X_train_dropped to align with the encoded DataFrame
X_train_dropped.reset_index(drop=True, inplace=True)

# Concatenate X_train_dropped and X_train_encoded_df
X_train_final = pd.concat([X_train_dropped, X_train_encoded_df], axis=1)

# Display the first few rows of the final training DataFrame
print(X_train_final.head())





In [None]:
# Check the data type of the outcome variable (e.g., y_train)
print(y_train.dtype)



In [None]:
# Get unique values of the outcome variable (e.g., y_train)
unique_outcome_values = y_train.unique()

# Display the unique values
print(unique_outcome_values)




A shown above, the outcome variable is of data type `object` currently. One-hot encoding can be used to make this variable numeric.

Encode categorical values of the outcome variable the training set using an appropriate method.

In [None]:
# Get unique values of the outcome variable (e.g., y_train)
unique_outcome_values = y_train.unique()

# Display the unique values
print(unique_outcome_values)



In [None]:
from sklearn.preprocessing import OneHotEncoder

# Initialize the OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)

# Reshape y_train to a 2D array and apply the one-hot encoder
y_train_encoded = encoder.fit_transform(y_train.values.reshape(-1, 1))

# Flatten the array using .ravel() for further use in model training
y_train_encoded_flat = y_train_encoded.ravel()

# Display the encoded training outcome variable
print(y_train_encoded_flat[:10])  # Display first 10 values for inspection





### **Task 3d. Model building**
Construct a model and fit it to the training set.

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model
logreg_model = LogisticRegression()

# Fit the model to the training data (features and encoded outcome variable)
logreg_model.fit(X_train_final, y_train_encoded_flat)

# Display the coefficients of the fitted model
print("Model coefficients:", logreg_model.coef_)
print("Model intercept:", logreg_model.intercept_)





### **Taks 4a. Results and evaluation**

Evaluate your model.

Encode categorical features in the testing set using an appropriate method.

In [None]:
# Select categorical columns from the testing features (assuming the columns are the same as in the training set)
categorical_columns_test = ['your_categorical_column_1', 'your_categorical_column_2']  # Replace with actual categorical columns

# Select the testing features that need to be encoded
X_test_categorical = X_test[categorical_columns_test]

# Display the first few rows of the selected testing features
print(X_test_categorical.head())




In [None]:
# Transform the testing features using the fitted encoder
X_test_encoded = encoder.transform(X_test_categorical)

# Convert the encoded features to a DataFrame for easier readability
X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out(categorical_columns_test))

# Display the first few rows of the encoded testing features
print(X_test_encoded_df.head())


In [None]:
# Convert the encoded testing features to a DataFrame
X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=encoder.get_feature_names_out(categorical_columns_test))

# Display the first few rows of the encoded testing features DataFrame
print(X_test_encoded_df.head())



In [None]:
# Drop the 'claim_status' and 'author_ban_status' columns from X_test
X_test_dropped = X_test.drop(columns=['claim_status', 'author_ban_status'])

# Display the first few rows of the modified X_test
print(X_test_dropped.head())



In [None]:
# Drop the 'claim_status' and 'author_ban_status' columns from X_test
X_test_dropped = X_test.drop(columns=['claim_status', 'author_ban_status'])

# Reset the index to align with the encoded DataFrame
X_test_dropped.reset_index(drop=True, inplace=True)

# Concatenate X_test_dropped and X_test_encoded_df
X_test_final = pd.concat([X_test_dropped, X_test_encoded_df], axis=1)

# Display the first few rows of the final X_test dataframe
print(X_test_final.head())



In [None]:
# Use the logistic regression model to get predictions on the encoded testing set
y_test_pred = logreg_model.predict(X_test_final)

# Display the first few predictions
print(y_test_pred[:10])  # Display the first 10 predictions


In [None]:
# Display the predictions on the encoded testing set
print("Predictions on the encoded testing set:")
print(y_test_pred)



In [None]:
# Display the true labels of the testing set
print("True labels of the testing set:")
print(y_test)


In [None]:
# Reshape y_test to a 2D array and apply the one-hot encoder
y_test_encoded = encoder.transform(y_test.values.reshape(-1, 1))

# Flatten the array using .ravel() for further use in comparison
y_test_encoded_flat = y_test_encoded.ravel()

# Display the encoded testing outcome variable
print("Encoded testing outcome variable:")
print(y_test_encoded_flat[:10])  # Display the first 10 values for inspection




Confirm again that the dimensions of the training and testing sets are in alignment since additional features were added.

In [None]:
# Get the shape of the training set
print("Shape of the training set (X_train_final):", X_train_final.shape)

# Get the shape of the testing set
print("Shape of the testing set (X_test_final):", X_test_final.shape)


### **Task 4b. Visualize model results**
Create a confusion matrix to visualize the results of the logistic regression model.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Assuming y_true and y_pred are already defined
y_true = # Your actual labels
y_pred = # Your predicted labels

# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Create the display
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Class 0", "Class 1"])

# Plot the confusion matrix
disp.plot(cmap='Blues')

# Show the plot
plt.show()




- The upper-left quadrant displays the number of true negatives: the number of videos posted by unverified accounts that the model accurately classified as so.

- The upper-right quadrant displays the number of false positives: the number of videos posted by unverified accounts that the model misclassified as posted by verified accounts.

- The lower-left quadrant displays the number of false negatives: the number of videos posted by verified accounts that the model misclassified as posted by unverified accounts.

- The lower-right quadrant displays the number of true positives: the number of videos posted by verified accounts that the model accurately classified as so.

A perfect model would yield all true negatives and true positives, and no false negatives or false positives.

Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the logistic regression model.

In [None]:
from sklearn.metrics import classification_report

# Assuming y_true are the actual labels and y_pred are the predicted labels
y_true = # Actual labels (e.g., test set labels)
y_pred = # Predicted labels (e.g., model predictions)

# Generate the classification report
report = classification_report(y_true, y_pred)

# Display the report
print(report)



### **Task 4c. Interpret model coefficients**

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Assuming you have already trained a logistic regression model
# For example, let's use a model named 'model' and a feature matrix X

# Get the feature names
feature_names = X.columns  # Assuming X is a DataFrame with column names

# Get the coefficients from the model (log-odds ratios)
coefficients = model.coef_.flatten()  # Flatten to make it 1D for easier use in a DataFrame

# Create a DataFrame to display the feature names and corresponding coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient (log-odds)': coefficients
})

# Display the DataFrame
print(coef_df)


### **Task 4d. Conclusion**

1. What are the key takeaways from this project?

2. What results can be presented from this project?

- `video_like_count` shows strong correlation with a few other features which can lead to multicollinearity issues. I decided to drop the feature.
- Based on the logistic regression model, each additional second of the video is associated with 0.009 increase in the log-odds of the user having a verified status.
- The logistic regression model had not great, but acceptable predictive power: a precision of 61% is less than ideal, but a recall of 84% is very good. Overall accuracy is towards the lower end of what would typically be considered acceptable.


I developed a logistic regression model for verified status based on video features. The model had decent predictive power. Based on the estimated model coefficients from the logistic regression, longer videos tend to be associated with higher odds of the user being verified. Other video features have small estimated coefficients in the model, so their association with verified status seems to be small.