# **TikTok Project**
**Course 5 - Regression Analysis: Simplify complex data relationships**

In [1]:
import numpy as np
import pandas as pd
import platform
import statsmodels
print('Python version: ', platform.python_version())
print('numpy version: ', np.__version__)
print('pandas version: ', pd.__version__)
print('statsmodels version: ', statsmodels.__version__)

Python version:  3.12.11
numpy version:  2.3.0
pandas version:  2.3.0
statsmodels version:  0.14.5


You are a data professional at TikTok. The data team is working towards building a machine learning model that can be used to determine whether a video contains a claim or whether it offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

The team is getting closer to completing the project, having completed an initial plan of action, initial Python coding work, EDA, and hypothesis testing.

The TikTok team has reviewed the results of the hypothesis testing. TikTok’s Operations Lead, Maika Abadi, is interested in how different variables are associated with whether a user is verified. Earlier, the data team observed that if a user is verified, they are much more likely to post opinions. Now, the data team has decided to explore how to predict verified status to help them understand how video characteristics relate to verified users. Therefore, you have been asked to conduct a logistic regression using verified status as the outcome variable. The results may be used to inform the final model related to predicting whether a video is a claim vs an opinion.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 5 End-of-course project: Regression modeling**


In this activity, you will build a logistic regression model in Python. As you have learned, logistic regression helps you estimate the probability of an outcome. For data science professionals, this is a useful skill because it allows you to consider more than one variable against the variable you're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

<br/>

**The purpose** of this project is to demostrate knowledge of EDA and regression models.

**The goal** is to build a logistic regression model and evaluate the model.
<br/>
*This activity has three parts:*

**Part 1:** EDA & Checking Model Assumptions
* What are some purposes of EDA before constructing a logistic regression model?

**Part 2:** Model Building and Evaluation
* What resources do you find yourself using as you complete this stage?

**Part 3:** Interpreting Model Results

* What key insights emerged from your model(s)?

* What business recommendations do you propose based on the models built?

Follow the instructions and answer the question below to complete the activity. Then, you will complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.


# **Build a regression model**

<img src="../../../images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="../../../images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**
Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

### **Task 1. Imports and loading**
Import the data and packages that you've learned are needed for building regression models.

In [4]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for data preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import resample

# Import packages for data modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


Load the TikTok dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [5]:
# Load dataset into dataframe
data = pd.read_csv("../../../data/tiktok_dataset.csv")

<img src="../../../images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

In this stage, consider the following question where applicable to complete your code response:

* What are some purposes of EDA before constructing a logistic regression model?


==> ENTER YOUR RESPONSE HERE

### **Task 2a. Explore data with EDA**

Analyze the data and check for and handle missing values and duplicates.

Inspect the first five rows of the dataframe.

In [6]:
# Display first few rows
### YOUR CODE HERE ###

data.head(5)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


Get the number of rows and columns in the dataset.

In [7]:
# Get number of rows and columns
### YOUR CODE HERE ###

data.shape

(19382, 12)

Get the data types of the columns.

In [8]:
# Get data types of columns
### YOUR CODE HERE ###

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


Get basic information about the dataset.

In [None]:
# Get basic information
### YOUR CODE HERE ###


Generate basic descriptive statistics about the dataset.

In [None]:
# Generate basic descriptive stats
### YOUR CODE HERE ###


Check for and handle missing values.

In [None]:
# Check for missing values
### YOUR CODE HERE ###

In [None]:
# Drop rows with missing values
### YOUR CODE HERE ###

In [None]:
# Display first few rows after handling missing values
### YOUR CODE HERE ###

Check for and handle duplicates.

In [None]:
# Check for duplicates
### YOUR CODE HERE ###

Check for and handle outliers.

In [None]:
# Create a boxplot to visualize distribution of `video_duration_sec`
### YOUR CODE HERE ###



In [None]:
# Create a boxplot to visualize distribution of `video_view_count`
### YOUR CODE HERE ###



In [None]:
# Create a boxplot to visualize distribution of `video_like_count`
### YOUR CODE HERE ###



In [None]:
# Create a boxplot to visualize distribution of `video_comment_count`
### YOUR CODE HERE ###



In [None]:
# Check for and handle outliers for video_like_count
### YOUR CODE HERE ###



Check class balance of the target variable. Remember, the goal is to predict whether the user of a given post is verified or unverified.

In [None]:
# Check class balance
### YOUR CODE HERE ###


Approximately 94.2% of the dataset represents videos posted by unverified accounts and 5.8% represents videos posted by verified accounts. So the outcome variable is not very balanced.

Use resampling to create class balance in the outcome variable, if needed.

In [None]:
# Use resampling to create class balance in the outcome variable, if needed

# Identify data points from majority and minority classes
### YOUR CODE HERE ###

# Upsample the minority class (which is "verified")
### YOUR CODE HERE ###

# Combine majority class with upsampled minority class
### YOUR CODE HERE ###

# Display new class counts
### YOUR CODE HERE ###

Get the average `video_transcription_text` length for videos posted by verified accounts and the average `video_transcription_text` length for videos posted by unverified accounts.



In [None]:
# Get the average `video_transcription_text` length for claims and the average `video_transcription_text` length for opinions
### YOUR CODE HERE ###


Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Extract the length of each `video_transcription_text` and add this as a column to the dataframe
### YOUR CODE HERE ###


In [None]:
# Display first few rows of dataframe after adding new column
### YOUR CODE HERE ###


Visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts.

In [None]:
# Visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts
# Create two histograms in one plot
### YOUR CODE HERE ###


### **Task 2b. Examine correlations**

Next, code a correlation matrix to help determine most correlated variables.

In [None]:
# Code a correlation matrix to help determine most correlated variables
### YOUR CODE HERE ###


Visualize a correlation heatmap of the data.

In [None]:
# Create a heatmap to visualize how correlated variables are
### YOUR CODE HERE ###


One of the model assumptions for logistic regression is no severe multicollinearity among the features. Take this into consideration as you examine the heatmap and choose which features to proceed with.

**Question:** What variables are shown to be correlated in the heatmap?

<img src="../../../images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

After analysis and deriving variables with close relationships, it is time to begin constructing the model. Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### **Task 3a. Select variables**

Set your Y and X variables.

Select the outcome variable.

In [None]:
# Select outcome variable
### YOUR CODE HERE ###


Select the features.

In [None]:
# Select features
### YOUR CODE HERE ###


# Display first few rows of features dataframe
### YOUR CODE HERE ###


### **Task 3b. Train-test split**

Split the data into training and testing sets.

In [None]:
# Split the data into training and testing sets
### YOUR CODE HERE ###


Confirm that the dimensions of the training and testing sets are in alignment.

In [None]:
# Get shape of each training and testing set
### YOUR CODE HERE ###


### **Task 3c. Encode variables**

Check the data types of the features.

In [None]:
# Check data types
### YOUR CODE HERE ###


In [None]:
# Get unique values in `claim_status`
### YOUR CODE HERE ###



In [None]:
# Get unique values in `author_ban_status`
### YOUR CODE HERE ###


As shown above, the `claim_status` and `author_ban_status` features are each of data type `object` currently. In order to work with the implementations of models through `sklearn`, these categorical features will need to be made numeric. One way to do this is through one-hot encoding.

Encode categorical features in the training set using an appropriate method.

In [None]:
# Select the training features that needs to be encoded
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###


In [None]:
# Set up an encoder for one-hot encoding the categorical features
### YOUR CODE HERE ###


In [None]:
# Fit and transform the training features using the encoder
### YOUR CODE HERE ###


In [None]:
# Get feature names from encoder
### YOUR CODE HERE ###


In [None]:
# Display first few rows of encoded training features
### YOUR CODE HERE ###


In [None]:
# Place encoded training features (which is currently an array) into a dataframe
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###


In [None]:
# Display first few rows of `X_train` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
### YOUR CODE HERE ###


In [None]:
# Concatenate `X_train` and `X_train_encoded_df` to form the final dataframe for training data (`X_train_final`)
# Note: Using `.reset_index(drop=True)` to reset the index in X_train after dropping `claim_status` and `author_ban_status`,
# so that the indices align with those in `X_train_encoded_df` and `count_df`
### YOUR CODE HERE ###

# Display first few rows
### YOUR CODE HERE ###


Check the data type of the outcome variable.

In [None]:
# Check data type of outcome variable
### YOUR CODE HERE ###


In [None]:
# Get unique values of outcome variable
### YOUR CODE HERE ###


A shown above, the outcome variable is of data type `object` currently. One-hot encoding can be used to make this variable numeric.

Encode categorical values of the outcome variable the training set using an appropriate method.

In [None]:
# Set up an encoder for one-hot encoding the categorical outcome variable
### YOUR CODE HERE ###


In [None]:
# Encode the training outcome variable
# Notes:
#   - Adjusting the shape of `y_train` before passing into `.fit_transform()`, since it takes in 2D array
#   - Using `.ravel()` to flatten the array returned by `.fit_transform()`, so that it can be used later to train the model
### YOUR CODE HERE ###

# Display the encoded training outcome variable
### YOUR CODE HERE ###


### **Task 3d. Model building**

Construct a model and fit it to the training set.

In [None]:
# Construct a logistic regression model and fit it to the training set
### YOUR CODE HERE ###



<img src="../../../images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Taks 4a. Results and evaluation**

Evaluate your model.

Encode categorical features in the testing set using an appropriate method.

In [None]:
# Select the testing features that needs to be encoded
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###


In [None]:
# Transform the testing features using the encoder
### YOUR CODE HERE ###


# Display first few rows of encoded testing features
### YOUR CODE HERE ###


In [None]:
# Place encoded testing features (which is currently an array) into a dataframe
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###


In [None]:
# Display first few rows of `X_test` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
### YOUR CODE HERE ###


In [None]:
# Concatenate `X_test` and `X_test_encoded_df` to form the final dataframe for training data (`X_test_final`)
# Note: Using `.reset_index(drop=True)` to reset the index in X_test after dropping `claim_status`, and `author_ban_status`,
# so that the indices align with those in `X_test_encoded_df` and `test_count_df`
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###


Test the logistic regression model. Use the model to make predictions on the encoded testing set.

In [None]:
# Use the logistic regression model to get predictions on the encoded testing set
### YOUR CODE HERE ###


Display the predictions on the encoded testing set.

In [None]:
# Display the predictions on the encoded testing set
### YOUR CODE HERE ###


Display the true labels of the testing set.

In [None]:
# Display the true labels of the testing set
### YOUR CODE HERE ###


Encode the true labels of the testing set so it can be compared to the predictions.

In [None]:
# Encode the testing outcome variable
# Notes:
#   - Adjusting the shape of `y_test` before passing into `.transform()`, since it takes in 2D array
#   - Using `.ravel()` to flatten the array returned by `.transform()`, so that it can be used later to compare with predictions
### YOUR CODE HERE ###


# Display the encoded testing outcome variable


Confirm again that the dimensions of the training and testing sets are in alignment since additional features were added.

In [None]:
# Get shape of each training and testing set
### YOUR CODE HERE ###


### **Task 4b. Visualize model results**

Create a confusion matrix to visualize the results of the logistic regression model.

In [None]:
# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###

Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the logistic regression model.

In [None]:
# Create a classification report
### YOUR CODE HERE ###


### **Task 4c. Interpret model coefficients**

In [None]:
# Get the feature names from the model and the model coefficients (which represent log-odds ratios)
# Place into a DataFrame for readability
### YOUR CODE HERE ###


### **Task 4d. Conclusion**

1. What are the key takeaways from this project?

2. What results can be presented from this project?

==> ENTER YOUR RESPONSE TO QUESTIONS 1 AND 2 HERE

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged. 