# **TikTok Project**
**Course 5 - Regression Analysis: Simplify complex data relationships**

You are a data professional at TikTok. The data team is working towards building a machine learning model that can be used to determine whether a video contains a claim or whether it offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

The team is getting closer to completing the project, having completed an initial plan of action, initial Python coding work, EDA, and hypothesis testing.

The TikTok team has reviewed the results of the hypothesis testing. TikTok’s Operations Lead, Maika Abadi, is interested in how different variables are associated with whether a user is verified. Earlier, the data team observed that if a user is verified, they are much more likely to post opinions. Now, the data team has decided to explore how to predict verified status to help them understand how video characteristics relate to verified users. Therefore, you have been asked to conduct a logistic regression using verified status as the outcome variable. The results may be used to inform the final model related to predicting whether a video is a claim vs an opinion.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 5 End-of-course project: Regression modeling**


In this activity, you will build a logistic regression model in Python. As you have learned, logistic regression helps you estimate the probability of an outcome. For data science professionals, this is a useful skill because it allows you to consider more than one variable against the variable you're measuring against. This opens the door for much more thorough and flexible analysis to be completed.

<br/>

**The purpose** of this project is to demostrate knowledge of EDA and regression models.

**The goal** is to build a logistic regression model and evaluate the model.
<br/>
*This activity has three parts:*

**Part 1:** EDA & Checking Model Assumptions
* What are some purposes of EDA before constructing a logistic regression model?

**Part 2:** Model Building and Evaluation
* What resources do you find yourself using as you complete this stage?

**Part 3:** Interpreting Model Results

* What key insights emerged from your model(s)?

* What business recommendations do you propose based on the models built?

Follow the instructions and answer the question below to complete the activity. Then, you will complete an executive summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.


# **Build a regression model**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**
Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

### **Task 1. Imports and loading**
Import the data and packages that you've learned are needed for building regression models.

In [None]:
# Import packages for data manipulation
### YOUR CODE HERE ###


# Import packages for data visualization
### YOUR CODE HERE ###


# Import packages for data preprocessing
### YOUR CODE HERE ###


# Import packages for data modeling
### YOUR CODE HERE ###
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
#import plotly.express as px
import seaborn as sns

# Import packages for data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Import packages for data modeling
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


Load the TikTok dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

In this stage, consider the following question where applicable to complete your code response:

* What are some purposes of EDA before constructing a logistic regression model?


==> ENTER YOUR RESPONSE HERE

### **Task 2a. Explore data with EDA**

Analyze the data and check for and handle missing values and duplicates.

Inspect the first five rows of the dataframe.

In [None]:
# Display first few rows
### YOUR CODE HERE ###
data.head()

Get the number of rows and columns in the dataset.

In [None]:
# Get number of rows and columns
### YOUR CODE HERE ###
rows, columns = data.shape
rows, columns

Get the data types of the columns.

In [None]:
# Get data types of columns
### YOUR CODE HERE ###
data_types = data.dtypes
data_types

Get basic information about the dataset.

In [None]:
# Get basic information
### YOUR CODE HERE ###
data_info = data.info()
data_info

Generate basic descriptive statistics about the dataset.

In [None]:
# Generate basic descriptive stats
descriptive_stats = data.describe(include='all')
descriptive_stats
descriptive_stats = data.describe(include='all')
descriptive_stats

Check for and handle missing values.

In [None]:
# Check for missing values
### YOUR CODE HERE ###
missing_values = data.isnull().sum()
missing_values

In [None]:
# Drop rows with missing values
### YOUR CODE HERE ###
data_cleaned = data.dropna()

# Verify the shape of the cleaned data
data_cleaned.shape

In [None]:
# Display first few rows after handling missing values
### YOUR CODE HERE ###
data_cleaned.head()

Check for and handle duplicates.

In [None]:
# Check for duplicates
### YOUR CODE HERE ###
duplicate_rows = data_cleaned.duplicated().sum()
duplicate_rows

Check for and handle outliers.

In [None]:
# Create a boxplot to visualize distribution of `video_duration_sec`
### YOUR CODE HERE ###

# Create a boxplot to visualize the distribution of 'video_duration_sec'
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.boxplot(data_cleaned['video_duration_sec'])
plt.title('Boxplot of Video Duration (Seconds)')
plt.ylabel('Duration (Seconds)')
plt.show()


In [None]:
# Create a boxplot to visualize distribution of `video_view_count`
### YOUR CODE HERE ###

# Create a boxplot to visualize the distribution of 'video_view_count'

plt.figure(figsize=(8, 6))
plt.boxplot(data_cleaned['video_view_count'])
plt.title('Boxplot of Video View Count')
plt.ylabel('View Count')
plt.show()


In [None]:
# Create a boxplot to visualize distribution of `video_like_count`
### YOUR CODE HERE ###
# Create a boxplot to visualize the distribution of 'video_like_count'

plt.figure(figsize=(8, 6))
plt.boxplot(data_cleaned['video_like_count'])
plt.title('Boxplot of Video Like Count')
plt.ylabel('Like Count')
plt.show()


In [None]:
# Create a boxplot to visualize distribution of `video_comment_count`
### YOUR CODE HERE ###
# Create a boxplot to visualize the distribution of 'video_comment_count'

plt.figure(figsize=(8, 6))
plt.boxplot(data_cleaned['video_comment_count'])
plt.title('Boxplot of Video Comment Count')
plt.ylabel('Comment Count')
plt.show()


In [None]:
# Check for and handle outliers for video_like_count
### YOUR CODE HERE ###

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = data_cleaned['video_like_count'].quantile(0.25)
Q3 = data_cleaned['video_like_count'].quantile(0.75)

# Calculate the Interquartile Range (IQR)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter the dataset to remove outliers
data_no_outliers = data_cleaned[(data_cleaned['video_like_count'] >= lower_bound) & (data_cleaned['video_like_count'] <= upper_bound)]

# Verify how many rows were removed
outliers_removed = data_cleaned.shape[0] - data_no_outliers.shape[0]
outliers_removed, data_no_outliers.shape


Check class balance.

In [None]:
# Check class balance for video_comment_count
### YOUR CODE HERE ###
# Check the distribution (class balance) for 'video_comment_count'
comment_count_distribution = data_no_outliers['video_comment_count'].value_counts()

comment_count_distribution


Approximately 94.2% of the dataset represents videos posted by unverified accounts and 5.8% represents videos posted by verified accounts. So the outcome variable is not very balanced.

Use resampling to create class balance in the outcome variable, if needed.

In [None]:
# Use resampling to create class balance in the outcome variable, if needed

# Identify data points from majority and minority classes
### YOUR CODE HERE ###

# Upsample the minority class (which is "verified")
### YOUR CODE HERE ###

# Combine majority class with upsampled minority class
### YOUR CODE HERE ###

# Display new class counts
### YOUR CODE HERE ###
from sklearn.utils import resample

# First, simplify the class balance by grouping the 'video_comment_count' into ranges
# (e.g., 0 comments, 1-10 comments, 11-100 comments, 101+ comments)
data_no_outliers['comment_class'] = pd.cut(data_no_outliers['video_comment_count'],
                                           bins=[-1, 0, 10, 100, data_no_outliers['video_comment_count'].max()],
                                           labels=['0 comments', '1-10 comments', '11-100 comments', '101+ comments'])

# Separate majority and minority classes
majority_class = data_no_outliers[data_no_outliers['comment_class'] == '0 comments']
minority_classes = data_no_outliers[data_no_outliers['comment_class'] != '0 comments']

# Upsample the minority classes
minority_upsampled = resample(minority_classes,
                              replace=True,   # Sample with replacement
                              n_samples=len(majority_class),  # Match the majority class size
                              random_state=42)

# Combine the majority class with the upsampled minority classes
balanced_data = pd.concat([majority_class, minority_upsampled])

# Display new class counts for 'comment_class'
new_class_counts = balanced_data['comment_class'].value_counts()
new_class_counts


Get the average `video_transcription_text` length for videos posted by verified accounts and the average `video_transcription_text` length for videos posted by unverified accounts.



In [None]:
# Get the average `video_transcription_text` length for claims and the average `video_transcription_text` length for opinions
### YOUR CODE HERE ###
# First, calculate the length of each 'video_transcription_text'
balanced_data['transcription_length'] = balanced_data['video_transcription_text'].apply(lambda x: len(str(x)))

# Now, calculate the average length for claims and opinions
average_claim_length = balanced_data[balanced_data['claim_status'] == 'claim']['transcription_length'].mean()
average_opinion_length = balanced_data[balanced_data['claim_status'] == 'opinion']['transcription_length'].mean()

average_claim_length, average_opinion_length


Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Extract the length of each `video_transcription_text` and add this as a column to the dataframe
### YOUR CODE HERE ###
# Extract the length of each 'video_transcription_text' and add it as a column to the dataframe
balanced_data['transcription_length'] = balanced_data['video_transcription_text'].apply(lambda x: len(str(x)))

# Display the first few rows to confirm the new column has been added
balanced_data[['video_transcription_text', 'transcription_length']].head()


In [None]:
# Display first few rows of dataframe after adding new column
### YOUR CODE HERE ###
# Display the first few rows of the dataframe after adding the 'transcription_length' column
balanced_data.head()


Visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts.

In [None]:
# Visualize the distribution of `video_transcription_text` length for videos posted by verified accounts and videos posted by unverified accounts
# Create two histograms in one plot
### YOUR CODE HERE ###
# Filter the data based on verified and unverified accounts
verified_accounts = balanced_data[balanced_data['verified_status'] == 'verified']
unverified_accounts = balanced_data[balanced_data['verified_status'] == 'not verified']

# Plot two histograms in one plot
plt.figure(figsize=(10, 6))

plt.hist(verified_accounts['transcription_length'], bins=30, alpha=0.5, label='Verified Accounts', color='blue')
plt.hist(unverified_accounts['transcription_length'], bins=30, alpha=0.5, label='Unverified Accounts', color='orange')

plt.title('Distribution of Video Transcription Length for Verified and Unverified Accounts')
plt.xlabel('Transcription Length (Characters)')
plt.ylabel('Frequency')
plt.legend(loc='upper right')

plt.show()


### **Task 2b. Examine correlations**

Next, code a correlation matrix to help determine most correlated variables.

In [None]:
# Select only the numeric columns from the dataset
numeric_columns = balanced_data.select_dtypes(include=['float64', 'int64'])

# Calculate the correlation matrix for the numeric columns
correlation_matrix = numeric_columns.corr()

# Display the correlation matrix
correlation_matrix



Visualize a correlation heatmap of the data.

In [None]:
# Create a heatmap to visualize how correlated variables are
### YOUR CODE HERE ###
# Create a heatmap to visualize the correlation matrix
import seaborn as sns
plt.figure(figsize=(10, 8))

# Generate a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add titles and labels
plt.title('Correlation Heatmap of Numeric Variables')
plt.show()


One of the model assumptions for logistic regression is no severe multicollinearity among the features. Take this into consideration as you examine the heatmap and choose which features to proceed with.

**Question:** What variables are shown to be correlated in the heatmap?

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

After analysis and deriving variables with close relationships, it is time to begin constructing the model. Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### **Task 3a. Select variables**

Set your Y and X variables.

Select the outcome variable.

In [None]:
# # Create a heatmap to visualize the correlation matrix
import seaborn as sns
plt.figure(figsize=(10, 8))

# Generate a heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add titles and labels
plt.title('Correlation Heatmap of Numeric Variables')
plt.show()



Select the features.

In [None]:
# Select features
### YOUR CODE HERE ###


# Display first few rows of features dataframe
### YOUR CODE HERE ###
# Select relevant features
features = balanced_data[['video_duration_sec', 'video_view_count', 'video_like_count', 
                          'video_share_count', 'video_download_count', 'video_comment_count', 'transcription_length']]

# Display the first few rows of the selected features
features.head()


### **Task 3b. Train-test split**

Split the data into training and testing sets.

In [None]:
# Split the data into training and testing sets
### YOUR CODE HERE ###
# Split the selected features into training and testing sets
from sklearn.model_selection import train_test_split

# Define the target variable (video_comment_count in this case) and features
X = features.drop(columns=['video_comment_count'])
y = features['video_comment_count']

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape


Confirm that the dimensions of the training and testing sets are in alignment.

In [None]:
# Get shape of each training and testing set
### YOUR CODE HERE ###
# Display the shapes of the training and testing sets for both features and target
X_train.shape, X_test.shape, y_train.shape, y_test.shape


### **Task 3c. Encode variables**

Check the data types of the features.

In [None]:
# Check data types
### YOUR CODE HERE ###
# Check the data types of the features in the training set
X_train.dtypes


In [None]:
# Get unique values in `claim_status`
### YOUR CODE HERE ###
# Get the unique values in the 'claim_status' column
unique_claim_status = balanced_data['claim_status'].unique()
unique_claim_status



In [None]:
# Get unique values in `author_ban_status`
### YOUR CODE HERE ###
# Get the unique values in the 'author_ban_status' column
unique_author_ban_status = balanced_data['author_ban_status'].unique()
unique_author_ban_status


As shown above, the `claim_status` and `author_ban_status` features are each of data type `object` currently. In order to work with the implementations of models through `sklearn`, these categorical features will need to be made numeric. One way to do this is through one-hot encoding.

Encode categorical features in the training set using an appropriate method.

In [None]:
# Select the training features that needs to be encoded
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###
# Select the categorical features that need to be encoded
categorical_features = balanced_data[['claim_status', 'author_ban_status']]

# Display the first few rows of the selected features
categorical_features.head()


In [None]:
# Set up an encoder for one-hot encoding the categorical features
### YOUR CODE HERE ###
# Correct the OneHotEncoder setup by using the 'sparse' parameter
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' to avoid multicollinearity

# Fit and transform the categorical features
encoded_features = encoder.fit_transform(categorical_features)

# Convert the result to a DataFrame and display the first few rows
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_features.columns))
encoded_df.head()


In [None]:
# # Correct the OneHotEncoder setup by using the 'sparse' parameter
encoder = OneHotEncoder(sparse=False, drop='first')  # drop='first' to avoid multicollinearity

# Fit and transform the categorical features
encoded_features = encoder.fit_transform(categorical_features)

# Convert the result to a DataFrame and display the first few rows
encoded_df = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(categorical_features.columns))
encoded_df.head()

### YOUR CODE HERE ###


In [None]:
# Get feature names from encoder
### YOUR CODE HERE ###
# Get the feature names from the encoder
encoded_feature_names = encoder.get_feature_names_out(categorical_features.columns)
encoded_feature_names


In [None]:
# Display first few rows of encoded training features
### YOUR CODE HERE ###
# Display the first few rows of the encoded training features (categorical features that have been one-hot encoded)
encoded_df.head()


In [None]:
# Place encoded training features (which is currently an array) into a dataframe
### YOUR CODE HERE ###
# Convert the encoded array into a DataFrame
encoded_train_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)

# Display the first few rows of the newly created DataFrame
encoded_train_df.head()


# Display first few rows
### YOUR CODE HERE ###


In [None]:
# Display first few rows of `X_train` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
### YOUR CODE HERE ###
# Display the first few rows of X_train with 'claim_status' and 'author_ban_status' columns dropped
X_train_dropped = X_train.drop(columns=['claim_status', 'author_ban_status'], errors='ignore')

# Display the first few rows
X_train_dropped.head()


In [None]:
# Concatenate `X_train` and `X_train_encoded_df` to form the final dataframe for training data (`X_train_final`)
# Note: Using `.reset_index(drop=True)` to reset the index in X_train after dropping `claim_status` and `author_ban_status`,
# so that the indices align with those in `X_train_encoded_df` and `count_df`
### YOUR CODE HERE ###

# Display first few rows
### YOUR CODE HERE ###
# Concatenate X_train_dropped and the encoded features DataFrame (encoded_train_df)
X_train_final = pd.concat([X_train_dropped.reset_index(drop=True), encoded_train_df.reset_index(drop=True)], axis=1)

# Display the first few rows of the final training DataFrame
X_train_final.head()


Check the data type of the outcome variable.

In [None]:
# Check data type of outcome variable
### YOUR CODE HERE ###
# Check the data type of the outcome variable (y_train)
y_train.dtypes



In [None]:
# Get unique values of outcome variable
### YOUR CODE HERE ###
# Get the unique values of the outcome variable (y_train)
unique_outcome_values = y_train.unique()
unique_outcome_values


A shown above, the outcome variable is of data type `object` currently. One-hot encoding can be used to make this variable numeric.

Encode categorical values of the outcome variable the training set using an appropriate method.

In [None]:
# Set up an encoder for one-hot encoding the categorical outcome variable
### YOUR CODE HERE ###
### YOUR CODE HERE ###
# Set up one-hot encoding for the outcome variable
encoder_outcome = OneHotEncoder(sparse=False)

# Reshape y_train to a 2D array (required for OneHotEncoder)
y_train_reshaped = y_train.values.reshape(-1, 1)

# Fit and transform the outcome variable
encoded_outcome = encoder_outcome.fit_transform(y_train_reshaped)

# Convert the result to a DataFrame and display the first few rows
encoded_outcome_df = pd.DataFrame(encoded_outcome, columns=encoder_outcome.get_feature_names_out(['video_comment_count']))
encoded_outcome_df.head()


In [None]:
# Encode the training outcome variable
# Notes:
#   - Adjusting the shape of `y_train` before passing into `.fit_transform()`, since it takes in 2D array
#   - Using `.ravel()` to flatten the array returned by `.fit_transform()`, so that it can be used later to train the model
### YOUR CODE HERE ###

# Display the encoded training outcome variable
### YOUR CODE HERE ###
# Adjust the shape of y_train and perform one-hot encoding
y_train_encoded = encoder_outcome.fit_transform(y_train.values.reshape(-1, 1)).ravel()

# Display the encoded training outcome variable
y_train_encoded[:10]  # Display the first 10 entries of the encoded outcome variable


### **Task 3d. Model building**

Construct a model and fit it to the training set.

In [None]:
# Construct a logistic regression model and fit it to the training set
### YOUR CODE HERE ###
# Import the linear regression model
from sklearn.linear_model import LinearRegression

# Initialize the linear regression model
linear_model = LinearRegression()

# Fit the linear regression model to the cleaned training set
linear_model.fit(X_train_final_cleaned, y_train_cleaned)

# Display a confirmation message
"Linear regression model has been successfully fitted to the cleaned training set."



<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Taks 4a. Results and evaluation**

Evaluate your model.

Encode categorical features in the testing set using an appropriate method.

In [None]:
# Select the testing features that needs to be encoded
### YOUR CODE HERE ###
# Check the columns available in X_test
X_test.columns


# Display first few rows
### YOUR CODE HERE ###


In [None]:
# Transform the testing features using the encoder
### YOUR CODE HERE ###


# Display first few rows of encoded testing features
### YOUR CODE HERE ###


# Drop rows with missing values in the testing features
X_test_cleaned = X_test.dropna()

# Display the first few rows of the cleaned testing set
X_test_cleaned.head()


In [None]:
# Redefine X_train_final_numerical to contain only the numerical features from the training set
X_train_final_numerical = X_train_dropped  # This contains only numerical features from the training set

# Reinitialize the linear regression model
linear_model_numerical = LinearRegression()

# Fit the model with the numerical features in the training set
linear_model_numerical.fit(X_train_final_numerical, y_train_cleaned)

# Make predictions on the aligned test set
predictions_final = linear_model_numerical.predict(X_test_final)

# Recalculate the error metrics
mae = mean_absolute_error(y_test_final, predictions_final)
mse = mean_squared_error(y_test_final, predictions_final)
rmse = np.sqrt(mse)

# Display the error metrics
mae, mse, rmse




In [None]:
# Display first few rows of `X_test` with `claim_status` and `author_ban_status` columns dropped (since these features are being transformed to numeric)
### YOUR CODE HERE ###
# Since 'claim_status' and 'author_ban_status' are already dropped from X_test, we can directly display the current X_test.
# Display the first few rows of X_test (without 'claim_status' and 'author_ban_status')

X_test_dropped = X_test.drop(columns=['claim_status', 'author_ban_status'], errors='ignore')

# Display first few rows
X_test_dropped.head()


In [None]:
# Concatenate `X_test` and `X_test_encoded_df` to form the final dataframe for training data (`X_test_final`)
# Note: Using `.reset_index(drop=True)` to reset the index in X_test after dropping `claim_status`, and `author_ban_status`,
# so that the indices align with those in `X_test_encoded_df` and `test_count_df`
### YOUR CODE HERE ###


# Display first few rows
### YOUR CODE HERE ###
# Reapply encoding to the categorical features in X_test (assuming they are not available in the current version)
# Since we don't have 'claim_status' and 'author_ban_status' anymore, I will continue without them

# Since there are no categorical features in the test set at this point, we can proceed with only the numerical features in X_test_dropped

# Display the first few rows of X_test_dropped directly
X_test_dropped.head()


Test the logistic regression model. Use the model to make predictions on the encoded testing set.

In [None]:
X = pd.get_dummies(data.drop(columns=['video_transcription_text', 'claim_status', 'author_ban_status'], errors='ignore'))

# Define the target variable
y = data['video_comment_count']

# Drop rows with missing values in the feature set
X_cleaned = X.dropna()
y_cleaned = y[X_cleaned.index]

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Evaluate the model (optional, depending on task)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Display the first few predictions
print('Predictions:', y_pred[:10])


Display the predictions on the encoded testing set.

In [None]:
# Display the predictions on the encoded testing set
### YOUR CODE HERE ###
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('Course_tiktok_dataset.csv')

# Apply one-hot encoding to categorical features, drop unnecessary columns
X = pd.get_dummies(data.drop(columns=['video_transcription_text', 'claim_status', 'author_ban_status'], errors='ignore'))

# Define the target variable
y = data['video_comment_count']

# Drop rows with missing values in the feature set
X_cleaned = X.dropna()
y_cleaned = y[X_cleaned.index]

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Display the first few predictions
print('Predictions on the encoded testing set:', y_pred[:10])

# Evaluate the model's accuracy (optional)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Display the true labels of the testing set.

In [None]:
# Display the true labels of the testing set
### YOUR CODE HERE ###
X = pd.get_dummies(data.drop(columns=['video_transcription_text', 'claim_status', 'author_ban_status'], errors='ignore'))

# Define the target variable
y = data['video_comment_count']

# Drop rows with missing values in the feature set
X_cleaned = X.dropna()
y_cleaned = y[X_cleaned.index]

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Initialize the logistic regression model
logistic_model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
logistic_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logistic_model.predict(X_test)

# Display the first few predictions and the true labels
print('Predictions on the encoded testing set:', y_pred[:10])
print('True labels of the testing set:', y_test[:10].values)

# Evaluate the model's accuracy (optional)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Encode the true labels of the testing set so it can be compared to the predictions.

In [None]:
# Encode the testing outcome variable
# Notes:
#   - Adjusting the shape of `y_test` before passing into `.transform()`, since it takes in 2D array
#   - Using `.ravel()` to flatten the array returned by `.transform()`, so that it can be used later to compare with predictions
### YOUR CODE HERE ###


# Display the encoded testing outcome variable
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Reshape y_test to be a 2D array, as required by .transform()
y_test_reshaped = y_test.values.reshape(-1, 1)

# Fit and transform the outcome variable
y_test_encoded = encoder.fit_transform(y_test_reshaped)

# Flatten the array using .ravel()
y_test_encoded_flat = y_test_encoded.ravel()

# Display the encoded testing outcome variable
print('Encoded testing outcome variable:', y_test_encoded_flat[:10])


Confirm again that the dimensions of the training and testing sets are in alignment since additional features were added.

In [None]:
# Get shape of each training and testing set
### YOUR CODE HERE ###
# Display the shape of each set
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)


### **Task 4b. Visualize model results**

Create a confusion matrix to visualize the results of the logistic regression model.

In [None]:
# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Create a display of the confusion matrix using seaborn's heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')

# Add labels and title
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

# Display the plot
plt.show()


Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the logistic regression model.

In [None]:
# Create a classification report
### YOUR CODE HERE ###
from sklearn.metrics import classification_report

# Generate the classification report
class_report = classification_report(y_test, y_pred)

# Display the classification report
print("Classification Report:\n", class_report)


### **Task 4c. Interpret model coefficients**

In [None]:
# Get the feature names from the model and the model coefficients (which represent log-odds ratios)
# Place into a DataFrame for readability
### YOUR CODE HERE ###
import pandas as pd

# Extract feature names from the training set
feature_names = X_train.columns

# Get the model coefficients (log-odds ratios) from the logistic regression model
coefficients = logistic_model.coef_[0]  # Extract the coefficients from the model

# Create a DataFrame to display feature names alongside their coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient (Log-Odds Ratio)': coefficients
})

# Display the DataFrame sorted by the absolute value of coefficients
coef_df_sorted = coef_df.sort_values(by='Coefficient (Log-Odds Ratio)', key=abs, ascending=False)
print(coef_df_sorted)


### **Task 4d. Conclusion**

1. What are the key takeaways from this project?

2. What results can be presented from this project?

==> ENTER YOUR RESPONSE TO QUESTIONS 1 AND 2 HERE

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged. 