# **TikTok Project**
**Course 6 - The Nuts and bolts of machine learning**

Recall that you are a data professional at TikTok. Your supervisor was impressed with the work you have done and has requested that you build a machine learning model that can be used to determine whether a video contains a claim or whether it offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

A notebook was structured and prepared to help you in this project. A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 6 End-of-course project: Classifying videos using machine learning**

In this activity, you will practice using machine learning techniques to predict on a binary outcome variable.
<br/>

**The purpose** of this model is to increase response time and system efficiency by automating the initial stages of the claims process.

**The goal** of this model is to predict whether a TikTok video presents a "claim" or presents an "opinion".
<br/>

*This activity has three parts:*

**Part 1:** Ethical considerations
* Consider the ethical implications of the request

* Should the objective of the model be adjusted?

**Part 2:** Feature engineering

* Perform feature selection, extraction, and transformation to prepare the data for modeling

**Part 3:** Modeling

* Build the models, evaluate them, and advise on next steps

Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.



# **Classify videos using machine learning**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

In this stage, consider the following questions:


1.   **What are you being asked to do? What metric should I use to evaluate success of my business/organizational objective?**

2.   **What are the ethical implications of the model? What are the consequences of your model making errors?**
  *   What is the likely effect of the model when it predicts a false negative (i.e., when the model says a video does not contain a claim and it actually does)?

  *   What is the likely effect of the model when it predicts a false positive (i.e., when the model says a video does contain a claim and it actually does not)?

3.   **How would you proceed?**


==> ENTER YOUR RESPONSES HERE

**Modeling workflow and model selection process**

Previous work with this data has revealed that there are ~20,000 videos in the sample. This is sufficient to conduct a rigorous model validation workflow, broken into the following steps:

1. Split the data into train/validation/test sets (60/20/20)
2. Fit models and tune hyperparameters on the training set
3. Perform final model selection on the validation set
4. Assess the champion model's performance on the test set

![](https://raw.githubusercontent.com/adacert/tiktok/main/optimal_model_flow_numbered.svg)


### **Task 1. Imports and data loading**

Start by importing packages needed to build machine learning models to achieve the goal of this project.

In [None]:
# Import packages for data manipulation
### YOUR CODE HERE ###


# Import packages for data visualization
### YOUR CODE HERE ###


# Import packages for data preprocessing
### YOUR CODE HERE ###


# Import packages for data modeling
### YOUR CODE HERE ###


Now load the data from the provided csv file into a dataframe.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2: Examine data, summary info, and descriptive stats**

Inspect the first five rows of the dataframe.

In [None]:
# Display first few rows
### YOUR CODE HERE ###


Get the number of rows and columns in the dataset.

In [None]:
# Get number of rows and columns
### YOUR CODE HERE ###


Get the data types of the columns.

In [None]:
# Get data types of columns
### YOUR CODE HERE ###


Get basic information about the dataset.

In [None]:
# Get basic information
### YOUR CODE HERE ###


Generate basic descriptive statistics about the dataset.

In [None]:
# Generate basic descriptive stats
### YOUR CODE HERE ###


Check for and handle missing values.

In [None]:
# Check for missing values
### YOUR CODE HERE ###


In [None]:
# Drop rows with missing values
### YOUR CODE HERE ###


In [None]:

# Display first few rows after handling missing values
### YOUR CODE HERE ###


Check for and handle duplicates.

In [None]:
# Check for duplicates
### YOUR CODE HERE ###


Check for and handle outliers.

In [None]:
### YOUR CODE HERE ###


Check class balance.

In [None]:
# Check class balance
### YOUR CODE HERE ###


<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**
Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### **Task 3: Feature engineering**

Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Extract the length of each `video_transcription_text` and add this as a column to the dataframe
### YOUR CODE HERE ###


Calculate the average text_length for claims and opinions.

In [None]:
# Calculate the average text_length for claims and opinions
### YOUR CODE HERE ###


Visualize the distribution of `text_length` for claims and opinions.

In [None]:
# Visualize the distribution of `text_length` for claims and opinions
# Create two histograms in one plot
### YOUR CODE HERE ###


**Feature selection and transformation**

Encode target and catgorical variables.

In [None]:
# Create a copy of the X data
### YOUR CODE HERE ###

# Drop unnecessary columns
### YOUR CODE HERE ###

# Encode target variable
### YOUR CODE HERE ###

# Dummy encode remaining categorical values
### YOUR CODE HERE ###


### **Task 4: Split the data**

Assign target variable.

In [None]:
# Isolate target variable
### YOUR CODE HERE ###


Isolate the features.

In [None]:
# Isolate features
### YOUR CODE HERE ###

# Display first few rows of features dataframe
### YOUR CODE HERE ###


#### **Task 5: Create train/validate/test sets**

Split data into training and testing sets, 80/20.

In [None]:
# Split the data into training and testing sets
### YOUR CODE HERE ###


Split the training set into training and validation sets, 75/25, to result in a final ratio of 60/20/20 for train/validate/test sets.

In [None]:
# Split the training data into training and validation sets
### YOUR CODE HERE ###


Confirm that the dimensions of the training, validation, and testing sets are in alignment.

In [None]:
# Get shape of each training, validation, and testing set
### YOUR CODE HERE ###


### **Task 6. Build models**


### **Build a random forest model**

Fit a random forest model to the training set. Use cross-validation to tune the hyperparameters and select the model that performs best on recall.

In [None]:
# Instantiate the random forest classifier
### YOUR CODE HERE ###

# Create a dictionary of hyperparameters to tune
### YOUR CODE HERE ###

# Define a list of scoring metrics to capture
### YOUR CODE HERE ###

# Instantiate the GridSearchCV object
### YOUR CODE HERE ###


In [None]:
### Fit the model to the data 
### YOUR CODE HERE ###


In [None]:
# Examine best recall score
### YOUR CODE HERE ###


In [None]:
# Examine best parameters
### YOUR CODE HERE ###


Check the precision score to make sure the model isn't labeling everything as claims. You can do this by using the `cv_results_` attribute of the fit `GridSearchCV` object, which returns a numpy array that can be converted to a pandas dataframe. Then, examine the `mean_test_precision` column of this dataframe at the index containing the results from the best model. This index can be accessed by using the `best_index_` attribute of the fit `GridSearchCV` object.

In [None]:
# Access the GridSearch results and convert it to a pandas df
### YOUR CODE HERE ###

# Examine the GridSearch results df at column `mean_test_precision` in the best index
### YOUR CODE HERE ###


**Question:** How well is your model performing? Consider average recall score and precision score.

### **Build an XGBoost model**

In [None]:
# Instantiate the XGBoost classifier
### YOUR CODE HERE ###

# Create a dictionary of hyperparameters to tune
### YOUR CODE HERE ###

# Define a list of scoring metrics to capture
### YOUR CODE HERE ###

# Instantiate the GridSearchCV object
### YOUR CODE HERE ###


In [None]:
# Fit the model to the data
### YOUR CODE HERE ###


In [None]:
# Examine best recall score
### YOUR CODE HERE ###


In [None]:
# Examine best parameters
### YOUR CODE HERE ###


Repeat the steps used for random forest to examine the precision score of the best model identified in the grid search.

In [None]:
# Access the GridSearch results and convert it to a pandas df
### YOUR CODE HERE ###

# Examine the GridSearch results df at column `mean_test_precision` in the best index
### YOUR CODE HERE ###


**Question:** How well does your model perform? Consider recall score and precision score.

<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**
Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Task 7. Evaluate model**

Evaluate models against validation criteria.

#### **Random forest**

In [None]:
# Use the random forest "best estimator" model to get predictions on the validation set
### YOUR CODE HERE ###

Display the predictions on the validation set.

In [None]:
# Display the predictions on the validation set
### YOUR CODE HERE ###

Display the true labels of the validation set.

In [1]:
# Display the true labels of the validation set
### YOUR CODE HERE ###

Create a confusion matrix to visualize the results of the classification model.

In [None]:
# Create a confusion matrix to visualize the results of the classification model

# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix using ConfusionMatrixDisplay()
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###


Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the model.
<br> </br>

**Note:** In other labs there was a custom-written function to extract the accuracy, precision, recall, and F<sub>1</sub> scores from the GridSearchCV report and display them in a table. You can also use scikit-learn's built-in [`classification_report()`](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) function to obtain a similar table of results.

In [None]:
# Create a classification report
# Create classification report for random forest model
### YOUR CODE HERE ###


**Question:** What does your classification report show? What does the confusion matrix indicate?

#### **XGBoost**

Now, evaluate the XGBoost model on the validation set.

In [None]:
# Use the best estimator to predict on the validation data
### YOUR CODE HERE ###


In [None]:
# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix using ConfusionMatrixDisplay()
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###


In [None]:
# Create a classification report
### YOUR CODE HERE ###


**Question:** Describe your XGBoost model results. How does your XGBoost model compare to your random forest model?

### **Use champion model to predict on test data**

In [None]:
### YOUR CODE HERE ###


In [None]:
# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix using ConfusionMatrixDisplay()
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###


#### **Feature importances of champion model**


In [None]:
### YOUR CODE HERE ###


**Question:** Describe your most predictive features. Were your results surprising?

### **Task 8. Conclusion**

In this step use the results of the models above to formulate a conclusion. Consider the following questions:

1. **Would you recommend using this model? Why or why not?**

2. **What was your model doing? Can you explain how it was making predictions?**

3. **Are there new features that you can engineer that might improve model performance?**

4. **What features would you want to have that would likely improve the performance of your model?**

Remember, sometimes your data simply will not be predictive of your chosen target. This is common. Machine learning is a powerful tool, but it is not magic. If your data does not contain predictive signal, even the most complex algorithm will not be able to deliver consistent and accurate predictions. Do not be afraid to draw this conclusion.


==> ENTER YOUR RESPONSES HERE

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.