# **TikTok Project**
**Course 6 - The Nuts and bolts of machine learning**

Recall that you are a data professional at TikTok. Your supervisor was impressed with the work you have done and has requested that you build a machine learning model that can be used to determine whether a video contains a claim or whether it offers an opinion. With a successful prediction model, TikTok can reduce the backlog of user reports and prioritize them more efficiently.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 6 End-of-course project: Classifying videos using machine learning**

In this activity, you will practice using machine learning techniques to predict on a binary outcome variable.
<br/>

**The purpose** of this model is to increase response time and system efficiency by automating the initial stages of the claims process.

**The goal** of this model is to predict whether a TikTok video presents a "claim" or presents an "opinion".
<br/>

*This activity has three parts:*

**Part 1:** Ethical considerations
* Consider the ethical implications of the request

* Should the objective of the model be adjusted?

**Part 2:** Feature engineering

* Perform feature selection, extraction, and transformation to prepare the data for modeling

**Part 3:** Modeling

* Build the models, evaluate them, and advise on next steps

Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.



# **Classify videos using machine learning**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.



## **PACE: Plan**

Consider the questions in your PACE Strategy Document to reflect on the Plan stage.

In this stage, consider the following questions:


1.   **What are you being asked to do? What metric should I use to evaluate success of my business/organizational objective?**

2.   **What are the ethical implications of the model? What are the consequences of your model making errors?**
  *   What is the likely effect of the model when it predicts a false negative (i.e., when the model says a video does not contain a claim and it actually does)?

  *   What is the likely effect of the model when it predicts a false positive (i.e., when the model says a video does contain a claim and it actually does not)?

3.   **How would you proceed?**


1.   **What are you being asked to do? What metric should I use to evaluate success of my business/organizational objective?**


The problem being asked to be solved is that the huge load of backlogs in the user reports which needs to be decremented for effective consideration of user reports, hence for this a machine learning model especially a random forest model is being asked to be developed which can help classify the user content (which was reported) as claim or opinion. then further processing can ensure if the content needs to be taken down or not

here I think the F1 score metric would be the most useful as both false positives and false negatives being high is a problem here as
false positive would mean a video is a claim when its an opinion, this would decrease the credibility of the author and in turn create a state of problematic situation surrounding the app.
false negatives are exepensive because if a video which is a claim classified as an opinion it could be the case that the video has something malicious, misinformation, etc which tend to garner a lot of engagement and create a state of unrest among the viewers which inturn lowers the credibility of the app.

hence both of these should be targetted to be lowered down.

2.   **What are the ethical implications of the model? What are the consequences of your model making errors?**
   
the ethical implication is something which brings in a new view thos here the recall would seem a better metric to judge the model, we should target recall to be higher as a video which is a claim should not be classified as an opinion i.e. it shouldnt commit a false positives or a type 1 error. Here even if the model classifies some opinions as claims its fine as in the subsequent processes of reviewing by human moderators or other processes it will eventually be tagged an opinion or the author wouldnt be tagged as being problematic. However leaving out vidos which are actually claims and letting them being tagged opinions(false nagatives or a type 2 error) is wrong as these claim videos tend to have malicious, misinformation, etc content which could have more negative implications,

3.   **How would you proceed?**

I would proceed by first doing an eda but as the data is used in a lot of phases before its fairly clean and structured but yes then I would go for encoding and then I would assign the outcome and predictor variables, for the predcitor variables those variables would be used which came out as important in the previous phases analysis and then after splitting them both in train ,validation and test splits other tasks will be carried out, like grid search, cross validation, model testing and other evaluations.

exemplar answer
**Modeling workflow and model selection process**

Previous work with this data has revealed that there are ~20,000 videos in the sample. This is sufficient to conduct a rigorous model validation workflow, broken into the following steps:

1. Split the data into train/validation/test sets (60/20/20)
2. Fit models and tune hyperparameters on the training set
3. Perform final model selection on the validation set
4. Assess the champion model's performance on the test set



![](https://raw.githubusercontent.com/adacert/tiktok/main/optimal_model_flow_numbered.svg)


### **Task 1. Imports and data loading**

Start by importing packages needed to build machine learning models to achieve the goal of this project.

In [1]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for data preprocessing
from sklearn.feature_extraction.text import CountVectorizer

# Import packages for data modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance

Now load the data from the provided csv file into a dataframe.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe
data = pd.read_csv(r"C:\Users\saswa\Documents\GitHub\Python-For-Data-Analysis\Course-6\Data\module_5_data\tiktok_dataset.csv")



## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2: Examine data, summary info, and descriptive stats**

Inspect the first five rows of the dataframe.

In [3]:
# Display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


Get the number of rows and columns in the dataset.

In [4]:
# Get number of rows and columns
data.shape

(19382, 12)

Get the data types of the columns.

In [10]:
# Get data types of columns
data.dtypes

#                             int64
claim_status                 object
video_id                      int64
video_duration_sec            int64
video_transcription_text     object
verified_status              object
author_ban_status            object
video_view_count            float64
video_like_count            float64
video_share_count           float64
video_download_count        float64
video_comment_count         float64
dtype: object

Get basic information about the dataset.

In [11]:
# Get basic information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


Generate basic descriptive statistics about the dataset.

In [12]:
# Generate basic descriptive stats
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Check for and handle missing values.

In [16]:
# Check for missing values
data.isna().count()


#                           19084
claim_status                19084
video_id                    19084
video_duration_sec          19084
video_transcription_text    19084
verified_status             19084
author_ban_status           19084
video_view_count            19084
video_like_count            19084
video_share_count           19084
video_download_count        19084
video_comment_count         19084
dtype: int64

There are very few missing values relative to the number of samples in the dataset. Therefore, observations with missing values can be dropped.

In [14]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [18]:

# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


Check for and handle duplicates.

Tree-based models are robust to outliers, so there is no need to impute or drop any values based on where they fall in their distribution.

In [None]:
# Check for duplicates
data.duplicated().sum()

There are no duplicate observations in the data.

Check for and handle outliers.

Tree-based models are robust to outliers, so there is no need to impute or drop any values based on where they fall in their distribution.

Check class balance.

In [19]:
# Check class balance
data["claim_status"].value_counts(normalize=True)

claim_status
claim      0.503458
opinion    0.496542
Name: proportion, dtype: float64

Approximately 50.3% of the dataset represents claims and 49.7% represents opinions, so the outcome variable is balanced.



## **PACE: Construct**
Consider the questions in your PACE Strategy Document to reflect on the Construct stage.

### **Task 3: Feature engineering**

Extract the length of each `video_transcription_text` and add this as a column to the dataframe, so that it can be used as a potential feature in the model.

In [None]:
# Extract the length of each `video_transcription_text` and add this as a column to the dataframe
### YOUR CODE HERE ###


Calculate the average text_length for claims and opinions.

In [None]:
# Calculate the average text_length for claims and opinions
### YOUR CODE HERE ###


Visualize the distribution of `text_length` for claims and opinions.

In [None]:
# Visualize the distribution of `text_length` for claims and opinions
# Create two histograms in one plot
### YOUR CODE HERE ###


**Feature selection and transformation**

Encode target and catgorical variables.

In [None]:
# Create a copy of the X data
### YOUR CODE HERE ###

# Drop unnecessary columns
### YOUR CODE HERE ###

# Encode target variable
### YOUR CODE HERE ###

# Dummy encode remaining categorical values
### YOUR CODE HERE ###


### **Task 4: Split the data**

Assign target variable.

In [None]:
# Isolate target variable
### YOUR CODE HERE ###


Isolate the features.

In [None]:
# Isolate features
### YOUR CODE HERE ###

# Display first few rows of features dataframe
### YOUR CODE HERE ###


#### **Task 5: Create train/validate/test sets**

Split data into training and testing sets, 80/20.

In [None]:
# Split the data into training and testing sets
### YOUR CODE HERE ###


Split the training set into training and validation sets, 75/25, to result in a final ratio of 60/20/20 for train/validate/test sets.

In [None]:
# Split the training data into training and validation sets
### YOUR CODE HERE ###


Confirm that the dimensions of the training, validation, and testing sets are in alignment.

In [None]:
# Get shape of each training, validation, and testing set
### YOUR CODE HERE ###


### **Task 6. Build models**


### **Build a random forest model**

Fit a random forest model to the training set. Use cross-validation to tune the hyperparameters and select the model that performs best on recall.

In [None]:
# Instantiate the random forest classifier
### YOUR CODE HERE ###

# Create a dictionary of hyperparameters to tune
### YOUR CODE HERE ###

# Define a list of scoring metrics to capture
### YOUR CODE HERE ###

# Instantiate the GridSearchCV object
### YOUR CODE HERE ###


In [None]:
### Fit the model to the data 
### YOUR CODE HERE ###


In [None]:
# Examine best recall score
### YOUR CODE HERE ###


In [None]:
# Examine best parameters
### YOUR CODE HERE ###


Check the precision score to make sure the model isn't labeling everything as claims. You can do this by using the `cv_results_` attribute of the fit `GridSearchCV` object, which returns a numpy array that can be converted to a pandas dataframe. Then, examine the `mean_test_precision` column of this dataframe at the index containing the results from the best model. This index can be accessed by using the `best_index_` attribute of the fit `GridSearchCV` object.

In [None]:
# Access the GridSearch results and convert it to a pandas df
### YOUR CODE HERE ###

# Examine the GridSearch results df at column `mean_test_precision` in the best index
### YOUR CODE HERE ###


**Question:** How well is your model performing? Consider average recall score and precision score.

### **Build an XGBoost model**

In [None]:
# Instantiate the XGBoost classifier
### YOUR CODE HERE ###

# Create a dictionary of hyperparameters to tune
### YOUR CODE HERE ###

# Define a list of scoring metrics to capture
### YOUR CODE HERE ###

# Instantiate the GridSearchCV object
### YOUR CODE HERE ###


In [None]:
# Fit the model to the data
### YOUR CODE HERE ###


In [None]:
# Examine best recall score
### YOUR CODE HERE ###


In [None]:
# Examine best parameters
### YOUR CODE HERE ###


Repeat the steps used for random forest to examine the precision score of the best model identified in the grid search.

In [None]:
# Access the GridSearch results and convert it to a pandas df
### YOUR CODE HERE ###

# Examine the GridSearch results df at column `mean_test_precision` in the best index
### YOUR CODE HERE ###


**Question:** How well does your model perform? Consider recall score and precision score.

<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**
Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Task 7. Evaluate model**

Evaluate models against validation criteria.

#### **Random forest**

In [None]:
# Use the random forest "best estimator" model to get predictions on the validation set
### YOUR CODE HERE ###

Display the predictions on the validation set.

In [None]:
# Display the predictions on the validation set
### YOUR CODE HERE ###

Display the true labels of the validation set.

In [1]:
# Display the true labels of the validation set
### YOUR CODE HERE ###

Create a confusion matrix to visualize the results of the classification model.

In [None]:
# Create a confusion matrix to visualize the results of the classification model

# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix using ConfusionMatrixDisplay()
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###


Create a classification report that includes precision, recall, f1-score, and accuracy metrics to evaluate the performance of the model.
<br> </br>

**Note:** In other labs there was a custom-written function to extract the accuracy, precision, recall, and F<sub>1</sub> scores from the GridSearchCV report and display them in a table. You can also use scikit-learn's built-in [`classification_report()`](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) function to obtain a similar table of results.

In [None]:
# Create a classification report
# Create classification report for random forest model
### YOUR CODE HERE ###


**Question:** What does your classification report show? What does the confusion matrix indicate?

#### **XGBoost**

Now, evaluate the XGBoost model on the validation set.

In [None]:
# Use the best estimator to predict on the validation data
### YOUR CODE HERE ###


In [None]:
# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix using ConfusionMatrixDisplay()
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###


In [None]:
# Create a classification report
### YOUR CODE HERE ###


**Question:** Describe your XGBoost model results. How does your XGBoost model compare to your random forest model?

### **Use champion model to predict on test data**

In [None]:
### YOUR CODE HERE ###


In [None]:
# Compute values for confusion matrix
### YOUR CODE HERE ###

# Create display of confusion matrix using ConfusionMatrixDisplay()
### YOUR CODE HERE ###

# Plot confusion matrix
### YOUR CODE HERE ###

# Display plot
### YOUR CODE HERE ###


#### **Feature importances of champion model**


In [None]:
### YOUR CODE HERE ###


**Question:** Describe your most predictive features. Were your results surprising?

### **Task 8. Conclusion**

In this step use the results of the models above to formulate a conclusion. Consider the following questions:

1. **Would you recommend using this model? Why or why not?**

2. **What was your model doing? Can you explain how it was making predictions?**

3. **Are there new features that you can engineer that might improve model performance?**

4. **What features would you want to have that would likely improve the performance of your model?**

Remember, sometimes your data simply will not be predictive of your chosen target. This is common. Machine learning is a powerful tool, but it is not magic. If your data does not contain predictive signal, even the most complex algorithm will not be able to deliver consistent and accurate predictions. Do not be afraid to draw this conclusion.


==> ENTER YOUR RESPONSES HERE

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.