# **<span style="color: red;">CLASSIFICATION TASK</span>**

# **1. Top Streamers on Twitch**
---
(https://www.kaggle.com/datasets/aayushmishra1512/twitchdata)

## **1.1 Importing Libraries**
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score, precision_recall_curve, average_precision_score

## **1.2 Printing CSV file**
---

In [2]:
twitch_data = pd.read_csv("/kaggle/input/twitchdata/twitchdata-update.csv")
print(twitch_data)
twitch_data.columns

              Channel  Watch time(Minutes)  Stream time(minutes)  \
0               xQcOW           6196161750                215250   
1            summit1g           6091677300                211845   
2              Gaules           5644590915                515280   
3            ESL_CSGO           3970318140                517740   
4                Tfue           3671000070                123660   
..                ...                  ...                   ...   
995         LITkillah            122524635                 13560   
996  빅헤드 (bighead033)            122523705                153000   
997    마스카 (newmasca)            122452320                217410   
998     AndyMilonakis            122311065                104745   
999              Remx            122192850                 99180   

     Peak viewers  Average viewers  Followers  Followers gained  Views gained  \
0          222720            27716    3246298           1734810      93036735   
1          310998    

Index(['Channel', 'Watch time(Minutes)', 'Stream time(minutes)',
       'Peak viewers', 'Average viewers', 'Followers', 'Followers gained',
       'Views gained', 'Partnered', 'Mature', 'Language'],
      dtype='object')

In [3]:
twitch_data.head()

Unnamed: 0,Channel,Watch time(Minutes),Stream time(minutes),Peak viewers,Average viewers,Followers,Followers gained,Views gained,Partnered,Mature,Language
0,xQcOW,6196161750,215250,222720,27716,3246298,1734810,93036735,True,False,English
1,summit1g,6091677300,211845,310998,25610,5310163,1370184,89705964,True,False,English
2,Gaules,5644590915,515280,387315,10976,1767635,1023779,102611607,True,True,Portuguese
3,ESL_CSGO,3970318140,517740,300575,7714,3944850,703986,106546942,True,False,English
4,Tfue,3671000070,123660,285644,29602,8938903,2068424,78998587,True,False,English


## **1.3 Cleaning**
---
Since there were no errors when printing the dataset, there is probably no need for cleaning. To double-check, the .isna().sum() function from Pandas will be used. If the returned values are 0, then the dataset is fine to use directly.

In [4]:
twitch_data.isna().sum()

Channel                 0
Watch time(Minutes)     0
Stream time(minutes)    0
Peak viewers            0
Average viewers         0
Followers               0
Followers gained        0
Views gained            0
Partnered               0
Mature                  0
Language                0
dtype: int64

In [5]:
twitch_data['Mature'].value_counts()

Mature
False    770
True     230
Name: count, dtype: int64

In [6]:
twitch_data['Partnered'].value_counts()

Partnered
True     978
False     22
Name: count, dtype: int64

## **1.4 Regression Task & Visualization**
---

### **1.41 Variables & Function**
---

In [7]:
# --- Splitting Data ---
X = twitch_data.drop(['Partnered', 'Channel', 'Mature'], axis=1)
X = pd.get_dummies(X, columns=['Language'], drop_first=True)
y = twitch_data['Mature']

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# --- Function ---
def evaluate_model(model, train_X, val_X, train_y, val_y):
    model.fit(train_X, train_y)
    val_predictions = model.predict(val_X)
    
    accuracy = accuracy_score(val_y, val_predictions)
    report = classification_report(val_y, val_predictions)
    
    print("Accuracy of the model:\n", accuracy)
    print("\nReport:\n", report)
    
    return val_predictions

### **1.41 Logistic Regression Model**
---

In [8]:
lr_model = LogisticRegression(max_iter=10000)
evaluate_model(lr_model, train_X, val_X, train_y, val_y)

# --- ROC Curve ---
y_proba = lr_model.predict_proba(val_X)
fpr, tpr, thresholds = roc_curve(val_y, y_proba[:, 1])
auc = roc_auc_score(val_y, y_proba[:, 1])
print(f"ROC AUC: {auc:.3f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines+markers', name='TPR'))
fig.update_layout(
    title="(Logistic Regression) ROC Curve",
    xaxis_title="FPR",
    yaxis_title="TPR"
)
fig.show()

# --- Precision-Recall Curve ---
y_proba = lr_model.predict_proba(val_X)
precision, recall, thresholds = precision_recall_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines+markers', name='Precision-Recall'))
fig.update_layout(
    title="(Logistic Regression) Precision-Recall Curve",
    xaxis_title="Recall",
    yaxis_title="Precision"
)
fig.show()

# --- Confusion Matrix ---
val_predictions = lr_model.predict(val_X)
cm = confusion_matrix(val_y, val_predictions)

x = ['Positive', 'Negative']
y = ['Positive', 'Negative']
cm_text = [['TP', 'FN'], ['FP', 'TN']]

fig = px.imshow(cm, 
                labels=dict(x='Actual Values', y='Predicted Values'), 
                x=x, y=y
               )
fig.update_xaxes(side="top")
fig.update_traces(text=cm_text, texttemplate="%{text}")
fig.update_layout(
    title="(Logistic Regression) Confusion Matrix"
)
fig.show()

Accuracy of the model:
 0.764

Report:
               precision    recall  f1-score   support

       False       0.77      0.99      0.87       192
        True       0.33      0.02      0.03        58

    accuracy                           0.76       250
   macro avg       0.55      0.50      0.45       250
weighted avg       0.67      0.76      0.67       250

ROC AUC: 0.593


### **1.42 Decision Tree Model**
---

In [9]:
dt_model = DecisionTreeClassifier(random_state = 1)
evaluate_model(dt_model, train_X, val_X, train_y, val_y)

# --- ROC Curve ---
y_proba = dt_model.predict_proba(val_X)
fpr, tpr, thresholds = roc_curve(val_y, y_proba[:, 1])
auc = roc_auc_score(val_y, y_proba[:, 1])
print(f"ROC AUC: {auc:.3f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines+markers', name='TPR'))
fig.update_layout(
    title="(Decision Tree) ROC Curve",
    xaxis_title="FPR",
    yaxis_title="TPR"
)
fig.show()

# --- Precision-Recall Curve ---
y_proba = dt_model.predict_proba(val_X)
precision, recall, thresholds = precision_recall_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines+markers', name='Precision-Recall'))
fig.update_layout(
    title="(Decision Tree) Precision-Recall Curve",
    xaxis_title="Recall",
    yaxis_title="Precision"
)
fig.show()

# --- Confusion Matrix ---
val_predictions = dt_model.predict(val_X)
cm = confusion_matrix(val_y, val_predictions)

fig = px.imshow(cm, 
                labels=dict(x='Actual Values', y='Predicted Values'), 
                x=x, y=y
               )
fig.update_xaxes(side="top")
fig.update_traces(text=cm_text, texttemplate="%{text}")
fig.update_layout(
    title="(Decision Tree) Confusion Matrix"
)
fig.show()

Accuracy of the model:
 0.684

Report:
               precision    recall  f1-score   support

       False       0.79      0.80      0.79       192
        True       0.32      0.31      0.31        58

    accuracy                           0.68       250
   macro avg       0.55      0.55      0.55       250
weighted avg       0.68      0.68      0.68       250

ROC AUC: 0.554


### **1.43 Random Forest Model**
---

In [10]:
rf_model = RandomForestClassifier(random_state = 1)
evaluate_model(rf_model, train_X, val_X, train_y, val_y)

# --- ROC Curve ---
y_proba = rf_model.predict_proba(val_X)
fpr, tpr, thresholds = roc_curve(val_y, y_proba[:, 1])
auc = roc_auc_score(val_y, y_proba[:, 1])
print(f"ROC AUC: {auc:.3f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines+markers', name='TPR'))
fig.update_layout(
    title="(Random Forest) ROC Curve",
    xaxis_title="FPR",
    yaxis_title="TPR"
)
fig.show()

# --- Precision-Recall Curve ---
y_proba = rf_model.predict_proba(val_X)
precision, recall, thresholds = precision_recall_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines+markers', name='Precision-Recall'))
fig.update_layout(
    title="(Random Forest) Precision-Recall Curve",
    xaxis_title="Recall",
    yaxis_title="Precision"
)
fig.show()

# --- Confusion Matrix ---
val_predictions = rf_model.predict(val_X)
cm = confusion_matrix(val_y, val_predictions)

fig = px.imshow(cm, 
                labels=dict(x='Actual Values', y='Predicted Values'), 
                x=x, y=y
               )
fig.update_xaxes(side="top")
fig.update_traces(text=cm_text, texttemplate="%{text}")
fig.update_layout(
    title="(Random Forest) Confusion Matrix"
)
fig.show()

Accuracy of the model:
 0.764

Report:
               precision    recall  f1-score   support

       False       0.78      0.96      0.86       192
        True       0.46      0.10      0.17        58

    accuracy                           0.76       250
   macro avg       0.62      0.53      0.52       250
weighted avg       0.71      0.76      0.70       250

ROC AUC: 0.600


# **2. Deliverables**
---
Code references:
1. https://www.kaggle.com/code/leodaniel/roc-curve-simply-explained (Plotly instead of MatPlotLib)
2. https://www.kaggle.com/code/dansbecker/classification#Confusion-Matrix
3. https://plotly.com/python/heatmaps/

## **2.1 Dataset Description**
This dataset contains data on the top Twitch streamers in (2019?).

The .CSV file contains 1000 rows and 11 columns. Fortunately, there were no empty rows or infinite values, which meant that I didn't need to specifically clean the dataset. Four columns were able to be used with the classification task—namely "Partnered", "Channel", "Mature", and "Language". From those columns, "Partnered" and "Mature" were the most relevant for the task, since they were based on True/False logic.

## **2.2 Preprocessing**
### 2.2a Cleaning:
After seeing there were no errors from printing the CSV file, Pandas' .isna().sum() function was used to double-check if there were really any problematic rows. Fortunately, all the columns returned 0, which meant it was fine to use.

### 2.2b Turning Language string into dummy data:
A ValueError is thrown when running the task since the model expects floating-point values instead of strings. To solve this, pd.getdummies() was used on the "Language" column to turn it into numerical data.

### 2.3c Splitting data:
The standard train and validation variables were set to be used with sklearn's train_test_split feature.

### 2.4d Creating an evaluate_model() function:
The models would just be reusing the same code, so writing a function would be better to keep the code cleaner and less repetitive. For this function, the five arguments (model, train_X, val_X, train_y, val_y) have to be passed. These arguments are used to fit and predict the model, as well as give out the results of the model.

## **2.3 Model Implementation**
### 2.3a Logistic Regression, Decision Tree, Random Forests:
Since the evaluate_model() function has been created, implementing the models would only be about setting their variables and calling them from their libraries. In the function, the specific model's set variable is used as an argument.

## **2.4 Results**
### 2.4a Logistic Regression:
Accuracy: 0.764

True Positives: 190

True Negatives: 1

False Positives: 57

False Negatives: 2

Precision (False/True): 0.77/0.33

Recall  (False/True): 0.99/0.02

ROC AUC: 0.593
___
Achieved 76.4% accuracy with an ROC AUC of 0.593. Identifies most False cases but struggles with True cases.

### 2.4b Decision Tree:
Accuracy: 0.684

True Positives: 153

True Negatives: 18

False Positives: 40

False Negatives: 39

Precision (False/True): 0.79/0.32

Recall  (False/True): 0.80/0.31

ROC AUC: 0.554
___
Achieved 68.4% accuracy with an ROC AUC of 0.554. Performs moderately on False cases but struggles highly with True cases.

### 2.4c Random Forests:
Accuracy: 0.764

True Positives: 185

True Negatives: 6

False Positives: 52

False Negatives: 7

Precision (False/True): 0.78/0.46

Recall  (False/True): 0.96/0.10

ROC AUC: 0.600
___
Achieved 76.4% accuracy with an ROC AUC of 0.600. Correctly identifies False cases but only detected a few True cases.

## **2.5 Interpretation**
### 2.5a Logistic Regression, Decision Tree, Random Forests:
All models are weak at separability, likely due to a class imbalance in the training data. At their ROC AUC scores, they are more or less just slightly better than random guesses. The Logistic Regression and Random Forest models have similar accuracies and performances because they generalize better than a single Decision Tree.

### 2.5b Overall:
Unfortunately, the training data was skewed since the number of rows in the 'Mature' column that were False numbered in 770, while True numbered in 230. This makes the models biased in predicting 'False' results. However, the 'Mature' column was the better option to use the classification task compared to 'Partnered', since that column had rows of False that numbered 22, and True with 978.

## **2.6 Critical Reflection**
My time spent on the Classification Task was more streamlined than when I was working on the Regression Task since it was building on my previous experience, and my chosen dataset was easier to use than the one I used in the first experiment. With the Twitch Streamer dataset, I was only able to apply the binary classification method, so the task was more straightforward than expected. I was also able to make use of code references and documentation more clearly than with the Regression task, where I had to progress by repeated failures and testing.

In this second experiment, all I had to do was figure out what data the model was going to be trained on. I only had two choices, but I had to select the one that was more appropriate for training, which was the column that was more balanced in terms of classes. Since model performance is dependent on balanced targets, I had to check the True/False value counts for the two columns. The ‘Mature’ column was the more ‘balanced’ out of the two, having a 77/23 split over the ‘Partnered’ column’s 97.8/2.2 split. I then went ahead and let the models be trained on the ‘Mature’ data, and then visualized the results using Plotly.

I don’t think I’m as interested in learning more about classification, despite having read that they could be used in medical diagnosis scenarios. My hesitance could also partially be explained because the dataset that I used for the task wasn’t as challenging for me as when I was working with the Violent Crime Rate data to predict future trends. Since I can't think of any other possible projects I could try to continue to learn more about classification tasks. My interest leans more towards regression tasks, where I can compare and contrast numerical values easily.

Despite my preference, both tasks still play equally important roles in machine learning. Classification tasks are used for categorical predictions, while regression models are great at estimating and predicting continuous outcomes. Both techniques synergize with each other, particularly in real-world scenarios, and can be found in complex systems—for example, the previously mentioned medical space. Just like how a classification model could be able to classify a disease, a regression model could also be used to predict its severity and the like.

Overall, this final experiment helped in developing my understanding of machine learning fundamentals. Even though I feel more aligned with regression tasks, my experience with classification tasks made me appreciate both of their uses in the field of predictive models. 