# **<span style="color: red;">CLASSIFICATION TASK</span>**

# **1. Top Streamers on Twitch**
---
(https://www.kaggle.com/datasets/aayushmishra1512/twitchdata)

## **1.1 Importing Libraries**
---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, roc_auc_score, precision_recall_curve, average_precision_score

## **1.2 Printing CSV file**
---

In [2]:
twitch_data = pd.read_csv("/kaggle/input/twitchdata/twitchdata-update.csv")
print(twitch_data)
twitch_data.columns

              Channel  Watch time(Minutes)  Stream time(minutes)  \
0               xQcOW           6196161750                215250   
1            summit1g           6091677300                211845   
2              Gaules           5644590915                515280   
3            ESL_CSGO           3970318140                517740   
4                Tfue           3671000070                123660   
..                ...                  ...                   ...   
995         LITkillah            122524635                 13560   
996  빅헤드 (bighead033)            122523705                153000   
997    마스카 (newmasca)            122452320                217410   
998     AndyMilonakis            122311065                104745   
999              Remx            122192850                 99180   

     Peak viewers  Average viewers  Followers  Followers gained  Views gained  \
0          222720            27716    3246298           1734810      93036735   
1          310998    

Index(['Channel', 'Watch time(Minutes)', 'Stream time(minutes)',
       'Peak viewers', 'Average viewers', 'Followers', 'Followers gained',
       'Views gained', 'Partnered', 'Mature', 'Language'],
      dtype='object')

In [3]:
twitch_data.head()

Unnamed: 0,Channel,Watch time(Minutes),Stream time(minutes),Peak viewers,Average viewers,Followers,Followers gained,Views gained,Partnered,Mature,Language
0,xQcOW,6196161750,215250,222720,27716,3246298,1734810,93036735,True,False,English
1,summit1g,6091677300,211845,310998,25610,5310163,1370184,89705964,True,False,English
2,Gaules,5644590915,515280,387315,10976,1767635,1023779,102611607,True,True,Portuguese
3,ESL_CSGO,3970318140,517740,300575,7714,3944850,703986,106546942,True,False,English
4,Tfue,3671000070,123660,285644,29602,8938903,2068424,78998587,True,False,English


## **1.3 Cleaning**
---
Since there were no errors when printing the dataset, there is probably no need for cleaning. To double-check, the .isna().sum() function from Pandas will be used. If the returned values are 0, then the dataset is fine to use directly.

In [4]:
twitch_data.isna().sum()

Channel                 0
Watch time(Minutes)     0
Stream time(minutes)    0
Peak viewers            0
Average viewers         0
Followers               0
Followers gained        0
Views gained            0
Partnered               0
Mature                  0
Language                0
dtype: int64

## **1.4 Regression Task & Visualization**
---

### **1.41 Variables & Function**
---

In [5]:
# --- Splitting Data ---
X = twitch_data.drop(['Partnered', 'Channel', 'Mature'], axis=1)
X = pd.get_dummies(X, columns=['Language'], drop_first=True)
y = twitch_data['Mature']

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# --- Function ---
def evaluate_model(model, train_X, val_X, train_y, val_y):
    model.fit(train_X, train_y)
    val_predictions = model.predict(val_X)
    
    accuracy = accuracy_score(val_y, val_predictions)
    cm = confusion_matrix(val_y, val_predictions)
    report = classification_report(val_y, val_predictions)
    
    print("Accuracy of the model:\n", accuracy)
    print("\nConfusion Matrix:\n", cm)
    print("\nReport:\n", report)
    
    return val_predictions

### **1.41 Logistic Regression Model**
---

In [6]:
lr_model = LogisticRegression(max_iter=10000)
evaluate_model(lr_model, train_X, val_X, train_y, val_y)

# --- ROC Curve ---
y_proba = lr_model.predict_proba(val_X)
fpr, tpr, thresholds = roc_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines+markers', name='TPR'))
fig.update_layout(
    title="(Logistic Regression) ROC Curve",
    xaxis_title="FPR",
    yaxis_title="TPR"
)
fig.show()

# --- Precision-Recall Curve ---
y_proba = lr_model.predict_proba(val_X)
precision, recall, thresholds = precision_recall_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines+markers', name='Precision-Recall'))
fig.update_layout(
    title="(Logistic Regression) Precision-Recall Curve",
    xaxis_title="Recall",
    yaxis_title="Precision"
)
fig.show()

# --- Confusion Matrix ---
val_predictions = lr_model.predict(val_X)
cm = confusion_matrix(val_y, val_predictions)

x = ['Positive', 'Negative']
y = ['Positive', 'Negative']
cm_text = [['TP', 'FN'], ['FP', 'TN']]

fig = px.imshow(cm, 
                labels=dict(x='Actual Values', y='Predicted Values'), 
                x=x, y=y
               )
fig.update_xaxes(side="top")
fig.update_traces(text=cm_text, texttemplate="%{text}")
fig.update_layout(
    title="(Logistic Regression) Confusion Matrix"
)
fig.show()

Accuracy of the model:
 0.764

Confusion Matrix:
 [[190   2]
 [ 57   1]]

Report:
               precision    recall  f1-score   support

       False       0.77      0.99      0.87       192
        True       0.33      0.02      0.03        58

    accuracy                           0.76       250
   macro avg       0.55      0.50      0.45       250
weighted avg       0.67      0.76      0.67       250



### **1.42 Decision Tree Model**
---

In [7]:
dt_model = DecisionTreeClassifier(random_state = 1)
evaluate_model(dt_model, train_X, val_X, train_y, val_y)

# --- ROC Curve ---
y_proba = dt_model.predict_proba(val_X)
fpr, tpr, thresholds = roc_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines+markers', name='TPR'))
fig.update_layout(
    title="ROC Curve",
    xaxis_title="FPR",
    yaxis_title="TPR"
)
fig.show()

# --- Precision-Recall Curve ---
y_proba = dt_model.predict_proba(val_X)
precision, recall, thresholds = precision_recall_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines+markers', name='Precision-Recall'))
fig.update_layout(
    title="Precision-Recall Curve",
    xaxis_title="Recall",
    yaxis_title="Precision"
)
fig.show()

# --- Confusion Matrix ---
val_predictions = dt_model.predict(val_X)
cm = confusion_matrix(val_y, val_predictions)

fig = px.imshow(cm, 
                labels=dict(x='Actual Values', y='Predicted Values'), 
                x=x, y=y
               )
fig.update_xaxes(side="top")
fig.update_traces(text=cm_text, texttemplate="%{text}")
fig.update_layout(
    title="(Logistic Regression) Confusion Matrix"
)
fig.show()

Accuracy of the model:
 0.684

Confusion Matrix:
 [[153  39]
 [ 40  18]]

Report:
               precision    recall  f1-score   support

       False       0.79      0.80      0.79       192
        True       0.32      0.31      0.31        58

    accuracy                           0.68       250
   macro avg       0.55      0.55      0.55       250
weighted avg       0.68      0.68      0.68       250



### **1.43 Random Forest Model**
---

In [8]:
rf_model = RandomForestClassifier(random_state = 1)
evaluate_model(rf_model, train_X, val_X, train_y, val_y)

# --- ROC Curve ---
y_proba = rf_model.predict_proba(val_X)
fpr, tpr, thresholds = roc_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines+markers', name='TPR'))
fig.update_layout(
    title="ROC Curve",
    xaxis_title="FPR",
    yaxis_title="TPR"
)
fig.show()

# --- Precision-Recall Curve ---
y_proba = rf_model.predict_proba(val_X)
precision, recall, thresholds = precision_recall_curve(val_y, y_proba[:, 1])

fig = go.Figure()
fig.add_trace(go.Scatter(x=recall, y=precision, mode='lines+markers', name='Precision-Recall'))
fig.update_layout(
    title="Precision-Recall Curve",
    xaxis_title="Recall",
    yaxis_title="Precision"
)
fig.show()

# --- Confusion Matrix ---
val_predictions = rf_model.predict(val_X)
cm = confusion_matrix(val_y, val_predictions)

fig = px.imshow(cm, 
                labels=dict(x='Actual Values', y='Predicted Values'), 
                x=x, y=y
               )
fig.update_xaxes(side="top")
fig.update_traces(text=cm_text, texttemplate="%{text}")
fig.update_layout(
    title="(Logistic Regression) Confusion Matrix"
)
fig.show()

Accuracy of the model:
 0.764

Confusion Matrix:
 [[185   7]
 [ 52   6]]

Report:
               precision    recall  f1-score   support

       False       0.78      0.96      0.86       192
        True       0.46      0.10      0.17        58

    accuracy                           0.76       250
   macro avg       0.62      0.53      0.52       250
weighted avg       0.71      0.76      0.70       250



# **2. Deliverables**
---
Code references:
1. https://www.kaggle.com/code/leodaniel/roc-curve-simply-explained (Plotly instead of MatPlotLib)
2. https://www.kaggle.com/code/dansbecker/classification#Confusion-Matrix
3. https://plotly.com/python/heatmaps/

## **2.1 Dataset Description**
This dataset contains data on the top Twitch streamers in (2019?).

The .CSV file contains 1000 rows and 11 columns.

## **2.2 Preprocessing**


## **2.3 Model Implementation**


## **2.4 Results**


## **2.5 Interpretation**


## **2.6 Critical Reflection**
