In [11]:
import numpy as np 
import pandas as pd

In [12]:
df = pd.read_csv(r"A:\Code\PY\DSc\datasets\05-multiclass-classfication\Cardiotocographic.csv")
df

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,...,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency,NSP
0,120,0.000000,0.000000,0.000000,0.000000,0.0,0.0,73,0.5,43,...,62,126,2,0,120,137,121,73,1,2
1,132,0.006380,0.000000,0.006380,0.003190,0.0,0.0,17,2.1,0,...,68,198,6,1,141,136,140,12,0,1
2,133,0.003322,0.000000,0.008306,0.003322,0.0,0.0,16,2.1,0,...,68,198,5,1,141,135,138,13,0,1
3,134,0.002561,0.000000,0.007682,0.002561,0.0,0.0,16,2.4,0,...,53,170,11,0,137,134,137,13,1,1
4,132,0.006515,0.000000,0.008143,0.000000,0.0,0.0,16,2.4,0,...,53,170,9,0,137,136,138,11,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2121,140,0.000000,0.000000,0.007426,0.000000,0.0,0.0,79,0.2,25,...,137,177,4,0,153,150,152,2,0,2
2122,140,0.000775,0.000000,0.006971,0.000000,0.0,0.0,78,0.4,22,...,103,169,6,0,152,148,151,3,1,2
2123,140,0.000980,0.000000,0.006863,0.000000,0.0,0.0,79,0.4,20,...,103,170,5,0,153,148,152,4,1,2
2124,140,0.000679,0.000000,0.006110,0.000000,0.0,0.0,78,0.4,27,...,103,169,6,0,152,147,151,4,1,2


### Preprocessing + Basic EDA

In [13]:
df['NSP'].value_counts()

NSP
1    1655
2     295
3     176
Name: count, dtype: int64

In [14]:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [15]:
X.shape, y.shape

((2126, 21), (2126,))

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression is a parameteric algorithm
scaler = StandardScaler()

X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X.columns)


In [51]:
lr1 = OneVsOneClassifier(LogisticRegression())
lr1.fit(X_train_scaled, y_train)
y_pred1 = lr1.predict(X_test_scaled)

print(accuracy_score(y_test, y_pred1))

0.8826291079812206


In [52]:
lr2 = OneVsRestClassifier(LogisticRegression())
lr2.fit(X_train_scaled, y_train)
y_pred2 = lr2.predict(X_test_scaled)
print(accuracy_score(y_test, y_pred2))

0.8779342723004695


The OutputCodeClassifier is an approach in scikit-learn used for multi-class classification problems. It works by encoding each class into a binary code and training multiple binary classifiers to predict these binary codes. The key idea is to split the multi-class problem into several binary classification tasks.

This approach is a generalization of One-vs-Rest (OvR) and One-vs-One (OvO) methods, and it can be more flexible. It combines the predictions from each classifier, using the binary code to make a final decision.

### How Does It Work?
- Output Code Matrix: In this method, each class is represented by a binary code. For example, if you have 5 classes, you might represent them with a 3-bit binary code (because ⌈log2 5⌉=3).

- Multiple Binary Classifiers: The strategy involves creating multiple binary classifiers to predict each bit of the binary code. These classifiers will predict if the corresponding bit for each class is 0 or 1.

- Final Prediction: Once all the classifiers give their binary predictions, the final predicted class is the one whose binary code matches the predicted one most closely.

In [54]:
from sklearn.multiclass import OutputCodeClassifier

occ = OutputCodeClassifier(LogisticRegression(),code_size=3)
# code size: The number of bits used to represent each class in the binary code.

occ.fit(X_train_scaled, y_train)
y_pred_occ = occ.predict(X_test)

print(accuracy_score(y_test,y_pred_occ))

0.19014084507042253
