# Semi-Supervised Learning

In this exercise, we use a breast cancer dataset to explore the concepts of semi-supervised learning. In particular, we will perform the following tasks: 

1. Create a dataset suitable for semi-supervised learning
2. Create a baseline and report accuracy
3. Solve the classification task using a semi-supervised method and report accuracy
4. Create a classification model that utilizes the predicted output from the semi-supervised learning

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from numpy import concatenate
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from seaborn import catplot

### Load the data

data location: `/dsa/data/DSA-8410/Wisconsin-Breast-Cancer-Cytology/BreastCancer.csv`

In [2]:
data = pd.read_csv("/dsa/data/DSA-8410/Wisconsin-Breast-Cancer-Cytology/BreastCancer.csv")

In [3]:
data.shape

(683, 4)

In [4]:
data.head()

Unnamed: 0,id,thickness,size,class
0,1000025,5,1,0
1,1002945,5,4,0
2,1015425,3,1,0
3,1016277,6,8,0
4,1017023,4,1,0


### Remove the 'id' column

In [5]:
data= data.drop(["id"],axis=1)
data.head()

Unnamed: 0,thickness,size,class
0,5,1,0
1,5,4,0
2,3,1,0
3,6,8,0
4,4,1,0


### Extract the first two features and class variable

In [6]:
X = data.iloc[:,0:2]
y = data.loc[:,"class"]

### T1. Create datasets for semi-supervised learning

1. Create train and test datasets with a 50-50 split with stratification 
2. Split the training set into a labeled and unlabeled datasets with a 50-50 split with stratification 

In [7]:
# divide the dataset into training set and test set, in 50-50 ratio with stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1, stratify=y)

# divide the training set into label and non-label, i 50-50 ration
X_train_lab, X_train_unlab, y_train_lab, y_train_unlab = train_test_split(
    X_train, y_train, test_size=0.5, random_state=1, stratify=y_train)

# check the amount of data
print(f"Labeled Train Set: {X_train_lab.shape}, {y_train_lab.shape}")
print(f"Unlabeled Train Set: {X_train_unlab.shape}, {y_train_unlab.shape}")
print(f"Test Set: {X_test.shape}, {y_test.shape}")

Labeled Train Set: (170, 2), (170,)
Unlabeled Train Set: (171, 2), (171,)
Test Set: (342, 2), (342,)


### T2. Report the sizes of the labeled, unlabeled, and test sets

In [8]:
print(f"Labeled Train Set Size: X={X_train_lab.shape}, y={y_train_lab.shape}")
print(f"Unlabeled Train Set Size: X={X_train_unlab.shape}, y={y_train_unlab.shape}")
print(f"Test Set Size: X={X_test.shape}, y={y_test.shape}")

Labeled Train Set Size: X=(170, 2), y=(170,)
Unlabeled Train Set Size: X=(171, 2), y=(171,)
Test Set Size: X=(342, 2), y=(342,)


### T3. Baseline Performance 

We can establish a baseline by fitting a classifier only on the labeled training data. This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm that fits the labeled data alone. If this is not the case, we need to rethink about the semi-supervised model and/or data that we are using.

### T4. Define and fit the random forest model as a baseline

In [12]:
# define a random forest classifier
baseline_model = RandomForestClassifier(random_state=1)

# training models with labeled training data
baseline_model.fit(X_train_lab, y_train_lab)

# evaluate model performance on test set
y_pred = baseline_model.predict(X_test)
baseline_accuracy = accuracy_score(y_test, y_pred)

# accurancy of baseline model
print(f"Baseline Model Accuracy: {baseline_accuracy:.2f}")

Baseline Model Accuracy: 0.94


### T5. Report baseline prediction accuracy

In [13]:
Random Forest obtained an accuracy of 0.94 on the test set, which means that the model is highly accurate in supervised learning using labeled training data only.

SyntaxError: invalid syntax (<ipython-input-13-81523f69a8d8>, line 1)

### T6. Fit a label propagation model 


In [17]:
# define label model
label_prop_model = LabelPropagation()

# prepare training data, set untagged label label to -1
y_train_mixed = np.copy(y_train)

# The true position of the unlabeled portion to set the label to -1
y_train_mixed[len(y_train_lab):] = -1

# using mixed training model inclluding untagged data to train model
label_prop_model.fit(X_train, y_train_mixed)

# evaluate performence of model in testing set
y_test_pred = label_prop_model.predict(X_test)
label_prop_accuracy = accuracy_score(y_test, y_test_pred)

# print accuracy
print(f"Label Propagation Model Accuracy: {label_prop_accuracy:.2f}")


Label Propagation Model Accuracy: 0.93


### T7. Report prediction accuracy by label propagation method

In [None]:
The accuacy by label propagation is 0.93, which means it has a highly accurancy use the model.

### T8. Fit a supervised model using the estimated labels for the training dataset

In [18]:
# obtain evaluated labels from label propagation model
y_train_estimated = label_prop_model.transduction_

# training a suoervised learning model using evaluated label, such as random forest
supervised_model = RandomForestClassifier(random_state=1)
supervised_model.fit(X_train, y_train_estimated)

# estimated performance of supervised learning in test set
y_test_pred_supervised = supervised_model.predict(X_test)
supervised_model_accuracy = accuracy_score(y_test, y_test_pred_supervised)

# print the accuracy
print(f"Supervised Model Accuracy (using estimated labels): {supervised_model_accuracy:.2f}")

Supervised Model Accuracy (using estimated labels): 0.94


### T9. Discuss your observations

# Save your notebook, then `File > Close and Halt`