# Semi-Supervised Learning

In this exercise, we use a breast cancer dataset to explore the concepts of semi-supervised learning. In particular, we will perform the following tasks: 

1. Create a dataset suitable for semi-supervised learning
2. Create a baseline and report accuracy
3. Solve the classification task using a semi-supervised method and report accuracy
4. Create a classification model that utilizes the predicted output from the semi-supervised learning

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from numpy import concatenate
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from seaborn import catplot

### Load the data

data location: `/dsa/data/DSA-8410/Wisconsin-Breast-Cancer-Cytology/BreastCancer.csv`

In [None]:
data = pd.read_csv("/dsa/data/DSA-8410/Wisconsin-Breast-Cancer-Cytology/BreastCancer.csv")

In [None]:
data.shape

In [None]:
data.head()

### Remove the 'id' column

In [None]:
data= data.drop(["id"],axis=1)
data.head()

### Extract the first two features and class variable

In [None]:
X = data.iloc[:,0:2]
y = data.loc[:,"class"]

### T1. Create datasets for semi-supervised learning

1. Create train and test datasets with a 50-50 split with stratification 
2. Split the training set into a labeled and unlabeled datasets with a 50-50 split with stratification 

### T2. Report the sizes of the labeled, unlabeled, and test sets

### T3. Baseline Performance 

We can establish a baseline by fitting a classifier only on the labeled training data. This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm that fits the labeled data alone. If this is not the case, we need to rethink about the semi-supervised model and/or data that we are using.

### T4. Define and fit the random forest model as a baseline

### T5. Report baseline prediction accuracy

### T6. Fit a label propagation model 


### T7. Report prediction accuracy by label propagation method

### T8. Fit a supervised model using the estimated labels for the training dataset

### T9. Discuss your observations

# Save your notebook, then `File > Close and Halt`