<a href="https://colab.research.google.com/github/Silkepaepens/D012513A-Specialised-Bio-informatics-Machine-Learning/blob/main/scikit-learn-example/example_scikit-learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset

#### Breast cancer wisconsin (diagnostic) dataset

Number of Instances: 569

Number of Attributes: 30 numeric, predictive attributes and the class

Attribute Information:

        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    
Class Distribution: 212 - Malignant, 357 - Benign

Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

Donor: Nick Street

Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.

# Loading data

In [None]:
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/D012513A-Specialised-Bio-informatics-Machine-Learning/main/scikit-learn-example/breast_cancer_dataset.csv")
#data = pd.read_csv("breast_cancer_dataset.csv")
bhjsvfhgvzghvcfgjhhahahhaah

data

# The features

In [None]:
X = data.iloc[:,:-1]
X

In [None]:
X.describe().transpose()

In [None]:
from sklearn.preprocessing import StandardScaler

X = pd.DataFrame(StandardScaler().fit_transform(X),columns=X.columns)

In [None]:
X.describe().transpose()

# The label

In [None]:
y = data.iloc[:,-1]
y

# Train-test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Creating the model

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# Model fitting

In [None]:
model.fit(X_train,y_train)

# Computing predictions

In [None]:
predictions = model.predict(X_test)

predictions

# Evaluate the predictions

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)

cm

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay(confusion_matrix=cm, display_labels = model.classes_).plot()

# Feature extraction: t-SNE

In [None]:
from sklearn.manifold import TSNE

prj_tsne = TSNE(n_components=2, learning_rate='auto', init='random', perplexity=10)

X_embedded = prj_tsne.fit_transform(X)

In [None]:
X_embedded

In [None]:
tsne_result = pd.DataFrame(X_embedded, columns=["t-SNE_1","t-SNE_2"])
tsne_result["label"] = y
tsne_result

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="label",data=tsne_result)

# K-means clustering of feature vectors

## k = 2 clusters

In [None]:
from sklearn.cluster import KMeans

cls_kmns = KMeans(n_clusters=2, init='k-means++')

In [None]:
kmeans_result = cls_kmns.fit_predict(X)

kmeans_result

In [None]:
tsne_result["kmeans_full"] = kmeans_result

tsne_result

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12,6))
axes[0] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="label",data=tsne_result,ax=axes[0])
axes[0].set_title("true label")
axes[1] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="kmeans_full",data=tsne_result,ax=axes[1])
axes[1].set_title("kmeans_full")
plt.show()

## k = 3 clusters

In [None]:
cls_kmns = KMeans(n_clusters=3, init='k-means++')

kmeans_result = cls_kmns.fit_predict(X)

tsne_result["kmeans_full_3"] = kmeans_result

fig, axes = plt.subplots(1, 2, figsize=(12,6))
axes[0] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",data=tsne_result,ax=axes[0])
axes[0].set_title("true label")
axes[1] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="kmeans_full_3",data=tsne_result,ax=axes[1])
axes[1].set_title("kmeans_full_3")
plt.show()

# K-means clustering of embedding

## k = 2 clusters

In [None]:
cls_kmns = KMeans(n_clusters=2, init='k-means++')

kmeans_result = cls_kmns.fit_predict(X_embedded)

In [None]:
tsne_result["kmeans_embedded"] = kmeans_result

tsne_result

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12,6))
axes[0] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="label",data=tsne_result,ax=axes[0])
axes[0].set_title("true label")
axes[1] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="kmeans_embedded",data=tsne_result,ax=axes[1])
axes[1].set_title("kmeans_embedded")
plt.show()

## k = 3 clusters

In [None]:
cls_kmns = KMeans(n_clusters=3, init='k-means++')

kmeans_result = cls_kmns.fit_predict(X_embedded)

tsne_result["kmeans_embedded_3"] = kmeans_result

fig, axes = plt.subplots(1, 2, figsize=(12,6))
axes[0] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",data=tsne_result,ax=axes[0])
axes[0].set_title("true label")
axes[1] = sns.scatterplot(x="t-SNE_1",y="t-SNE_2",hue="kmeans_embedded_3",data=tsne_result,ax=axes[1])
axes[1].set_title("kmeans_full_3")
plt.show()