## Gene Expression Analysis using Python

This notebook will show the implementation of support vector machine nad kmeans algorithm for cancer gene expression classification.
Before passing the data throguh our algorithms we will use Principal Component Analysis to reduce dimensionality.

#### Dataset for this project comes from: https://www.kaggle.com/crawford/gene-expression

#### First we will import the nesscecary libraries for reading processing our dataset

In [1]:
import os #IO functions
import pandas as pd # data preprocessing
import numpy as np # linear algebra
import plotly.plotly as py # data visualization

### Content of our dataset (description from Kaggle)

Golub et al "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

There are two datasets containing the initial (training, 38 samples) and independent (test, 34 samples) datasets used in the paper. These datasets contain measurements corresponding to ALL and AML samples from Bone Marrow and Peripheral Blood. Intensity values have been re-scaled such that overall intensities for each chip are equivalent.

In [2]:
MAIN_DIR = os.getcwd()

X_df = pd.read_csv(os.path.join(MAIN_DIR, 'data_set_ALL_AML_train.csv'), encoding = 'utf-8')
y_data = pd.read_csv(os.path.join(MAIN_DIR,'actual.csv'))

print(f'Shape of X data: {X_df.shape}')
print(f'Shape of y data: {y_data.shape}')

X_df.head()

Shape of X data: (7129, 78)
Shape of y data: (72, 2)


Unnamed: 0,Gene Description,Gene Accession Number,1,call,2,call.1,3,call.2,4,call.3,...,29,call.33,30,call.34,31,call.35,32,call.36,33,call.37
0,AFFX-BioB-5_at (endogenous control),AFFX-BioB-5_at,-214,A,-139,A,-76,A,-135,A,...,15,A,-318,A,-32,A,-124,A,-135,A
1,AFFX-BioB-M_at (endogenous control),AFFX-BioB-M_at,-153,A,-73,A,-49,A,-114,A,...,-114,A,-192,A,-49,A,-79,A,-186,A
2,AFFX-BioB-3_at (endogenous control),AFFX-BioB-3_at,-58,A,-1,A,-307,A,265,A,...,2,A,-95,A,49,A,-37,A,-70,A
3,AFFX-BioC-5_at (endogenous control),AFFX-BioC-5_at,88,A,283,A,309,A,12,A,...,193,A,312,A,230,P,330,A,337,A
4,AFFX-BioC-3_at (endogenous control),AFFX-BioC-3_at,-295,A,-264,A,-376,A,-419,A,...,-51,A,-139,A,-367,A,-188,A,-407,A


### Data preprocessing 

First thing we have to do before implementing the PCA algorithm is to prepare our dataset.
PCA requires from us to pass it only a numeric matrix. So in order to create it we have to
delete all the unnesscecary columns from it.

#### Analysis steps:
        
        1. Transpose dataframe so that each row is a patient and each column is a gene
        2. Remove the Gene Description and Accesion Number 
        3. Remove "call column"
        4. Reset index


#### 1. Transpose the dataframe

In [3]:
X_df = X_df.T

print(f'Shape of X transposed data: {X_df.shape}')
X_df.head()

Shape of X transposed data: (78, 7129)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
Gene Description,AFFX-BioB-5_at (endogenous control),AFFX-BioB-M_at (endogenous control),AFFX-BioB-3_at (endogenous control),AFFX-BioC-5_at (endogenous control),AFFX-BioC-3_at (endogenous control),AFFX-BioDn-5_at (endogenous control),AFFX-BioDn-3_at (endogenous control),AFFX-CreX-5_at (endogenous control),AFFX-CreX-3_at (endogenous control),AFFX-BioB-5_st (endogenous control),...,Transcription factor Stat5b (stat5b) mRNA,Breast epithelial antigen BA46 mRNA,GB DEF = Calcium/calmodulin-dependent protein ...,TUBULIN ALPHA-4 CHAIN,CYP4B1 Cytochrome P450; subfamily IVB; polypep...,PTGER3 Prostaglandin E receptor 3 (subtype EP3...,HMG2 High-mobility group (nonhistone chromosom...,RB1 Retinoblastoma 1 (including osteosarcoma),GB DEF = Glycophorin Sta (type A) exons 3 and ...,GB DEF = mRNA (clone 1A7)
Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
call,A,A,A,A,A,A,A,A,A,A,...,A,A,A,P,A,A,A,A,A,A
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14


#### 2. We will drop the first two rows of our dataset since they do not provide any information helping us with our classification task

In [4]:
X_df = X_df.iloc[2:,:]
X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
call,A,A,A,A,A,A,A,A,A,A,...,A,A,A,P,A,A,A,A,A,A
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
call.1,A,A,A,A,A,A,A,A,A,A,...,A,A,A,A,A,A,A,A,A,A
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41


#### 3. Next we will drop all the rows with 'call' header since its useless for us

In [5]:
X_df = X_df.drop([index for index, row in X_df.iterrows() if "call" in index])
X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
1,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
2,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
3,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
4,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
5,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


 Now we will separate category from input data

#### 4. Reset index in dataframe

In [6]:
X_df.reset_index(drop=True, inplace=True)
X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
0,-214,-153,-58,88,-295,-558,199,-176,252,206,...,185,511,-125,389,-37,793,329,36,191,-37
1,-139,-73,-1,283,-264,-400,-330,-168,101,74,...,169,837,-36,442,-17,782,295,11,76,-14
2,-76,-49,-307,309,-376,-650,33,-367,206,-215,...,315,1199,33,168,52,1138,777,41,228,-41
3,-135,-114,265,12,-419,-585,158,-253,49,31,...,240,835,218,174,-110,627,170,-50,126,-91
4,-106,-125,-76,168,-230,-284,4,-122,70,252,...,156,649,57,504,-26,250,314,14,56,-25


In [7]:
print(f'Shape of our processed data: {X_df.shape}')

X_df.describe()

Shape of our processed data: (38, 7129)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
count,38,38,38,38,38,38,38,38,38,38,...,38,38,38,38,38,38,38,38,38,38
unique,35,35,35,37,38,36,37,38,34,36,...,33,38,36,38,36,38,37,31,36,36
top,-81,-114,-1,132,-407,-284,-31,-194,206,350,...,103,1215,57,255,-22,987,295,26,246,-22
freq,2,2,2,2,1,2,2,1,2,2,...,2,1,3,1,2,1,2,2,2,2


### Fantastic, now we are ready for implementing our PCA

#### Principal Component Analysis 


    1. Standarize the d-dimensional data using formula

$$s= \frac {value - mean}{std}$$ 

    2. Eigendecomposition - get eigenvectors and eigenvalues
        - using correlation/covariance matrix, the covariance 
          between two features is defined as follows:
        
\begin{equation*}
\sigma_{jk} = \frac{1}{n-1}\sum_{i=1}^{N}(x_{ij} - \bar{x_j})(x_{ik} -\bar{x_k})
\end{equation*}

        This can be summarized via the following equation:
        
\begin{equation*}
\sigma_{jk} = \frac{1}{n-1}((X - \bar{x})^T(X - \bar{x}))
\end{equation*}
        
        Where mean vector is:
\begin{equation*} \bar{x} = \sum_{k=1}^n x_i \end{equation*}

        The mean vector is a d-dimensional vector where each value in this vector represents the sample mean of a feature column in the dataset.
        
        - using Singular Value Decomposition
        
    3. Sort eigenvalues in decreasing order then
       take the k eigenvectors coreesponding to
       the k highest eigenvalues.
       k - number of dimensions of our subspace
       
    4. Create the projection matrix W from k eigenvectors
    
    5. Transform the original d-dimensional data with projection
       matrix W in order to get subspace representation of our data
       

In [8]:
from sklearn.preprocessing import StandardScaler 


class Principal_Component_Analysis:
    
    def __init__(self, X):
        
        """
        Description:
            Constructor of PCA 

        Arguments:
            X - data to be processed

        Returns:
            Nothing, only sets the parameters for the given object.
        """

        self.X = X.astype(float) # data
        self.scaler = StandardScaler() # standarization
        self.explainde_variance = None

    def standarize(self):
        
        """
        Description:
            Method for data standarization, in this case
            we use Standard Scaler object from scikit-learn library

        Returns:
            Scaled X data.
        """
        
        return self.scaler.fit_transform(self.X)
    
    def create_covariance_matrix(self, X_scaled):
        
        """
        Description:
            This methods uses scaled data to create covariance matrix

        Arguments:
            X_scaled - standarized data

        Returns:
            Covariance matrix
        """

        mean_vec = np.mean(X_scaled, axis=0)
        print(f'Scaled data shape: {X_scaled.shape}')
        
        covariance_matrix = (X_scaled - mean_vec).T.dot((X_scaled - mean_vec)) / (X_scaled.shape[0]-1)
        print(f'The covariance matrix shape: \n {covariance_matrix.shape}')

        return covariance_matrix
    
    def eigendecomposion(self, covariance_matrix):

        """
        Description:
            This methods calculates the eigenvectors and eigenvalue from covariance matrix
            next it sorts the eigenvector and eigenvalues pairs in descending order.

        Arguments:
            covariance_matrix 

        Returns:
            List of eigenvectors and eigenvalues pairs in descending order

        """

        eig_vals, eig_vecs = np.linalg.eig(covariance_matrix)
        
        return eig_vals, eig_vecs
    


    def pair_eigen(self, eig_vals, eig_vecs):
        
        # Make a list of (eigenvalue, eigenvector) tuples
        eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

        # Sort the (eigenvalue, eigenvector) tuples from high to low
        eig_pairs = sorted(eig_pairs,reverse=True, key=lambda x:len(x))

        return eig_pairs
        
    def create_projection_matrix(self, eig_pairs, nb_dim: int):
        
        eig_pair_list = []
        num_of_cols = eig_pairs[0][1].shape[0]
        
        for i in range(0, nb_dim):
            eig_pair_list.append(eig_pairs[i][1].reshape(num_of_cols,1))
            
        eig_pair_tup = tuple(eig_pair_list)
        
        matrix_w = np.hstack(eig_pair_tup)
        
        return matrix_w
    
    
    def project(self, projection_matrix):
        pass
    
    

#### 1. Initialize the object and pass the data

In [9]:
pca = Principal_Component_Analysis(X=X_df)

#### 2. Now let's standarize the data in order to obtain data that its distribution will have mean value 0 and standard deviation of 1.
#### This is useful when you want to compare data that correspond to different units

In [10]:
scaled_data = pca.standarize()
scaled_data

array([[-0.86149567, -0.03310102, -0.3517011 , ...,  0.54606799,
        -0.43582025, -0.25587506],
       [-0.16772267,  1.03740009,  0.13913948, ..., -0.26704265,
        -0.59574421,  0.49964792],
       [ 0.41504666,  1.35855042, -2.49589941, ...,  0.70869012,
        -0.38436645, -0.38727036],
       ...,
       [ 0.82206015,  1.35855042,  0.56970139, ..., -1.4704464 ,
        -0.51647755, -0.09163093],
       [-0.02896807,  0.95711251, -0.1708651 , ...,  0.64364126,
        -0.28702143,  0.86098499],
       [-0.13072144, -0.47468273, -0.45503596, ..., -1.01510444,
         0.397175  ,  0.63104322]])

#### 3. Calculate the covariance matrix

In [11]:
cov_matrix = pca.create_covariance_matrix(X_scaled=scaled_data)

Scaled data shape: (38, 7129)
The covariance matrix shape: 
 (7129, 7129)


#### 4. Get the eigenvectors and eigenvalues

In [12]:
eig_vals, eig_vecs = pca.eigendecomposion(cov_matrix)

#### Let's visualize the principal component of our problem

We need to extract the real part of eigenvalues in order to visualize it

In [13]:
eig_vals_real = np.real(eig_vals)
tot = sum(eig_vals_real)
var_exp = [(i / tot)*100 for i in sorted(eig_vals_real, reverse=True)]
cum_var_exp = np.cumsum(var_exp)

trace1 = dict(type='bar',x=['PC %s' %i for i in range(1,31)],y=var_exp,name='Individual')
trace2 = dict(type='scatter',x=['PC %s' %i for i in range(1,31)], y=cum_var_exp,name='Cumulative')

data = [trace1, trace2]

layout=dict(title='Explained variance by different principal components',yaxis=dict(title='Explained variance in percent'),
annotations=list([
    dict(
        x=1.16,
        y=1.05,
        xref='paper',
        yref='paper',
        text='Explained Variance',
        showarrow=False)]))

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='selecting-principal-components')


Consider using IPython.display.IFrame instead



#### Plot disscusion

As we can see from the plot above the the most of variance (93% of variance) can be explained by the first 30 principal components. It should more than enough for the classifier to get some acceptable results.

#### 5. Creating  the projection matrix 

We will pack our eigen values and eigen vectors into a tuple and than sort them via descending order

In [14]:
eigen_pairs = pca.pair_eigen(eig_vals, eig_vecs)
print('Eigenvalues in descending order (first 5):')
for i in range(0,5):
    
    print(eigen_pairs[i][0])


Eigenvalues in descending order (first 5):
1097.3575909754795
876.9764850290308
483.27218616584304
357.65814522778464
339.1704257562485


Now, we're finally creating the projection matrix with the specifed number of output dimensions

In [15]:
projection_matrix = pca.create_projection_matrix(eig_pairs=eigen_pairs,nb_dim=3)
print('W matrix:', projection_matrix)

W matrix: [[ 1.29741138e-02+0.j -2.17619942e-03+0.j -7.17359328e-03+0.j]
 [ 6.48473527e-03+0.j -4.86244308e-03+0.j -7.91294911e-03+0.j]
 [-1.86511421e-03+0.j -3.56293818e-03+0.j -1.59078930e-03+0.j]
 ...
 [-7.14956201e-03+0.j  1.57028822e-02+0.j  7.21514208e-03+0.j]
 [-3.55688572e-03+0.j -8.27612611e-03+0.j -1.96830386e-02+0.j]
 [ 5.95097737e-05+0.j -4.76305008e-03+0.j -7.56848583e-03+0.j]]


In [16]:
X_train = scaled_data.dot(projection_matrix)

Let's do sanity check and see if the output dimensions are correct

In [17]:
print(X_train.shape)

(38, 3)


### Preparing validation data

In [18]:
X_valid = pd.read_csv(os.path.join(MAIN_DIR,'data_set_ALL_AML_independent.csv'))
X_valid = X_valid.T
X_valid.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
Gene Description,AFFX-BioB-5_at (endogenous control),AFFX-BioB-M_at (endogenous control),AFFX-BioB-3_at (endogenous control),AFFX-BioC-5_at (endogenous control),AFFX-BioC-3_at (endogenous control),AFFX-BioDn-5_at (endogenous control),AFFX-BioDn-3_at (endogenous control),AFFX-CreX-5_at (endogenous control),AFFX-CreX-3_at (endogenous control),AFFX-BioB-5_st (endogenous control),...,Transcription factor Stat5b (stat5b) mRNA,Breast epithelial antigen BA46 mRNA,GB DEF = Calcium/calmodulin-dependent protein ...,TUBULIN ALPHA-4 CHAIN,CYP4B1 Cytochrome P450; subfamily IVB; polypep...,PTGER3 Prostaglandin E receptor 3 (subtype EP3...,HMG2 High-mobility group (nonhistone chromosom...,RB1 Retinoblastoma 1 (including osteosarcoma),GB DEF = Glycophorin Sta (type A) exons 3 and ...,GB DEF = mRNA (clone 1A7)
Gene Accession Number,AFFX-BioB-5_at,AFFX-BioB-M_at,AFFX-BioB-3_at,AFFX-BioC-5_at,AFFX-BioC-3_at,AFFX-BioDn-5_at,AFFX-BioDn-3_at,AFFX-CreX-5_at,AFFX-CreX-3_at,AFFX-BioB-5_st,...,U48730_at,U58516_at,U73738_at,X06956_at,X16699_at,X83863_at,Z17240_at,L49218_f_at,M71243_f_at,Z78285_f_at
39,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,277,1023,67,214,-135,1074,475,48,168,-70
call,A,A,A,A,A,A,A,A,A,A,...,A,A,A,A,A,A,A,A,A,A
40,-87,-248,262,295,-226,-493,367,-452,194,162,...,83,529,-295,352,-67,67,263,-33,-33,-21


In [19]:
X_valid = X_valid.iloc[2:,:]
X_valid.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
39,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,277,1023,67,214,-135,1074,475,48,168,-70
call,A,A,A,A,A,A,A,A,A,A,...,A,A,A,A,A,A,A,A,A,A
40,-87,-248,262,295,-226,-493,367,-452,194,162,...,83,529,-295,352,-67,67,263,-33,-33,-21
call.1,A,A,A,A,A,A,A,A,A,A,...,A,A,A,P,A,A,A,A,A,A
42,22,-153,17,276,-211,-250,55,-141,0,500,...,413,399,16,558,24,893,297,6,1971,-42


In [20]:
X_valid = X_valid.drop([index for index, row in X_valid.iterrows() if "call" in index])
X_valid.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
39,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,277,1023,67,214,-135,1074,475,48,168,-70
40,-87,-248,262,295,-226,-493,367,-452,194,162,...,83,529,-295,352,-67,67,263,-33,-33,-21
42,22,-153,17,276,-211,-250,55,-141,0,500,...,413,399,16,558,24,893,297,6,1971,-42
47,-243,-218,-163,182,-289,-268,-285,-172,52,-134,...,174,277,6,81,2,722,170,0,510,-73
48,-130,-177,-28,266,-170,-326,-222,-93,10,159,...,233,643,51,450,-46,612,370,29,333,-19


In [21]:
X_valid.reset_index(drop=True, inplace=True)
X_valid.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7119,7120,7121,7122,7123,7124,7125,7126,7127,7128
0,-342,-200,41,328,-224,-427,-656,-292,137,-144,...,277,1023,67,214,-135,1074,475,48,168,-70
1,-87,-248,262,295,-226,-493,367,-452,194,162,...,83,529,-295,352,-67,67,263,-33,-33,-21
2,22,-153,17,276,-211,-250,55,-141,0,500,...,413,399,16,558,24,893,297,6,1971,-42
3,-243,-218,-163,182,-289,-268,-285,-172,52,-134,...,174,277,6,81,2,722,170,0,510,-73
4,-130,-177,-28,266,-170,-326,-222,-93,10,159,...,233,643,51,450,-46,612,370,29,333,-19


In [22]:
X_valid = pca.scaler.fit_transform(X_valid)
X_valid = X_valid.dot(projection_matrix)


Data with input dtype object were all converted to float64 by StandardScaler.


Data with input dtype object were all converted to float64 by StandardScaler.



In [23]:
cancer_vals = y_data.iloc[:,1]
len(cancer_vals)

72

In [24]:
y_train = cancer_vals[:38]
y_valid = cancer_vals[38:]
y_train

0     ALL
1     ALL
2     ALL
3     ALL
4     ALL
5     ALL
6     ALL
7     ALL
8     ALL
9     ALL
10    ALL
11    ALL
12    ALL
13    ALL
14    ALL
15    ALL
16    ALL
17    ALL
18    ALL
19    ALL
20    ALL
21    ALL
22    ALL
23    ALL
24    ALL
25    ALL
26    ALL
27    AML
28    AML
29    AML
30    AML
31    AML
32    AML
33    AML
34    AML
35    AML
36    AML
37    AML
Name: cancer, dtype: object

In [25]:
from sklearn.preprocessing import LabelEncoder as LE
le = LE()

y_train = le.fit_transform(y_train)
y_valid = le.fit_transform(y_valid)

print(f'Shape train : {y_train.shape}')
print(f'Shape valid : {y_valid.shape}')

Shape train : (38,)
Shape valid : (34,)


In [26]:
print(X_train)

[[-18.54479866+0.j   5.82694304+0.j  19.44231273+0.j]
 [  7.34138894+0.j  10.00866306+0.j -11.79569766+0.j]
 [-52.85214592+0.j  11.39062982+0.j  29.54856592+0.j]
 [-13.68855104+0.j  -6.24683926+0.j  22.62331824+0.j]
 [ 36.79010623+0.j  32.79969788+0.j  -5.25331356+0.j]
 [  9.64084235+0.j -20.99224287+0.j  22.47134432+0.j]
 [-21.80887035+0.j -16.25820427+0.j  25.14513081+0.j]
 [-56.00037786+0.j -21.41652449+0.j  35.86157605+0.j]
 [-21.6187088 +0.j  30.85448981+0.j   8.31335286+0.j]
 [ 22.87497029+0.j  -8.69773133+0.j   3.90786836+0.j]
 [ 18.36152492+0.j   6.89758927+0.j  -0.9767139 +0.j]
 [ 43.2081958 +0.j -34.17327535+0.j   4.1054309 +0.j]
 [ 32.58248935+0.j  33.29863232+0.j   0.7273315 +0.j]
 [  9.97387307+0.j  13.0831566 +0.j   5.47073533+0.j]
 [ 35.33518017+0.j  36.19388859+0.j   1.01962677+0.j]
 [ 16.02774859+0.j  11.32504965+0.j  10.26390708+0.j]
 [-72.52752229+0.j  69.56845595+0.j   6.60329508+0.j]
 [ 31.02865129+0.j  -8.61783154+0.j   2.24325396+0.j]
 [ 37.49570105+0.j  -6.95014

In [28]:
from sklearn.preprocessing import LabelEncoder as LE

In [29]:
y_train = y_train.reshape((38,1))
y_valid = y_valid.reshape((34,1))

In [30]:
print(f'Shape train : {y_train.shape}')
print(f'Shape valid : {y_valid.shape}')

Shape train : (38, 1)
Shape valid : (34, 1)


In [31]:
X_train = np.real(X_train)
X_valid = np.real(X_valid)

In [32]:
print(f'Shape train : {X_train.shape}')
print(f'Shape valid : {X_valid.shape}')

Shape train : (38, 3)
Shape valid : (34, 3)


In [33]:
all_x, all_y, all_z = [], [] , []
aml_x, aml_y, aml_z = [], [] , []
iterator = 0


for cat in y_train:
    print(cat)
    if cat == 0:
        
        all_x.append(X_train[iterator,0])
        all_y.append(X_train[iterator,1])
        all_z.append(X_train[iterator,2])
    else:
        aml_x.append(X_train[iterator,0])
        aml_y.append(X_train[iterator,1])
        aml_z.append(X_train[iterator,2])
    
    iterator +=1
    


[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[0]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]


In [34]:
all_x, all_y, all_z = np.array(all_x),np.array(all_y),np.array(all_z)
aml_x, aml_y, aml_z = np.array(aml_x),np.array(aml_y),np.array(aml_z)

In [35]:
print(all_x.shape)
print(aml_x.shape)

print(all_y.shape)
print(aml_y.shape)

print(all_z.shape)
print(aml_z.shape)

(27,)
(11,)
(27,)
(11,)
(27,)
(11,)


In [36]:
import plotly.plotly as py
import plotly.graph_objs as go


trace1 = go.Scatter3d(
    x=all_x,
    y=all_y,
    z=all_z,
    mode='markers',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(217, 217, 217, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)


trace2 = go.Scatter3d(
    x=aml_x,
    y=aml_y,
    z=aml_z,
    mode='markers',
    marker=dict(
        color='rgb(127, 127, 127)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(204, 204, 204)',
            width=1
        ),
        opacity=0.9
    )
)
data = [trace1, trace2]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='simple-3d-scatter')



Consider using IPython.display.IFrame instead



### Now it's time to build our classifier

In [37]:
from sklearn.svm import SVC

In [38]:
clf = SVC(gamma='auto')

In [39]:
clf.fit(X_train, y_train) 


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [40]:
print('True classes: ' , y_valid)

True classes:  [[0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]]


In [41]:
print('Predicted classes: ', clf.predict(X_valid))

Predicted classes:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [42]:
print('Score: ', clf.score(X_valid, y_valid))

Score:  0.5882352941176471


### We will now use clustering in order to separate our data

In [43]:
from sklearn.cluster import KMeans

# Number of clusters
kmeans = KMeans(n_clusters=2)
# Fitting the input data
kmeans = kmeans.fit(X_train)
# Getting the cluster labels
labels = kmeans.predict(X_train)
# Centroid values
centroids = kmeans.cluster_centers_

In [44]:
print(centroids) # From sci-kit learn

[[ 25.2566482    6.60730318   2.38216031]
 [-28.06294245  -7.34144798  -2.64684479]]


In [45]:
centroid_all_x =[centroids[0][0]]
centroid_all_y =[centroids[0][1]]
centroid_all_z =[centroids[0][2]]


centroid_aml_x =[centroids[1][0]]
centroid_aml_y =[centroids[1][1]]
centroid_aml_z =[centroids[1][2]]


In [46]:
trace1 = go.Scatter3d(
    x=all_x,
    y=all_y,
    z=all_z,
    mode='markers',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(217, 217, 217, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)


trace2 = go.Scatter3d(
    x=aml_x,
    y=aml_y,
    z=aml_z,
    mode='markers',
    marker=dict(
        color='rgb(127, 127, 127)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(204, 204, 204)',
            width=1
        ),
        opacity=0.9
    )
)

trace3 = go.Scatter3d(
    x=centroid_all_x,
    y=centroid_all_y,
    z=centroid_all_z,
    mode='markers',
    marker=dict(
        color='rgb(30, 130, 0)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(240, 240, 240)',
            width=1
        ),
        opacity=0.9
    )
)

trace4 = go.Scatter3d(
    x=centroid_aml_x,
    y=centroid_aml_y,
    z=centroid_aml_x,
    mode='markers',
    marker=dict(
        color='rgb(250, 0, 0)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(150, 150, 150)',
            width=1
        ),
        opacity=0.9
    )
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='simple-3d-scatter')


Consider using IPython.display.IFrame instead



In [47]:
print(y_train)
print(labels)

[[0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]]
[1 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1
 1]


In [48]:
iterator = 0
positive = 0

for val in labels:
    
    if val == y_train[iterator]:
        positive +=1
    
    iterator +=1
    
print(positive)
print(iterator)

score = positive/iterator

print(score)

29
38
0.7631578947368421


In [49]:
preds = kmeans.predict(X_valid)
print(preds)

[1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1]


In [50]:
iterator = 0
positive = 0

for val in preds:
    
    if val == y_valid[iterator]:
        positive +=1
    
    iterator +=1
    
print(positive)
print(iterator)

score = positive/iterator

print(score)

18
34
0.5294117647058824


In [51]:
all_x, all_y, all_z = [], [] , []
aml_x, aml_y, aml_z = [], [] , []
iterator = 0


for cat in y_valid:
    
    if cat == 0:
        
        all_x.append(X_valid[iterator,0])
        all_y.append(X_valid[iterator,1])
        all_z.append(X_valid[iterator,2])
    else:
        aml_x.append(X_valid[iterator,0])
        aml_y.append(X_valid[iterator,1])
        aml_z.append(X_valid[iterator,2])
    
    iterator +=1
    
all_x, all_y, all_z = np.array(all_x),np.array(all_y),np.array(all_z)
aml_x, aml_y, aml_z = np.array(aml_x),np.array(aml_y),np.array(aml_z)

print(all_x.shape)
print(aml_x.shape)

print(all_y.shape)
print(aml_y.shape)

print(all_z.shape)
print(aml_z.shape)


(20,)
(14,)
(20,)
(14,)
(20,)
(14,)


In [52]:
trace1 = go.Scatter3d(
    x=all_x,
    y=all_y,
    z=all_z,
    mode='markers',
    marker=dict(
        size=12,
        line=dict(
            color='rgba(217, 217, 217, 0.14)',
            width=0.5
        ),
        opacity=0.8
    )
)


trace2 = go.Scatter3d(
    x=aml_x,
    y=aml_y,
    z=aml_z,
    mode='markers',
    marker=dict(
        color='rgb(127, 127, 127)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(204, 204, 204)',
            width=1
        ),
        opacity=0.9
    )
)

trace3 = go.Scatter3d(
    x=centroid_all_x,
    y=centroid_all_y,
    z=centroid_all_z,
    mode='markers',
    marker=dict(
        color='rgb(30, 130, 0)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(240, 240, 240)',
            width=1
        ),
        opacity=0.9
    )
)

trace4 = go.Scatter3d(
    x=centroid_aml_x,
    y=centroid_aml_y,
    z=centroid_aml_x,
    mode='markers',
    marker=dict(
        color='rgb(250, 0, 0)',
        size=12,
        symbol='circle',
        line=dict(
            color='rgb(150, 150, 150)',
            width=1
        ),
        opacity=0.9
    )
)

data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
    margin=dict(
        l=0,
        r=0,
        b=0,
        t=0
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='simple-3d-scatter')


Consider using IPython.display.IFrame instead

