# <center>Cassava Leaf Disease Classification</center>
# ![image](https://previews.123rf.com/images/bugning/bugning1307/bugning130700289/21226772-cassava.jpg)

## Contents:

- [About competition](#section-one)
- [Exploratory Data Analysis](#section-two)
 - [Class Distribution](#subsection-one)
 - [Sample Images](#subsection-two)
 - [Height, Width and Aspect ratio for all diseases](#subsection-three)
 - [Outlier Detection using Mean and Variance](#subsection-four)
   - [CBB](#outlier-ss-1)  
   - [CMD](#outlier-ss-2)  
   - [CMSD](#outlier-ss-3)
   - [CGM](#outlier-ss-4)
   - [Healthy](#outlier-ss-5)
  
 - [Outlier detection using K-means Clustering](#subsection-five)
 - [Feature Extraction](#section-six)
- [Next Steps](#section-three)

<a id="section-one"> </a>
## About Competition

Cassava, or Manihot esculenta, belongs to the family Euphorbiaceae and is cultivated in tropical and subtropical regions for its edible starchy tuberous root, which is commonly dried into a powder and named tapioca.

As the second-largest provider of carbohydrates in Africa, cassava is a key food security crop grown by smallholder farmers because it can withstand harsh conditions. At least 80% of household farms in Sub-Saharan Africa grow this starchy root, but viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated.

Existing methods of disease detection require farmers to solicit the help of government-funded agricultural experts to visually inspect and diagnose the plants. This suffers from being labor-intensive, low-supply and costly. As an added challenge, effective solutions for farmers must perform well under significant constraints, since African farmers may only have access to mobile-quality cameras with low-bandwidth.

In this competition, we introduce a dataset of 21,367 labeled images collected during a regular survey in Uganda. Most images were crowdsourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala. This is in a format that most realistically represents what farmers would need to diagnose in real life.

### HEALTH BENEFITS

Tapioca has been associated with some health benefits, such as **healthy weight gain, increased red blood cell count, improved digestion, preventing diabetes, protecting bone mineral density, preventing Alzheimer’s disease and maintaining fluid balance within the body.**

### EVALUATION
**$$Accuracy=\frac{TP + TN}{TP + FP + TN + FN}$$**

where,  
 - TP: True Positive
 - FP: False Positive
 - TN: True Negative
 - FN: False Negative

<a id="section-two"></a>
## Exploratory Data Analysis

In [1]:
!pip install -q imutils

In [2]:
import pandas as pd
import numpy as np
import os,json
from tqdm.notebook import tqdm
import time

#for graphs and images
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import seaborn as sns
import matplotlib.pyplot as plt
import cv2,imutils

#k-means
from scipy.cluster.vq import kmeans,whiten
from scipy.stats import zscore

#pytorch(feature-extraction)
from torchvision import models
import torch

from albumentations import *
from albumentations.pytorch import ToTensor

#train_valid split
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [3]:
input_path = './input/cassava-leaf-disease-classification/'

In [4]:
#reading the train.csv file
train_df = pd.read_csv(input_path+'train.csv')
print(train_df.shape)

FileNotFoundError: [Errno 2] No such file or directory: '../input/cassava-leaf-disease-classification/train.csv'

In [None]:
#maping the class labels mentioned in json file wiht its respective disease name
disease_names = open(input_path+'label_num_to_disease_map.json')
disease_names = json.load(disease_names)
train_df['disease_name'] = train_df['label'].apply(lambda x: disease_names[str(x)])
#visualize the top five rows from table
train_df.head()

<a id='subsection-one'></a>
### Class Labels Distribution

In [None]:
fig = make_subplots(rows=1, cols=2,
            specs=[[{"type": "xy"}, {"type": "domain"}]],)
# value_counts: to count number of images in each class with respect to disease_name column
# Bar plot 
t1 = go.Bar(x=train_df['disease_name'].value_counts().index, 
            y=train_df['disease_name'].value_counts().values,
            text=train_df['disease_name'].value_counts().values,
            textposition='auto',name='Count',
           marker_color='indianred')
#Pie chart with labels and counts
t2 = go.Pie(labels=train_df['disease_name'].value_counts().index,
           values=train_df['disease_name'].value_counts().values,
           hole=0.3)
fig.add_trace(t1,row=1, col=1)
fig.add_trace(t2,row=1, col=2)
fig.update_layout(title='Distribution of Class Labels')
fig.show()

<a id="subsection-two"></a>
### Sample Images from each class

In [None]:
#random seed is used to replicate the same images in every run
np.random.seed(2020)
#plotting 5 random samples for each class with image name and disease name as title
for class_name in train_df['disease_name'].unique():
    plt.figure(figsize=(20,50))
    for idx,img_name in enumerate(np.random.choice(train_df[train_df['disease_name'] == class_name]['image_id'].values,
                                                   size=5,replace=False)):
        plt.subplot(1,5,idx+1)
        #reading the image and converting BGR color space to RGB
        img = cv2.cvtColor(cv2.imread(input_path+'train_images/'+img_name), cv2.COLOR_BGR2RGB)
        plt.imshow(img)
        plt.axis('off')
        plt.title(r"$\bf{"+class_name + "}$"+'\n'+img_name )
    plt.show()

<a id="subsection-three"></a>
### Height, Width and Aspect ratio for all diseases

Check for the distribution of image size. which inturn will be helpful for defining the model Input shape.

 - In this dataset the image size (600 x 800) is same for all the images

In [None]:
for idx in tqdm(train_df.index):
    img_name = train_df.loc[idx,'image_id']
    #reading the image and converting BGR color space to RGB
    img = cv2.cvtColor(cv2.imread(input_path+'train_images/'+img_name), cv2.COLOR_BGR2RGB)
    
    #normalize the image in the range [0,1]
    norm_image = cv2.normalize(img, None, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)
    
    width,height,depth = img.shape
    
    #adding new column to the tabel with width height and aspect ratio for every image
    train_df.loc[idx,'Width'] = width
    train_df.loc[idx,'Height'] = height
    train_df.loc[idx,'Aspect Ratio'] = width/height
    
    #calculate mean and standart deviation for each image
    train_df.loc[idx,'Mean'] = img.mean()
    train_df.loc[idx,'SD'] = img.std()
    
    #calculate mean and standart deviation for each normalized image
    train_df.loc[idx,'Normalized_Mean'] = norm_image.mean()
    train_df.loc[idx,'Normalized_SD'] = norm_image.std()

<a id="subsection-four"></a>
### Outlier Detection using Mean and Standard Deviation

In [None]:
fig =  make_subplots(rows=2,cols=1,subplot_titles=['Original Image', 'Normalized Image'])
colors = ['rgb(13, 200, 58)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(243, 247, 15)']
for idx,class_name in enumerate(train_df['disease_name'].unique()):
    #scatter plot between mean and variance of the images for every disease
    fig.add_trace(go.Scatter(x=train_df[train_df['disease_name'] == class_name]['Mean'],
                             y=train_df[train_df['disease_name'] == class_name]['SD'],
                            mode = 'markers',name=class_name,
                            marker_color=colors[idx]),1,1)
    
    #scatter plot between mean and variance of the normalized images for every disease
    fig.add_trace(go.Scatter(x=train_df[train_df['disease_name'] == class_name]['Normalized_Mean'],
                             y=train_df[train_df['disease_name'] == class_name]['Normalized_SD'],
                            mode = 'markers',name=class_name,
                            marker_color=colors[idx], showlegend=False),2,1)
#x-axis and y axis title
fig.update_xaxes(title_text="Mean", row=1, col=1)
fig.update_yaxes(title_text="Standard Deviation", row=1, col=1)

fig.update_xaxes(title_text="Mean", row=2, col=1)
fig.update_yaxes(title_text="Standard Deviation", row=2, col=1)
fig.show()
    

**From the Scatter Plot,** 

- it is very clear that there are **outliers in dataset** with mean greater than 0.7 and less than 0.15. But to have a clear picture of outliers for each class will use **Box Plot**

In [None]:
fig = make_subplots(rows=2,cols=2,
                    subplot_titles=['Mean','Standard Deviation','Normalized Mean','Normalized Standard Deviation'],
                    shared_xaxes=True)
colors = ['rgb(13, 200, 58)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(243, 247, 15)']
for idx,class_name in enumerate(train_df['disease_name'].unique()):
    fig.add_trace(go.Box(y=train_df[train_df['disease_name'] == class_name]['Mean'],
                        name=class_name,showlegend=False,
                        marker_color=colors[idx]),1,1)
    fig.add_trace(go.Box(y=train_df[train_df['disease_name'] == class_name]['Normalized_Mean'],
                        name=class_name,showlegend=False,
                        marker_color=colors[idx]),2,1)
    fig.add_trace(go.Box(y=train_df[train_df['disease_name'] == class_name]['SD'],
                        name=class_name,showlegend=False,
                        marker_color=colors[idx]),1,2)
    fig.add_trace(go.Box(y=train_df[train_df['disease_name'] == class_name]['Normalized_SD'],
                        name=class_name,showlegend=False,
                        marker_color=colors[idx]),2,2)
fig.update_layout(title='Outlier Detection - Box Plot')
fig.show()

<a id="outlier-ss-1"></a>
#### Outliers in Cassava Bacterial Blight (CBB)

In [None]:
#filtering only Cassava Bacterial Blight (CBB) class from original data
CBB  = train_df[train_df['disease_name'] ==  'Cassava Bacterial Blight (CBB)']
print('Number of Images in Cassava Bacterial Blight (CBB) Class: '+str(len(CBB)))

#filtering the CBB data for which mean is between -0.85e-6 and 0.820e-6(observation from box plot)
outliers_CBB = CBB[CBB['Normalized_Mean'].between(0.17,0.58,inclusive=True)]
# filter only the rows which are not in above list of images
outliers_CBB = CBB[~CBB['image_id'].isin(outliers_CBB['image_id'])]
print('Number of Outlier Images in Cassava Bacterial Blight (CBB) Class: '+str(len(outliers_CBB))+'\n')
print(input_path+'train_images/'+outliers_CBB['image_id'].astype(str).values)

In [None]:
c = 0
#plot all the outlier images based on Mean value
for i,idx in enumerate(outliers_CBB.index):
    # a condition a change the row number in subplot
    if int(i/3) == c:
        c+=1
        plt.figure(figsize=(25,30))
    plt.subplot(c,3,i%3+1)
    img_name = outliers_CBB.loc[idx,'image_id']
    img = cv2.cvtColor(cv2.imread(os.path.join(input_path,'train_images',img_name)),cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    plt.title(img_name+'\n'+'Mean: '+ str(outliers_CBB.loc[idx,'Normalized_Mean'].round(2)) + '\n'+\
              'Standard Deviation: '+str(outliers_CBB.loc[idx,'Normalized_SD'].round(2)))
    plt.axis('off')
plt.show()

<a id="outlier-ss-2"></a>
#### Outliers in 'Cassava Mosaic Disease (CMD)'

In [None]:
#filtering only Cassava Mosaic Disease (CMD) class from original data
CMD  = train_df[train_df['disease_name'] ==  'Cassava Mosaic Disease (CMD)']
print('Number of Images in Cassava Mosaic Disease (CMD) Class: '+str(len(CMD)))

#filtering the CMD data for which Mean is between 0.19 and 0.64(observation from box plot)
outliers_CMD = CMD[CMD['Normalized_Mean'].between(0.19,0.64,inclusive=True)]
# filter only the rows which are not in above list of images
outliers_CMD = CMD[~CMD['image_id'].isin(outliers_CMD['image_id'])]
print('Number of Outlier Images in Cassava Mosaic Disease (CMD) Class: '+str(len(outliers_CMD))+'\n')
print(input_path+'train_images/'+outliers_CMD['image_id'].values)

In [None]:
c = 0
#plot all the outlier images based on Mean value
for i,idx in enumerate(outliers_CMD.index):
    # a condition a change the row number in subplot
    if int(i/3) == c:
        c+=1
        plt.figure(figsize=(25,30))
    plt.subplot(c,3,i%6+1)
    img_name = outliers_CMD.loc[idx,'image_id']
    img = cv2.cvtColor(cv2.imread(os.path.join(input_path,'train_images',img_name)),cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    plt.title(img_name+'\n'+'Mean: '+ str(outliers_CMD.loc[idx,'Normalized_Mean'].round(2)) + '\n'+\
              'Standard Deviation: '+str(outliers_CMD.loc[idx,'Normalized_SD'].round(2))+ '\n')
    plt.axis('off')
plt.show()

<a id="outlier-ss-3"></a>
#### Outliers in 'Cassava Brown Streak Disease (CBSD)'

In [None]:
#filtering only  'Cassava Brown Streak Disease (CBSD)' class from original data
CBSD  = train_df[train_df['disease_name'] ==   'Cassava Brown Streak Disease (CBSD)']
print('Number of Images in  Cassava Brown Streak Disease (CBSD) Class: '+str(len(CBSD)))

#filtering the CBSD data for which Normalized Mean is between 0.158 and 0.64(observation from box plot)
outliers_CBSD = CBSD[CBSD['Normalized_Mean'].between(0.158,0.64,inclusive=True)]
# filter only the rows which are not in above list of images
outliers_CBSD = CBSD[~CBSD['image_id'].isin(outliers_CBSD['image_id'])]
print('Number of Outlier Images in  Cassava Brown Streak Disease (CBSD) Class: '+str(len(outliers_CBSD))+'\n')
print(input_path+'train_images/'+outliers_CBSD['image_id'].values)

In [None]:
#plot all the outlier images based on Mean value
plt.figure(figsize=(20,30))
for i,idx in enumerate(outliers_CBSD.index):
    plt.subplot(1,3,i%3+1)
    img_name = outliers_CBSD.loc[idx,'image_id']
    img = cv2.cvtColor(cv2.imread(os.path.join(input_path,'train_images',img_name)),cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    plt.title(img_name+'\n'+'Mean: '+ str(outliers_CBSD.loc[idx,'Normalized_Mean'].round(2)) + '\n'+\
              'Standard Deviation: '+str(outliers_CBSD.loc[idx,'Normalized_SD'].round(2))+ '\n')
    plt.axis('off')
plt.show()

<a id="outlier-ss-4"></a>
#### Outliers in 'Cassava Green Mottle (CGM)'

In [None]:
#filtering only   'Cassava Green Mottle (CGM)' class from original data
CGM  = train_df[train_df['disease_name'] ==    'Cassava Green Mottle (CGM)']
print('Number of Images in Cassava Green Mottle (CGM) Class: '+str(len(CGM)))

#filtering the CMD data for which Mean is between 0.2 and 0.64(observation from box plot)
outliers_CGM = CGM[CGM['Normalized_Mean'].between(0.2,0.64,inclusive=True)]
# filter only the rows which are not in above list of images
outliers_CGM = CGM[~CGM['image_id'].isin(outliers_CGM['image_id'])]
print('Number of Outlier Images in Cassava Green Mottle (CGM) Class: '+str(len(outliers_CGM))+'\n')
print(input_path+'train_images/'+outliers_CGM['image_id'].values)

In [None]:
plt.figure(figsize=(20,30))
#plot all the outlier images based on Mean value
for i,idx in enumerate(outliers_CGM.index):
    plt.subplot(1,3,i%3+1)
    img_name = outliers_CGM.loc[idx,'image_id']
    img = cv2.cvtColor(cv2.imread(os.path.join(input_path,'train_images',img_name)),cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    plt.title(img_name+'\n'+'Mean: '+ str(outliers_CGM.loc[idx,'Normalized_Mean'].round(2)) + '\n'+\
              'Standard Deviation: '+str(outliers_CGM.loc[idx,'Normalized_SD'].round(2)))
    plt.axis('off')
plt.show()

<a id="outlier-ss-5"></a>
#### Outliers in 'Healthy'

In [None]:
#filtering only   'Healthy' class from original data
Healthy  = train_df[train_df['disease_name'] ==    'Healthy']
print('Number of Images in Healthy Class: '+str(len(Healthy)))

#filtering the CMD data for which Mean is between 0.175 and 0.64 (observation from box plot)
outliers_Healthy = Healthy[Healthy['Normalized_Mean'].between(0.175,0.64,inclusive=True)]
# filter only the rows which are not in above list of images
outliers_Healthy = Healthy[~Healthy['image_id'].isin(outliers_Healthy['image_id'])]
print('Number of Outlier Images in Healthy Class: '+str(len(outliers_Healthy))+'\n')
print(input_path+'train_images/'+outliers_Healthy['image_id'].values)

In [None]:
c = 0
#plot all the outlier images based on Mean value
for i,idx in enumerate(outliers_Healthy.index):
    # a condition a change the row number in subplot
    if int(i/4) == c:
        c+=1
        plt.figure(figsize=(25,30))
    plt.subplot(c,4,i%4+1)
    img_name = outliers_Healthy.loc[idx,'image_id']
    img = cv2.cvtColor(cv2.imread(os.path.join(input_path,'train_images',img_name)),cv2.COLOR_BGR2RGB)
    plt.imshow(img)
    plt.title(img_name+'\n'+'Mean: '+ str(outliers_Healthy.loc[idx,'Normalized_Mean'].round(2)) + '\n'+\
              'Standard Deviation: '+str(outliers_Healthy.loc[idx,'Normalized_SD'].round(2)))
    plt.axis('off')
plt.show()

<a id="subsection-five"></a>
### Outliers Detection using K-Means Clustering

In [None]:
# create an dict to store all k_means_colr values for all the images with flatten structure
k_means_cluster_colors = {'Cassava Bacterial Blight (CBB)':[],
                          'Cassava Mosaic Disease (CMD)':[],
                          'Cassava Brown Streak Disease (CBSD)':[],
                          'Cassava Green Mottle (CGM)':[],
                          'Healthy' : []}
images = {'Cassava Bacterial Blight (CBB)':[],
                          'Cassava Mosaic Disease (CMD)':[],
                          'Cassava Brown Streak Disease (CBSD)':[],
                          'Cassava Green Mottle (CGM)':[],
                          'Healthy' : []}

for class_name in tqdm(train_df['disease_name'].unique()):
    
    #filter different classes
    df = train_df[train_df['disease_name'] == class_name]
    
    for idx in tqdm(df.index,desc=class_name):
        #read image
        img_name = train_df.loc[idx,'image_id']
        img = cv2.cvtColor(cv2.imread(input_path+'train_images/'+img_name), cv2.COLOR_BGR2RGB)
        img = imutils.resize(img,height=150)
        #normalize the given image
        norm_img = cv2.normalize(img, None, alpha=0, beta=1, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_32F)
        images[class_name].append(img_name) 
  
        #k-means chstering with cluster size of 5
        cluster_centers, distortion = kmeans(norm_img.reshape((-1,3)),5)

        #standard deviation for each color band
        std = np.expand_dims(img.reshape((-1,3)).std(axis=0),1) 

        k_means_cluster_colors[class_name].append((np.matmul(cluster_centers,std).T).astype(int)[0])
    

<a id="ss-k-1"></a>
#### Z-Score

Z score is an important concept in statistics. Z score is also called standard score. This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically, Z score tells how many standard deviations away a data point is from the mean.If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.

**$$Zscore=\frac{x-Mean}{Standart Deviation}$$**


In [None]:
colors = ['rgb(13, 200, 58)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(243, 247, 15)']
for idx,class_name in enumerate(k_means_cluster_colors):
    x = np.sum(k_means_cluster_colors[class_name],axis=1)
    z_score = zscore(x)
    fig=go.Figure()
    fig.add_trace(go.Histogram(x=z_score,
                              marker_color=colors[idx]))
    fig.update_layout(title=class_name)
    fig.show()

In [None]:
k_means_outliers = {'Cassava Bacterial Blight (CBB)':[],
                          'Cassava Mosaic Disease (CMD)':[],
                          'Cassava Brown Streak Disease (CBSD)':[],
                          'Cassava Green Mottle (CGM)':[],
                          'Healthy' : []}
print('Outliers based on K-Means result')
for idx,class_name in enumerate(k_means_cluster_colors):
    x = np.sum(k_means_cluster_colors[class_name],axis=1)
    z_score = zscore(x)
    k_means_outliers[class_name].append(list(np.where((z_score>3))[0]) + list(np.where((z_score<-3))[0]))
    print('Number of Outliers from '+class_name+': '+str(len(k_means_outliers[class_name][0])))

##### Visulaizing 5 random images from K-Means Method

In [None]:
for class_name in k_means_outliers:
    print('Outliers: '+ class_name)
    plt.figure(figsize=(20,5))
    imgs = []
    for idx,img_name in enumerate(np.random.choice(k_means_outliers[class_name][0],5,replace=False)):
        img_name=  images[class_name][img_name]
        img = cv2.cvtColor(cv2.imread(input_path+'train_images/'+img_name),cv2.COLOR_BGR2RGB)
        plt.subplot(1,5,idx%5+1)
        plt.imshow(img)
        plt.axis('off')
        plt.title(class_name+'\n'+img_name)
    plt.show()
    print('Image Names: '+'\n')
    for idx in k_means_outliers[class_name][0]:
        imgs.append(images[class_name][idx])
    print(', '.join(imgs)+'\n\n')
        

<a id="section-six"></a>
## Feature Extraction

In this part, we will be extracting features from pretrained networks and find the important fetures using PCA and t-sne (Dimensionality reduction methods)

<a id="ss-fe-1"></a>
#### VGG

In [None]:
class Load_Dataset(torch.utils.data.Dataset):
    def __init__(self,df):
        self.image_paths = df['image_id']
        self.labels = df['label']
        self.default_transform = Compose([
            Normalize((0.485, 0.456, 0.406),
                                 (0.229, 0.224, 0.225),always_apply=True),
            Resize(224,224),
            ToTensor()
        ])
        
    def __len__(self):
        return self.image_paths.shape[0]
    
    def __getitem__(self,i):
        image_name = self.image_paths[i]
        img_path = os.path.join('../input/cassava-leaf-disease-classification/train_images',image_name)
        image = cv2.cvtColor(cv2.imread(img_path),cv2.COLOR_BGR2RGB)
        image = self.default_transform(image=image)['image']
        label = torch.tensor(self.labels[i])

        return image,label

In [None]:
#Load dataset
train_data = Load_Dataset(train_df)
train_loader = torch.utils.data.DataLoader(train_data,batch_size=128)

#loading pretrained model for vgg19
vgg_m = models.vgg19(pretrained=True)
#taking only the first layer from classifier (4096)
vgg_m.classifier = torch.nn.Sequential(vgg_m.classifier[0])
output_descriptor = np.zeros((1,4096))
output_label = np.zeros((1))
device = 'cuda'
vgg_m.to(device)
with torch.no_grad():
    for _, (images,labels) in tqdm(enumerate(train_loader)):
        
        images,labels = images.to(device),labels.to(device)
        #evaluating the image with pretrained model
        vgg_m.eval()
        pred = vgg_m(images)
        #concatenating all the outputs and labels as a batch of 128 and store in a variable
        output_descriptor =np.concatenate((output_descriptor,pred.cpu().numpy().squeeze()),0)
        output_label = np.concatenate((output_label,labels.cpu().numpy()))
output_descriptor = output_descriptor[1:]
output_label = output_label[1:]

#### PCA

In [None]:
st_time = time.time()
#extracting top 5 principal component features
pca = PCA(n_components=5)
pca_result_train = pca.fit_transform(output_descriptor)

print('PCA done; Time take {} seconds'.format(time.time()-st_time))
print('Variance: {}'.format(pca.explained_variance_ratio_))
print('Sum of variance in data by first top ten components: {:.2f}%'.format(100*(pca.explained_variance_ratio_.sum())))

##PCA df
pca_tr = pd.DataFrame()
for idx in range(pca_result_train.shape[1]):
    pca_tr['pca'+str(idx+1)] = pca_result_train[:,idx]

pca_tr['label'] = output_label.astype(int)
pca_tr['disease_name'] = pca_tr['label'].apply(lambda x: disease_names[str(x)])
pca_tr.head()

In [None]:
fig = go.Figure()
colors = ['rgb(13, 200, 58)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(243, 247, 15)']
for idx,dn in enumerate(pca_tr['disease_name'].unique()):
    df = pca_tr[pca_tr['disease_name'] == dn]
    fig.add_trace(go.Scatter3d(x=df['pca1'],y=df['pca2'],z=df['pca3'],mode='markers',marker_color = colors[idx],name=dn))
fig.update_layout(title='PCA 1 Vs PCA 2 Vs PCA 3')
fig.show()

#### T-SNE

In [None]:
st_time = time.time()
t_sne = TSNE(random_state=2020)
t_sne_tr = t_sne.fit_transform(output_descriptor)
print('TNSE done; Time take {} seconds'.format(time.time()-st_time))

##T-SNE df
tsne_tr = pd.DataFrame()
for idx in range(t_sne_tr.shape[1]):
    tsne_tr['t_sne'+str(idx+1)] = t_sne_tr[:,idx]
tsne_tr['label'] = output_label.astype(int)
tsne_tr['disease_name'] = tsne_tr['label'].apply(lambda x: disease_names[str(x)])
tsne_tr.head()

In [None]:
fig = go.Figure()
colors = ['rgb(13, 200, 58)','rgb(13, 160, 200)','rgb(190, 81, 249)','rgb(248, 104, 73)','rgb(243, 247, 15)']
for idx,dn in enumerate(tsne_tr['disease_name'].unique()):
    df = tsne_tr[tsne_tr['disease_name'] == dn]
    fig.add_trace(go.Scatter(x=df['t_sne1'],y=df['t_sne2'],mode='markers',marker_color = colors[idx],name=dn))
fig.update_layout(title='TSNE 1 Vs TSNE 2')
fig.update_xaxes(title_text="TSNE_1")
fig.update_yaxes(title_text="TSNE_2")
fig.show()

## Please give your feedbacks and comments and Do **"Upvote"** !!!