# Modeling our Data

Author: Sean Flannery [sflanner@purdue.edu](sflanner@purdue.edu)

Last Updated: June 15th, 2019

**Libraries Needed**: plotly, pandas, numpy, os, gensim, nltk, tqdm, sklearn

We shall use the `pandas` framework to read in this data into a pandas dataframe. We will be using it extensively throughout this notebook -- for more information on pandas, I'd check out this [Tutorial](https://www.learnpython.org/en/Pandas_Basics)

In [1]:
FILE_PREFIX = ''

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm
import os
import collections
import random

In [3]:
import plotly
plotly.__version__

'3.10.0'

Recommendation for how to do sizeref
`sizeref = 2. * max(array of size values) / (desired maximum marker size ** 2)`

This recommendation comes from [this tutorial](https://plot.ly/python/bubble-maps/)

In [4]:
import plotly.plotly as py
import plotly.graph_objs as go
import pandas as pd
import math

Read in our data from the last section for this part.

In [5]:
DESIRED_CLUSTER = 'xmeans_cluster'
#DESIRED_CLUSTER = 'kmeans_cluster'
#DESIRED_CLUSTER = 'markov_cluster'
#DESIRED_CLUSTER = 'hierarchical_cluster'

In [6]:
df = pd.read_csv(FILE_PREFIX + 'cluster_results.csv')[
    ['year', 
     'title', 
     'abstract', 
     'introduction', 
     'article-link' ,
     'xmeans_cluster', 
     'kmeans_cluster', 
     'pca_feature1', 
     'pca_feature2', 
     'pca_feature3']]
#df = df[df.year==2019]
df.head()

Unnamed: 0,year,title,abstract,introduction,article-link,xmeans_cluster,kmeans_cluster,pca_feature1,pca_feature2,pca_feature3
0,2019,Database Resources of the BIG Data Center in 2019,The BIG Data Center at Beijing Institute of Ge...,The BIG Data Center (http://bigd.big.ac.cn) at...,https://doi.org/10.1093/nar/gky993,122,2,-2.56182,-1.052873,1.107861
1,2019,The European Bioinformatics Institute in 2018:...,The European Bioinformatics Institute (https:/...,"A primary mission of EMBL-EBI is to collect, o...",https://doi.org/10.1093/nar/gky1124,53,32,-0.760801,3.730246,0.050672
2,2019,Database resources of the National Center for ...,The National Center for Biotechnology Informat...,The National Center for Biotechnology Informat...,https://doi.org/10.1093/nar/gky1069,56,29,0.40927,0.206475,0.310989
3,2019,AmtDB: a database of ancient human mitochondri...,Ancient mitochondrial DNA is used for tracing ...,Ancient DNA (aDNA) is a genetic material obtai...,https://doi.org/10.1093/nar/gky843,48,19,12.458119,-5.506885,4.119361
4,2019,AnimalTFDB 3.0: a comprehensive resource for a...,The Animal Transcription Factor DataBase (Anim...,Transcription factors (TFs) are special protei...,https://doi.org/10.1093/nar/gky822,124,2,-5.437143,-1.979282,2.420911


Get the top N most common clusters from the 

In [7]:
from collections import Counter
num_clusters = 10
top_counter = Counter(df[DESIRED_CLUSTER]).most_common(num_clusters) 
print("Top " + str(num_clusters) + " Largest Clusters ")
print(top_counter)
top_clusters = [c[0] for c in top_counter]

Top 10 Largest Clusters 
[(57, 121), (80, 107), (85, 84), (79, 84), (129, 83), (180, 81), (53, 76), (39, 75), (124, 71), (7, 71)]


In [8]:
cluster_colors = dict()
for c in top_clusters:
    r = random.randint(0,255)
    g = random.randint(0,255)
    b = random.randint(0,255)
    cluster_colors[c] = 'rgb(' + str(r) + ',' + str(g) + ',' + str(b) + ')'

In [9]:
df = df[df[DESIRED_CLUSTER].isin(top_clusters)]
df.head()

Unnamed: 0,year,title,abstract,introduction,article-link,xmeans_cluster,kmeans_cluster,pca_feature1,pca_feature2,pca_feature3
1,2019,The European Bioinformatics Institute in 2018:...,The European Bioinformatics Institute (https:/...,"A primary mission of EMBL-EBI is to collect, o...",https://doi.org/10.1093/nar/gky1124,53,32,-0.760801,3.730246,0.050672
4,2019,AnimalTFDB 3.0: a comprehensive resource for a...,The Animal Transcription Factor DataBase (Anim...,Transcription factors (TFs) are special protei...,https://doi.org/10.1093/nar/gky822,124,2,-5.437143,-1.979282,2.420911
6,2019,ChIPprimersDB: a public repository of verified...,Chromatin immunoprecipitation (ChIP) has usher...,Polymerase chain reaction (PCR) is widely used...,https://doi.org/10.1093/nar/gky813,85,17,-0.543181,-0.780428,1.649599
8,2019,COXPRESdb v7: a gene coexpression database for...,The advent of RNA-sequencing and microarray te...,"Owing to high-throughput technologies, a huge ...",https://doi.org/10.1093/nar/gky1155,85,29,-2.825087,0.891143,-0.628933
10,2019,DDBJ update: the Genomic Expression Archive (G...,The Genomic Expression Archive (GEA) for funct...,"The DNA Data Bank of Japan (DDBJ,https://www.d...",https://doi.org/10.1093/nar/gky1002,85,2,-0.221178,-1.608512,2.00044


In [10]:
df['text'] = 'Title: ' + df['title'] + '<br>Year ' + (df['year']).astype(str) + '</br><br> Cluster:' + (df[DESIRED_CLUSTER]).astype(str)
limits = [(0,2),(3,10),(11,20),(21,50),(50,3000)]
colors = ["rgb(0,116,217)","rgb(255,65,54)","rgb(133,20,75)","rgb(255,133,27)","lightgrey"]
cities = []
scale = 5000

In [11]:
df.sort_values([DESIRED_CLUSTER]);
df.sort_values(['year']);

In [12]:
for index, row in df.iterrows():
    row['text'] = ('Title: {title}<br>'+
                       'Year: {year}<br>' +
#                      'Abstract: {abstract}<br>'+
#                      'Introduction: {introduction}<br>'+
                      'Cluster: {cluster}<br>'
                      ).format(title=row['title'],
                               year=row['year'],
#                               abstract=row['abstract'],
#                               introduction=row['introduction'],
                               cluster=row[DESIRED_CLUSTER])

In [13]:
sizeref = 2*100/(100**2)

In [14]:
def getTraceByYear(year, is3d=False):
    tmpdf = df[df.year == year]
    x_vals = tmpdf['pca_feature1']
    y_vals = tmpdf['pca_feature2']
    text = tmpdf['text']
    size = [5] * len(tmpdf)
    color_vals = []
    for index, row in tmpdf.iterrows():
        color_vals.append(cluster_colors[row[DESIRED_CLUSTER]])
    if is3d:
        return go.Scatter3d(
            x=tmpdf['pca_feature1'],
            y=tmpdf['pca_feature2'],
            z=tmpdf['pca_feature3'],
            mode='markers',
            text=text,
            name=str(year),
            marker=dict(
                symbol='circle',
                sizemode='area',
                sizeref=sizeref,
                size=[3]*len(tmpdf),
                color=color_vals))
    else:
        return go.Scatter(
            x=tmpdf['pca_feature1'],
            y=tmpdf['pca_feature2'],
            mode='markers',
            text=text,
            name=str(year),
            marker=dict(
                symbol='circle',
                sizemode='area',
                sizeref=sizeref,
                size=[1]*len(tmpdf),
                color=color_vals))

## 2D Plot of NAR Data

In [15]:
data = []
for year in set(list(df['year'])):
    data.append(getTraceByYear(year))

In [16]:
layout = go.Layout(
    title='NAR Database Clusters 2D',
    xaxis=dict(
        title='PCA Component 1'
    ),
    yaxis=dict(
        title='PCA Component 2'
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
    legend=None
)

fig = go.Figure(data=data, layout=layout)

In [17]:
py.iplot(fig, filename='Markov NAR Cluster Figures 2D')


Consider using IPython.display.IFrame instead



In [18]:
plotly.offline.plot(fig)

'temp-plot.html'

## 3D Plot of NAR Data

In [19]:
data = []
for year in set(list(df['year'])):
    data.append(getTraceByYear(year, is3d=True))

Next, we will fix the layout, which contains the title, etc. of our work.

In [20]:
layout = go.Layout(
    title='NAR Database Clusters 3D',
    xaxis=dict(
        title='PCA Component 1'
    ),
    yaxis=dict(
        title='PCA Component 2'
    ),
    paper_bgcolor='rgb(243, 243, 243)',
    plot_bgcolor='rgb(243, 243, 243)',
    legend=None
)

fig = go.Figure(data=data, layout=layout)

In [21]:
py.iplot(fig, filename='Markov NAR Cluster Figure 3D')


Consider using IPython.display.IFrame instead



In [22]:
plotly.offline.plot(fig)

'temp-plot.html'