## Unsupervised Metadata Assessment

In this notebook you will find the functions and methodology used for classifying key class variables.  
The dataset used is from Gene Expression Omnibus [GEO](https://www.ncbi.nlm.nih.gov/geo/)  
        The pipeline is divided into several phases:    
     **1. In view of the need for datasets with binary classes, we proceed to separate the classes into subsets**  
     **2. The idea is to create features from a topic modeling perspective for each class**  
     **3. A matrix is formed with all cases and the value corresponding to the topic found**  
     **4. Given that each subset is unbalanced, you do not want to incur overfitting then you have a balanced**  
     **5. A logistic classifier is carried out since the variables are a matrix of weights**  
     **6. A training set evaluation is performed.**

In [1]:
import turicreate as tc

### Geo Dataset

In [2]:
sf_keys = tc.SFrame('NCBItrainset.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
sf_keys

diseaseName,class
skin tumour,DiseaseClass
cancer,DiseaseClass
colon cancers,DiseaseClass
adenomatous polyposis coli ...,SpecificDisease
APC,SpecificDisease
colon and some other cancers ...,CompositeMention
skin tumours,DiseaseClass
pilomatricomas,SpecificDisease
pilomatricomas,SpecificDisease
skin tumour,DiseaseClass


In [4]:
#import pandas as pd
#df_values = pd.read_csv('../datasets/geo_values.csv')
#df_values[df_values['values'].notna()]
#sf_values = tc.SFrame('../datasets/geo_values.csv')

In [4]:
disease_classes = [i for i in sf_keys['class'].unique()]

In [5]:
print('Number of key value pairs: {}\nDifferent key classes: {}'.format(len(sf_keys), disease_classes))

Number of key value pairs: 1301
Different key classes: ['Modifier', 'CompositeMention', 'SpecificDisease', 'DiseaseClass']


###  Key class subsets

It is needed to create different subsets, one per each key class, a filtering is performed

In [6]:
def create_subsets(df, column_category):
    tuples = []
    for category in df[column_category].unique():
        yes_category = df[df[column_category] == category]
        no_category = df[df[column_category] != category]
        no_category[column_category] = 'no '+category
        table = yes_category.append(no_category)
        tuples.append((category, table))
        
    tables = {key: value for (key, value) in tuples}
    print(tables.keys())
    return tables

In [7]:
geo_tables = create_subsets(sf_keys, 'class')

dict_keys(['Modifier', 'CompositeMention', 'SpecificDisease', 'DiseaseClass'])


In [9]:
for k in geo_tables.keys():
    print(str(k), round((sum(list(geo_tables[str(k)]['class'] == str(k)))/len(sf_keys)),2))

Modifier 0.25
CompositeMention 0.02
SpecificDisease 0.6
DiseaseClass 0.13


Here it is an example of the first 40 rows of a subset looks like

In [10]:
geo_tables['Modifier'].print_rows(40,2)

+--------------------------+----------+
|       diseaseName        |  class   |
+--------------------------+----------+
|          tumour          | Modifier |
|     hemochromatosis      | Modifier |
|            HH            | Modifier |
|            HH            | Modifier |
|            HH            | Modifier |
|            HH            | Modifier |
|            HH            | Modifier |
|      ovarian cancer      | Modifier |
|      breast cancer       | Modifier |
|      ovarian cancer      | Modifier |
|      ovarian cancer      | Modifier |
|      ovarian cancer      | Modifier |
|      ovarian cancer      | Modifier |
|          cancer          | Modifier |
|      ovarian cancer      | Modifier |
|      ovarian cancer      | Modifier |
| absence of functional C7 | Modifier |
|         dystonic         | Modifier |
|           FMF            | Modifier |
|           ALPS           | Modifier |
|           ALPS           | Modifier |
|           ALPS           | Modifier |


### Pipeline method for one class

In [59]:
key_class = 'CompositeMention'
class_column_name = 'disease_class'

### Extracting features from topic modeling

In [60]:
def create_features(category_df, n_features):
    # Remove stopwords and convert to bag of words
    doc = tc.text_analytics.count_words(category_df['diseaseName'])
    doc = doc.dict_trim_by_keys(tc.text_analytics.stop_words(), True)
    
    # Learn topic model
    model = tc.topic_model.create(doc, verbose=False)
    # Agreaggate the unique words
    sf_topics = model.get_topics()
    sf_words = sf_topics.groupby(key_column_names='word', operations={'sum_scores': tc.aggregate.SUM('score')})
    
    # Sort the features scores and filter out all those which are key
    sf_words = sf_words.sort('sum_scores', ascending= False).filter_by(disease_classes, 'word', exclude=True)
    
    # Take a look of the features related with this key class
    features = [i for i in sf_words['word']][0:n_features] #changable
    return sf_words, features

In [61]:
sf_words, features = create_features(geo_tables[key_class], 10)

In [62]:
sf_words.print_rows(10,2)

+------------+---------------------+
|    word    |      sum_scores     |
+------------+---------------------+
| deficiency | 0.37253717819240517 |
|  disease   | 0.19708714006519984 |
|  syndrome  |  0.1947309410646745 |
| dystrophy  |  0.1333002841441469 |
|    ald     | 0.13079235760136804 |
|   cancer   | 0.11069063386944271 |
|  muscular  | 0.10244560966096244 |
|    apc     | 0.10150260656240491 |
|   breast   | 0.09964754886254477 |
|   linked   | 0.08435706864537848 |
+------------+---------------------+
[40 rows x 2 columns]



### Creating a feature matrix

In [63]:
def get_input_matrix(features, category_df, sf_words):
    tuples = []
    for word in features:
        feature_vector = [1 if (word in i) else 0 for i in category_df['diseaseName']]
        tuples.append((word, feature_vector))
        
    sf_features = tc.SFrame({key: value for (key, value) in tuples})
    #concatenating the features with the category matrix
    category_df = category_df.add_row_number()
    sf_features = sf_features.add_row_number()
    final_table = category_df.join(sf_features, on='id', how='left')
    for f in features:
        score = sf_words[sf_words['word'] == str(f)]['sum_scores'].astype(float)[0]
        final_table[str(f)] = [(1.0+score) * i for i in final_table[str(f)]]
        
    return final_table

In [64]:
input_matrix = get_input_matrix(features, geo_tables[key_class], sf_words)

In [65]:
input_matrix

id,diseaseName,class,ald,apc,breast,cancer,deficiency
0,colon and some other cancers ...,CompositeMention,0.0,0.0,0.0,1.110690633869443,0.0
1,breast or ovarian cancer,CompositeMention,0.0,0.0,1.0996475488625448,1.110690633869443,0.0
2,disorder of lymphocyte homeostasis and ...,CompositeMention,0.0,0.0,0.0,0.0,0.0
3,vasculopathy of the heart and brain ...,CompositeMention,0.0,0.0,0.0,0.0,0.0
4,breast and ovarian cancer,CompositeMention,0.0,0.0,1.0996475488625448,1.110690633869443,0.0
5,breast or ovarian cancer,CompositeMention,0.0,0.0,1.0996475488625448,1.110690633869443,0.0
6,breast or ovarian cancer,CompositeMention,0.0,0.0,1.0996475488625448,1.110690633869443,0.0
7,"contractures of the elbows, Achilles tendons ...",CompositeMention,0.0,0.0,0.0,0.0,0.0
8,inherited breast and ovarian cancers ...,CompositeMention,0.0,0.0,1.0996475488625448,1.110690633869443,0.0
9,familial breast and ovarian cancers ...,CompositeMention,0.0,0.0,1.0996475488625448,1.110690633869443,0.0

disease,dystrophy,linked,muscular,syndrome
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0,0.0


In [33]:
input_matrix.export_csv('feature_matrix_modifier.csv')

### Class balancing

In [16]:
def binary_balancing(sf, key_class, class_column_name):
    class_set = sf[sf[class_column_name]== key_class ]
    inverse_set = sf[sf[class_column_name]!= key_class]
    return class_set.append(inverse_set.sample(float(len(class_set))/float(len(input_matrix))))

In [17]:
balanced_input_matrix = binary_balancing(input_matrix, key_class, class_column_name)

Here it is an example of the first 10 rows of the input matrix looks like

In [18]:
balanced_input_matrix

id,key,key_class,cell,days,infection,line,months,point
0,human cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
1,dendritic cell lineages,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
2,or cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
3,huh7 cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
4,hybrid cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
5,tumor cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
6,atcc cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
7,host cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
8,donor cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0
9,fibrosarcoma cell line,cell line,1.238190701358203,0.0,0.0,1.160765616429564,0.0,0.0

post,source,type,years
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0
0.0,0.0,0.0,0.0


### Logistic Classifier

In [69]:
train_data, test_data = input_matrix.random_split(0.65)

In [70]:
model = tc.logistic_classifier.create(train_data, target = 'class', features = features, max_iterations = 10)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [71]:
model

Class                          : LogisticClassifier

Schema
------
Number of coefficients         : 11
Number of examples             : 829
Number of classes              : 2
Number of feature columns      : 10
Number of unpacked features    : 10

Hyperparameters
---------------
L1 penalty                     : 0.0
L2 penalty                     : 0.01

Training Summary
----------------
Solver                         : newton
Solver iterations              : 9
Solver status                  : SUCCESS: Optimal solution found.
Training time (sec)            : 0.0258

Settings
--------
Log-likelihood                 : 61.1179

Highest Positive Coefficients
-----------------------------
linked                         : 5.4973
dystrophy                      : 4.9701
(intercept)                    : 4.9608
syndrome                       : 3.5437
disease                        : 3.4448

Lowest Negative Coefficients
----------------------------
muscular                       : -8.5409
cancer  

In [72]:
predictions = model.classify(test_data)
evaluation_test_set = model.evaluate(test_data)

In [73]:
evaluation_test_set

{'accuracy': 0.9836448598130841,
 'auc': 0.7709219858156029,
 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 3
 
 Data:
 +---------------------+---------------------+-------+
 |     target_label    |   predicted_label   | count |
 +---------------------+---------------------+-------+
 |   CompositeMention  | no CompositeMention |   5   |
 | no CompositeMention | no CompositeMention |  421  |
 | no CompositeMention |   CompositeMention  |   2   |
 +---------------------+---------------------+-------+
 [3 rows x 3 columns],
 'f1_score': 0.9917550058892814,
 'log_loss': 0.06729743629046574,
 'precision': 0.9882629107981221,
 'recall': 0.9952718676122931,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+-----+---+
 | threshold | fpr | tpr |  p  | n |
 +-----------+-----+-----+-----+---+
 |    0.0    | 1.0 | 1.0 | 423 | 5 |
 |   1e-05   | 1.0 | 1.0 | 423 | 5 |
 |   

---
### Pipeline methodology

In [26]:
def method_pipeline(sf, class_column_name):
    output = []
    key_classes = [i for i in sf[class_column_name].unique()]
    for key_class in key_classes:
        sf_words, features = create_features(geo_tables[key_class], 10)
        input_matrix = get_input_matrix(features, geo_tables[key_class], sf_words)
        balanced_input_matrix = binary_balancing(input_matrix, key_class, class_column_name)
        train_data, test_data = balanced_input_matrix.random_split(0.7)
        model = tc.logistic_classifier.create(train_data, target = class_column_name, features = features, max_iterations = 10, verbose=False)
        evaluation_test_set = model.evaluate(test_data)
        output.append((key_class, evaluation_test_set))
    return {key: value for (key, value) in output}    

In [27]:
results = method_pipeline(sf_keys, 'key_class')

In [28]:
results.keys()

dict_keys(['gender', 'cell line', 'genotype', 'sex', 'treatment', 'age', 'cell type', 'strain', 'time', 'disease', 'tissue'])

In [33]:
results['cell type']

{'accuracy': 0.9545454545454546, 'auc': 1.0, 'confusion_matrix': Columns:
 	target_label	str
 	predicted_label	str
 	count	int
 
 Rows: 3
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |  cell type   |    cell type    |   10  |
 | no cell type |   no cell type  |   11  |
 |  cell type   |   no cell type  |   1   |
 +--------------+-----------------+-------+
 [3 rows x 3 columns], 'f1_score': 0.9565217391304348, 'log_loss': 0.09449158474618659, 'precision': 0.9166666666666666, 'recall': 1.0, 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+-----+-----+----+----+
 | threshold | fpr | tpr | p  | n  |
 +-----------+-----+-----+----+----+
 |    0.0    | 1.0 | 1.0 | 11 | 11 |
 |   1e-05   | 1.0 | 1.0 | 11 | 11 |
 |   2e-05   | 1.0 | 1.0 | 11 | 11 |
 |   3e-05   | 1.0 | 1.0 | 11 | 11 |
 |   4e-05   | 1.0 | 1.0 | 11 | 11 |
 |   5e-

#### Roc curve

In [None]:
import matplotlib.pyplot as plt

In [None]:
roc_curve_table = results['treatment']['roc_curve']
plt.scatter(roc_curve_table['fpr'], roc_curve_table['tpr'])

In [None]:
#test_data.export_csv('../datasets/cell line/test_data_cell_line.csv')
#predictions.export_csv('../datasets/cell line/prediction_cell_line.csv')