# Data

Create 3 databases for each type of cancer. The different types in study are **skin**, **stomach** and **thyroid** cancers.

In [1]:
import pymongo

DATABASE_NAMES = ["skin_cancer_db", "stomach_cancer_db", "thyroid_cancer_db"]
SKIN = 0
STOMACH = 1
THYROID = 2

mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")

mongo_client.drop_database(DATABASE_NAMES[SKIN])
mongo_client.drop_database(DATABASE_NAMES[STOMACH])
mongo_client.drop_database(DATABASE_NAMES[THYROID])

skin_cancer_db = mongo_client[DATABASE_NAMES[SKIN]]
stomach_cancer_db = mongo_client[DATABASE_NAMES[STOMACH]]
thyroid_cancer_db = mongo_client[DATABASE_NAMES[THYROID]]

databases = [skin_cancer_db, stomach_cancer_db, thyroid_cancer_db]

print("MongoDB:", mongo_client.server_info)

MongoDB: <bound method MongoClient.server_info of MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)>


Each database has 3 collections: **Gene expression**, **Copy number analysis** and **Human Methylation**.

Each collection has its attributes in a header file located in the `data` folder

In [2]:
COLLECTIONS_NAMES = ["Copy Number Analysis", "Human Methylation", "Gene Expression"]
CNA = 0
METHYLATION = 1
RNA = 2

import csv

# Array with each collection attributes
cna_attributes = csv.reader(open("data/cna.header")).__next__()
methylation_attributes = csv.reader(open("data/methylation.header")).__next__()
rna_attributes = csv.reader(open("data/rna.header")).__next__()

print("CNA attributes: ", len(cna_attributes))
print("Methylation attributes: ", len(methylation_attributes))
print("RNA attributes: ", len(rna_attributes))

CNA attributes:  24777
Methylation attributes:  16049
RNA attributes:  20441


The data for each database is inside 3 csv files (one per collection) inside `skin`, `stomach` and `thyroid` folders under the main `data` folder.
Extract the data for each database and for each database's collection and insert it into MongoDB.

In [3]:
COLLECTIONS_CSV = ["cna.csv", "methylation_hm450.csv", "rnaZscore.csv"]

# Inserts a document in the database and collection provided
def insert_document(database, collection, document):
    database[collection].insert_one(document)

# Runs through the csvs to populate the database
def populate_database(csvs_folder, database):
    cna_csv = csv.DictReader(open(csvs_folder + COLLECTIONS_CSV[CNA]))
    methylation_csv = csv.DictReader(open(csvs_folder + COLLECTIONS_CSV[METHYLATION]))
    rna_csv = csv.DictReader(open(csvs_folder + COLLECTIONS_CSV[RNA]))
    
    for cna_document in cna_csv:
        insert_document(database, COLLECTIONS_NAMES[CNA], cna_document)

    print("\tNumber of CNA records:", cna_csv.line_num-1)

    for methylation_document in methylation_csv:
        insert_document(database, COLLECTIONS_NAMES[METHYLATION], methylation_document)

    print("\tNumber of Methylation records:", methylation_csv.line_num-1)

    for rna_document in rna_csv:
        insert_document(database, COLLECTIONS_NAMES[RNA], rna_document)

    print("\tNumber of RNA records:", rna_csv.line_num-1)


# skin_cancer_db
print("Skin Cancer Database:")
populate_database("data/skin/", skin_cancer_db)

# stomach_cancer_db
print("Stomach Cancer Database:")
populate_database("data/stomach/", stomach_cancer_db)

# thyroid_cancer_db
print("Thyroid Cancer Database:")
populate_database("data/thyroid/", thyroid_cancer_db)

Skin Cancer Database:
	Number of CNA records: 367
	Number of Methylation records: 473
	Number of RNA records: 472
Stomach Cancer Database:
	Number of CNA records: 441
	Number of Methylation records: 397
	Number of RNA records: 415
Thyroid Cancer Database:
	Number of CNA records: 499
	Number of Methylation records: 567
	Number of RNA records: 509


Check if the databases exist and have the correct collections and records.

In [4]:
def print_database_stats(database):
    print("\tCollections:", database.list_collection_names())
    print("\n\tCNA records:", database[COLLECTIONS_NAMES[CNA]].count_documents({}))
    print("\tMethylation records:", database[COLLECTIONS_NAMES[METHYLATION]].count_documents({}))
    print("\tRNA records:", database[COLLECTIONS_NAMES[RNA]].count_documents({}))

print("Databases:", mongo_client.list_database_names())

print("\nSkin Cancer:")
print_database_stats(skin_cancer_db)

print("\nStomach Cancer:")
print_database_stats(stomach_cancer_db)

print("\nThyroid Cancer:")
print_database_stats(thyroid_cancer_db)

Databases: ['admin', 'config', 'local', 'skin_cancer_db', 'stats', 'stomach_cancer_db', 'thyroid_cancer_db']

Skin Cancer:
	Collections: ['Gene Expression', 'Copy Number Analysis', 'Human Methylation']

	CNA records: 367
	Methylation records: 473
	RNA records: 472

Stomach Cancer:
	Collections: ['Human Methylation', 'Gene Expression', 'Copy Number Analysis']

	CNA records: 441
	Methylation records: 397
	RNA records: 415

Thyroid Cancer:
	Collections: ['Copy Number Analysis', 'Gene Expression', 'Human Methylation']

	CNA records: 499
	Methylation records: 567
	RNA records: 509
