# Data Science Foundations Capstone Proposal
In order to get your capstone approved, you must complete all of the following steps.

## 1) Get your data
You may use any data set(s) you like, so long as they meet these criteria:

* Your data should be interesting to _you_. You want your capstone to be something you're proud of.
* Your data should be "big enough":
    - It should have at least 1,000 rows.
    - It should have enough of columns to be interesting.
    - If you have questions, contact a member of the instructional team.

## 2) Provide a link to your data
Here, provide a URL to your data or describe how you will access it.

In [None]:
# Enter link here.
# PubMed search for Marfan and closely related Syndromes, searched using specified indexed MeSH Headings on July 24, 2019
https://www.ncbi.nlm.nih.gov/pubmed/?term=Marfan+Syndrome+%5BMH%5D+OR+Loeys-Dietz+Syndrome+%5BMH%5D+OR+Ehlers-Danlos+Syndrome+%5BMH%5D+OR+Weill-Marchesani+Syndrome+%5BMH%5D%22

In [None]:
# I have uploaded data in the file data_pubmed_marfan1.csv on July 24. 
# Below is the Biopython program used for eSearch and eFetch in PubMed

## 3) Import your data
In the space below, import your data. If your data span multiple files, read them all in. If applicable, merge or append them as needed.

In [2]:
from Bio import Entrez
from Bio import Medline
import pandas as pd
import time
Entrez.email = "Preeti.Kochar@nih.gov" 
# The code to search and import PubMed data to Jupyter Lab shared with me by Melanie Huston, a colleague

In [3]:
# return a list of the PMIDs that match your search term

def getPubMedIDs(searchstring,maxrecords):
    IDlist=[]
    if not maxrecords > 100000: #maximum possible = 100,000 records
        handle = Entrez.esearch(db="pubmed", term=searchstring, retmax = maxrecords) 
        result = Entrez.read(handle)
        IDlist= result["IdList"]
        handle.close()

    return IDlist

In [4]:
# get MEDLINE data records for each PMID and store in a dataframe
# searchPMIDlist was created by getPubMedIDs function
# this will only get the first 10000 records from your PMID list

def getPubMeddata(searchPMIDlist,dataframename):
    
    # a trick to remove duplicates from the PMID list using set and list
    searchPMIDlist = list(set(searchPMIDlist))
    
    # for displaying number of records processed
    counter=0
    
    # get MEDLINE data records
    fetchhandle = Entrez.efetch(db="pubmed", id=searchPMIDlist, rettype="medline", retmode="text")
    fetchresult = Medline.parse(fetchhandle)

    # parse the dictionary of returned records
    for record in fetchresult: 
        if "PMID" not in record: # if there's no PMID in this record (rare), skip it
            continue
        JT=''
        if "JT" in record:
            JT=record["JT"]
        PT=''
        if "PT" in record: # if there's a pub type list in this record, store it
            PT=record["PT"]
        TI=''
        if "TI" in record: # if there's a title in this record, store it
            TI=record["TI"]
        AB=''
        if "AB" in record: # if there's an abstract in this record, store it
            AB=record["AB"]
        else: "NA"
        MH=''
        if "MH" in record:
            MH=record["MH"]

        # put the data you found into a new row in the dataframe
        # you might want to collect different data for your purposes
        dataframename = dataframename.append({'PMID': record ["PMID"],
                                     'Journal': JT,
                                     'PT': PT,
                                     'Title': TI,
                                     'Abstract': AB, 
                                     'MeSH Terms' : MH}, ignore_index=True)
        
         # if we've processed 500 new records, display number of records processed
        counter += 1
        if not counter % 500:
            print(counter, "records processed") 
            
    time.sleep(5) # wait time between repeated fetches
    fetchhandle.close()
    
    return dataframename
        

In [5]:
myPMIDlist=[]
searchstring="Marfan Syndrome [MH] OR Loeys-Dietz Syndrome [MH] OR Ehlers-Danlos Syndrome [MH] OR Weill-Marchesani Syndrome [MH]"

myPMIDlist = getPubMedIDs(searchstring, 10000)
print("Search string:", searchstring)
print("Total PMIDs found:",len(myPMIDlist))

Search string: Marfan Syndrome [MH] OR Loeys-Dietz Syndrome [MH] OR Ehlers-Danlos Syndrome [MH] OR Weill-Marchesani Syndrome [MH]
Total PMIDs found: 8848


In [6]:
# initialize your dataframe for the citation record data
# you might want to collect different data for your purposes

columnlist=['PMID', 'Journal','PT', 'Title', 'Abstract','MeSH Terms']
marfan_like = pd.DataFrame(columns=columnlist, index=None)

In [9]:
# Fetch data
marfan_like = getPubMeddata(myPMIDlist, marfan_like) # will only get the first 10000 records from your PMID list
print('Record Table Length:',len(marfan_like))

500 records processed
1000 records processed
1500 records processed
2000 records processed
2500 records processed
3000 records processed
3500 records processed
4000 records processed
4500 records processed
5000 records processed
5500 records processed
6000 records processed
6500 records processed
7000 records processed
7500 records processed
8000 records processed
8500 records processed
Record Table Length: 8848


## 4) Show me the head of your data.

In [11]:
marfan_like.head(5)

Unnamed: 0,PMID,Journal,PT,Title,Abstract,MeSH Terms
0,2606899,Journal of biochemistry,"[Journal Article, Research Support, Non-U.S. G...",Partial characterization of an unusual 185 kDa...,Production of an unusual collagenous protein w...,"[Antibodies/immunology, Collagen/*analysis, El..."
1,14302561,Clinical obstetrics and gynecology,[Journal Article],THE MARFAN SYNDROME AND PREGNANCY.,,"[*Aortic Aneurysm, *Aortic Rupture, *Arachnoda..."
2,4251765,Clinical chemistry,[Journal Article],Ion-exchange chromatography of free amino acid...,,"[Amino Acids/*metabolism, Angiomatosis/metabol..."
3,25459960,The Journal of hand surgery,"[Journal Article, Review]",Ehlers-Danlos syndrome.,,"[Ehlers-Danlos Syndrome/*diagnosis, *Hand, Hum..."
4,27339264,Disability and rehabilitation,[Journal Article],The association between muscle strength and ac...,PURPOSE: The patients diagnosed with Ehlers-Da...,"[Adult, Ehlers-Danlos Syndrome/*rehabilitation..."


## 5) Show me the shape of your data

In [12]:
marfan_like.shape

(8848, 6)

## 6) Show me the proportion of missing observations for each column of your data

In [13]:
marfan_like.isnull().sum() # I need to look into why I get zeroes, when it's clear from the head that Abstract
# is missing from some. 

PMID          0
Journal       0
PT            0
Title         0
Abstract      0
MeSH Terms    0
dtype: int64

## 7) Give me a problem statement.
Below, write a problem statement. Keep in mind that your task is to tease out relationships in your data and eventually build a predictive model. Your problem statement can be vague, but you should have a goal in mind. Your problem statement should be between one sentence and one paragraph.

In [None]:
# I plan to analyze PubMed/MEDLINE data on Marfan-like Syndromes (congenital connective tissue diseases) and genetic 
# mutation or other aspects of the diseases based mainly on the indexed MeSH terms. My Dataset is derived from a 
# PubMed search and is approx. 8500 citations.  The data will have 6 columns: The ID of the citation, the Journal, 
# the Title, Abstract and the MeSH terms indexed for each citation. 
# As I work on the data, I may import more data e.g. about the genes and also decide not the use certain columns,
# e.g. Abstract

## 8) What is your _y_-variable?
For Part C of your capstone, you will need to perform a statistical model. This means you will have to accurately predict some y-variable for some combination of x-variables. From your problem statement in part 7, what is that y-variable?

In [None]:
# My y-variable will be the mesh term for the protein indexed with the specific Marfan-like syndrome. 
# The assumption is that when a particular gene is indexed in the same MEDLINE citation, that it is correlated 
# with the disease. I may explore other variables such as the pathology, disease severity, etc.