# Module 3 Final Project Submission

Please fill out:
* Student name: David Braslow
* Student pace: self paced 
* Scheduled project review date/time: 
* Instructor name: Eli
* Blog post URL: TBD
* Data source: https://www.kaggle.com/kevinarvai/clinvar-conflicting/version/3


# Overview

This project uses a Kaggle dataset to predict gene classifications. In this dataset, we are given multiple genetic variants and various properties of each. Expert raters at different laboratories rated these variants based on their perceived clinical classifications, with ratings ranging from Benign to Pathogenic. The target variable is whether the raters have clinical classifications that are concordant, meaning that they are in the same clinical category.  

## Initialization

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Obtaining the Data

For this project, I downloaded the dataset from the Kaggle page as a csv.

In [2]:
df = pd.read_csv('clinvar_conflicting.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,CHROM,POS,REF,ALT,AF_ESP,AF_EXAC,AF_TGP,CLNDISDB,CLNDISDBINCL,CLNDN,...,SIFT,PolyPhen,MOTIF_NAME,MOTIF_POS,HIGH_INF_POS,MOTIF_SCORE_CHANGE,LoFtool,CADD_PHRED,CADD_RAW,BLOSUM62
0,1,955563,G,C,0.0,0.0,0.0,"MedGen:C3808739,OMIM:615120|MedGen:CN169374",,"Myasthenic_syndrome,_congenital,_8|not_specified",...,,,,,,,0.421,11.39,1.133255,-2.0
1,1,955597,G,T,0.0,0.42418,0.2826,MedGen:CN169374,,not_specified,...,,,,,,,0.421,8.15,0.599088,
2,1,955619,G,C,0.0,0.03475,0.0088,"MedGen:C3808739,OMIM:615120|MedGen:CN169374",,"Myasthenic_syndrome,_congenital,_8|not_specified",...,,,,,,,0.421,3.288,0.069819,1.0
3,1,957640,C,T,0.0318,0.02016,0.0328,"MedGen:C3808739,OMIM:615120|MedGen:CN169374",,"Myasthenic_syndrome,_congenital,_8|not_specified",...,,,,,,,0.421,12.56,1.356499,
4,1,976059,C,T,0.0,0.00022,0.001,MedGen:CN169374,,not_specified,...,,,,,,,0.421,17.74,2.234711,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65188 entries, 0 to 65187
Data columns (total 46 columns):
CHROM                 65188 non-null object
POS                   65188 non-null int64
REF                   65188 non-null object
ALT                   65188 non-null object
AF_ESP                65188 non-null float64
AF_EXAC               65188 non-null float64
AF_TGP                65188 non-null float64
CLNDISDB              65188 non-null object
CLNDISDBINCL          76 non-null object
CLNDN                 65188 non-null object
CLNDNINCL             76 non-null object
CLNHGVS               65188 non-null object
CLNSIGINCL            76 non-null object
CLNVC                 65188 non-null object
CLNVI                 27659 non-null object
MC                    58219 non-null object
ORIGIN                59065 non-null float64
SSR                   104 non-null float64
CLASS                 65188 non-null int64
Allele                65188 non-null object
Consequence        

# Scrubbing the Data

There seem to be a number of feilds with missing data and incorrect types. In this section, I scrub the dataset squeaky-clean.

## Very Low Incidence Features

Here I drop features with under 600 entries (1% of dataset).

In [4]:
df = df.drop(['CLNDISDBINCL', 'CLNDNINCL', 'CLNSIGINCL', 'SSR', 'DISTANCE', 'MOTIF_NAME', 'MOTIF_POS', 'HIGH_INF_POS', 'MOTIF_SCORE_CHANGE'], axis = 1)

## Low Incidence Features

Here I dichotomize features that are present for less than half the dataset, 1 indicating that data are present, 0 otherwise.

In [5]:
for var in ['CLNVI', 'INTRON', 'BAM_EDIT', 'SIFT', 'PolyPhen', 'BLOSUM62']:
    df[var] = df[var].apply(lambda x: 1 if x == x else 0).astype('category')
    print(df[var].value_counts())

0    37529
1    27659
Name: CLNVI, dtype: int64
0    56385
1     8803
Name: INTRON, dtype: int64
0    33219
1    31969
Name: BAM_EDIT, dtype: int64
0    40352
1    24836
Name: SIFT, dtype: int64
0    40392
1    24796
Name: PolyPhen, dtype: int64
0    39595
1    25593
Name: BLOSUM62, dtype: int64


## Target: CLASS

The CLASS vartible is the target variable, which indicates whether there were conflicting submissions.

In [6]:
df.rename({'CLASS': 'target'}, inplace = True)
df['target'] = df['target'].astype('category')

KeyError: 'target'

## CHROM

This variable captures the chromosome on which the variant is located. This should be a categorical variable.

In [None]:
df['CHROM'].value_counts()

In [None]:
df['CHROM'] = df['CHROM'].astype('category')

## POS

This variable captures position of the gene on the chromosome. Will need to treat this with care in analysis, since it depends on CHROM.

In [None]:
df['POS'].describe()

## REF, ALT, Allele

These variables are for capture variant alleles - should be categorical.

In [None]:
for var in ['REF', 'ALT', 'Allele']:
    print(df[var].value_counts()[0:10])

There are a lot of low-frequency categories - I will lump them together into an "other" category.

In [None]:
for var in ['REF', 'ALT', 'Allele']:
    df[var] = df[var].apply(lambda x: 'O' if x not in ['A', 'C', 'G', 'T'] else x).astype('category')

## AF_ESP, AF_EXAC, and AF_TGP

These variables capture the allele frequency as found in other datasets. They are almost all zero, so I dichotomize them into zero vs non-zero.

In [None]:
df[['AF_ESP', 'AF_EXAC', 'AF_TGP']].describe()

In [None]:
df[['AF_ESP', 'AF_EXAC', 'AF_TGP']].hist()

In [None]:
df['AF_ESP'] = df['AF_ESP'].apply(lambda x: 1 if x > 0 else 0).astype('category')
df['AF_EXAC'] = df['AF_EXAC'].apply(lambda x: 1 if x > 0 else 0).astype('category')
df['AF_TGP'] = df['AF_TGP'].apply(lambda x: 1 if x > 0 else 0).astype('category')

## CLNDISDB

This variable contains IDs for diseases in other databases. This variable has a large number of values, so it will be difficult to use it. I see that different values for this variable often contain the same identifiedrs, making the values arguable not unique (e.g. 'MedGen:CN169374' appears in multiple values). I choose to drop it.

In [None]:
print(len(df['CLNDISDB'].unique()))
df['CLNDISDB'].value_counts()[0:10]

In [None]:
df = df.drop('CLNDISDB', axis = 1)

## CLNDN

This captures the preferred disease name using the identifiers from CLNDISDB. This may be cleaner than the other variable, and is probably important for prediction, so I will attempt to clean it.

In [None]:
print(len(df['CLNDN'].unique()))
df['CLNDN'].value_counts()[0:20]

Each value is a list of diseases. It seems like I could clean this by creating dummy variables for specific common diseases in each list. I will create dummies for the top 100 diseases.

In [None]:
name_df = df['CLNDN'].str.split(pat = '|', expand = True)
name_df.head()
top_100_dn = name_df.apply(pd.value_counts).sum(axis=1).sort_values(ascending = False)[0:100]
print(top_100_dn[0:10])

top_100_dn_list = list(top_100_dn.index)
print(top_100_dn_list[0:10])

In [None]:
for dn in top_100_dn_list:
    df[dn] = df['CLNDN'].apply(lambda x: 1 if dn in x else 0).astype('category')
df = df.drop('CLNDN', axis = 1)

In [None]:
print(df.columns)

## CLNHGVS

This variable is all unique values that I don't understand related to HGVS expression. I choose to drop it.

In [None]:
print(len(df['CLNHGVS'].unique()))
df = df.drop('CLNHGVS', axis = 1)

## MC

Molecular consequence is a categorical variable, need to clean up rare values. Since values are lists of consequences, I will do this similarly to how I did it for the names, splitting up the series and coding dummies.

In [None]:
df['MC'].value_counts()[0:10]

In [None]:
name_df = df['MC'].str.split(pat = '[|,]', expand = True)
name_df.head()
top_mc = name_df.apply(pd.value_counts).sum(axis=1).sort_values(ascending = False)[0:20]
print(top_mc)

top_mc_list = [x for x in list(top_mc.index) if 'SO:' not in x]
print(top_mc_list)

In [None]:
df['MC'] = df['MC'].fillna('unknown')
for mc in top_mc_list:
    df[mc] = df['MC'].apply(lambda x: 1 if mc in x else 0).astype('category')
    print(df[mc].value_counts())
df = df.drop('MC', axis = 1)

## ORIGIN

Here is the description: "Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other" Since almost all have origin 1 (germline), I will recode this to have 0 for all other values to make it a dummy variable.

In [None]:
df['ORIGIN'] = df['ORIGIN'].fillna(0).apply(lambda x: 1 if x == 1.0 else 0).astype('category')

## Consequence

This variable is similar to MC, but with slightly different values. I'm not sure why. I will use it to update the MC dummy variables from before.

In [None]:
name_df = df['Consequence'].str.split(pat = '&', expand = True)
name_df.head()
top_mc = name_df.apply(pd.value_counts).sum(axis=1).sort_values(ascending = False)
print(top_mc[0:20])

In [None]:
for mc in top_mc_list:
    mc2 = mc + '2'
    df[mc2] = df['Consequence'].apply(lambda x: 1 if mc in x else 0).astype('category')
    df[mc] = df[[mc, mc2]].apply(lambda x: max(x[mc], x[mc2]), axis = 1).astype('category')
    print(df[mc].value_counts())
    df=df.drop(mc2, axis = 1)
df = df.drop('Consequence', axis = 1)

## IMPACT

Categorical variable capturing variant impact

In [None]:
df['IMPACT'].value_counts()

In [None]:
df['IMPACT'] = df['IMPACT'].astype('category')

## SYMBOL

This variable is the Gene symbol/ID. It has many values - I will make it categorical, but only keep the top 100 values, recoding the rest as "Other".

In [None]:
len(df['SYMBOL'].unique())

In [None]:
df['SYMBOL'].value_counts()[0:10]

In [None]:
top_100_symb = df['SYMBOL'].value_counts()[0:100].index
df['SYMBOL'] = df['SYMBOL'].apply(lambda x: x if x in top_100_symb else 'Other').astype('category')

In [None]:
df['SYMBOL'].value_counts()[0:100]

## Feature

This is an ID associated with gene name - deleting due to redundancy

In [None]:
df = df.drop('Feature', axis = 1)

## Feature_type and BIOTYPE

These features have little information (almost all records have same value), so I drop them.

In [None]:
for var in ['Feature_type', 'BIOTYPE']:
    print(df[var].value_counts())
    df = df.drop(var, axis = 1)

## EXON

This captures the relative exon number. Given the very large numbers of unique values, I choose to drop it.

In [None]:
len(df['EXON'].unique())

In [None]:
df = df.drop('EXON', axis = 1)

## cDNA_position, CDS_position, Protein_position

These represent relative positions of the base pair in various ways. These are all distance measures, which I think are irrelevant to the problem at hand, and difficult to clean so I drop them.

In [None]:
df = df.drop(['cDNA_position', 'CDS_position', 'Protein_position'], axis = 1)

## Amino_acids, Codons

These have a large number of unique values, so I drop them.

In [None]:
df = df.drop(['Amino_acids, Codons'], axis = 1)

## STRAND

Categorical: defined as + (forward) or - (reverse)

In [None]:
df['STRAND'].value_counts()

In [None]:
df['STRAND'] = df['STRAND'].astype('category')

## LoFtool

Numeric variable: Loss of Function tolerance score for loss of function variants. Will fill missing values with median.

In [None]:
df['LoFtool'] = df['LoFtool'].fillna(df['LoFtool'].median())

## CADD_PHRED, CADD_RAW

Different scores of deleteriousness - I keep them and fill missing values with medians.

In [None]:
df['CADD_PHRED'] = df['CADD_PHRED'].fillna(df['CADD_PHRED'].median())

In [None]:
df['CADD_RAW'] = df['CADD_RAW'].fillna(df['CADD_RAW'].median())

# Exploring the Data

In [None]:
df.info()