# Kidney Tumor data set
## Emily Isko and Joe Jessee
### Biof509: Applied Machine Learning

Data used in this project was downloaded from https://portal.gdc.cancer.gov/projects/TCGA-KIRC

Project ID:	TCGA-KIRC

DbGaP Study Accession:	phs000178

Project Name:	Kidney Renal Clear Cell Carcinoma

Disease Type:	Adenomas and Adenocarcinomas

Primary Site:	Kidney


There are **28** variables and **537** data entries/observations.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

In [None]:
# importing data
data = pd.read_csv("clinical.tsv", sep='\t')

print(data.info())
sb.pairplot(data)

## What to do with empty entries

In this data set, empty entries were often denoted with **--** or **not reported**

We changed this to NaN as appropriate.

In [None]:
# replace -- with Nan
data = data.replace('--', np.nan)
data = data.replace('not reported', np.nan)

# keep not reported in categorical variable, ethnicity and race
data['ethnicity'] = data['ethnicity'].replace(np.nan, 'not reported')
data['race'] = data['race'].replace(np.nan, 'not reported')


In [None]:
# looking at where the empty data are
print(data.isna().sum())

data.head()

Note that some variables have no recorded data or 1 variable for all entries. We will delete these columns as they are irrelevant.

In [None]:
# Delete columns with only 1 

delete_cols = ['case_id', 'project_id','classification_of_tumor',
       'last_known_disease_status', 'morphology','days_to_last_known_disease_status', 'days_to_recurrence','tumor_grade', 'tissue_or_organ_of_origin', 'days_to_birth','progression_or_recurrence', 'prior_malignancy',
       'site_of_resection_or_biopsy', 'therapeutic_agents', 'treatment_intent_type', 'treatment_or_therapy']

data = data.drop(delete_cols, axis=1)



data.shape

New data frame only has 13 variables.

## Modifying data types to encode meaningful information

Certain variables are encoded as categorical data (i.e. string objects) when the variable actually contains numeric data.

### Tumor stage

In this variable, tumor stage, the data for this variable is imported as a string object. Instead, we wanted the stages to be encoded as ranked integers. To do this, we replaced each stage with integers.

In [None]:
print("Before Replacement")
print(data.tumor_stage.head())

data = data.replace('stage i', 1)
data = data.replace('stage ii', 2)
data = data.replace('stage iii', 3)
data = data.replace('stage iv', 4)


print('\n')
print("After Replacement:")
print(data.tumor_stage.head())


## Encoding numerical data

We identified which variables should be encoded as numeric variables (anything related to year, age, or days) and converted the data types to floats.

In [None]:
to_float = ['year_of_birth','year_of_death', 'tumor_stage','age_at_diagnosis', 'days_to_death', 'days_to_last_follow_up']

for col in to_float:
    data[col] = data[col].astype('float64')


    
# add new column age_at_death
data['age_at_death'] = data['age_at_diagnosis'] + data['days_to_death']
    
    
data.info()

## Is there redundancy in the data?

We represented the data in a correlation matrix to identify redundant variables that may be encoding the same information.

In [None]:
sb.pairplot(data)

# Data visualization

In [None]:
from pandas.api.types import is_string_dtype

fig, axs = plt.subplots(4,3, figsize=(10,10))
fig.tight_layout(pad=5) 
i = 0
j = 0


for col in data:
    print(data[col].dtype)
    dtype = data[col].dtype
    valcounts = data[col].value_counts()
    
    if col == "submitter_id":
        continue
    elif is_string_dtype(data[col]):
        print(data[col].name, " is a string type")
        vals = valcounts.values.tolist()
        names = valcounts.index.tolist()
        
        axs[i,j].pie(vals, labels = names)
        axs[i,j].set_title(col)
        
    else:
        print(data[col].name, " is a number")
        axs[i,j].hist(data[col])
        axs[i,j].set_title(col)
    
    
    if j < 2:
        j += 1
    else:
        i += 1
        j = 0
        
        

plt.show()

In [None]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
