# CCTS 40500: ML Midterm
### Abdallah Aboelela

To do:
1. Go through first run through csv and figure out which models are best to use
2. Create master df merged on id values
3. Consider only first x weeks of a child's life
4. Add fraction of time sick in first 6 months, 2 years, overall (with different diseases?)

In [64]:
import pandas as pd 
import numpy as np
import os
import csv

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

# Questions A/B
#### Predict POS for Males and Females using infectious disease history

Here, we will attempt to predict POS using four models (Decision Tree, Random Forest, Naive Bayes, and Linear Regression, and based on the AUC and ROC curve create a model that randomly chooses between them.

In [73]:
# Contains cols sex, diagnosis, state, county, proportion of time, function in appendix
infectious = read_file('data/Infectious') 

['F', 'M']


In [74]:
#  Using label encoder values printed above to filter by gender
infectious_m = infectious[infectious.sex == 1]
infectious_f = infectious[infectious.sex == 0]

mX_train, mX_test, my_train, my_test = train_test_split(infectious_m.iloc[:, 1:].drop('diagnosis', axis = 1), 
                                                        infectious_m.diagnosis, random_state = 42)

fX_train, fX_test, fy_train, fy_test = train_test_split(infectious_f.iloc[:, 1:].drop('diagnosis', axis = 1), 
                                                        infectious_f.diagnosis, random_state = 42)

In [75]:
nb = GaussianNB()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier(n_estimators = 100)
reg = LinearRegression()

classifiers = [nb, dt, rf, reg]
names = ['NB', 'DT', 'RF', 'LinReg']

cols = ['clf','M', 'F']
rows = []

for i, clf in enumerate(classifiers):
    clf.fit(mX_train, my_train)
    y_pred = clf.predict(mX_test)
    m_auc = roc_auc_score(my_test, y_pred)
    
    clf.fit(fX_train, fy_train)
    y_pred = clf.predict(fX_test)
    f_auc = roc_auc_score(fy_test, y_pred)
    
    rows.append([names[i], m_auc, f_auc])

infectious_clfs = pd.DataFrame(rows, columns = cols).set_index('clf')

In [76]:
infectious_clfs

Unnamed: 0_level_0,M,F
clf,Unnamed: 1_level_1,Unnamed: 2_level_1
NB,0.582018,0.592048
DT,0.60657,0.675583
RF,0.596939,0.678571
LinReg,0.680163,0.713307


Across the board, we can see that it is easier to predict autism diagnosis for women than it is for men, and that a linear regression is particularly good, and crosses the 65% AUC boundry in both cases. Decision Tree and Random Forest also work well for women.

# Questions C/D/E
#### Predict POS for Males/Females using any/all combination of diseases
Strategy: Run the four predictive methods separately by gender, disease, and number of weeks considered, to identify which are best. Then, we will combine the most useful diseases into one merged df on c_id, and create a combined classifier that chooses between the best classifiers.

In [None]:
for fname in os.listdir('data'):
    if fname not in ['Positive', 'Negative']:
        disease, le = read_file('data/' + fname)
        disease_m = disease
        
        mX_train, mX_test, my_train, my_test = train_test_split(disease_m.iloc[:, 1:].drop('diagnosis', axis = 1), 
                                                        disease_m.diagnosis, random_state = 42)

        fX_train, fX_test, fy_train, fy_test = train_test_split(disease_f.iloc[:, 1:].drop('diagnosis', axis = 1), 
                                                        disease_f.diagnosis, random_state = 42)
        for i, clf in enumerate(classifiers):
            clf.fit(mX_train, my_train)
            y_pred = clf.predict(mX_test)
            m_auc = roc_auc_score(my_test, y_pred)

            clf.fit(fX_train, fy_train)
            y_pred = clf.predict(fX_test)
            f_auc = roc_auc_score(fy_test, y_pred)

            rows.append([names[i], m_auc, f_auc])

infectious_clfs = pd.DataFrame(rows, columns = cols).set_index('clf')

Below is the result of a function that applied Naive Bayes, Decision Tree, Random Forest, SVC, and Linear Regression models to ea

In [5]:
simple_aucs = pd.read_csv('general_auc_analysis.csv', index_col = 'Disease')
print(simple_aucs)
print()
print(simple_aucs.mean())
print()
print(simple_aucs.drop('SVC', axis = 1).mean(axis = 1))

                        NB        DT        RF  SVC    LinReg
Disease                                                      
Musculoskeletal   0.599812  0.605690  0.600000  0.5  0.602927
Positive          0.599433  0.626699  0.578947  0.5  0.555611
Development       0.569473  0.627991  0.617021  0.5  0.734572
Metabolic         0.566709  0.594859  0.555556  0.5  0.808795
Neoplastic        0.703814  0.589567  0.590909  0.5  0.717596
Digestive         0.553241  0.624882  0.588235  0.5  0.660349
Negative          0.670250  0.624723  0.616438  0.5  0.791504
Urinary           0.785251  0.494327  0.666667  0.5  0.570502
PNS               0.608439  0.728929  0.735294  0.5  0.783565
Endocrine         0.489496  0.491597  0.500000  0.5  0.428571
Immune            0.576491  0.645830  0.623457  0.5  0.708723
Reproductive      0.542819  0.619301  0.562500  0.5  0.680087
Hematologic       0.641562  0.579149  0.500000  0.5  0.719665
Cardiovascular    0.614819  0.610321  0.617647  0.5  0.695378
Infectio

In [None]:
merged = merge()

Musculoskeletal
Development
Metabolic
Neoplastic
Hepatic
Digestive
Urinary
PNS
Endocrine
Immune
Reproductive
Hematologic
Cardiovascular
Infectious
Respiratory
Integumentary
Ophthalmological
Procedural


# Appendix

In [72]:
'''
Takes a disease file, and reads into a pandas dataframe.
Creates from values: gender, diagnosis, state, and county
'''
def read_file(fname, new_features = False):
    with open(fname) as f:
        content = f.readlines()
    
    content = [x.strip().split(' ') for x in content]

    max_cols = 0
    for i, line in enumerate(content):
        max_cols = max(max_cols, len(line))

    name = fname[5:]

    columns = ['p_id', 'c_id'] + [name + str(i) for i in range(max_cols - 2)]

    df = pd.read_csv(fname, names = columns, engine = 'python', 
        delim_whitespace = True, index_col = 'c_id')

    df['sex'] = df.p_id.apply(lambda x: x[0])
    df['diagnosis'] = df.p_id.apply(lambda x: x[1:4])
    df['state'] = df.p_id.apply(lambda x: x[4:6])
    df['county'] = df.p_id.apply(lambda x: x[6:9])
    
    end_cols = 0
    if new_features:
        # Using sum overweights other diseases but generally proportional
        df['total'] = df.iloc[:, 1:df.shape[1]-4].sum(axis = 1)
        df['6mos'] = df.iloc[:, 1:28].sum(axis = 1)
        df['1y'] = df.iloc[:, 1:53].sum(axis = 1)
        df['2y'] = df.iloc[:, 1:105].sum(axis = 1)
        
        end_cols = 4
    
    df = df.replace(np.nan, -1)

    for col in df.columns[1:df.shape[1]-end_cols]:
        le = LabelEncoder()
        le = le.fit(df[col])
        
        df[col] = le.transform(df[col])
        
        if col == 'sex':
            print(list(le.classes_))
        
    return df