## Machine Learning on predicting the survival status

I selected some features as the independent varibles and select the 'Overall survival status' as the independent variable. I want to utilize the supervised learning model to predict whether the patient is alive or dead based on the features I select. Below is my analysis:

- **Dependent Varible:** Overall Survival Status
- **Independent Varibles:** 
    - Sex 
    - Diagnosis Age
    - Fraction Genome Altered
    - Longest Dimension
    - Smoking_year
    - Mutation Count
    - Shortest Dimension
    - Person Cigarette Smoking History Pack Year Value
    - Specimen Second Longest Dimension
    - TMB (nonsynonymous)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Set up function parameters for different cross validation strategies
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

kfold = KFold(n_splits=5) # I use this in PART 1.
skfold = StratifiedKFold(n_splits=5, shuffle=True) 
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=1)

In [3]:
data = pd.read_csv("Lung_new_data.csv")

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 12 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   Unnamed: 0                                        79 non-null     int64  
 1   Overall Survival Status                           79 non-null     int64  
 2   Sex                                               79 non-null     int64  
 3   Diagnosis Age                                     79 non-null     float64
 4   Fraction Genome Altered                           79 non-null     float64
 5   Longest Dimension                                 79 non-null     float64
 6   Smoking_year                                      79 non-null     float64
 7   Mutation Count                                    79 non-null     float64
 8   Shortest Dimension                                79 non-null     float64
 9   Person Cigarette Smokin

In [5]:
data.head()

Unnamed: 0.1,Unnamed: 0,Overall Survival Status,Sex,Diagnosis Age,Fraction Genome Altered,Longest Dimension,Smoking_year,Mutation Count,Shortest Dimension,Person Cigarette Smoking History Pack Year Value,Specimen Second Longest Dimension,TMB (nonsynonymous)
0,5,0,1,66.0,0.0661,0.8,24.0,119.0,0.4,20.0,0.8,4.033333
1,7,0,0,58.0,0.3056,1.8,30.0,487.0,0.3,15.0,0.9,16.8
2,9,1,1,76.0,0.234,1.6,37.0,464.0,0.5,19.0,0.9,15.8
3,14,0,0,74.0,0.3903,0.9,43.0,344.0,0.3,65.0,0.7,11.866667
4,15,0,1,62.0,0.3183,1.0,49.0,956.0,0.4,98.0,0.8,32.9


In [6]:
X = data[['Diagnosis Age','Fraction Genome Altered', 'Longest Dimension', 'Smoking_year',
               'Mutation Count', 'Shortest Dimension', 'Person Cigarette Smoking History Pack Year Value',
               'Specimen Second Longest Dimension', 'TMB (nonsynonymous)']]
y = data['Overall Survival Status']


In [7]:
#Step 1: Split the data into training and testing set
from sklearn.model_selection import train_test_split

# randomly assign some data to the test-set and the rest to the training-set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42) 

In [8]:
#KNN Classifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors = 14) 
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))


0.009961127308065754


In [9]:
#Logistic Regression model

import warnings
warnings.filterwarnings("ignore", category=FutureWarning) # I did this to remove all the warnings :)
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print(logreg.score(X_test, y_test))


0.55


In [10]:
#Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

tree_class = DecisionTreeClassifier(random_state=0)
tree_class.fit(X_train, y_train)
print(tree_class.score(X_test, y_test))


0.8


In [11]:
#Random forest classifier

from sklearn.ensemble import RandomForestClassifier

forest_classifier = RandomForestClassifier(random_state=0)
forest_classifier.fit(X_train, y_train)
print(forest_classifier.score(X_test, y_test))


0.75


In [12]:
#SVM
from sklearn import svm

svc = svm.SVC()
svc.fit(X_train, y_train)
print(svc.score(X_test, y_test))


0.7


### Decision Tree has the highest score on predicting whether the patient is alive or dead.