The purpose of this notebook is to apply the support vector machines model to RNA Seq Data. The data comes fro the study by Kopp et al. The experiment aimed to determine the effectiveness of a medication, Lumacaftor/ivacaftor, for Cystic Fibrosis. The experiment involved a total of 40 patients. 20 healthy control and 20 with Cystic Fibrosis. The 20 with CF were also treated with a medication Lumacaftor/ivacaftor for a total of 60 samples.

In [231]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns #For visualizatin
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix #For analysis after model application
from sklearn.model_selection import train_test_split #To train SVM model
from sklearn.model_selection import GridSearchCV #To change parameters in the SVM 
from sklearn.svm import SVC

In [47]:
data_df = pd.read_excel ('/Users/danielaquijano/Documents/GitHub/Machine-Learning-Course-Projects/sourcefiles/GSE124548_RNAseq.xlsx')
data_df.head()

Unnamed: 0,RowID,ID,Description,EntrezID,Class,pre_drug.vs.control_ShrunkenLog2FC,pre_drug.vs.control_MLELog2FC,pre_drug.vs.control_pVal,pre_drug.vs.control_pAdj,post_drug.vs.control_ShrunkenLog2FC,...,Raw_Orkambi_021_Base,Raw_Orkambi_021_V2,Raw_Orkambi_022_Base,Raw_Orkambi_022_V2,Raw_Orkambi_024_Base,Raw_Orkambi_024_V2,Raw_Orkambi_025_Base,Raw_Orkambi_025_V2,Raw_Orkambi_026_Base,Raw_Orkambi_026_V2
0,1,A1BG,alpha-1-B glycoprotein,1,protein_coding,-0.050497,-0.066327,0.691693,0.824882,0.082631,...,113,89,31,125,44,78,113,98,38,70
1,2,A1BG-AS1,A1BG antisense RNA 1,503538,lncRNA,0.004178,0.032895,0.971929,0.986309,0.155902,...,40,36,26,87,23,72,74,55,64,53
2,3,A2M-AS1,A2M antisense RNA 1 (head to head),144571,lncRNA,0.356337,0.932311,0.050847,0.179776,-0.359549,...,16,23,40,104,370,199,81,30,228,33
3,4,AAAS,aladin WD repeat nucleoporin,8086,protein_coding,-0.139005,-0.152091,0.285215,0.495514,-0.11887,...,130,71,28,178,62,219,176,106,160,121
4,5,AACS,acetoacetyl-CoA synthetase,65985,protein_coding,0.138032,0.143003,0.088856,0.251817,0.075925,...,192,157,71,228,130,185,198,114,131,157


Possible comparisons to perform:

An upaired analysis comparing healthy vs cystic fibrosis. These will identify genes that can distinguish a healthy patient from one with CF.

A paired analysis comparing the CF patients before and after treatment. This will give us an idea of how the CF patients are responding to the treatment. However we expect this to be noisy since the motivation for this experiment was that not all patients are responding similarly.

Expression data from the top 10 most differentially expressed genes will be used for unpaired analysis.

First test how well it can perform in predicting Healthy and CF patients, and then we will pretend the CF treated is our test set and make our prediction.

In [393]:
#Trim dataframe to only contain 60 columns
#Get column names/dataset information
data_df.info()
for col in data_df.columns:
    print(col)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15570 entries, 0 to 15569
Columns: 140 entries, RowID to Raw_Orkambi_026_V2
dtypes: float64(75), int64(62), object(3)
memory usage: 16.6+ MB
RowID
ID
Description
EntrezID
Class
pre_drug.vs.control_ShrunkenLog2FC
pre_drug.vs.control_MLELog2FC
pre_drug.vs.control_pVal
pre_drug.vs.control_pAdj
post_drug.vs.control_ShrunkenLog2FC
post_drug.vs.control_MLELog2FC
post_drug.vs.control_pVal
post_drug.vs.control_pAdj
post_drug.vs.pre_drug_ShrunkenLog2FC
post_drug.vs.pre_drug_MLELog2FC
post_drug.vs.pre_drug_pVal
post_drug.vs.pre_drug_pAdj
control_Mean
pre-drug_Mean
post-drug_Mean
Norm_10_HC_Auto_066_237
Norm_11_Orkambi_006_Base
Norm_12_Orkambi_007_V2
Norm_13_HC_Auto_068_239
Norm_14_Orkambi_007_Base
Norm_15_Orkambi_009_V2
Norm_16_HC_Auto_072_243
Norm_17_Orkambi_009_Base
Norm_18_Orkambi_010_V2
Norm_19_HC_Auto_074_245
Norm_1_Orkambi_001_Base
Norm_20_Orkambi_010_Base
Norm_21_HC_Immune_004
Norm_22_Orkambi_012_V2
Norm_23_HC_Auto_076_247
Norm_24_Orkambi_

In [49]:
#Obtain column numbers to extract 60 column raw data
print(data_df.columns.get_loc("Raw_10_HC_Auto_066_237"))
print(data_df.columns.get_loc('Raw_Orkambi_026_V2'))

80
139


In [88]:
#Slice Dataframe to only include raw data
raw_data=data_df.iloc[:,80:140]

In [89]:
#Add gene id names to the raw data
gene_ids=data_df["ID"]
raw_data.insert(0, "Gene_ID", gene_ids)
raw_data.head()

Unnamed: 0,Gene_ID,Raw_10_HC_Auto_066_237,Raw_11_Orkambi_006_Base,Raw_12_Orkambi_007_V2,Raw_13_HC_Auto_068_239,Raw_14_Orkambi_007_Base,Raw_15_Orkambi_009_V2,Raw_16_HC_Auto_072_243,Raw_17_Orkambi_009_Base,Raw_18_Orkambi_010_V2,...,Raw_Orkambi_021_Base,Raw_Orkambi_021_V2,Raw_Orkambi_022_Base,Raw_Orkambi_022_V2,Raw_Orkambi_024_Base,Raw_Orkambi_024_V2,Raw_Orkambi_025_Base,Raw_Orkambi_025_V2,Raw_Orkambi_026_Base,Raw_Orkambi_026_V2
0,A1BG,81,38,112,46,64,196,125,140,72,...,113,89,31,125,44,78,113,98,38,70
1,A1BG-AS1,49,31,84,26,38,86,80,76,38,...,40,36,26,87,23,72,74,55,64,53
2,A2M-AS1,74,108,62,47,41,31,105,45,63,...,16,23,40,104,370,199,81,30,228,33
3,AAAS,161,86,197,92,116,143,243,136,151,...,130,71,28,178,62,219,176,106,160,121
4,AACS,199,128,198,117,162,263,218,243,201,...,192,157,71,228,130,185,198,114,131,157


### Normalizing the data

Adjust the values such that each column will add up to 1,000,000. This assumes all samples have the same number of transcripts.

In [90]:
raw_cols_sums=raw_data[1:61].sum(axis = 0, skipna = True)
print(raw_cols_sums)
#Copy raw_data dataframe in order to replace the values with the normalized values
raw_data_normalized = raw_data.copy() 


Gene_ID                    A1BG-AS1A2M-AS1AAASAACSAAED1AAGABAAK1AAMDCAAMP...
Raw_10_HC_Auto_066_237                                                 49554
Raw_11_Orkambi_006_Base                                                36371
Raw_12_Orkambi_007_V2                                                  59219
Raw_13_HC_Auto_068_239                                                 35697
                                                 ...                        
Raw_Orkambi_024_V2                                                     52219
Raw_Orkambi_025_Base                                                   57331
Raw_Orkambi_025_V2                                                     53249
Raw_Orkambi_026_Base                                                   31577
Raw_Orkambi_026_V2                                                     45619
Length: 61, dtype: object


In [91]:
#Replace the first value of raw_cols_sums since it is a meaningless string of letters from gene names
raw_cols_sums.iat[0]=0

49554

In [92]:
#Insert a row at the end of the original raw_data that consists of the sums of each column
raw_data.loc[len(raw_data.index)] = raw_cols_sums
raw_data.tail()

Unnamed: 0,Gene_ID,Raw_10_HC_Auto_066_237,Raw_11_Orkambi_006_Base,Raw_12_Orkambi_007_V2,Raw_13_HC_Auto_068_239,Raw_14_Orkambi_007_Base,Raw_15_Orkambi_009_V2,Raw_16_HC_Auto_072_243,Raw_17_Orkambi_009_Base,Raw_18_Orkambi_010_V2,...,Raw_Orkambi_021_Base,Raw_Orkambi_021_V2,Raw_Orkambi_022_Base,Raw_Orkambi_022_V2,Raw_Orkambi_024_Base,Raw_Orkambi_024_V2,Raw_Orkambi_025_Base,Raw_Orkambi_025_V2,Raw_Orkambi_026_Base,Raw_Orkambi_026_V2
15566,ZYG11B,2591,2248,2374,2005,2406,3574,4167,4939,3513,...,3103,3112,1895,4289,2191,2274,2857,3537,1056,2432
15567,ZYX,826,1097,836,610,1431,1166,1411,3067,2651,...,1592,1118,950,2321,1160,1215,1144,872,2198,876
15568,ZZEF1,2552,2668,2583,2093,2486,2645,3005,4363,3036,...,2413,2023,960,3538,2939,3225,2370,2050,4532,2432
15569,ZZZ3,1743,1095,2415,1046,1343,2157,2279,2001,1690,...,1579,1734,654,2152,1153,1772,1768,1428,946,1362
15570,0,49554,36371,59219,35697,43462,64669,64171,67510,60646,...,45774,48848,18034,60077,33020,52219,57331,53249,31577,45619


#Iterate through dataframe
#for i in range(len(raw_data)):
    #for j in range(1, len(raw_data.columns)):
        #print(raw_data.iat[i,j])

In [93]:
#For each gene, divide it by its corresponding raw_cols_sum. Then multiply it by 1,000,000
#Save normalized dataframe as norm_cpm_df

for i in range(len(raw_data)):
    for j in range(1, len(raw_data.columns)):
        raw_data_normalized.iat[i,j]=(int(raw_data.iat[i,j])*1000000)/int(raw_cols_sums[j])


IndexError: index 15570 is out of bounds for axis 0 with size 15570

In [121]:
#This is the normalized dataframe
raw_data_normalized.head()

Unnamed: 0,Gene_ID,Raw_10_HC_Auto_066_237,Raw_11_Orkambi_006_Base,Raw_12_Orkambi_007_V2,Raw_13_HC_Auto_068_239,Raw_14_Orkambi_007_Base,Raw_15_Orkambi_009_V2,Raw_16_HC_Auto_072_243,Raw_17_Orkambi_009_Base,Raw_18_Orkambi_010_V2,...,Raw_Orkambi_021_Base,Raw_Orkambi_021_V2,Raw_Orkambi_022_Base,Raw_Orkambi_022_V2,Raw_Orkambi_024_Base,Raw_Orkambi_024_V2,Raw_Orkambi_025_Base,Raw_Orkambi_025_V2,Raw_Orkambi_026_Base,Raw_Orkambi_026_V2
0,A1BG,1634,1044,1891,1288,1472,3030,1947,2073,1187,...,2468,1821,1718,2080,1332,1493,1971,1840,1203,1534
1,A1BG-AS1,988,852,1418,728,874,1329,1246,1125,626,...,873,736,1441,1448,696,1378,1290,1032,2026,1161
2,A2M-AS1,1493,2969,1046,1316,943,479,1636,666,1038,...,349,470,2218,1731,11205,3810,1412,563,7220,723
3,AAAS,3248,2364,3326,2577,2668,2211,3786,2014,2489,...,2840,1453,1552,2962,1877,4193,3069,1990,5066,2652
4,AACS,4015,3519,3343,3277,3727,4066,3397,3599,3314,...,4194,3214,3937,3795,3937,3542,3453,2140,4148,3441


### Filtering Data 

In [138]:
#Analyze data based on the following genes, make a subset of raw data based on the selected genes
#New dataframe with selected 
top10_ci_genes = ['LOC105372578','MCEMP1','MMP9','SOCS3','ANXA3','G0S2','IL1R2','PFKFB3','OSM','SEMA6B']

In [139]:
#Make a logic statement to check if the gene ID from the list of genes is found in the Gene_ID column
#Filter dataframe, save it to norm_cpm_df, print new filtered dataframe
logic_top10_genes=raw_data_normalized.Gene_ID.isin(top10_ci_genes)
norm_cpm_df=raw_data_normalized[logic_top10_genes]
norm_cpm_df

Unnamed: 0,Gene_ID,Raw_10_HC_Auto_066_237,Raw_11_Orkambi_006_Base,Raw_12_Orkambi_007_V2,Raw_13_HC_Auto_068_239,Raw_14_Orkambi_007_Base,Raw_15_Orkambi_009_V2,Raw_16_HC_Auto_072_243,Raw_17_Orkambi_009_Base,Raw_18_Orkambi_010_V2,...,Raw_Orkambi_021_Base,Raw_Orkambi_021_V2,Raw_Orkambi_022_Base,Raw_Orkambi_022_V2,Raw_Orkambi_024_Base,Raw_Orkambi_024_V2,Raw_Orkambi_025_Base,Raw_Orkambi_025_V2,Raw_Orkambi_026_Base,Raw_Orkambi_026_V2
555,ANXA3,25285,35605,9895,44569,177327,54570,39940,339105,216749,...,113994,131898,161029,168816,73743,36059,91660,177280,9532,62298
4206,G0S2,565,2721,996,784,4164,4963,997,6236,5969,...,4391,3541,5434,7956,1090,1014,3000,3417,538,2652
5313,IL1R2,25265,46850,23337,44681,88836,74502,41140,113864,102578,...,65801,85407,118054,82244,73561,41479,56025,102912,45666,54911
6886,LOC105372578,1594,3794,1182,2549,17210,6401,1683,47948,46103,...,9415,13183,51846,57492,8843,2719,6383,12188,1045,7365
8139,MCEMP1,2683,2447,1300,2549,17325,5010,2290,34587,26959,...,15489,9314,5766,13732,2604,1876,6907,7943,1805,3003
8458,MMP9,22056,34533,39345,31627,219064,155221,26320,377025,158163,...,157600,81252,82067,241773,40642,24646,80305,137035,35500,57717
9470,OSM,3148,7368,2516,4426,23537,10855,4410,30943,33472,...,18569,16541,28279,23103,6147,4117,12122,17765,2153,7957
9800,PFKFB3,18141,30051,12968,33560,143780,35596,31213,210502,243593,...,110564,82418,103859,133628,28740,24607,57176,58742,40029,39676
11818,SEMA6B,565,439,574,532,1472,773,498,1821,3693,...,2381,1760,1885,2113,333,344,1290,1389,443,591
12605,SOCS3,30572,66124,19942,44653,203258,75476,44708,206221,370428,...,167453,149217,341632,239126,54603,35772,116708,185186,17449,73587


### Filtering Samples

The columns of the dataframe are labeled as such:

any column name that has the term HC is a Health Control

any column with Base is a patient with CF but no Treatment

any column with V2 is a CF patient with treatment

In [140]:
#Separate filtered dataframe by selected genes into the healthy and cystic fibrosis data frames 
#Split norm_cpm_df_top10 into 2 different dataframes
    #one with just healthy values, hc_top10
    #one with just CF (Base) values , cf_top10

In [337]:
#Filter Healthy samples 

hc_top10=norm_cpm_df.filter(regex=("_HC_"))
hc_top10.insert(0, "Gene_ID", gene_ids)
hc_top10.head(2)

Unnamed: 0,Gene_ID,Raw_10_HC_Auto_066_237,Raw_13_HC_Auto_068_239,Raw_16_HC_Auto_072_243,Raw_19_HC_Auto_074_245,Raw_21_HC_Immune_004,Raw_23_HC_Auto_076_247,Raw_26_HC_Auto_078_249,Raw_29_HC_Auto_080_251,Raw_3_HC_Auto_062_233,...,Raw_4_HC_Immune_002,Raw_7_HC_Auto_064_235,Raw_HC_Auto_082_253,Raw_HC_Auto_084_255,Raw_HC_Auto_088_259,Raw_HC_Auto_091_263,Raw_HC_Auto_093_265,Raw_HC_Auto_095_267,Raw_HC_Immune_006,Raw_HC_Immune_008
555,ANXA3,25285,44569,39940,17105,46019,26726,59648,33351,15090,...,26571,60781,14381,15115,54502,19848,38542,88176,29319,44281
4206,G0S2,565,784,997,956,909,1307,1396,1009,372,...,764,1330,534,459,1606,987,1079,1240,1027,1398


In [338]:
#Filter CF samples 
cf_top10=norm_cpm_df.filter(regex=("Base"))
cf_top10.insert(0, "Gene_ID", gene_ids)
cf_top10.head(2)

Unnamed: 0,Gene_ID,Raw_11_Orkambi_006_Base,Raw_14_Orkambi_007_Base,Raw_17_Orkambi_009_Base,Raw_1_Orkambi_001_Base,Raw_20_Orkambi_010_Base,Raw_24_Orkambi_012_Base,Raw_27_Orkambi_013_Base,Raw_30_Orkambi_014_Base,Raw_5_Orkambi_002_Base,...,Raw_Orkambi_015_Base,Raw_Orkambi_016_Base,Raw_Orkambi_017_Base,Raw_Orkambi_018_Base,Raw_Orkambi_020_Base,Raw_Orkambi_021_Base,Raw_Orkambi_022_Base,Raw_Orkambi_024_Base,Raw_Orkambi_025_Base,Raw_Orkambi_026_Base
555,ANXA3,35605,177327,339105,121128,244406,19587,30987,208073,70112,...,253041,48893,147532,69603,454134,113994,161029,73743,91660,9532
4206,G0S2,2721,4164,6236,5643,6408,1635,570,1718,6214,...,11065,3263,4527,2027,6360,4391,5434,1090,3000,538


In [331]:
#Filter CF patients with treatment
cf_treat_top10=norm_cpm_df.filter(regex=("V2"))
cf_treat_top10.insert(0, "Gene_ID", gene_ids)


### Format Dataframe for testing Treatment Samples

In [332]:
cf_treat_top10

Unnamed: 0,Gene_ID,Raw_12_Orkambi_007_V2,Raw_15_Orkambi_009_V2,Raw_18_Orkambi_010_V2,Raw_22_Orkambi_012_V2,Raw_25_Orkambi_013_V2,Raw_28_Orkambi_014_V2,Raw_2_Orkambi_001_V2,Raw_6_Orkambi_004_V2,Raw_9_Orkambi_006_V2,...,Raw_Orkambi_016_V2,Raw_Orkambi_017_V2,Raw_Orkambi_018_V2,Raw_Orkambi_019_V2,Raw_Orkambi_020_V2,Raw_Orkambi_021_V2,Raw_Orkambi_022_V2,Raw_Orkambi_024_V2,Raw_Orkambi_025_V2,Raw_Orkambi_026_V2
555,ANXA3,9895,54570,216749,28204,31907,324844,148511,159789,93737,...,14024,53873,19521,86842,46557,131898,168816,36059,177280,62298
4206,G0S2,996,4963,5969,2494,908,3369,2483,1111,3003,...,1500,3964,1424,5885,3475,3541,7956,1014,3417,2652
5313,IL1R2,23337,74502,102578,40146,79609,133071,75662,659918,60289,...,22658,77943,47435,78747,73825,85407,82244,41479,102912,54911
6886,LOC105372578,1182,6401,46103,2062,3257,33070,14987,16702,15594,...,1222,9350,3018,7330,7588,13183,57492,2719,12188,7365
8139,MCEMP1,1300,5010,26959,3083,4659,19720,11075,13962,6848,...,1278,5220,5754,5737,2545,9314,13732,1876,7943,3003
8458,MMP9,39345,155221,158163,63793,59351,103555,102582,240666,63172,...,15914,83676,33710,83103,62981,81252,241773,24646,137035,57717
9470,OSM,2516,10855,33472,5087,5390,23090,6197,13755,11077,...,3927,9466,3272,7755,12973,16541,23103,4117,17765,7957
9800,PFKFB3,12968,35596,243593,22587,25055,139295,73266,123694,54882,...,17137,43217,30635,38544,28149,82418,133628,24607,58742,39676
11818,SEMA6B,574,773,3693,451,671,3185,285,956,528,...,314,792,409,807,758,1760,2113,344,1389,591
12605,SOCS3,19942,75476,370428,34312,63952,295862,62960,133467,87129,...,43130,86302,26672,39607,131838,149217,239126,35772,185186,73587


In [356]:
cf_treat_top10_transposed=cf_treat_top10.transpose()


In [357]:
cf_treat_top10_transposed.columns = cf_treat_top10_transposed.iloc[0]
cf_treat_top10_transposed=cf_treat_top10_transposed.drop('Gene_ID',axis=0)

In [358]:
cf_treat_top10_transposed.head()


Gene_ID,ANXA3,G0S2,IL1R2,LOC105372578,MCEMP1,MMP9,OSM,PFKFB3,SEMA6B,SOCS3
Raw_12_Orkambi_007_V2,9895,996,23337,1182,1300,39345,2516,12968,574,19942
Raw_15_Orkambi_009_V2,54570,4963,74502,6401,5010,155221,10855,35596,773,75476
Raw_18_Orkambi_010_V2,216749,5969,102578,46103,26959,158163,33472,243593,3693,370428
Raw_22_Orkambi_012_V2,28204,2494,40146,2062,3083,63793,5087,22587,451,34312
Raw_25_Orkambi_013_V2,31907,908,79609,3257,4659,59351,5390,25055,671,63952


In [360]:
cf_treat_top10_transposed.to_csv('CF_treated_samples.csv')

### Format DataFrames for training

In [339]:
#Transpose dataframes so that the gene ids will now be columns: hc_top10_t and cf_top10_t
#Genes will be used as features for ML model
hc_top10_t=hc_top10.transpose()
cf_top10_t=cf_top10.transpose()

In [340]:
#Make gene ID the column label 
hc_top10_t.columns = hc_top10_t.iloc[0]
hc_top10_t=hc_top10_t.drop('Gene_ID',axis=0)

In [341]:
#Make Gene ID the column label 
cf_top10_t.columns = cf_top10_t.iloc[0]
cf_top10_t=cf_top10_t.drop('Gene_ID',axis=0)

In [394]:
#Healthy samples will be given a value of 0, CF samples will be given a value of 1 
import itertools
cf_top10_t['Y']=(list(itertools.repeat(1, len(cf_top10_t))))


In [343]:
hc_top10_t['Y']=(list(itertools.repeat(0, len(hc_top10_t))))
hc_top10_t.drop('Gene_ID',axis=0)

KeyError: "['Gene_ID'] not found in axis"

In [344]:
hc_cf_top10=pd.concat([hc_top10_t, cf_top10_t], ignore_index=True)
hc_cf_top10

Gene_ID,ANXA3,G0S2,IL1R2,LOC105372578,MCEMP1,MMP9,OSM,PFKFB3,SEMA6B,SOCS3,Y
0,25285,565,25265,1594,2683,22056,3148,18141,565,30572,0
1,44569,784,44681,2549,2549,31627,4426,33560,532,44653,0
2,39940,997,41140,1683,2290,26320,4410,31213,498,44708,0
3,17105,956,29726,1359,1508,11028,2422,19931,488,20398,0
4,46019,909,45917,3302,2948,25983,5156,26270,556,51765,0
5,26726,1307,26712,6762,2418,18980,4695,22171,421,38649,0
6,59648,1396,35909,6760,4145,27774,7115,28793,310,58673,0
7,33351,1009,38037,1607,1453,20010,3796,24970,495,39816,0
8,15090,372,21201,613,1708,19514,1730,14652,481,14236,0
9,10586,510,22034,876,1307,27679,1227,10810,286,10953,0


### Support Vector Machines Model

Using the labeled samples as healthy of CF, train SVM model in order to recognize when a sample is healthy or not. Then, the CF with treatment data can be passed in order to see if the model recognizes those treated CF samples as healthy or not. 

In [345]:
X=hc_cf_top10.drop('Y',axis=1) #Features, drop the target column since we are trying to predict Y, either healthy or disease
y=hc_cf_top10['Y'] #value that is to be predicted, 0 or 1, healthy or disease
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
model = SVC()
model.fit(X_train,y_train)

SVC()

In [346]:

model_predict = model.predict(X_test)
print(classification_report(y_test,model_predict))

              precision    recall  f1-score   support

           0       0.33      1.00      0.50         4
           1       1.00      0.33      0.50        12

    accuracy                           0.50        16
   macro avg       0.67      0.67      0.50        16
weighted avg       0.83      0.50      0.50        16



In [376]:
CF_treated=pd.read_csv('CF_treated_samples.csv')
CF_treated=CF_treated.drop(columns=['Unnamed: 0'])

### Use model to predict CF vs healthy in treated patients

In [None]:
#Here we can see that the model was able to accurately predict with 94% accuracy healthy versus CF
#Now, predict the treated cf patients
# define input
x2=CF_treated

In [381]:
# get prediction for new input
treatment_predict = model.predict(x2)
# summarize input and output
print(x2, treatment_predict)

     ANXA3  G0S2   IL1R2  LOC105372578  MCEMP1    MMP9    OSM  PFKFB3  SEMA6B  \
0     9895   996   23337          1182    1300   39345   2516   12968     574   
1    54570  4963   74502          6401    5010  155221  10855   35596     773   
2   216749  5969  102578         46103   26959  158163  33472  243593    3693   
3    28204  2494   40146          2062    3083   63793   5087   22587     451   
4    31907   908   79609          3257    4659   59351   5390   25055     671   
5   324844  3369  133071         33070   19720  103555  23090  139295    3185   
6   148511  2483   75662         14987   11075  102582   6197   73266     285   
7   159789  1111  659918         16702   13962  240666  13755  123694     956   
8    93737  3003   60289         15594    6848   63172  11077   54882     528   
9   191141  9984   64797         23758   18381  216385  17229  101612    1356   
10   14024  1500   22658          1222    1278   15914   3927   17137     314   
11   53873  3964   77943    

According to the model 8 patients were predicted to still have CF based on their genetic profile. The rest were predicted to not have CF, or fallunder the healthy category

### Using GridSearch Cross validation, check model performance with different values for C and different kernels.

### First, tune parameters with rbf kernel

In [385]:
#Grid search allows trying different parameters in the SVC model
#Grid search takes in a dictionary of parameters to try
parameter_tuning = {'C': [-0.1,-1, -10, -100,-1000,0.1,1, 10, 100,1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf',]} 

In [386]:
#Re-fit model set to true to fit the data to model wih each parameter
grid = GridSearchCV(SVC(),parameter_tuning,refit=True,verbose=True)
grid.fit(X,y)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 277, in _dense_fit
    self._probB, self.fit_status_ = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 192, in sklearn.svm._libsvm.fit
ValueError: C <= 0

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fi

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 277, in _dense_fit
    self._probB, self.fit_status_ = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 192, in sklearn.svm._libsvm.fit
ValueError: C <= 0

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fi

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 277, in _dense_fit
    self._probB, self.fit_status_ = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 192, in sklearn.svm._libsvm.fit
ValueError: C <= 0

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fi

 nan nan nan nan nan nan nan 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]


GridSearchCV(estimator=SVC(),
             param_grid={'C': [-0.1, -1, -10, -100, -1000, 0.1, 1, 10, 100,
                               1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf']},
             verbose=True)

In [387]:
#Inspect results of grid search 
grid.best_params_

{'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}

In [395]:
grid.best_estimator_ 

SVC(C=0.1, gamma=1)

In [391]:
grid_prediction = grid.predict(X)
print(classification_report(y,grid_prediction)) #Model is able to predict healthy VS CF with 100% accuracy

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        20
           1       1.00      1.00      1.00        20

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40



In [392]:
#Fit the grid search model to CF treated samples

# get prediction for new input
treatment_predict_grid = grid.predict(x2)
# summarize input and output
print(x2, treatment_predict_grid)


     ANXA3  G0S2   IL1R2  LOC105372578  MCEMP1    MMP9    OSM  PFKFB3  SEMA6B  \
0     9895   996   23337          1182    1300   39345   2516   12968     574   
1    54570  4963   74502          6401    5010  155221  10855   35596     773   
2   216749  5969  102578         46103   26959  158163  33472  243593    3693   
3    28204  2494   40146          2062    3083   63793   5087   22587     451   
4    31907   908   79609          3257    4659   59351   5390   25055     671   
5   324844  3369  133071         33070   19720  103555  23090  139295    3185   
6   148511  2483   75662         14987   11075  102582   6197   73266     285   
7   159789  1111  659918         16702   13962  240666  13755  123694     956   
8    93737  3003   60289         15594    6848   63172  11077   54882     528   
9   191141  9984   64797         23758   18381  216385  17229  101612    1356   
10   14024  1500   22658          1222    1278   15914   3927   17137     314   
11   53873  3964   77943    

### Second, Tune parameters with linear kernel 


In [396]:
parameter_tuning2 = {'C': [-0.1,-1, -10, -100,-1000,0.1,1, 10, 100,1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['linear']} 

In [397]:
#Re-fit model set to true to fit the data to model wih each parameter
grid2 = GridSearchCV(SVC(),parameter_tuning2,refit=True,verbose=True)
grid2.fit(X,y)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 277, in _dense_fit
    self._probB, self.fit_status_ = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 192, in sklearn.svm._libsvm.fit
ValueError: C <= 0

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fi

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 277, in _dense_fit
    self._probB, self.fit_status_ = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 192, in sklearn.svm._libsvm.fit
ValueError: C <= 0

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fi

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 277, in _dense_fit
    self._probB, self.fit_status_ = libsvm.fit(
  File "sklearn/svm/_libsvm.pyx", line 192, in sklearn.svm._libsvm.fit
ValueError: C <= 0

Traceback (most recent call last):
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 593, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/danielaquijano/opt/anaconda3/lib/python3.8/site-packages/sklearn/svm/_base.py", line 226, in fi

 nan nan nan nan nan nan nan 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8
 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8]


GridSearchCV(estimator=SVC(),
             param_grid={'C': [-0.1, -1, -10, -100, -1000, 0.1, 1, 10, 100,
                               1000],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['linear']},
             verbose=True)

In [399]:
#Inspect results of grid search 
grid2.best_params_

{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}

In [400]:
# get prediction for new input
treatment_predict_grid2 = grid2.predict(x2)
# summarize input and output
print(x2, treatment_predict_grid2)

     ANXA3  G0S2   IL1R2  LOC105372578  MCEMP1    MMP9    OSM  PFKFB3  SEMA6B  \
0     9895   996   23337          1182    1300   39345   2516   12968     574   
1    54570  4963   74502          6401    5010  155221  10855   35596     773   
2   216749  5969  102578         46103   26959  158163  33472  243593    3693   
3    28204  2494   40146          2062    3083   63793   5087   22587     451   
4    31907   908   79609          3257    4659   59351   5390   25055     671   
5   324844  3369  133071         33070   19720  103555  23090  139295    3185   
6   148511  2483   75662         14987   11075  102582   6197   73266     285   
7   159789  1111  659918         16702   13962  240666  13755  123694     956   
8    93737  3003   60289         15594    6848   63172  11077   54882     528   
9   191141  9984   64797         23758   18381  216385  17229  101612    1356   
10   14024  1500   22658          1222    1278   15914   3927   17137     314   
11   53873  3964   77943    