# Dataset Source:

### Title: Lung Cancer Data

#### Source Information:
* Data was published in : 
	  Hong, Z.Q. and Yang, J.Y. "Optimal Discriminant Plane for a Small
	  Number of Samples and Design Method of Classifier on the Plane",
	  Pattern Recognition, Vol. 24, No. 4, pp. 317-324, 1991.
* Donor: Stefan Aeberhard, stefan@coral.cs.jcu.edu.au
* Date : May, 1992

#### Past Usage:
* Hong, Z.Q. and Yang, J.Y. "Optimal Discriminant Plane for a Small
          Number of Samples and Design Method of Classifier on the Plane",
          Pattern Recognition, Vol. 24, No. 4, pp. 317-324, 1991.
* Aeberhard, S., Coomans, D, De Vel, O. "Comparisons of 
	  Classification Methods in High Dimensional Settings", 
	  submitted to Technometrics.
* Aeberhard, S., Coomans, D, De Vel, O. "The Dangers of 
	  Bias in High Dimensional Settings", submitted to
	  pattern Recognition.

#### Relevant Information:
* This data was used by Hong and Young to illustrate the 
	  power of the optimal discriminant plane even in ill-posed
	  settings. Applying the KNN method in the resulting plane	
	  gave 77% accuracy. However, these results are strongly
	  biased (See Aeberhard's second ref. above, or email to
	  stefan@coral.cs.jcu.edu.au). Results obtained by
	  Aeberhard et al. are : 
	  RDA : 62.5%, KNN 53.1%, Opt. Disc. Plane 59.4%
* The data described 3 types of pathological lung cancers.
	  The Authors give no information on the individual
	  variables nor on where the data was originally used.

*  In the original data 4 values for the fifth attribute were -1.
          These values have been changed to ? (unknown). (*)
*  In the original data 1 value for the 39 attribute was 4.  This
          value has been changed to ? (unknown). (*)
    
	  
* Number of Instances: 32

* Number of Attributes: 57 (1 class attribute, 56 predictive)

* Attribute Information:

	attribute 1 is the class label.
	
	- All predictive attributes are nominal, taking on integer 
	  values 0-3

8. Missing Attribute Values: Attributes 5 and 39 (*)

9. Class Distribution:
	- 3 classes, 
		1.)	9 observations
		2.)	13     "
		3.)	10     "



Load Libraries

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

plt.style.use('ggplot')
%matplotlib inline

reading the data.


In [6]:
filepath = 'lung-cancer.data'
lun_cancer_df = pd.read_csv(filepath,header=None)

checking the dataframe.

In [7]:
lun_cancer_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,1,0,3,0,?,0,2,2,2,1,...,2,2,2,2,2,1,1,1,2,2
1,1,0,3,3,1,0,3,1,3,1,...,2,2,2,2,2,2,2,1,2,2
2,1,0,3,3,2,0,3,3,3,1,...,2,2,2,2,2,2,2,2,1,2
3,1,0,2,3,2,1,3,3,3,1,...,2,2,2,2,2,2,2,2,2,2
4,1,0,3,2,1,1,3,3,3,2,...,2,2,2,2,2,2,2,1,2,2
5,1,0,3,3,2,0,3,3,3,1,...,2,2,2,2,2,2,2,2,1,2
6,1,0,3,2,1,0,3,3,3,1,...,2,2,2,2,1,2,2,2,1,2
7,1,0,2,2,1,0,3,1,3,3,...,2,2,1,2,2,2,2,1,2,2
8,1,0,3,1,1,0,3,1,3,1,...,2,2,2,2,2,2,2,1,2,2
9,2,0,2,3,2,0,2,2,2,1,...,2,2,2,1,3,2,1,1,2,2


adding header to our dataframe.

In [15]:
col_names = ['feature_'+ str(_) for _ in range(len(lun_cancer_df.columns))]
col_names[0] = 'class'
lun_cancer_df.columns = col_names
print(col_names)

['class', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21', 'feature_22', 'feature_23', 'feature_24', 'feature_25', 'feature_26', 'feature_27', 'feature_28', 'feature_29', 'feature_30', 'feature_31', 'feature_32', 'feature_33', 'feature_34', 'feature_35', 'feature_36', 'feature_37', 'feature_38', 'feature_39', 'feature_40', 'feature_41', 'feature_42', 'feature_43', 'feature_44', 'feature_45', 'feature_46', 'feature_47', 'feature_48', 'feature_49', 'feature_50', 'feature_51', 'feature_52', 'feature_53', 'feature_54', 'feature_55', 'feature_56']


In [19]:
lun_cancer_df.head(20)

Unnamed: 0,class,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_47,feature_48,feature_49,feature_50,feature_51,feature_52,feature_53,feature_54,feature_55,feature_56
0,1,0,3,0,?,0,2,2,2,1,...,2,2,2,2,2,1,1,1,2,2
1,1,0,3,3,1,0,3,1,3,1,...,2,2,2,2,2,2,2,1,2,2
2,1,0,3,3,2,0,3,3,3,1,...,2,2,2,2,2,2,2,2,1,2
3,1,0,2,3,2,1,3,3,3,1,...,2,2,2,2,2,2,2,2,2,2
4,1,0,3,2,1,1,3,3,3,2,...,2,2,2,2,2,2,2,1,2,2
5,1,0,3,3,2,0,3,3,3,1,...,2,2,2,2,2,2,2,2,1,2
6,1,0,3,2,1,0,3,3,3,1,...,2,2,2,2,1,2,2,2,1,2
7,1,0,2,2,1,0,3,1,3,3,...,2,2,1,2,2,2,2,1,2,2
8,1,0,3,1,1,0,3,1,3,1,...,2,2,2,2,2,2,2,1,2,2
9,2,0,2,3,2,0,2,2,2,1,...,2,2,2,1,3,2,1,1,2,2


In [18]:
lun_cancer_df.tail(20)

Unnamed: 0,class,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_47,feature_48,feature_49,feature_50,feature_51,feature_52,feature_53,feature_54,feature_55,feature_56
12,2,0,2,1,1,0,1,2,2,1,...,2,2,2,2,2,2,2,1,2,2
13,2,0,2,2,1,1,2,3,3,1,...,2,2,2,2,2,1,1,1,2,2
14,2,1,3,0,?,1,1,2,2,1,...,2,2,2,2,2,2,2,1,2,1
15,2,0,3,2,2,1,2,2,2,1,...,2,2,2,2,2,2,2,2,2,2
16,2,0,3,2,2,0,1,1,3,1,...,2,2,2,2,2,2,2,1,2,2
17,2,0,2,1,1,0,2,1,3,1,...,2,2,2,2,2,1,1,1,2,2
18,2,0,2,0,?,0,2,3,3,3,...,2,2,2,2,2,2,2,2,1,2
19,2,0,1,2,1,0,3,3,3,1,...,2,2,2,2,2,1,1,2,2,1
20,2,0,2,0,?,1,3,3,3,1,...,2,2,2,2,1,2,2,1,2,2
21,2,0,3,3,2,0,2,1,3,1,...,2,2,1,2,2,2,2,2,1,2


### Missing Values:
1. replacing "?" by "NaN".
2. identify all columns which contain missing values.
3. replace missing values with most frequent value or mode in each column.

1. replacing "?" by "NaN".

In [22]:
lun_cancer_df = lun_cancer_df.replace("?",np.nan)
lun_cancer_df.head(20)

Unnamed: 0,class,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_47,feature_48,feature_49,feature_50,feature_51,feature_52,feature_53,feature_54,feature_55,feature_56
0,1,0,3,0,,0,2,2,2,1,...,2,2,2,2,2,1,1,1,2,2
1,1,0,3,3,1.0,0,3,1,3,1,...,2,2,2,2,2,2,2,1,2,2
2,1,0,3,3,2.0,0,3,3,3,1,...,2,2,2,2,2,2,2,2,1,2
3,1,0,2,3,2.0,1,3,3,3,1,...,2,2,2,2,2,2,2,2,2,2
4,1,0,3,2,1.0,1,3,3,3,2,...,2,2,2,2,2,2,2,1,2,2
5,1,0,3,3,2.0,0,3,3,3,1,...,2,2,2,2,2,2,2,2,1,2
6,1,0,3,2,1.0,0,3,3,3,1,...,2,2,2,2,1,2,2,2,1,2
7,1,0,2,2,1.0,0,3,1,3,3,...,2,2,1,2,2,2,2,1,2,2
8,1,0,3,1,1.0,0,3,1,3,1,...,2,2,2,2,2,2,2,1,2,2
9,2,0,2,3,2.0,0,2,2,2,1,...,2,2,2,1,3,2,1,1,2,2


2. identifing all columns which contain missing values.

In [23]:
missing_values = lun_cancer_df.isnull().sum()
print(missing_values)

class         0
feature_1     0
feature_2     0
feature_3     0
feature_4     4
feature_5     0
feature_6     0
feature_7     0
feature_8     0
feature_9     0
feature_10    0
feature_11    0
feature_12    0
feature_13    0
feature_14    0
feature_15    0
feature_16    0
feature_17    0
feature_18    0
feature_19    0
feature_20    0
feature_21    0
feature_22    0
feature_23    0
feature_24    0
feature_25    0
feature_26    0
feature_27    0
feature_28    0
feature_29    0
feature_30    0
feature_31    0
feature_32    0
feature_33    0
feature_34    0
feature_35    0
feature_36    0
feature_37    0
feature_38    1
feature_39    0
feature_40    0
feature_41    0
feature_42    0
feature_43    0
feature_44    0
feature_45    0
feature_46    0
feature_47    0
feature_48    0
feature_49    0
feature_50    0
feature_51    0
feature_52    0
feature_53    0
feature_54    0
feature_55    0
feature_56    0
dtype: int64


we got 5 missing values 4 in column 4 and 1 in column 38, and we have to find the most frequent value for both columns.


In [26]:
fet_mode_4 = lun_cancer_df['feature_4'].value_counts().idxmax() 
fet_mode_38 = lun_cancer_df['feature_38'].value_counts().idxmax()
print("feature_4_mode:{0} \nfeature_38_mode:{1}".format(fet_mode_4,fet_mode_38))

feature_4_mode:1 
feature_38_mode:2


3. replace missing values with most frequent value or mode in each column.

In [30]:
features = [('feature_4',fet_mode_4),('feature_38',fet_mode_38)]

for fet,mode in (features):
    lun_cancer_df[fet].replace(np.nan,mode,inplace=True)

checking if there is missing values after treatment or not.

In [31]:
lun_cancer_df.isnull().sum()

class         0
feature_1     0
feature_2     0
feature_3     0
feature_4     0
feature_5     0
feature_6     0
feature_7     0
feature_8     0
feature_9     0
feature_10    0
feature_11    0
feature_12    0
feature_13    0
feature_14    0
feature_15    0
feature_16    0
feature_17    0
feature_18    0
feature_19    0
feature_20    0
feature_21    0
feature_22    0
feature_23    0
feature_24    0
feature_25    0
feature_26    0
feature_27    0
feature_28    0
feature_29    0
feature_30    0
feature_31    0
feature_32    0
feature_33    0
feature_34    0
feature_35    0
feature_36    0
feature_37    0
feature_38    0
feature_39    0
feature_40    0
feature_41    0
feature_42    0
feature_43    0
feature_44    0
feature_45    0
feature_46    0
feature_47    0
feature_48    0
feature_49    0
feature_50    0
feature_51    0
feature_52    0
feature_53    0
feature_54    0
feature_55    0
feature_56    0
dtype: int64

## Decision Trees Classifier:
<br>
building decision tree calssifier and test all hyper-parameters.