# Reading dataset

In [1]:
import pandas as pd

In [2]:
# reading from csv file using pandas
data = pd.read_csv('dataset/set.csv')

# display first 5 rows from dataset
data.head()

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,43,"Other (American Indian/AK Native, Asian/Pacifi...",Married (including common law),T2,N3,IIIC,Moderately differentiated; Grade II,Regional,40,Positive,Positive,19,11,1,Alive
1,47,"Other (American Indian/AK Native, Asian/Pacifi...",Married (including common law),T2,N2,IIIA,Moderately differentiated; Grade II,Regional,45,Positive,Positive,25,9,2,Alive
2,67,White,Married (including common law),T2,N1,IIB,Poorly differentiated; Grade III,Regional,25,Positive,Positive,4,1,2,Dead
3,46,White,Divorced,T1,N1,IIA,Moderately differentiated; Grade II,Regional,19,Positive,Positive,26,1,2,Dead
4,63,White,Married (including common law),T2,N2,IIIA,Moderately differentiated; Grade II,Regional,35,Positive,Positive,21,5,3,Dead


# Checking for missing/null values

In [3]:
# Print information about a dataset including the index dtype and columns, non-null values and memory usage.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Age                     4024 non-null   int64 
 1   Race                    4024 non-null   object
 2   Marital Status          4024 non-null   object
 3   T Stage                 4024 non-null   object
 4   N Stage                 4024 non-null   object
 5   6th Stage               4024 non-null   object
 6   Grade                   4024 non-null   object
 7   A Stage                 4024 non-null   object
 8   Tumor Size              4024 non-null   int64 
 9   Estrogen Status         4024 non-null   object
 10  Progesterone Status     4024 non-null   object
 11  Regional Node Examined  4024 non-null   int64 
 12  Reginol Node Positive   4024 non-null   int64 
 13  Survival Months         4024 non-null   int64 
 14  Status                  4024 non-null   object
dtypes: i

In [4]:
# display number of null values in every column
data.isnull().sum(axis=0)

Age                       0
Race                      0
Marital Status            0
T Stage                   0
N Stage                   0
6th Stage                 0
Grade                     0
A Stage                   0
Tumor Size                0
Estrogen Status           0
Progesterone Status       0
Regional Node Examined    0
Reginol Node Positive     0
Survival Months           0
Status                    0
dtype: int64

<strong> There is not any null or missing values </strong>

# Encoding of non-numeric values (look at: section 4)
We need to convert non-numeric values to numeric values. But first, we need to understand the type of each non-numeric column. There are mainly 3 types:
1. Binary: The column contains only 2 types of values (example: married: yes/no).
2. Nominal: The column contains more than 2 types of values; The values can't have a specific order (example: country: Egypt/France/UK).
3. Ordinal: The column contains more than 2 types of values; The values have a specific order (example: size: small/medium/large).

To determine the type of each non-numeric column, we need to know what unique values does each column contain

In [5]:
# display each of these columns with both number of unique values of them and these uniques values
# better to be at the form of Dataframe
for col in ['Race','Marital Status','T Stage', 'N Stage','6th Stage','Grade','A Stage',
           'Estrogen Status','Progesterone Status','Status']:
    print("column name: '"+ col +"'\n", "number of values: " +  str(len(data[col].unique())) + "'\n", "The values are", 
          data[col].unique())
    print("----------------------------------")


column name: 'Race'
 number of values: 3'
 The values are ['Other (American Indian/AK Native, Asian/Pacific Islander)' 'White'
 'Black']
----------------------------------
column name: 'Marital Status'
 number of values: 5'
 The values are ['Married (including common law)' 'Divorced' 'Single (never married)'
 'Widowed' 'Separated']
----------------------------------
column name: 'T Stage'
 number of values: 4'
 The values are ['T2' 'T1' 'T3' 'T4']
----------------------------------
column name: 'N Stage'
 number of values: 3'
 The values are ['N3' 'N2' 'N1']
----------------------------------
column name: '6th Stage'
 number of values: 5'
 The values are ['IIIC' 'IIIA' 'IIB' 'IIA' 'IIIB']
----------------------------------
column name: 'Grade'
 number of values: 4'
 The values are ['Moderately differentiated; Grade II' 'Poorly differentiated; Grade III'
 'Well differentiated; Grade I' 'Undifferentiated; anaplastic; Grade IV']
----------------------------------
column name: 'A Stage'
 n

1. The columns <code>A Stage</code>, <code>Estrogen Status</code>, <code>Progesterone Status</code> and <code>Status</code> are binary properties <br/><br/>
2. The columns <code>T Stage</code>, <code>N Stage</code>, <code>6th Stage</code> and <code>Grade</code>  are ordinal (categorical) properties <br/><br/>
3. The columns <code>Race</code> and <code>Marital Status</code> are nominal (categorical) properties <br/><br/>



## Binary-Encoded 

In [6]:
data_binary_encoded = data.replace({
    'A Stage': {'Regional': 1, 'Distant': 0},
    'Estrogen Status': {'Positive': 1, 'Negative': 0},
    'Progesterone Status' : {'Positive': 1 ,'Negative':0},
    'Status' : {'Alive': 1,'Dead':0}
})

## Ordinal-Encoding

In [7]:
data_ordinal_binary_encoded = data_binary_encoded.replace({
    'T Stage': {'T1': 1,'T2': 2,'T3':3,'T4':4},
    'N Stage': {'N1': 1,'N2': 2,'N3':3},
    '6th Stage': {'IIA': 1,'IIB': 2,'IIIA':3,'IIIB':4,'IIIC':5},
    'Grade' : {'Well differentiated; Grade I': 1,'Moderately differentiated; Grade II': 2,
              'Poorly differentiated; Grade III':3,'Undifferentiated; anaplastic; Grade IV':4}
})

## nominal encoding (one hot encoding)
Then we use <code>pd.get_dummies()</code> to convert the other non-numeric columns to one-hot encoding

In [8]:
data_encoded_final = pd.get_dummies(data_ordinal_binary_encoded)

data_encoded_final.head()

Unnamed: 0,Age,T Stage,N Stage,6th Stage,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,...,Survival Months,Status,Race_Black,"Race_Other (American Indian/AK Native, Asian/Pacific Islander)",Race_White,Marital Status_Divorced,Marital Status_Married (including common law),Marital Status_Separated,Marital Status_Single (never married),Marital Status_Widowed
0,43,2,3,5,2,1,40,1,1,19,...,1,1,0,1,0,0,1,0,0,0
1,47,2,2,3,2,1,45,1,1,25,...,2,1,0,1,0,0,1,0,0,0
2,67,2,1,2,3,1,25,1,1,4,...,2,0,0,0,1,0,1,0,0,0
3,46,1,1,1,2,1,19,1,1,26,...,2,0,0,0,1,1,0,0,0,0
4,63,2,2,3,2,1,35,1,1,21,...,3,0,0,0,1,0,1,0,0,0


In [9]:
# data types of final-encoded-data
data_encoded_final.dtypes

Age                                                               int64
T Stage                                                           int64
N Stage                                                           int64
6th Stage                                                         int64
Grade                                                             int64
A Stage                                                           int64
Tumor Size                                                        int64
Estrogen Status                                                   int64
Progesterone Status                                               int64
Regional Node Examined                                            int64
Reginol Node Positive                                             int64
Survival Months                                                   int64
Status                                                            int64
Race_Black                                                      

# Splitting data into input and output

In [10]:
data_input = data_encoded_final.drop(columns=['Status'])
data_output = data_encoded_final['Status']

# Splitting data into train, validation, and test

In [11]:
from sklearn.model_selection import train_test_split

X, X_test, y, y_test = train_test_split(
    data_input, data_output, test_size=0.30, random_state=0
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.30, random_state=0
)

print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('-------------------------')
print('X_val:', X_val.shape)
print('y_val:', y_val.shape)
print('-------------------------')
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (1971, 20)
y_train: (1971,)
-------------------------
X_val: (845, 20)
y_val: (845,)
-------------------------
X_test: (1208, 20)
y_test: (1208,)


# Solving the problem of imbalanced data
Displaying output value counts for our training set

In [12]:
y_train.value_counts()

1    1662
0     309
Name: Status, dtype: int64

We use `imbalanced-learn` package to make our training set balanced. We use two methods:
1. Undersampling: Removing samples from the majority class (class 0)
2. Oversampling: Repeating samples from the minority class


**Undersamping:**

Using undersampling to reduce the samples of class 0 so that Class 1 : Class 0 = 0.5

In [13]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy=0.5, random_state=0)

X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

y_train_rus.value_counts()

1    618
0    309
Name: Status, dtype: int64

**Oversamping:**

Using oversampling to increase the samples of class 1 so that Class 1 : Class 0 = 1

In [14]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy=1.0, random_state=0)

X_train_balanced, y_train_balanced = ros.fit_resample(X_train_rus, y_train_rus)

# Uncomment the following line if you want to see the difference when the data is not balanced
#X_train_balanced, y_train_balanced = X_train, y_train

y_train_balanced.value_counts()

0    618
1    618
Name: Status, dtype: int64

Now we use `(X_train_balanced, y_train_balanced)` to train our model