# Train Test Split

Libraries:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

Data Import:

In [2]:
data = pd.read_csv("data.csv")
data.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY


## Data Wrangling

Tidy *Drug* column.

In [3]:
data["Drug"] = data["Drug"].replace(r"[Dd]rug", "", regex=True)

In [4]:
data

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,Y
1,47,M,LOW,HIGH,13.093,C
2,47,M,LOW,HIGH,10.114,C
3,28,F,NORMAL,HIGH,7.798,X
4,61,F,LOW,HIGH,18.043,Y
...,...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567,C
196,16,M,LOW,HIGH,12.006,C
197,52,M,NORMAL,HIGH,9.894,X
198,23,M,NORMAL,NORMAL,14.020,X


### Map ordinal features

The features *BP* and *Cholesterol* are ordered categorical variables. For our algorithms to work, these must be encoded into integers.

In [5]:
bp_mapping = {"LOW": 0,
              "NORMAL": 1,
              "HIGH": 2}
cholesterol_mapping = {"NORMAL": 1,
                       "HIGH": 2}

In [6]:
data_encoded = data.copy()
data_encoded["BP"] = data_encoded["BP"].map(bp_mapping)
data_encoded["Cholesterol"] = data_encoded["Cholesterol"].map(cholesterol_mapping)

In [7]:
data_encoded

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,2,2,25.355,Y
1,47,M,0,2,13.093,C
2,47,M,0,2,10.114,C
3,28,F,1,2,7.798,X
4,61,F,0,2,18.043,Y
...,...,...,...,...,...,...
195,56,F,0,2,11.567,C
196,16,M,0,2,12.006,C
197,52,M,1,2,9.894,X
198,23,M,1,1,14.020,X


### One-hot encoding of nominal features

The features *Sex* and *Drug* are nominal features (they can't be ordered), thus they are one-hot encoded. To reduce the correlation among the variables, the first feature of the one-hot encoded variables is removed. Here the column Sex_M explains if the patient is male, the drug columns explain which drugs are taken. If all are zero, this means that drug A is taken.

In [8]:
data_encoded = pd.get_dummies(data_encoded,
               drop_first=True)

In [9]:
data_encoded

Unnamed: 0,Age,BP,Cholesterol,Na_to_K,Sex_M,Drug_B,Drug_C,Drug_X,Drug_Y
0,23,2,2,25.355,0,0,0,0,1
1,47,0,2,13.093,1,0,1,0,0
2,47,0,2,10.114,1,0,1,0,0
3,28,1,2,7.798,0,0,0,1,0
4,61,0,2,18.043,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...
195,56,0,2,11.567,0,0,1,0,0
196,16,0,2,12.006,1,0,1,0,0
197,52,1,2,9.894,1,0,0,1,0
198,23,1,1,14.020,1,0,0,1,0


## Train test split

Before we start, check version of scikit-learn, so that the seed yields same results. Here I have version 1.2.2.

In [10]:
!conda list scikit-learn

# packages in environment at /Users/Alexander/opt/anaconda3:
#
# Name                    Version                   Build  Channel
scikit-learn              1.2.2                    pypi_0    pypi


In [11]:
!pip show scikit-learn

Name: scikit-learn
Version: 1.2.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: 
Author-email: 
License: new BSD
Location: /Users/Alexander/opt/anaconda3/lib/python3.9/site-packages
Requires: joblib, numpy, scipy, threadpoolctl
Required-by: 


Shorter variable name.

In [12]:
data = data_encoded

Split data into independent and dependent variables.

In [13]:
X, y = data.iloc[:, :-4], data.iloc[:, -4:]

Perform train test split with sklearn.

In [14]:
X_train, X_test, y_train, y_test =\
    train_test_split(X, y,
                     test_size = 0.2,
                     random_state = 10,
                     stratify = y)

Check if it worked.

In [15]:
X_train

Unnamed: 0,Age,BP,Cholesterol,Na_to_K,Sex_M
38,39,1,1,9.709,0
94,56,0,2,15.015,1
20,57,0,1,19.128,1
187,47,2,2,10.403,1
159,34,0,1,12.923,0
...,...,...,...,...,...
15,16,2,1,15.516,0
148,61,0,1,7.340,0
134,42,2,2,21.036,0
181,59,1,2,13.884,0


In [16]:
y_train

Unnamed: 0,Drug_B,Drug_C,Drug_X,Drug_Y
38,0,0,1,0
94,0,0,0,1
20,0,0,0,1
187,0,0,0,0
159,0,0,1,0
...,...,...,...,...
15,0,0,0,1
148,0,0,1,0
134,0,0,0,1
181,0,0,1,0


In [17]:
X_test

Unnamed: 0,Age,BP,Cholesterol,Na_to_K,Sex_M
62,67,0,1,20.693,1
52,62,0,1,27.183,1
116,67,1,1,9.514,1
98,20,2,1,35.639,1
164,16,2,1,19.007,1
151,68,2,2,11.009,1
66,29,2,2,12.856,1
121,15,2,1,17.206,1
195,56,0,2,11.567,0
64,60,2,2,13.303,0


In [18]:
y_test

Unnamed: 0,Drug_B,Drug_C,Drug_X,Drug_Y
62,0,0,0,1
52,0,0,0,1
116,0,0,1,0
98,0,0,0,1
164,0,0,0,1
151,1,0,0,0
66,0,0,0,0
121,0,0,0,1
195,0,1,0,0
64,1,0,0,0


## Export

Merge X and y for train and test to have less files.

In [19]:
train = pd.concat([X_train, y_train], axis = 1)
test = pd.concat([X_test, y_test], axis = 1)

Export data to csv.

In [20]:
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)