In machine learning, the goal is to create a program that is able to perform tasks it has never been explicitly taught to perform. The way we do this, is to use data we have collected to **train** or **fit** a mathematical or statistical model.  The data used to fit the model is called **training data**.

The resulting trained model is used to predict future, previously unseen data. In this way, the program is able to manage new situations without human intervention.

One of the challenges for our machine learning model is **overfitting**, which is when a model that performs well on the training data, but is not able to generalize to new, previously unseen data.

To solve this problem, machine learning engineers set aside a portion of the data, called **test data** and use it to assess the performance of the trained model, as opposed to including it as part of the training dataset.

Overfitting in cyber security is an omnipresent danger.

One small oversight, such as using only benign data from one locale, can lead to a poor classifier.

There are various other ways to validate model performance, such as cross-validation. For simplicity, we will focus mainly on **train-test splitting**.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
%cd /content/drive/My Drive/Colab Notebooks

/content/drive/My Drive/Colab Notebooks


In [None]:
# import train_test_split module and the pandas library and read features into x and labels into y.

from sklearn.model_selection import train_test_split
import pandas as pd

In [None]:
df = pd.read_csv("north_korea_missile_test_database.csv")

In [None]:
y = df["Missile Name"]
x = df.drop("Missile Name", axis=1)

In [None]:
# Randomly split the dataset and its labels into a training set comprising 80% of the original dataset and a testing set which is 20% of the original.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=31)

In [None]:
# We will apply the train_test_split method once more, to obtain a validation set, which is x_val and y_val
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state=31)

In [None]:
# Training set is no 60% of the size of the original data, accompanied by a validation set and testing set of roughly 20%
print(len(x_train))
print(len(y_train))
print(len(x_val))
print(len(y_val))
print(len(x_test))
print(len(y_test))

81
81
27
27
27
27


**How does it work**


*   Step 1: We start by reading our dataset, consisting of historical and continuing missile experiments in North Korea.  We aim to predict the type of missile based on remaining features, such as facility and time of launch

*   Step 2: We apply `sklearn`'s `train_test_split` method to subdivide `x` and `y` into a training set, `x_train` and `y_train`, and also a testing set, `x_test` and `y_test`.  The `test_size=0.2` parameter means that the testing set consists of 20% of the original data, while the remainder is placed in the training set.  The `random_state` parameter allows us to reproduce the same, randomly generated split.

*   Step 3: We often compare several different models. The danger of using the testing set to select the best model, is that we may end up overfitting the testing set. This is similar to the statistical sin of **data fishing**. In order to solve this problem, we need to create an additional dataset called a **validation set**.

*   Step 4 will be to double-check our assumptions, by employing the `len` function to compute the length of the array.