# Train Test Split in Machine Learning
To measure the performance of a model on unseen data, we use the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function apart of the sklearn package. This function removes a subsample from our training data used to build the model. After building the model, we can measure the performance by performing classification or regression on the unlabelled hold out data then comparing the results to the hold-out data labels/actual values.

Train_test_split is commonly used to detect overfitting and evaluate a model's performance. A common split is 80% training to 20% testing, although you may change those percentages for a variety of reasons.

In [None]:
# Load libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# Load IRIS dataset
iris = load_iris()
X, y = iris.data, iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [None]:
# Split the dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2024) # Try changing the test_size parameter!!

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

# Train the Decision Tree classifier
clf = DecisionTreeClassifier(random_state=2024)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: \n{accuracy*100:.2f}% \n")

X_train shape: (120, 4)
X_test shape: (30, 4)
Accuracy: 
86.67% 



In [None]:
## End of Script