**Lab: Train-Test Split in Python (with KNN)**

**Objective**
In this lab, you will learn how to:
* Perform a train-test split on a dataset.
* Train a K-Nearest Neighbor (KNN) classifier on the training data.
* Evaluate the model's accuracy on the test data.

By the end of this lab, you will have a working understanding of the basic process of training and testing a machine learning model.

**Steps for the Lab**

Step 1: Import Required Libraries

First, open your Python environment and import the following libraries:

In [1]:
# Importing libraries
from sklearn.datasets import load_iris  # Dataset
from sklearn.model_selection import train_test_split  # Train-test split
from sklearn.neighbors import KNeighborsClassifier  # KNN model
from sklearn.metrics import accuracy_score  # Accuracy metric

Step 2: Load the Dataset

We will use the **Iris dataset** for this lab. The Iris dataset contains data about the features of different types of iris flowers (like petal length, sepal length, etc.) and their corresponding species.

In [2]:
# Load the Iris dataset
iris = load_iris()

# Features and labels
X = iris.data  # Features (sepal length, petal width, etc.)
y = iris.target  # Target labels (species of iris)

Step 3: Perform the Train-Test Split

To avoid **overfitting**, we split the dataset into two parts: 80% for training and 20% for testing. This way, the model will be trained on one set and evaluated on unseen data.

In [None]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets to verify the split
print(f"Training Set Shape: {X_train.shape}")
print(f"Testing Set Shape: {X_test.shape}")

Step 4: Train the KNN Model

We will use the **K-Nearest Neighbors (KNN)** algorithm for classification. In KNN, a data point is classified by a majority vote of its neighbors. We will set the number of neighbors to 3 in this example.

In [None]:
# Create and train the KNN model
knn = KNeighborsClassifier(n_neighbors=3)  # K=3
knn.fit(X_train, y_train)  # Train the model

Step 5: Test the Model

Once the model is trained, we can use it to make predictions on the test data (data the model hasn't seen before).

In [None]:
# Predict the labels for the test set
y_pred = knn.predict(X_test)

# Print the predictions
print(f"Predicted Labels: {y_pred}")

Step 6: Evaluate the Model

The model's accuracy is the percentage of correct predictions on the test set. We'll compare the predicted labels to the actual labels to calculate the accuracy.

In [6]:
# Calculate and print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 100.00%


**Lab Questions**

Answer the following questions after completing this lab:

1) What is the purpose of splitting the data into training and testing sets?
2) Why is it important to evaluate a model's performance on data it hasn't seen before?
3) What could happen if we don’t use a train-test split (i.e., if we test on the same data we trained on)?
4) How does changing the number of neighbors (n_neighbors) in KNN affect the model's accuracy?