# Project Goal

This project aims to predict the **species of an Iris flower** based on measurements (sepal length, sepal width, petal length, and petal width).
 

## Step 1: Load and Understand the Dataset
The Iris dataset is built into the Scikit-learn library, making it easy to load.

In [113]:
# Import necessary libraries
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

# Load the dataset
iris = load_iris()
print(df.iloc[:4, -4:])

# Convert to a DataFrame for easy viewing
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Show the first 10 rows
df.head(10)

   sepal width (cm)  petal length (cm)  petal width (cm)  species
0               3.5                1.4               0.2        0
1               3.0                1.4               0.2        0
2               3.2                1.3               0.2        0
3               3.1                1.5               0.2        0


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


Here, `iris.data` contains the features (flower measurements), and `iris.target` represents the species. The species are represented by numbers (0, 1, and 2), which stand for **setosa**, **versicolor**, and **virginica**.

## Step 2: Preprocess the Data
Since the Iris dataset is clean and doesn’t have any missing values, preprocessing will be minimal. However, I'll split the data into **training** and **testing sets** to evaluate our model’s performance accurately.

In [24]:
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df.drop(columns='species')
y = df['species']

# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 3: Train a Model
**Goal**: I want the model to learn the relationship between measurements (features) and species (target) from the training data so it can make predictions on new data. It classifies data points based on how close they are to other points in the feature space. The algorithm looks at the k **closest points (neighbors)** in the training set and then assigns the most common class among those neighbors to the test point.

In [29]:
# Import the k-NN classifier
from sklearn.neighbors import KNeighborsClassifier

# Create a k-NN model with k=3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)
# When I set n_neighbors=3, I am telling the model to consider the 3 closest neighbors for classification.

# Train (fit) the model on the training data
knn.fit(X_train, y_train)
# knn.fit(X_train, y_train): This is where the model "learns." It memorizes the locations of all the training points in the feature space.

-  When I set `n_neighbors=3`, I am telling the model to consider the **3 closest neighbors** for classification.
- `knn.fit(X_train, y_train)`: This is where the model "learns." It memorizes the locations of all the training points in the feature space.





## Step 4: Evaluate the Model's Accuracy
**Goal**: Measure how well the model performs on new data it hasn’t seen before (the test set). This helps us gauge if our model can generalize well to new examples.

**Steps to evaluate the model:** \
**1. Make Predictions:** Use the trained model to predict labels for the test set. \
**2. Measure Accuracy:** Compare the model's predictions with the actual labels from the test set.

In [43]:
from sklearn.metrics import accuracy_score

# Make predictions on the test data
y_pred = knn.predict(X_test)

- `knn.predict(X_test)`: The model takes each example in the test set, finds the **3 closest points** in the training set (since `n_neighbors=3`),
and assigns the most common species among those neighbors.

In [57]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Model Accuracy: 100.00%


- `accuracy_score(y_test, y_pred)`: This function calculates the proportion of correct predictions by comparing `y_pred` (model predictions) and `y_test` (actual labels).
- High accuracy (close to 100%) indicates the model is performing well.

## Step 5: Make Predictions
**Goal**: Use the trained model to make predictions on new data. This is useful when I want to classify an unseen example based on the model’s learned patterns.

**Steps to make a prediction**:\
**Define the new sample data**: Input measurements for a new flower.\
**Predict the species**: Use the model to predict which species the new sample most likely belongs to.

In [84]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Example measurement (sepal length, sepal width, petal length, petal width)
sample = pd.DataFrame([[5.1, 3.5, 1.4, 0.2]], columns=X.columns)

# Predict the species
predicted_species = knn.predict(sample)
species_name = iris.target_names[predicted_species[0]]
print(f"The predicted species is: {species_name}")

The predicted species is: setosa


- `pd.DataFrame([[5.1, 3.5, 1.4, 0.2]], columns=X.columns)`: This represents a new flower’s measurements (in the format: [sepal length, sepal width, petal length, petal width]).
- `knn.predict(sample)`: The model looks at this sample’s nearest neighbors and assigns the most common species among those neighbors as the prediction.
  
The model then outputs the species (`setosa`, `versicolor`, or `virginica`), giving us a prediction for the new sample.

## Iris Classification Project Summary

**Goal**: Predict the species of an Iris flower based on measurements like sepal length, sepal width, petal length, and petal width.

**Steps**:

1. **Data Exploration**: Loaded and reviewed the Iris dataset, identifying features and target species.
2. **Data Split**: Divided data into training (80%) and test (20%) sets for unbiased model evaluation.
3. **Model Training**: Chose and trained a k-Nearest Neighbors (k-NN) model (k=3) to classify each flower based on its 3 closest neighbors.\
(Choosing 3 neighbors allows for a balance between being sensitive enough to detect class boundaries while being robust enough to ignore isolated noise.)
5. **Evaluation**: Measured model accuracy on test data, confirming its predictive reliability.
6. **Prediction**: Used the model to classify a new flower sample, translating the prediction into the species name.

## Key Takeaways
- **Data Preparation** is crucial: Properly splitting data allows for fair testing.
- **Model Selection** matters: k-NN is great for small, structured datasets like Iris.
- **Model Evaluation** confirms reliability: Accuracy lets us know if the model’s predictions are dependable.
- **Practical Prediction** tests the model’s usefulness on new data.