<a href="https://colab.research.google.com/github/Tejes-Aulakh/Python/blob/main/Intro_to_AI_KMC_Iris_DataSet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Edited content:

# Iris Clustering Using K-Means Clustering

The Iris dataset is a classic dataset in the field of machine learning and statistics. It is often used for benchmarking algorithms and learning how to handle data. Here are the key details about the Iris dataset:

## Description:

The Iris dataset consists of 150 samples of iris flowers, each described by four features. The dataset contains three different species of iris flowers: *Iris setosa*, *Iris versicolor*, and *Iris virginica*. Each species has 50 samples.

## Features:

The dataset includes the following four features (all measured in centimetres):

* Sepal length
* Sepal width
* Petal length
* Petal width

These features are the dimensions of the flowers' sepals and petals.

## Target:

The target variable is the species of the iris flower, which can take one of three possible values:

* *Iris setosa*
* *Iris versicolor*
* *Iris virginica*

## Data Structure:

* Number of samples: 150
* Number of features: 4
* Number of classes: 3 (*Iris setosa*, *Iris versicolor*, *Iris virginica*)


## Example Data:
Here is a sample from the dataset:


| Sepal Length (cm)	| Sepal Width (cm) | Petal Length (cm) |	Petal Width (cm) |	Species |
|---|----|---|---|---|
| 5.1 |	3.5 |	1.4 |	0.2 |	Iris setosa
| 7.0 |	3.2 |	4.7 |	1.4 |	Iris versicolor
| 6.3 |	3.3 |	6.0 |	2.5 |	Iris virginica


# Import Libraries
Python includes a large number of libraries, either pre-installed or available through package managers such as pip, to support machine learning. The four libraries we will be using are NumPy, pandas, scikit-learn, and Matplotlib. It may be useful to spend some time reading about these libraries.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load a Dataset
scikit-learn provides a one-shot (single-line) load routine for the Iris dataset, allowing us to load data that we know is already sanitized (checked and cleaned of errors) and split it into data and target variables, `x` and `y`.

In [None]:
# Load the Iris dataset
iris = load_iris()
x = iris.data
y = iris.target

# Standardise the data features
In our example, the standardized features are the four measurements from the Iris dataset (sepal length, sepal width, petal length, and petal width) that have been scaled to have a mean of 0 and a standard deviation of 1. Standardizing features is a common preprocessing step in machine learning, helping to ensure that each feature contributes equally to the analysis and improving the performance of many algorithms.

In the provided script, we used the `StandardScaler` from scikit-learn to standardize the features. Here’s the relevant part of the script:

```python
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

After standardisation, the features will have the following properties:
- Mean: 0
- Standard Deviation: 1


In [None]:
# Standardise the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)

# Perform the k-means clustering
We need to actually run the algorithm. In this case, we know there are three possible options for the irises—the three species of the flower—so we set the number of clusters to 3. The choice of 42 for the random state is arbitrary, but by setting it instead of using a truly random number, we can introduce reproducibility and consistency. Your clusters should match ours. Once we've set up our specifications, we can create the clusters on our scaled data using the fit method.

In [None]:
# Perform K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)

# Map cluster labels to species names
Now we have our clusters, it's all about making the data readable by humans. We can create a data frame in pandas from the data. A data frame is like a table in Excel, though a lot more powerful. In this case we willmap each of the iris types and then apply the names to the clusters.

In [None]:
species = np.array(['Iris setosa', 'Iris versicolor', 'Iris virginica'])
cluster_mapping = {}
for i in range(3):
    # Find the most common species in each cluster
    cluster_species = y[kmeans.labels_ == i]
    most_common_species = np.bincount(cluster_species).argmax()
    cluster_mapping[i] = species[most_common_species]

# Plot the results
Of course, we want to see the results using matplotlib's handy graph plotting libraries. In this case, we will view the first two fields of the dataset pesky computer screens and their 2D displays). To change what is displayed, the numbers in the two `X_scaled[kmeans.labels_ == i, n],` parameters on line 37 would need to be changed, along with the appropriate label of course.

In [None]:
# Create a scatter plot with labeled clusters
plt.figure(figsize=(8, 6))
colors = ['r', 'g', 'b']
markers = ['o', 's', 'D']

for i in range(3):
    plt.scatter(X_scaled[kmeans.labels_ == i, 0], X_scaled[kmeans.labels_ == i, 1],
                c=colors[i], marker=markers[i], label=cluster_mapping[i], edgecolor='k')

plt.title('K-Means Clustering of Iris Dataset')
plt.xlabel('Standardized Sepal Length')
plt.ylabel('Standardized Sepal Width')
plt.legend()
plt.show()