# Practice Sheet: A Simple Introduction to Data Analysis with Iris

This notebook will guide you through a basic data analysis workflow using the famous Iris dataset. We will use **Pandas** for data manipulation and **NumPy** for a simple numerical operation. Finally, we'll use **Matplotlib** to create a beautiful visualization.

### The Goal
Our objective is to load the Iris dataset, explore its basic properties, calculate a new feature, and visualize the relationship between sepal length and sepal width for the different flower varieties.

## The Iris Dataset

The [Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris) is a small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.

![image-2.png](attachment:image-2.png)

**What do the instances in this dataset represent?**

Each instance is a plant

**Additional Information**

This is one of the earliest datasets used in the literature on classification methods and widely used in statistics and machine learning.  The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.  One class is linearly separable from the other 2; the latter are not linearly separable from each other.

**Predicted attribute:** class of iris plant.

This data differs from the data presented in Fishers article (identified by Steve Chadwick,  spchadwick@espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.  

**Has Missing Values?**

No

### Step 1: Import Necessary Libraries
Import:
- `pandas` is for loading and working with tabular data (like our CSV).
- `numpy` is for efficient numerical operations.
- `matplotlib.pyplot` is for creating plots and visualizations.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Step 2: Load the Data

[Download](https://gist.github.com/netj/8836201) the CSV file, and manually import it (Recommneded for practice purpose)

We need to get our data from the `iris.csv` file into a structure that we can work with in Python. The Pandas `DataFrame` is the standard for this. We use `pd.read_csv()` to do this.

In [3]:
# Load the dataset from the CSV file
file_path = 'iris.csv'
iris_df = pd.read_csv(file_path)

### Step 3: Initial Data Exploration

Before diving into analysis, it's important to get a feel for your data. What do the columns look like? Are there any missing values? What are the data types? Pandas provides simple functions for this initial inspection.

Use the following functions to display basic features of the Iris dataset
- `.head()` shows us the first few rows of the data.
- `.info()` gives a concise summary of the DataFrame, including data types and non-null values.
- `.describe()` provides descriptive statistics for the numerical columns (like mean, std, min, max).

### Step 4: Process Data with NumPy and Pandas

This is where we see how Pandas and NumPy work together. Let's create a new feature to see if the ratio of petal length to width is a distinguishing characteristic. 

Pandas `DataFrame` columns can be treated like NumPy arrays, allowing us to perform fast, vectorized operations on them. We will calculate `petal.length / petal.width` and store it in a new column called `petal_ratio`.

**HINT:** You will need to select the columns with Pandas, then use `.to_numpy()` and divide the columns in the new column `petal_ratio`

Once you are done, display the first 5 rows

### Step 5: Visualisation


#### 5.1 Histogram
Let's start with a histogram. It reveals the distribution of a single variable. It groups numbers into ranges (bins) and the height of the bar shows how many data points fall into that range.

* By overlaying the histograms for petal.length for all three species, we can answer questions like:
* Does this feature follow a normal (bell-curve) distribution?
* How much do the petal lengths overlap between species?
* Are there clear cut-off points in petal length that could help us classify the flowers?

#### 5.2 Scatter Plot
A scatter plot is an excellent way to see the relationship between two numerical variables. We will plot `sepal.length` vs. `sepal.width`.

To make the plot even more insightful, we will color-code each point based on its `variety`. This will help us see if the different species of Iris have distinct sepal characteristics and if they form visible clusters.

### Analysis of the Visualization

From the scatter plot, we can immediately draw some conclusions:

1.  **Setosa (Red):** This species forms a very distinct cluster. It generally has a smaller sepal length but a larger sepal width compared to the other two.
2.  **Versicolor (Green) and Virginica (Blue):** These two species are more similar to each other than to Setosa. However, we can still see a pattern: Virginica generally has a larger sepal length than Versicolor.

This simple visualization clearly shows that sepal dimensions are a useful feature for distinguishing between the different Iris varieties, especially for identifying Setosa.