# Iris Dataset Analysis

### Step 1 - Importing Packages

The first step before we do anything, will be to import the packages we need for this evaluation and data extractions. Packages are as follows:

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

As we can see, we still need Pandas to interact with the iris document as well as numpy for all our data manipulation (Although Pandas can do most of what Numpy does, I needed it for some specific issues I encountered at the beginning which we'll dive into later on).

Then, for all our plots, we'll need matplotlib and seaborn (While seaborn can do what matplotlib does, I started off with matplotlib and only at the end, did I move to seaborn as it was the easiest way to achieve a pairplot).

### Step 2 - Reading our Data Set and organising it

The Iris data set comes with no column names and using Pandas, we have no way of actually knowing what columns are what. Through the names txt file, we know what the attributes are but we need to name them in order to work with them. 

In [7]:
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

After our columns have been named, we go ahead and open the data set in our folder with Pandas.

In [8]:
df = pd.read_csv('iris.data', names=column_names)
print(df)

     sepal_length  sepal_width  petal_length  petal_width         species
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]


Now, it's way more clearer. I had some issues here because all the documentation I was checking was about csv files, not data files. It was Chatgpt that explained to me that I can apply the same reading method to data files as I would to csv files.

__Prompt__ - I have iris.data what file format is that?

__Chatgpt response__: 
_The iris.data file is essentially a plain text file formatted as comma-separated values (CSV)._
_Even though it doesn't have a ".csv" extension, it follows the same structure: each row represents a record, and the values are separated by commas._

We are also going to need to go ahead and assign all those values in each column to their own variables. This will make it easier to work with each feature later on.

In [9]:
sepal_length = df['sepal_length']
sepal_width  = df['sepal_width']
petal_length = df['petal_length']
petal_width  = df['petal_width']
species      = df['species']

### Step 3 - Outputting Summaries of variables
#### Step 3.1 - Fixing df.describe

This was the hardest part of the project for me. With Pandas, we have access to a very handy function called .describe() which makes our lives incredibly easy when it comes
to getting statistics and values out of the data set. 

The main issue is the presentation. I don't want the median displayed as 50%. By default, Pandas displays it like this. Unfortunately according to the pandas documentation, df.describe(percentiles=[]) still includes the 50%. this is more of a stylistic choice but I think we need to solve it. 

I then went ahead and found out about select_dtypes and .agg to basically get just the numbers, seeing that we're more interested in that rather than the species at the moment. 
and getting the aggregates of what I want. So, we first get the numeric variables only and get the aggregates of what we want.

In [10]:
numeric_df = df.select_dtypes(include=['number'])
numeric_summary = numeric_df.agg(['count', 'mean', 'std', 'min', 'median', 'max'])

#### Step 3.2 - Creating the Summary txt files