# Iris Dataset Analysis

### Step 1 - Importing Packages

The first step before we do anything, will be to import the packages we need for this evaluation and data extractions. Packages are as follows:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

As we can see, we still need Pandas to interact with the iris document as well as numpy for all our data manipulation 

Then, for all our plots, we'll need matplotlib and seaborn (While seaborn can do what matplotlib does, I started off with matplotlib and only at the end, did I move to seaborn as it was the easiest way to achieve a pairplot).

### Step 2 - Reading our Data Set and organising it

The Iris data set comes with no column names and using Pandas, we have no way of actually knowing what columns are what. Through the names txt file, we know what the attributes are but we need to name them in order to work with them. 

In [None]:
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

After our columns have been named, we go ahead and open the data set in our folder with Pandas.

In [None]:
df = pd.read_csv('iris.data', names=column_names)
print(df)

Now, it's way more clearer. I had some issues here because all the documentation I was checking was about csv files, not data files. It was Chatgpt that explained to me that I can apply the same reading method to data files as I would to csv files.

__Prompt__ - I have iris.data what file format is that?

__Chatgpt response__: 
_The iris.data file is essentially a plain text file formatted as comma-separated values (CSV)._
_Even though it doesn't have a ".csv" extension, it follows the same structure: each row represents a record, and the values are separated by commas._

We are also going to need to go ahead and assign all those values in each column to their own variables. This will make it easier to work with each feature later on.

In [None]:
sepal_length = df['sepal_length']
sepal_width  = df['sepal_width']
petal_length = df['petal_length']
petal_width  = df['petal_width']
species      = df['species']

### Step 3 - Outputting Summaries of variables
#### Step 3.1 - Fixing df.describe

This was the hardest part of the project for me. With Pandas, we have access to a very handy function called .describe() which makes our lives incredibly easy when it comes
to getting statistics and values out of the data set. 

The main issue is the presentation. I don't want the median displayed as 50%. By default, Pandas displays it like this. Unfortunately according to the pandas documentation, df.describe(percentiles=[]) still includes the 50%. this is more of a stylistic choice but I think we need to solve it. 

I then went ahead and found out about select_dtypes to basically get just the numbers, seeing that we're more interested in that rather than the species at the moment. 

In [None]:
numeric_df = df.select_dtypes(include=['number'])

This basically creates a variable called "numeric_df" and assigns it the function select_dtypes that should be able to specify only the columns with numeric data types (such as integers and floats). 
>
In other words, it scans the entire DataFrame (df, which contains the Iris dataset) and filters out only those columns that contain numerical values, ignoring any columns with non-numeric data (like strings or categories, such as the species column).

Next, we move forward to getting our actual values, for this, we're gonna use the following functions from numpy:
>
np.mean

np.min

np.max

np.std

np.median

Each of these will provide us with what df.describe would provide with one line of code, but instead of showing us the median as 50%, we can actually use this to call our stats what we want and organise them how we wish. 

#### Step 3.2 - Creating the Summary txt files

Now, for the actual creationg of the txt, I'm going to break this code down and I'll explain it in comments what is actually happening here.

In [None]:
# We begin by opening or "creating" a txt file called summary. We will call it f. The With makes sure that we don't need to close the file afterwards,
# it should close automatically. 
with open("summary.txt", "w") as f:

# Now, we're going to iterate through the columns. This for loop will iterate through each column in the numeric.df variable we have created above that should only
# contain numeric values.
    for column in numeric_df.columns:

# Next, we will extract all the values from the current column into a Numpy array called "data".
        data = numeric_df[column].values

# Now, we can go ahead and write into the txt file what we want. We'll write statistics for the first column (whatever it is, it should be Sepal_length), and then we get for that column,
# the various stats, 2 decimal places up, and we will repeat this same code and go through this loop as many times as there are columns because of the loop.
        f.write(f"Statistics for {column}:\n")
        f.write(f"  Mean: {np.mean(data):.2f}\n")
        f.write(f"  Minima: {np.min(data):.2f}\n")
        f.write(f"  Maxima: {np.max(data):.2f}\n")
        f.write(f"  Standard Deviation: {np.std(data):.2f}\n")
        f.write(f"  Median: {np.median(data):.2f}\n")
        f.write("\n")  

# We'll finish it all with a print to just signal to whoever ran the code that the summary txt was successfully generated. 
print("Summary exported to iris_numeric_summary.txt")

With our txt file now created with all our statistics, we can try and extract some analysis. 

```
Statistics for sepal_length:
  Mean: 5.84
  Minima: 4.30
  Maxima: 7.90
  Standard Deviation: 0.83
  Median: 5.80

Statistics for sepal_width:
  Mean: 3.05
  Minima: 2.00
  Maxima: 4.40
  Standard Deviation: 0.43
  Median: 3.00

Statistics for petal_length:
  Mean: 3.76
  Minima: 1.00
  Maxima: 6.90
  Standard Deviation: 1.76
  Median: 4.35

Statistics for petal_width:
  Mean: 1.20
  Minima: 0.10
  Maxima: 2.50
  Standard Deviation: 0.76
  Median: 1.30
```

From the summaries we have extracted, we can see that overall, everything falls very much in line with what we expected. With nothing particularly egregious about it but we do notice that some features contain a higher Standard Deviation than others. For instance, petal length has a standard deviation of 1.76, indicating greater spread across the entire dataset. In contrast, sepal width shows much less variation and appears more uniform across species.
>
This is a huge number and already highlights that some species (we're not sure which ones yet, but we could infer Petal Length) are driving that standard deviation up. In other words, there are outliers present in those features that are not shared accross the species and given that number, we can clearly see that some gap between the species is vast.

#### Step 4 - Creating the Plots (Histograms)

So, for the plot I wasn't really sure how to approach it. Should we get each variable only? Or maybe each variable accross all three species? After some research, 
I found a very interesting article that showcased the different variables with all plotted on top of each other in histograms. This was what I decided to go for.

The article used Seaborn so I went ahead and imported that.
We first create our variables for each and we search the dataframe for their respective species lines via the loc function and check if the species
matches the species we've chosen. 

I ran into several problems, particularly when it comes to the Sepal_width. I discovered seaborns own built in palette and after several tries I settled with colorblind.

In [None]:
sns.set_palette("muted")

After this is done, we go ahead with Seaborn itself, getting our df (dataset), our hue, which would help us know that all species need to have different colours,
and the height which I left as the original.

Next comes the histplot itself. We get each variable as defined, followed by the number of bins (I went with 10 after several tries) and opacity. 

In [None]:
petal_length_histogram = sns.FacetGrid(df, hue="species", height=3).map(sns.histplot, "petal_length", bins=10, alpha=0.5).add_legend()
petal_width_histogram = sns.FacetGrid(df, hue="species",  height=3).map(sns.histplot, "petal_width", bins=10, alpha=0.5).add_legend()
sepal_length_histogram = sns.FacetGrid(df, hue="species",  height=3).map(sns.histplot, "sepal_length", bins=10, alpha=0.5).add_legend()
sepal_width_histogram = sns.FacetGrid(df, hue="species",  height=3).map(sns.histplot, "sepal_width", bins=10, alpha=0.5).add_legend()
plt.show()
plt.close()

Finally we saved all of the above to different pngs. ChatGPT<sup>12</sup> helped me here and remminded me that the above code needed to be saved to a specific variable.


In [None]:
petal_length_histogram.savefig("petal_length_histogram.png", bbox_inches='tight')
petal_width_histogram.savefig("petal_width_histogram.png", bbox_inches='tight')
sepal_length_histogram.savefig("sepal_length_histogram.png", bbox_inches='tight')
sepal_width_histogram.savefig("sepal_width_histogram.png", bbox_inches='tight')

#### Step 5 - Creating the Plots Part 2 (Scatterplots)

Now, for the scatterplot, we need to once again decide on what colours we're going to use here. Since we've already decided on "colorblind"
for the histograms, let's keep that one for consistency and pull from that with seaborn.
By following the documentation inthe seaborn website, we just need to pull from our data(df), assign our x and y which have already been defined previously and choose the palette. 
>
Afterwards, we make sure that we just make sure we also export that to its respective png file.

In [None]:
sepal_scatterplot = sns.scatterplot(data=df, x="sepal_length", y="sepal_width", hue="species", palette="colorblind")
plt.savefig("sepal_scatterplot.png", bbox_inches='tight')
plt.show()

We repeat the same for the petals.

In [None]:
petal_scatterplot = sns.scatterplot(data=df, x="petal_length", y="petal_width", hue="species", palette="colorblind")
plt.savefig("petal_scatterplot.png", bbox_inches='tight')
plt.show()

#### Step 6 - Extra (Pairplots)

One final analysis I found worth using was the pairplot. As it show cases the various relationships accross all variables, while more complete, it's also
more overwhelming to the eyes, making minute or specific detail finding more difficult. I considered using the histogram fro the diagonal but after several 
attepmts at trying to make it look less "squished" I eventually settled with leaving the default KDE since we're already exported the histograms anyway.

In [None]:
pair_plot = sns.pairplot(df, hue="species")
pair_plot.savefig("pairplot.png", bbox_inches='tight')
plt.show()

### References

1. https://realpython.com/pandas-dataframe/#filtering-data - On filtering the data for specific columns.

2. https://medium.com/@SamTaylor92/data-analysis-python-exploring-a-dataset-summary-statistics-afc7a690ec96 - On approaching the Iris Dataset.

3. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html - On selecting specific types of data (e.g numbers).

4. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html - On the aggregate function to bypass the Median showcasing problem.

5. https://www.w3schools.com/python/python_file_write.asp - On writing files with Python.

6. https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_numeric_dtype.html - Documentation on api.types that shows us how to check for specific values. Very important since one of the features ("species") doesn't have any agg values.

7. https://seaborn.pydata.org/generated/seaborn.FacetGrid.html - On the basics of Seaborn and getting and getting our data onto the plot.

8. https://medium.com/@nirajan.acharya777/exploratory-data-analysis-of-iris-dataset-9c0df76771df - On the various ways of showcasing the Iris Dataset and possible correlations.

9. https://matplotlib.org/stable/users/explain/colors/colors.html - On the many colour keywords for the plots.

10. https://medium.com/@maxmarkovvision/optimal-number-of-bins-for-histograms-3d7c48086fde - A study on the most appropriate number of bins for histograms.

11. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html - On saving plots as png files.

12. __ChatGPT:__ _"The best approach is to assign each FacetGrid to a variable, save its figure immediately with the savefig method (or plt.savefig using its .fig attribute), 
and then close that figure before moving on. This ensures that each plot is saved independently without overlapping."_ 

13. https://seaborn.pydata.org/generated/seaborn.scatterplot.html - On making scatterplots with seaborn

14. https://seaborn.pydata.org/generated/seaborn.pairplot.html - On making pairplots with seaborn

15. https://www.analyticsvidhya.com/blog/2024/02/pair-plots-in-machine-learning/ - On reading pairplots






## End