# Iris Dataset Analysis

### Step 1 - Importing Packages

The first step before we do anything, will be to import the packages we need for this evaluation and data extractions. Packages are as follows:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

As we can see, we still need Pandas to interact with the iris document as well as numpy for all our data manipulation 

Then, for all our plots, we'll need matplotlib and seaborn (While seaborn can do what matplotlib does, I started off with matplotlib and only at the end, did I move to seaborn as it was the easiest way to achieve a pairplot).

### Step 2 - Reading our Data Set and organising it

The Iris data set comes with no column names and using Pandas, we have no way of actually knowing what columns are what. Through the names txt file, we know what the attributes are but we need to name them in order to work with them. 

In [None]:
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

After our columns have been named, we go ahead and open the data set in our folder with Pandas.

In [None]:
df = pd.read_csv('iris.data', names=column_names)
print(df)

Now, it's way more clearer. I had some issues here because all the documentation I was checking was about csv files, not data files. It was Chatgpt that explained to me that I can apply the same reading method to data files as I would to csv files.

__Prompt__ - I have iris.data what file format is that?

__Chatgpt response__: 
_The iris.data file is essentially a plain text file formatted as comma-separated values (CSV)._
_Even though it doesn't have a ".csv" extension, it follows the same structure: each row represents a record, and the values are separated by commas._

We are also going to need to go ahead and assign all those values in each column to their own variables. This will make it easier to work with each feature later on.

In [None]:
sepal_length = df['sepal_length']
sepal_width  = df['sepal_width']
petal_length = df['petal_length']
petal_width  = df['petal_width']
species      = df['species']

### Step 3 - Outputting Summaries of variables
#### Step 3.1 - Fixing df.describe

This was the hardest part of the project for me. With Pandas, we have access to a very handy function called .describe() which makes our lives incredibly easy when it comes
to getting statistics and values out of the data set. 

The main issue is the presentation. I don't want the median displayed as 50%. By default, Pandas displays it like this. Unfortunately according to the pandas documentation, df.describe(percentiles=[]) still includes the 50%. this is more of a stylistic choice but I think we need to solve it. 

I then went ahead and found out about select_dtypes and .agg to basically get just the numbers, seeing that we're more interested in that rather than the species at the moment. 
and getting the aggregates of what I want. So, we first get the numeric variables only and get the aggregates of what we want.

In [None]:
numeric_df = df.select_dtypes(include=['number'])
numeric_summary = numeric_df.agg(['count', 'mean', 'std', 'min', 'median', 'max'])

#### Step 3.2 - Creating the Summary txt files

Now that this is done, we'll go ahead and create a for loop to iterate through each column and assign each variable to its own named summary txt.

In [None]:
for col in df.columns:
    file_name = f"{col}_summary.txt"

Next, within this loop, we're going to use the is_numeric_dtype to search each column if they are numeric. This is because we have the species which are non numerical.
This if statement will search for the numeric values and assing them to a variable using the agg function used before and if there is a string in there (species names)
It will do the same and count their values with value_counts

In [None]:
if pd.api.types.is_numeric_dtype(df[col]):
    col_summary = df[col].agg(['count', 'mean', 'std', 'min', 'median', 'max'])
else:
    col_summary = df[col].value_counts()


We then move to the txt generation part. We open file name as a write file and just write in there the summaries that we're organised before in the if statement.
I struggled with this because I was only getting the species to their own txt file until I realised that the reason for that was due to all of the above not being inside the for loop.

In [None]:
with open(file_name, "w") as file:
    file.write(col_summary.to_string())
    
print(f"Summary for {col} has been written to {file_name}")

#### Step 4 - Creating the Plots (Histograms)

So, for the plot I wasn't really sure how to approach it. Should we get each variable only? Or maybe each variable accross all three species? After some research, 
I found a very interesting article that showcased the different variables with all plotted on top of each other in histograms. This was what I decided to go for.

The article used Seaborn so I went ahead and imported that.
We first create our variables for each and we search the dataframe for their respective species lines via the loc function and check if the species
matches the species we've chosen. 

I ran into several problems, particularly when it comes to the Sepal_width. I discovered seaborns own built in palette and after several tries I settled with colorblind.

In [None]:
sns.set_palette("muted")

After this is done, we go ahead with Seaborn itself, getting our df (dataset), our hue, which would help us know that all species need to have different colours,
and the height which I left as the original.

Next comes the histplot itself. We get each variable as defined, followed by the number of bins (I went with 10 after several tries) and opacity. 

In [None]:
petal_length_histogram = sns.FacetGrid(df, hue="species", height=3).map(sns.histplot, "petal_length", bins=10, alpha=0.5).add_legend()
petal_width_histogram = sns.FacetGrid(df, hue="species",  height=3).map(sns.histplot, "petal_width", bins=10, alpha=0.5).add_legend()
sepal_length_histogram = sns.FacetGrid(df, hue="species",  height=3).map(sns.histplot, "sepal_length", bins=10, alpha=0.5).add_legend()
sepal_width_histogram = sns.FacetGrid(df, hue="species",  height=3).map(sns.histplot, "sepal_width", bins=10, alpha=0.5).add_legend()
plt.show()
plt.close()

Finally we saved all of the above to different pngs. ChatGPT<sup>12</sup> helped me here and remminded me that the above code needed to be saved to a specific variable.


In [None]:
petal_length_histogram.savefig("petal_length_histogram.png", bbox_inches='tight')
petal_width_histogram.savefig("petal_width_histogram.png", bbox_inches='tight')
sepal_length_histogram.savefig("sepal_length_histogram.png", bbox_inches='tight')
sepal_width_histogram.savefig("sepal_width_histogram.png", bbox_inches='tight')

#### Step 5 - Creating the Plots Part 2 (Scatterplots)

Now, for the scatterplot, we need to once again decide on what colours we're going to use here. Since we've already decided on "colorblind"
for the histograms, let's keep that one for consistency and pull from that with seaborn.
By following the documentation inthe seaborn website, we just need to pull from our data(df), assign our x and y which have already been defined previously and choose the palette. 
>
Afterwards, we make sure that we just make sure we also export that to its respective png file.

In [None]:
sepal_scatterplot = sns.scatterplot(data=df, x="sepal_length", y="sepal_width", hue="species", palette="colorblind")
plt.savefig("sepal_scatterplot.png", bbox_inches='tight')
plt.show()

We repeat the same for the petals.

In [None]:
petal_scatterplot = sns.scatterplot(data=df, x="petal_length", y="petal_width", hue="species", palette="colorblind")
plt.savefig("petal_scatterplot.png", bbox_inches='tight')
plt.show()

#### Step 6 - Extra (Pairplots)

One final analysis I found worth using was the pairplot. As it show cases the various relationships accross all variables, while more complete, it's also
more overwhelming to the eyes, making minute or specific detail finding more difficult. I considered using the histogram fro the diagonal but after several 
attepmts at trying to make it look less "squished" I eventually settled with leaving the default KDE since we're already exported the histograms anyway.

In [None]:
pair_plot = sns.pairplot(df, hue="species")
pair_plot.savefig("pairplot.png", bbox_inches='tight')
plt.show()

### References

1. https://realpython.com/pandas-dataframe/#filtering-data - On filtering the data for specific columns.

2. https://medium.com/@SamTaylor92/data-analysis-python-exploring-a-dataset-summary-statistics-afc7a690ec96 - On approaching the Iris Dataset.

3. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html - On selecting specific types of data (e.g numbers).

4. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html - On the aggregate function to bypass the Median showcasing problem.

5. https://www.w3schools.com/python/python_file_write.asp - On writing files with Python.

6. https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_numeric_dtype.html - Documentation on api.types that shows us how to check for specific values. Very important since one of the features ("species") doesn't have any agg values.

7. https://seaborn.pydata.org/generated/seaborn.FacetGrid.html - On the basics of Seaborn and getting and getting our data onto the plot.

8. https://medium.com/@nirajan.acharya777/exploratory-data-analysis-of-iris-dataset-9c0df76771df - On the various ways of showcasing the Iris Dataset and possible correlations.

9. https://matplotlib.org/stable/users/explain/colors/colors.html - On the many colour keywords for the plots.

10. https://medium.com/@maxmarkovvision/optimal-number-of-bins-for-histograms-3d7c48086fde - A study on the most appropriate number of bins for histograms.

11. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html - On saving plots as png files.

12. __ChatGPT:__ _"The best approach is to assign each FacetGrid to a variable, save its figure immediately with the savefig method (or plt.savefig using its .fig attribute), 
and then close that figure before moving on. This ensures that each plot is saved independently without overlapping."_ 

13. https://seaborn.pydata.org/generated/seaborn.scatterplot.html - On making scatterplots with seaborn

14. https://seaborn.pydata.org/generated/seaborn.pairplot.html - On making pairplots with seaborn

15. https://www.analyticsvidhya.com/blog/2024/02/pair-plots-in-machine-learning/ - On reading pairplots






## End