# Investigating a Data Set using Python - a worked example

This notebook accompanies the program `analysis.py`, which performs exploratory data analysis, and creates plots, using the Fisher Iris data set. The code in this program is broken down here, with added comments, notes, and sources.

## Part 1 - Libraries Used
A number of Python libraries are included here to provide useful functions and tools.
- The [sys](https://docs.python.org/3/library/sys.html) library provides functions and variables that allow us to interact with the program's enviroment. For this program, it allows us to read files and write to them.

- The [pandas](https://pandas.pydata.org/docs/index.html) library contains tools made specifically for data analysis and manipulation. Including it allows us to put our data into a Pandas DataFrame, which allows for easy data access, manupulation, and passing to plotting functions. 

- The [matplotlib](https://matplotlib.org) library contains tools to create plots and other data visualisations. This program uses the [Pyplot](https://matplotlib.org/stable/tutorials/pyplot.html) part of Matplotlib.

- The [seaborn](https://seaborn.pydata.org/) library is built on Matplotlib and extends its capabilities to create informative plots. Seaborn works particularly well with data stored in a Pandas DataFrame object.

In [5]:
# Library to allow file-handling
import sys
# Library for data analysis
import pandas as pd
# Libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## (2) Downloading and reading-in the dataset

- Data downloaded from [URL provided by Andrew](https://archive.ics.uci.edu/dataset/53/iris).
- There are two sets of data: `iris.data` and `bezdekiris.data`. According to [Bezdek et. all (1999)](https://doi.org/10.1109%2F91.771092) the original UCI `iris.data` contains two data points that are inaccurate when compared to Fisher's 1936 data. After inspection, I determined that `bezdekiris.data` contains the corrected data, so this is the dataset I will use.
- Read into Pandas DataFrame following this [example](https://www.angela1c.com/projects/iris_project/downloading-iris/). [Documentation for pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)

In [None]:
# variable to hold Iris data filename
FILENAME = "bezdekiris.data"

# List of features (columns) in the data
feature_names = ["sepal length", "sepal width", "petal length", "petal width"]

with open (FILENAME, 'rt') as iris_file:

    # Create Pandas DataFrame from the iris data file.
    # Add column names manually as they are not present in the iris.data file.
    iris = pd.read_csv(
        iris_file, 
        names=feature_names + ["species"]
        )
    
# print out first 5 lines of dataframe
print(iris.head())

   sepal length  sepal width  petal length  petal width      species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


## (3.1) Output a summary of each variable to a single text file
Now that the dataset is in Pandas DataFrame format, can get statistics of each feature in the data using `pandas.DataFrame.describe()` method [[Documentation for pandas.DataFrame.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe)]

In [4]:
iris.describe()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## (3.2) Save a histogram of each variable to png files
- Created a single plot comprising four subplots, using [matplotlib.pyplot.subplot](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html)
- Uses [Seaborn.histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html) to plot each histogram
- Specified the 'species' attribute to be used to separate the histograms by species, and colour-code them
