# Introduction

This notebook investigates the famous Fisher Iris dataset (Fisher, R. (1936). Iris [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76.)

In [None]:
# First, import necessary libraries for importing data and whatever analysis follows
# will put these in a requirements file later
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn import datasets
import matplotlib.pyplot as plt 
from sklearn.preprocessing import LabelEncoder as le
from sklearn.linear_model import LinearRegression
import seaborn as sns


# Step 1: Acquiring the Data

The first step involves acquiring the data. 
Data has been downloaded from https://archive.ics.uci.edu/dataset/53/iris which includes the option to import data using python.


In [None]:
# ucimlrepo is a package for importing datasets from the the UC Irvine Machine Learning Repository.
# See: https://github.com/uci-ml-repo/ucimlrepo     
from ucimlrepo import fetch_ucirepo 

In [None]:
# fetch the datas. the ID specifies whic of the UCI datasets you want.
iris = fetch_ucirepo(id=53) 

The data that is fetched also contains metadata

In [None]:
# metadata contains details of the dataset including its main characterisics, shape, info on missing data, and relevant links (e.g. where to find raw data) 
# the meta data also contains detailed additional information including text descriptions of variables, funding sources, and the purpose of the data, 
print(iris.metadata) 

In [None]:
# lets take the data and save it to a variable called iris
iris = iris.data

# Step 2: Initial exploration of data package structure

In [None]:
# print iris to see what it contains
print(iris)

As shown above, when you load iris data it returns as a dictionary-like object which contains a list of features, a list of classes, and each instance of the dataset.
What do each of these elements represent?
- Each class represents a different species of iris
- Each feature represents a different measured aspect of the flowers
- Each instance represents a specific flower and the measurements of its features in centimeters.

Let's look at each of these below

In [None]:
# look at the features of the data. you can see the columns represent sepal length, sepal width, petal length, and petal width.
print(iris.features)

In [None]:
# the targets are labels for the data. in this case, they are the species of iris flower (setosa, versicolor, virginica).
print(iris.targets)

# Step 3: Explore and summarise the dataset

In [None]:
# I would like to have both the targets and features in one dataframe to make my analysis and code easier. 
# the code in x suggests putting the targets and features into x and y variables.
# data (as pandas dataframes) 
X = iris.features 
y = iris.targets 

In [None]:
# i then used these two variables to create a new dataframe called iris_df.
# we'll use the pandas function concat to do this. we'll specify we're joining on aixs=1, which means we're joining on the columns. 
# see: https://pandas.pydata.org/docs/user_guide/merging.html#joining-logic-of-the-resulting-axis 
iris_df = pd.concat([X, y], axis=1)

In [None]:
# let's explore our new dataframe. we'll start looking at the top and bottom 5 rows to get a sense of what the data looks like.

In [None]:
# return top 5 rows
# see: https://www.w3schools.com/python/pandas/ref_df_head.asp#:~:text=The%20head()%20method%20returns,a%20number%20is%20not%20specified.&text=Note%3A%20The%20column%20names%20will,addition%20to%20the%20specified%20rows.
iris_df.head()

In [None]:
# return bottom 5 rows.
# https://www.w3schools.com/python/pandas/ref_df_tail.asp#:~:text=The%20tail()%20method%20returns,a%20number%20is%20not%20specified.
iris_df.tail()

In [None]:
# double check the types of data in iris. we can see that each column is a float64 type, except for the target/class column.
iris_df.dtypes

Now we will move on to summarizing the basic descriptive aspects of the dataset, which will tell us about the flowers themselves.

In [None]:
# Describe the data set. This will show basic descriptive statistics for each column in the dataframe.
# This includes the count, mean, standard deviation, min, max, and 25th, 50th, and 75th percentiles.
# see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html 
iris_df.describe()

In [None]:
# we can check for nulls by combining the ifnull function with the sum function
# see: https://www.w3schools.com/python/pandas/ref_df_isnull.asp 
# and https://www.w3schools.com/python/pandas/ref_df_sum.asp
print(iris_df.isnull().sum())

In [None]:
# Describe has down us the mean, median and quartiles for each column. Let's look other measures of distirbution for each column.

In [None]:
# the skew function will show us the skewness of the data. the skewness of a measure of how distributed the data is around the mean. 
# see: https://www.datacamp.com/tutorial/understanding-skewness-and-kurtosis 
# i want it for each column so im going to use for loop to save time. see: https://statisticsglobe.com/iterate-over-columns-pandas-dataframe-python 

# for each column in iris df, calculate the skewness and then print it out.  
for column in iris_df:
   if column != 'class': # first check if the column is not the class column. that has strings so won't work - learned this from earlier error. 
    skew = iris_df[column].skew()
    print (f"Skewness of {column}: {skew}")


Skewness of sepal length: 0.3149109566369728
Skewness of sepal width: 0.3340526621720866
Skewness of petal length: -0.27446425247378287
Skewness of petal width: -0.10499656214412734


In [23]:
# Similarly, we can check the data for kurtosis. According to data camp, "kurtosis focuses more on the height. It tells us how peaked or flat our normal (or normal-like) distribution is. 
# see https://www.datacamp.com/tutorial/understanding-skewness-and-kurtosis
# for each column in iris df, calculate the skewness and then print it out.  
for column in iris_df:
   if column != 'class': # first check if the column is not the class column. that has strings so won't work - learned this from earlier error. 
    kurtosis = iris_df[column].kurtosis()
    print (f"Kurtosis of {column}: {kurtosis}")

Kurtosis of sepal length: -0.5520640413156395
Kurtosis of sepal width: 0.2907810623654279
Kurtosis of petal length: -1.4019208006454036
Kurtosis of petal width: -1.3397541711393433


We can see from the analysis above that the mean and median are largely similar. Similarly, both our skewness and kurtosis are within normal range. These findingins indiate our data is fairly normally distributed and not impacted by many outliers (see: https://www.smartpls.com/documentation/functionalities/excess-kurtosis-and-skewness)

The mean sepal length across the dataset is apprx. 5.6cm. The mean sepal width is approx. 3.1cm. While the means for petal length and width are 3.8cm and 1.2cm, resectively.

In [None]:
# The class column is a string variable and therefore we cannot calculate mean, median, skewness, or kurtosis as we did above. However, we can count the occurence of each value.
# the value_counts function will return a series containing counts of unique values. 
# see: https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html 
iris_df['class'].value_counts()

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64