# Fundamentals of Data Analysis Project

**Kevin O'Leary**

***

## **Introduction to Dataset**

The Fischers Iris dataset was made by famous by statistician Ronald Fischer when he used it in his 1936 paper "The use of multiple measurements in taxonomic problems". However, it was actually compiled before this by Edgar Anderson, a botanist who was examining the variation within the Iris flower.

The 1936 paper was proposing 'Fishers linear discriminant' which today is known as linear discriminant analysis. This is a method used in statistics to find a combination of features that can best seperate the data into distinct classes.

The dataset is hosted on the UCI Machine Learning Repository. It consists of 3 classes of iris;

<img src="https://github.com/Kevin002023/pands-project/blob/main/images/iris-image.png">
- The image above shows the 3 classes of iris in this data set.

It is a multivariate dataset containing information on 150 specimens of iris. There are 5 attributes recorded for each specimen. [These are as follows](https://archive.ics.uci.edu/ml/datasets/iris) :

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica

Fischer used this information to identify a method of distinguishing between the classes of iris's. It has since been used as a benchmark dataset for machine learning algorithms.

## **Project Outline**

The purpose of this project was to research the Irish Dataset, import it to a juypyter notebook and carry out analysis of this dataset.  The classification of the variables are analyses and their distributions examined. 

## **Software Used**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## **Import of Dataset**

The dataset was imported using the url and the pandas read_csv() function.  This was done so the code would work even with the dataset unattached to the repository. The dataset is called 'data' in my code hereafter. 

In [2]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_headings = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Class']
data = pd.read_csv(data_url, sep=",",  names = col_headings,)

In [8]:
# Check to make sure it looks as expected. 

data.head()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


As expected the dataset contains 150 samples with 5 different variables. The next step was to get some preliminary statistics regarding each variable. 

In [9]:
# Using Pandas function describe()

data.describe()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Classification of Variables

There are [2](https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch8/5214817-eng.htm) main categories of variables each with their own subcategories classes:
- Categorical - A variable that cannot be quantified
  - Nominal - A variable that has distinct categories with no intrinsic order or ranking. (eg. gender/nationality etc)
  - Ordinal - these variables contain a specific order allowing for comparisons between them. (eg. Quality levels - "Bad", "Average", "Good", "Excellent")

- Numeric - A quantifiable characteristic whose values are numbers
  - Continuous - A variable that can assume an infinite number of real values. (eg Temperature, weight could be any values especially if considering values between the whole numbers)
  - Discrete - a variable that can only assume a finite number of real values. Typically whole numbers. (eg number of students in a class, the values can be  any whole number count, but it cannot have fractional values or represent partial cars)  

### Class Variable

In [18]:
data['Class'].describe()

count             150
unique              3
top       Iris-setosa
freq               50
Name: Class, dtype: object

From a quicklook at the above values, we can see "Class" is a nominal categorical value. There are only 3 possible values "Iris-Setosa", "Iris-Virginica" and "Iris-Versicolour"

The describe function above confirms this. 

### Sepal Length Variable

In [17]:
data['Sepal Length'].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: Sepal Length, dtype: float64

### Sepal Width Variable

In [19]:
data['Sepal Width'].describe()

count    150.000000
mean       3.054000
std        0.433594
min        2.000000
25%        2.800000
50%        3.000000
75%        3.300000
max        4.400000
Name: Sepal Width, dtype: float64

### Petal Length Variable

In [21]:
data['Petal Length'].describe()

count    150.000000
mean       3.758667
std        1.764420
min        1.000000
25%        1.600000
50%        4.350000
75%        5.100000
max        6.900000
Name: Petal Length, dtype: float64

### Petal Width Variable

In [20]:
data['Petal Width'].describe()

count    150.000000
mean       1.198667
std        0.763161
min        0.100000
25%        0.300000
50%        1.300000
75%        1.800000
max        2.500000
Name: Petal Width, dtype: float64

The other 4 variables are all continuous numeric values. They are all decimal values. 'Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'could all be measured more accurately by moving to more decimal palces. However in practice, the methods used and the accuracy of the measurement instrument will restrict the precision of the variable. In this case it was restricted to one decimal place.

***

## End