# Fundamentals of Data Analysis - Project 

**Author: Cecilia Pastore**

***


### 1. Introduction

<details>
    <summary>Task requested:</summary>
           <p>
           
Project

• The project is to create a notebook investigating the variables and data points within the well-known iris flower data set associated with Ronald A Fisher.

• In the notebook, you should discuss the classification of each variable within the data set according to common variable types and scales of measurement in mathematics, statistics, and Python.

• Select, demonstrate, and explain the most appropriate summary statistics to describe each variable.

• Select, demonstrate, and explain the most appropriate plot(s) for each variable.

• The notebook should follow a cohesive narrative about the data set.


### 2. Irish Flower Dataset

The iris flower dataset is a well-known dataset in the field of data science, and it is often used to illustrate basic data analysis, visualization techniques and machine learning . The Iris Dataset is considered as the "Hello World" for data science.

<div>
  <center><img src="https://machinelearninghd.com/wp-content/uploads/2021/03/iris-dataset.png" width="600"></center>
</div>

<div>
  <center><a href="https://machinelearninghd.com/iris-dataset-uci-machine-learning-repository-project/"><i>[Fig. 1] - Iris Flowers </i></a></center>
</div>

The dataset contains measurements of the sepal length, sepal width, petal length, and petal width for 150 iris flowers belonging to three different species: Iris setosa, Iris versicolor, and Iris virginica. Fifty samples are collected for each species with no null value.

<div>
  <center><img src="https://www.bogotobogo.com/python/scikit-learn/images/features/iris-data-set.png" width="400"></center>
</div>

<div>
  <center><a href="https://www.bogotobogo.com/python/scikit-learn/scikit_machine_learning_features_extraction.php"><i>[Fig. 2] - Iris Flowers measurement </i></a></center>
</div>

The iris flower dataset was originally collected by the botanist Edgar Anderson in the 1930s and later popularized by the statistician Ronald Fisher in his seminal paper on discriminant analysis. The dataset is sometimes referred to as Anderson's Iris dataset, as Anderson collected it to quantify the morphological variations among three closely related Iris species. Anderson ensured that all of the samples were collected from the same pasture on the Gaspé Peninsula, on the same day, and measured at the same time by the same person using the same apparatus [[]](https://rpubs.com/AjinkyaUC/Iris_DataSet).

Since then, the dataset has become a classic example in the field of data science, and it is frequently used in introductory courses and textbooks. It is a simple and well-understood dataset, making it a popular choice for teaching and research purposes, providing an excellent opportunity to explore different aspects of data analysis and visualization.

The dataset is widely available and can be accessed freely on the UCI website [[]](https://archive.ics.uci.edu/dataset/53/iris).

### 2. import the dataset 

#### Import the datased on a pandas dataframe

The dataset need to be imported. On this project I decide to use the UCI website suggestion on import the dataset in python [[]](https://archive.ics.uci.edu/dataset/53/iris). 


In [1]:
import pandas as pd
# https://www.angela1c.com/projects/iris_project/downloading-iris/
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
      # using the attribute information as the column names
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
iris =  pd.read_csv(csv_url, names = col_names)

[[]](https://datagy.io/pandas-replace-values/)

In [2]:
# replace the 3 spieces with a more friendly name 
iris["Species"].replace(to_replace="Iris-setosa", value="Setosa", inplace=True)
iris["Species"].replace(to_replace="Iris-versicolor", value="Versicolor", inplace=True)
iris["Species"].replace(to_replace="Iris-virginica", value="Virginica", inplace=True)
iris



Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


### 3. Explore and Cleaning of the Dataset

Once the dataset have been imported and formatted in the needed way, we will need to explore it and prepare it for analysis. 

https://www.geeksforgeeks.org/python-pandas-dataframe-series-head-method/

In [4]:
# checking the first 5 line of the dataset to see if the format fit
print("==== First 5 line of the dataset ==== \n \n")
print(str(iris.head())+'\n \n')

==== First 5 line of the dataset ==== 
 

   Sepal_Length  Sepal_Width  Petal_Length  Petal_Width Species
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa
 



https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/

In [None]:
# print unique value on the the species colume to check no duplicate and that the replace has been done correctly 
print("==== Print unique value of species ==== \n \n")
unique_species = pd.unique(iris['Species'])
print("Species\n \n")
for species in unique_species:
        f.write(species)
        f.write("\n")
print("\n")

In [34]:
iris.groupby('Species').count().reset_index()

Unnamed: 0,Species,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
0,Setosa,50,50,50,50
1,Versicolor,50,50,50,50
2,Virginica,50,50,50,50


https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/

In [39]:
# checking missing value
print("==== Checking missing value ==== \n \n")
print (str(iris.isnull().sum())+'\n \n')  

==== Checking missing value ==== 
 

Sepal_Length    0
Sepal_Width     0
Petal_Length    0
Petal_Width     0
Species         0
dtype: int64
 



In [38]:
# shape of the datased
print("==== Shape of the dataset ==== \n \n")
print("Number of rows: {}\n".format(iris.shape[0]))
print("Number of columns: {}\n".format(iris.shape[1]))
print("Size: {}\n".format(iris.size))
print("Columns: {}\n\n".format(", ".join(iris.columns)))
print(str(iris.value_counts("Species"))+'\n\n')

==== Shape of the dataset ==== 
 

Number of rows: 150

Number of columns: 5

Size: 750

Columns: Sepal_Length, Sepal_Width, Petal_Length, Petal_Width, Species


Species
Setosa        50
Versicolor    50
Virginica     50
Name: count, dtype: int64




In [47]:
# get the data type 
print("==== Data type ==== \n")
print("Attribute \t Type \n")
print(str(iris.dtypes)+'\n\n') 

==== Data type ==== 

Attribute 	 Type 

Sepal_Length    float64
Sepal_Width     float64
Petal_Length    float64
Petal_Width     float64
Species          object
dtype: object




Statistics

In [6]:
# get statistics 
print("==== Statistics ==== \n \n")
print(str(iris.describe())+'\n\n') 

==== Statistics ==== 
 

       Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000




https://learnpython.com/blog/how-to-summarize-data-in-python/
https://www.w3schools.com/python/pandas/ref_df_agg.asp


In [7]:
# get an user more friendly view of the statistics 
# for loop to print for each species the varaible mean, median, std, min, max 
for column in iris.columns[:-1]:
    print(f"==== {column} Statistics ====\n\n")
    print(f"{iris.groupby('Species')[column].agg(['mean', 'median', 'std', 'min', 'max'])}\n\n")

==== Sepal_Length Statistics ====


             mean  median       std  min  max
Species                                      
Setosa      5.006     5.0  0.352490  4.3  5.8
Versicolor  5.936     5.9  0.516171  4.9  7.0
Virginica   6.588     6.5  0.635880  4.9  7.9


==== Sepal_Width Statistics ====


             mean  median       std  min  max
Species                                      
Setosa      3.418     3.4  0.381024  2.3  4.4
Versicolor  2.770     2.8  0.313798  2.0  3.4
Virginica   2.974     3.0  0.322497  2.2  3.8


==== Petal_Length Statistics ====


             mean  median       std  min  max
Species                                      
Setosa      1.464    1.50  0.173511  1.0  1.9
Versicolor  4.260    4.35  0.469911  3.0  5.1
Virginica   5.552    5.55  0.551895  4.5  6.9


==== Petal_Width Statistics ====


             mean  median       std  min  max
Species                                      
Setosa      0.244     0.2  0.107210  0.1  0.6
Versicolor  1.326     1.

https://stackoverflow.com/questions/55009203/how-does-pandas-calculate-quartiles

In [9]:
# get quartile per species 
print("==== Quartiles per species ==== \n \n")
print(str(iris.groupby('Species').quantile([0.25, 0.50, 0.75]))+'\n\n')   

==== Quartiles per species ==== 
 

                 Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
Species                                                              
Setosa     0.25         4.800        3.125         1.400          0.2
           0.50         5.000        3.400         1.500          0.2
           0.75         5.200        3.675         1.575          0.3
Versicolor 0.25         5.600        2.525         4.000          1.2
           0.50         5.900        2.800         4.350          1.3
           0.75         6.300        3.000         4.600          1.5
Virginica  0.25         6.225        2.800         5.100          1.8
           0.50         6.500        3.000         5.550          2.0
           0.75         6.900        3.175         5.875          2.3




***

### End