# Background: 
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). These measures were used to create a linear discriminant model to classify the species. The dataset is often used in data mining, classification and clustering examples and to test algorithms.

## Data Description

Petal Length - in cm

Petal Width - in cm

Sepal Length - in cm

Sepal Width - in cm

Species - Sentosa, Versicolour, and Virginica 

### Getting Started with Pandas:

In [5]:
import pandas as pd 

### Load the dataset
The Iris flower data set or Fisherâ€™s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. This is a very famous and widely used dataset by everyone trying to learn machine learning. 

The dataset is available in the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/iris). we will load this data from a library called seaborn.

In [4]:
import pandas as pd
import seaborn
data = seaborn.load_dataset("iris")
type(data)



pandas.core.frame.DataFrame

### Check if the dataset has been loaded correctly

In [6]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [7]:
# Looking at the first few rows
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [12]:
# Looking at the first few rows again
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [8]:
# Looking at the last few rows of the data frame
data.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


We can save this dataset to our local system as a csv file.

### Export dataframe as csv

In [9]:
data.to_csv('iris.csv', index=False) # Saves the file in the same folder that contains the notebook

Let us now look at the data itself

### Displaying the number of rows randomly

In [10]:
data.sample(10) 

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
129,7.2,3.0,5.8,1.6,virginica
68,6.2,2.2,4.5,1.5,versicolor
54,6.5,2.8,4.6,1.5,versicolor
97,6.2,2.9,4.3,1.3,versicolor
105,7.6,3.0,6.6,2.1,virginica
75,6.6,3.0,4.4,1.4,versicolor
134,6.1,2.6,5.6,1.4,virginica
18,5.7,3.8,1.7,0.3,setosa
52,6.9,3.1,4.9,1.5,versicolor
26,5.0,3.4,1.6,0.4,setosa


### Check out the shape of the dataset

In [11]:
data.shape

(150, 5)

The dataset has 150 rows of observations and 5 columns.

### Slicing the rows
If you want to print or work upon a particular group of lines that is from say 10th row to 20th row.

In [12]:
# data[start:end] 
# start is inclusive whereas end is exclusive 
print(data[10:21]) 
# it will print the rows from 10 to 20. 
  
# you can also save it in a variable for further use in analysis 
sliced_data=data[10:21] 
print(sliced_data) 

    sepal_length  sepal_width  petal_length  petal_width species
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8          3.0           1.4          0.1  setosa
13           4.3          3.0           1.1          0.1  setosa
14           5.8          4.0           1.2          0.2  setosa
15           5.7          4.4           1.5          0.4  setosa
16           5.4          3.9           1.3          0.4  setosa
17           5.1          3.5           1.4          0.3  setosa
18           5.7          3.8           1.7          0.3  setosa
19           5.1          3.8           1.5          0.3  setosa
20           5.4          3.4           1.7          0.2  setosa
    sepal_length  sepal_width  petal_length  petal_width species
10           5.4          3.7           1.5          0.2  setosa
11           4.8          3.4           1.6          0.2  setosa
12           4.8         

### Displaying only specific columns

In [13]:
# Select columns Petal Width and Species from iris data
# we will save it in a another variable named "specific_data" 
  
specific_data = data[["petal_width","species"]] 
# data[["column_name1","column_name2","column_name3"]] 
  
# now we will print the first 10 columns of the specific_data dataframe. 
print(specific_data.head(10)) 

   petal_width species
0          0.2  setosa
1          0.2  setosa
2          0.2  setosa
3          0.2  setosa
4          0.2  setosa
5          0.4  setosa
6          0.3  setosa
7          0.2  setosa
8          0.2  setosa
9          0.1  setosa


In [14]:
data['petal_width']

0      0.2
1      0.2
2      0.2
3      0.2
4      0.2
      ... 
145    2.3
146    1.9
147    2.0
148    2.3
149    1.8
Name: petal_width, Length: 150, dtype: float64

### Calculating sum, mean, median and mode of a particular column

In [15]:
data.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [20]:
# data["column_name"].sum() 
  
sum_data = data['sepal_length'].sum() 
mean_data = data['sepal_length'].mean() 
median_data = data['sepal_length'].median() 
mode_data = data['sepal_length'].mode() 
  
print("Sum:",sum_data, "\nMean:", mean_data, "\nMedian:",median_data, "\nMode:",mode_data) 

Sum: 876.5 
Mean: 5.843333333333335 
Median: 5.8 
Mode: 0    5.0
dtype: float64


### Calculating sum, mean and mode of a particular Species

In [21]:
data.species.unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [16]:
data.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [19]:
data.species.value()

AttributeError: 'Series' object has no attribute 'value'

In [17]:
# Species == 'Iris-setosa'

sum_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].sum() 
mean_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].mean() 
median_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].median() 
mode_data_sentosa = data.loc[data['species'] == 'setosa', 'sepal_length'].mode() 
  
print("Sum:",sum_data_sentosa, "\nMean:", mean_data_sentosa, "\nMedian:",median_data_sentosa, "\nMode:",mode_data_sentosa) 

Sum: 250.3 
Mean: 5.005999999999999 
Median: 5.0 
Mode: 0    5.0
1    5.1
dtype: float64


groupby function is very helpful when we want to analyse such information in the data.
Please try it on this dataset to practice.

We will discuss group by and several other data manipulation functions in the next session.

In [20]:
data.groupby('species').mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [22]:
stats = data.groupby('species').agg(('sum','mean','median',lambda x:x.value_counts().index[0]))
stats

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_length,sepal_length,sepal_width,sepal_width,sepal_width,sepal_width,petal_length,petal_length,petal_length,petal_length,petal_width,petal_width,petal_width,petal_width
Unnamed: 0_level_1,sum,mean,median,<lambda_0>,sum,mean,median,<lambda_0>,sum,mean,median,<lambda_0>,sum,mean,median,<lambda_0>
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
setosa,250.3,5.006,5.0,5.1,171.4,3.428,3.4,3.4,73.1,1.462,1.5,1.4,12.3,0.246,0.2,0.2
versicolor,296.8,5.936,5.9,5.5,138.5,2.77,2.8,3.0,213.0,4.26,4.35,4.5,66.3,1.326,1.3,1.3
virginica,329.4,6.588,6.5,6.3,148.7,2.974,3.0,3.0,277.6,5.552,5.55,5.1,101.3,2.026,2.0,1.8


In [24]:
data.petal_length.value_counts()

1.5    13
1.4    13
5.1     8
4.5     8
1.3     7
1.6     7
5.6     6
4.0     5
4.9     5
4.7     5
4.8     4
1.7     4
4.4     4
4.2     4
5.0     4
4.1     3
5.5     3
4.6     3
6.1     3
5.7     3
3.9     3
5.8     3
1.2     2
1.9     2
6.7     2
3.5     2
5.9     2
6.0     2
5.4     2
5.3     2
3.3     2
4.3     2
5.2     2
6.3     1
1.1     1
6.4     1
3.6     1
3.7     1
3.0     1
3.8     1
6.6     1
6.9     1
1.0     1
Name: petal_length, dtype: int64