# Iris

### import _pandas_

In [1]:
import pandas as pd

Pandas includes numpy as `np`, and it's convenient to have this available directly:

In [2]:
np = pd.np

### Import the dataframe, read it into a Panda's DataFrame and assign it to df.

In [3]:
?pd.read_csv()

In [4]:
df = pd.read_csv('Iris_data.csv')

### Have a look at the first 5 rows of df using the head method

In [5]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1.0,5.1,3.5,1.4,0.2,Iris-setosa
1,2.0,4.9,3.0,1.4,0.2,Iris-setosa
2,3.0,4.7,3.2,1.3,0.2,Iris-setosa
3,4.0,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,5.0,3.6,1.4,0.2,Iris-setosa


### Have a look at the last 3 rows of df using the tail method

In [6]:
?pd.DataFrame.tail()

In [7]:
df.tail(3)     # or df.tail(n=3)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
151,151.0,6.5,3.0,5.2,2.0,Iris-virginica
152,152.0,6.2,3.4,5.4,2.3,Iris-virginica
153,153.0,5.9,3.0,5.1,1.8,Iris-virginica


### Have a look at the size of the datasets
**The first number is the number of row, the second one the number of columns**

In [8]:
df.shape

(154, 6)

**What is the number of observations in df?**

In [9]:
df.shape[0]

154

**What is the number of columns in the dataset?**

In [10]:
df.shape[1]

6

### Get the names of the columns and info about them (number of non null and type)

In [11]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154 entries, 0 to 153
Data columns (total 6 columns):
Id               153 non-null float64
SepalLengthCm    153 non-null float64
SepalWidthCm     152 non-null float64
PetalLengthCm    152 non-null float64
PetalWidthCm     152 non-null float64
Species          153 non-null object
dtypes: float64(5), object(1)
memory usage: 7.3+ KB


**We can also get a list of the columns names:**

In [12]:
df.columns.tolist()

['Id',
 'SepalLengthCm',
 'SepalWidthCm',
 'PetalLengthCm',
 'PetalWidthCm',
 'Species']

### Force pandas to display less lines

In [13]:
pd.options.display.max_rows

60

In [14]:
pd.options.display.max_rows = 25

___
## _Subsetting_
We can subset a dataframe by label, by index or a combination of both.  
There are different ways to do it, using .loc, .iloc and also [ ]. See documentation:  
https://pandas.pydata.org/pandas-docs/stable/indexing.html

**Let's have a look at the 'SepalLengthCm' column:**

In [15]:
df['SepalLengthCm']   # we could also use df.SepalLengthCm

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
5      5.4
6      4.6
7      5.0
8      4.4
9      4.9
10     5.4
11     4.8
      ... 
142    6.0
143    6.9
144    6.7
145    6.9
146    5.8
147    6.8
148    6.7
149    6.7
150    6.3
151    6.5
152    6.2
153    5.9
Name: SepalLengthCm, Length: 154, dtype: float64

**Then at the 12th observation:**

In [16]:
df.iloc[12]    # .iloc uses positions ("i" stands for integer)

Id                        12
SepalLengthCm            4.8
SepalWidthCm             3.4
PetalLengthCm            1.6
PetalWidthCm             0.2
Species          Iris-setosa
Name: 12, dtype: object

In [17]:
df.loc[12]    # .loc uses indexes and labels

Id                        12
SepalLengthCm            4.8
SepalWidthCm             3.4
PetalLengthCm            1.6
PetalWidthCm             0.2
Species          Iris-setosa
Name: 12, dtype: object

In [18]:
df[10:15]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
10,11.0,5.4,3.7,1.5,0.2,Iris-setosa
11,12.0,4.8,3.4,1.6,0.2,Iris-setosa
12,12.0,4.8,3.4,1.6,0.2,Iris-setosa
13,13.0,4.8,3.0,1.4,0.1,Iris-setosa
14,14.0,4.3,3.0,1.1,0.1,Iris-setosa


**At the 'SepalLengthCm' of the last three observations:**

In [19]:
df.iloc[-3:, 1] 

151    6.5
152    6.2
153    5.9
Name: SepalLengthCm, dtype: float64

In [20]:
df.loc[151:, 'SepalLengthCm']   

151    6.5
152    6.2
153    5.9
Name: SepalLengthCm, dtype: float64

**And finally look at the PetalLengthCm and PetalWidthCm of the 146th, the 8th and the 1rst observations:**

In [21]:
df.loc[[145, 7, 0], ['PetalLengthCm', 'PetalWidthCm']]

Unnamed: 0,PetalLengthCm,PetalWidthCm
145,5.1,2.3
7,1.5,0.2
0,1.4,0.2


In [22]:
df.iloc[[145, 7, 0], [3,-2]]

Unnamed: 0,PetalLengthCm,PetalWidthCm
145,5.1,2.3
7,1.5,0.2
0,1.4,0.2


**!!WARNING!!**  Unlike Python and ``.iloc``, the end value in a range specified by ``.loc`` **includes** the last index specified. 

In [23]:
df.loc[5:10]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
5,6.0,5.4,3.9,1.7,0.4,Iris-setosa
6,7.0,4.6,3.4,1.4,0.3,Iris-setosa
7,8.0,5.0,3.4,1.5,0.2,Iris-setosa
8,9.0,4.4,2.9,1.4,0.2,Iris-setosa
9,10.0,4.9,3.1,1.5,0.1,Iris-setosa
10,11.0,5.4,3.7,1.5,0.2,Iris-setosa


In [24]:
df.iloc[5:10]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
5,6.0,5.4,3.9,1.7,0.4,Iris-setosa
6,7.0,4.6,3.4,1.4,0.3,Iris-setosa
7,8.0,5.0,3.4,1.5,0.2,Iris-setosa
8,9.0,4.4,2.9,1.4,0.2,Iris-setosa
9,10.0,4.9,3.1,1.5,0.1,Iris-setosa


**We can also use condition(s) to filter.**  

Display the rows of df where **PetalWidthCm** is greater than 2.

In [25]:
df[df['PetalWidthCm'] > 2]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
102,102.0,6.3,3.3,6.0,2.5,Iris-virginica
104,104.0,7.1,3.0,5.9,2.1,Iris-virginica
106,106.0,6.5,3.0,5.8,2.2,Iris-virginica
107,107.0,7.6,3.0,6.6,2.1,Iris-virginica
111,111.0,7.2,3.6,6.1,2.5,Iris-virginica
115,115.0,6.8,3.0,5.5,2.1,Iris-virginica
117,117.0,5.8,2.8,5.1,2.4,Iris-virginica
118,118.0,6.4,3.2,5.3,2.3,Iris-virginica
120,120.0,7.7,3.8,6.7,2.2,Iris-virginica
121,121.0,7.7,2.6,6.9,2.3,Iris-virginica


Display the rows of df where **PetalWidthCm** is greater than 2 and **PetalLengthCm** is less than 5.5.

In [26]:
df[(df['PetalWidthCm'] > 2) & (df['PetalLengthCm'] < 5.5)]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
117,117.0,5.8,2.8,5.1,2.4,Iris-virginica
118,118.0,6.4,3.2,5.3,2.3,Iris-virginica
143,143.0,6.9,3.1,5.4,2.1,Iris-virginica
145,145.0,6.9,3.1,5.1,2.3,Iris-virginica
149,149.0,6.7,3.0,5.2,2.3,Iris-virginica
152,152.0,6.2,3.4,5.4,2.3,Iris-virginica


___
## Let's clean here and there

**Get the number of unique values from the Species column.**

In [27]:
df['Species'].nunique()

3

**Get the proportion for these values.**

In [28]:
df['Species'].value_counts(normalize = True)

Iris-versicolor    0.339869
Iris-setosa        0.333333
Iris-virginica     0.326797
Name: Species, dtype: float64

### NaN

**Find the number of observations with NaN in the SepalLengthCm column.**

In [29]:
df['SepalLengthCm'].isnull().sum()

1

In [30]:
df['PetalLengthCm'].isnull().sum()

2

**Find the indexes of the observation without PetalLengthCm.**

In [31]:
df[df['PetalLengthCm'].isnull()].index

Int64Index([112, 138], dtype='int64')

**Use the          method to remove the row which only has nan values.**

In [32]:
df = df.dropna(how = 'all')

In [33]:
df.shape   # checking the shape of df

(153, 6)

**Use the          method to remove the row with at  least one nan value.**

In [34]:
df = df.dropna(how = 'any')

In [35]:
df.shape   # checking the shape of df

(151, 6)

### Duplicates

In [36]:
df = df.drop_duplicates()

In [37]:
df.shape   # checking the shape of df

(150, 6)

___
## _Some stats_

**Use the describe method to see how the data is distributed (numerical features only!)**

In [38]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,76.393333,5.843333,3.054,3.758667,1.198667
std,44.376803,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,76.5,5.8,3.0,4.35,1.3
75%,114.75,6.4,3.3,5.1,1.8
max,153.0,7.9,4.4,6.9,2.5


We can convert the **Id** column to string:

In [39]:
df['Id'] = df['Id'].astype('str')

In [40]:
df.describe()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


We can also change the **Species** column to save memory space.

In [41]:
df['Species'] = df['Species'].astype('category')

In [42]:
df.dtypes   # checking the types of the columns

Id                 object
SepalLengthCm     float64
SepalWidthCm      float64
PetalLengthCm     float64
PetalWidthCm      float64
Species          category
dtype: object

**We can also use the functions count(), mean(), sum(), median(), std(), min() and max() separately if we are only interested in one of those.**

**Get the minimum for each comumn.**

In [43]:
df.min()

Id                       1.0
SepalLengthCm            4.3
SepalWidthCm               2
PetalLengthCm              1
PetalWidthCm             0.1
Species          Iris-setosa
dtype: object

**Calculate the maximum of the duration.**

In [44]:
df['SepalLengthCm'].max()

7.9

**We can also get information for each type of flower using the groupby methode.**  

We'll get the median for each species.

In [45]:
df.groupby('Species').median()

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,5.0,3.4,1.5,0.2
Iris-versicolor,5.9,2.8,4.35,1.3
Iris-virginica,6.5,3.0,5.55,2.0


### Correlation between the numerical features

In [46]:
df.corr()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
SepalLengthCm,1.0,-0.109369,0.871754,0.817954
SepalWidthCm,-0.109369,1.0,-0.420516,-0.356544
PetalLengthCm,0.871754,-0.420516,1.0,0.962757
PetalWidthCm,0.817954,-0.356544,0.962757,1.0


### Saving the dataframe as a csv file

In [47]:
df.to_csv('iris_data_cleaned.csv')