# Fundamental data analysis in Python using the Iris dataset and Pandas

Let's look at doing some "data analysis" using pandas in Python.

Pandas is a library designed specifically for working with data in python, and functions as something of an extremely high-powered and customisable Excel.

To get started, we need to install our required software.

While everything required in this course can be run (with some setup) via Google Colab, I would recommend familiarising yourself with Anaconda, as it's widely used in data analysis.

## Requirements

### Conda

```conda install pandas xlrd```

See [Download Anaconda](https://www.anaconda.com/products/distribution)

### Colab/PIP

```!pip install pandas xlrd```

See [Google Colab](https://colab.research.google.com/?utm_source=scs-index)

# Importing Libraries

In [1]:
# Importing the Python Library: pandas
# pandas is used as a data analysis and manipulation tool.
# For detailed info on pandas, please go to the website https://pandas.pydata.org/
import pandas as pd

In [11]:
conda install pandas xlrd

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.3.1
  latest version: 23.7.2

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.2



## Package Plan ##

  environment location: /Users/henilpatel/anaconda3

  added / updated specs:
    - pandas
    - xlrd


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2023.05.30 |       hecd8cb5_0         121 KB
    certifi-2023.7.22          |  py310hecd8cb5_0         154 KB
    openssl-1.1.1v             |       hca72f7f_0         3.3 MB
    pandas-2.0.3               |  py310h3ea8b11_0        11.7 MB
    python-tzdata-2023.3       |     pyhd3eb1b0_0         140 KB
    xlrd-2.0.1                 |     pyhd3eb1b0_1          97 KB
    ----------------

## Opening the data file
Now we need to open the iris data file. We do this using pandas. You can search for these in the pandas help yourself, but the functions of interest are:
- `pandas.read_csv` and `pandas.to_csv` to read and write CSV files respectively.
- `pandas.read_excel` and `pandas.to_excel` to read and write MS Excel files respectively.

We're going to open the iris data file. Note: you will need to edit the code to ensure that it points to where you have downloaded the `iris.xls` file.

In [4]:
# Reading the excel file of Iris dataset 
irisdata_path = "./iris.xls"
iris_data = pd.read_excel(irisdata_path)

# Another way of reading the Iris dataset through csv file
# Please open the iris.xls file in excel, save it as iris.csv from file- save as
#iris = pd.read_csv("C:/Keras/iris.csv")

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Details about the Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
* Number of Instances: 150 (50 in each of three classes)
* Number of Attributes: 4 numeric, predictive attributes and the class
* Attribute Information:
    * sepal length in cm
    * sepal width in cm
    * petal length in cm
    * petal width in cm
* Classes:
    * Iris-Setosa
    * Iris-Versicolour
    * Iris-Virginica

# Taking a quick look at the data


## Head, Tail, and Sample

* ```df.head(n)``` retrieves the first ```n``` rows of the dataframe
* ```df.tail(n)``` retrieves the last ```n``` rows of the dataframe
* ```df.sample(n)``` retrieves ```n``` randomly selected rows from the dataframe

In [5]:
# head
iris_data.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [6]:
# tail
iris_data.tail()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [7]:
# sample
iris_data.sample(5)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
2,4.7,3.2,1.3,0.2,setosa
121,5.6,2.8,4.9,2.0,virginica
12,4.8,3.0,1.4,0.1,setosa
109,7.2,3.6,6.1,2.5,virginica
42,4.4,3.2,1.3,0.2,setosa


# Descriptive Statistics

You can see the column names with `df.columns`.

In [8]:
# Access the columns
iris_data.columns

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')

In [9]:
iris_data.describe()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Accessing data
## Accessing columns
To access columns use the `[colname]`. This returns a Series.
For example, `iris_data['Sepal.Length']`. In contrast, `iris_data[['Sepal.Length']]` returns a DataFrame.

In [10]:
# To access a column as a series
iris_data['Sepal.Width']

0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: Sepal.Width, Length: 150, dtype: float64

In [11]:
# To access a column as a DataFrame
iris_data[['Sepal.Width']]

Unnamed: 0,Sepal.Width
0,3.5
1,3.0
2,3.2
3,3.1
4,3.6
...,...
145,3.0
146,2.5
147,3.0
148,3.4


## Accessing a subset of rows
To access a subset of rows, you can do the following: `iris_data[0:3]` selects rows 0, 1 and 2.

In [12]:
#To access a subset of rows
iris_data[0:3]

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [13]:
#To access a subset of columns
iris_data[iris_data.columns[2:4]]

Unnamed: 0,Petal.Length,Petal.Width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2
...,...,...
145,5.2,2.3
146,5.0,1.9
147,5.2,2.0
148,5.4,2.3


### loc and iloc
A more straightforward and more powerful way to access parts of the DataFrame is with `loc` which takes a label or list of labels and `iloc` which takes an index or list of indices. Note: you're indexing not calling a function, so use `[]` not `()`.

In [47]:
# Please remember "loc" takes a label or list of labels.
# Accessing iris data with all rorws and particular column ('Petal.Length' and 'Petal.Width') using loc function.
iris_data.loc[:, ['Petal.Length', 'Petal.Width']]


Unnamed: 0,Petal.Length,Petal.Width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2
...,...,...
145,5.2,2.3
146,5.0,1.9
147,5.2,2.0
148,5.4,2.3


In [14]:
# Accessing particular rows( first 4 rows) and particular column ('Petal.Length' and 'Petal.Width') using loc function.
iris_data.loc[0:3, ['Petal.Length', 'Petal.Width']]

Unnamed: 0,Petal.Length,Petal.Width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2


In [50]:
# Accessing particular rows( from index 3 to 5) and particular column ('Petal.Length' and 'Petal.Width') using loc function.
iris_data.loc[3:5, ['Petal.Length', 'Petal.Width']]

Unnamed: 0,Petal.Length,Petal.Width
3,1.5,0.2
4,1.4,0.2
5,1.7,0.4


In [15]:
# Please remember "iloc" takes an index or list of indices
# Accessing all rows and all columns using iloc function.
iris_data.iloc[:]

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [15]:
#Accessing all rows with index 1 to 4 columns using iloc function.
iris_data.iloc[:, 1:4]

Unnamed: 0,Sepal.Width,Petal.Length,Petal.Width
0,3.5,1.4,0.2
1,3.0,1.4,0.2
2,3.2,1.3,0.2
3,3.1,1.5,0.2
4,3.6,1.4,0.2
...,...,...,...
145,3.0,5.2,2.3
146,2.5,5.0,1.9
147,3.0,5.2,2.0
148,3.4,5.4,2.3


In [17]:
# Accessing particular rows(index 3 to 5) and particular columns (index 2 to 4) using iloc function.
iris_data.iloc[3:5, 2:4]

Unnamed: 0,Petal.Length,Petal.Width
3,1.5,0.2
4,1.4,0.2


# (Basic) Data Modification

In [65]:
# Renaming a column name "Sepal.Length" to "Sepal"

iris_data.rename(columns={'Sepal.Length': 'Sepal'}, inplace=True)
iris_data.columns

Index(['Sepal', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'], dtype='object')

# Accessing Rows by Condition (Bonus)

We can also check the rows of a dataframe against some condition or set of conditions, resulting in a series of `True` or `False` values.

In [63]:
iris_data['Petal.Length'] > 3

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: Petal.Length, Length: 150, dtype: bool

We can then use this series to filter against the dataframe itself

In [64]:
iris_data.loc[iris_data['Petal.Length'] > 3]

Unnamed: 0,Sepal,Sepal.Width,Petal.Length,Petal.Width,Species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
