---

# Python Part 9. Pandas 

*Pandas* is an open-source library that is built on top of NumPy library (click [here](https://pandas.pydata.org/docs/) to see the pandas documentation). It is a Python package that offers various data structures and operations for manipulating numerical data and time series, typically stored in tabular formats (click [here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html?highlight=data%20reader) for all the readable formats with pandas). To use the pandas module its as simple as running the code: ```import pandas as pd```


### Reading in CSV Files with ```read_csv(path-to-file)```
One such tabular format relavent to this notebook is the *CSV* file format, which is stored on disk with the *.csv* extension (click [here](https://en.wikipedia.org/wiki/Comma-separated_values) for a description of CSV files). Specifically, in my current working directory I have a folder called Datasets which contains a CSV file called iris_dataset.csv, where the *path* to this file is located, with respect to the location of this notebook, at 
```python
windows_path = "Datasets\iris_dataset.csv"

macOS_linux_path = "Datasets/iris_dataset.csv"

```

Once you have located your CSV file called iris_dataset.csv you my read this data into your notebook with the following code:
```python
import matplotlib.pyplot as plt # import matplotlib.pyplot for plotting
import numpy as np # import numpy for numerical operations
import pandas as pd # import pandas!

iris = pd.read_csv("Datasets/iris_dataset.csv") # Currently on a MacOS or Linux System
print(f"{type(iris) = }")
```

---

---

### Viewing the DataFrame
In the above code cell we have assigned the variable ```iris``` to the pandas DataFrame generated by the iris_dataset.csv file. We can view this DataFrame in Jupyter notebooks in the following two ways:
```python
print(f"{iris = } \n")

print("iris DataFrame without print call")
iris # Only works when called in the last line of a Jupyter notebook cell
```


---

---

### DataFrame ```head()``` and ```tail()``` Methods

Many times when dealing with large tabular data it helps to only view the first or last entries in the DataFrame. When wanting this, simple call the DataFrame methods ```head()``` and ```tail()``` respectively. The following code illustrates these methods.
```python
print("The first 10 rows from the DataFrame")
print(iris.head(10))

print("The last 3 rows from the DataFrame")
print(iris.tail(3))
```

---

---

### What is the DataFrames doing for Data Scientists?

The data contained in the ```iris``` DataFrame object is *actual real world data* and often referred to as the "hello world" of machine learning datasets. This data consists of 4 numerical measurements and 1 categorical measurement (the columns) for 150 instances of [iris flowers (description in this link)](https://en.wikipedia.org/wiki/Iris_flower_data_set) (the rows). To be clear, 

1. Each column is a single measurement over many instances.
2. Each row is a collection of measurements over one single instance.

To access the column names (names of measurements) we can call the DataFrame ```column``` attribute and to access the row names (names of each instance) we can call the DataFrame ```index()``` method. For example, run the following code in the cell below. 
```python
print(f"{iris.columns = } \n")
print(f"{iris.index = }")
```

---

---

### Accessing a Single Column of Data from a DataFrame
There are two ways to access a single column from a pandas DataFrame, both of which are illustrated with the code below. Note that the output of these calls are not pandas DataFrame types, but wrather pandas *Series* types; which are essentially single column DataFrame with some differences in functionality. 
```python
print(f"{type(iris.sepal_length) = }")
print(iris.sepal_length)

print(f"{type(iris['sepal_length']) = }")
print(iris["sepal_length"])
```


---

---

### Accessing a Multiple Columns of Data from a DataFrame
We can also access a multiple columns of data from a pandas DataFrame, both of which are illustrated with the code below.  
```python
iris[["sepal_length", "sepal_width"]]
```


---

---

### Accessing the Rows of Data from a DataFrame
There are two ways to access a single row from a pandas DataFrame. However, for now we will focus on using the DataFrame ```iloc()``` method (*integer index location method*). This method works in a similar manor to accessing the rows of a numpy ```ndarray``` type. For example, try running the following code in the cell below. 
```python
print("iris.iloc[0] = ")
print(iris.iloc[0], "\n \n")

print("iris.iloc[0:50] = ")
iris.iloc[0:50]
```


---

---

### Collecting Data According to Boolean Conditions
Wrather than slicing through a DataFrame to access specific rows of data, it is often convient to access rows of data subject to some Boolean condition. For example, maybe we wanted only the rows from ```iris``` where the species entry is "setosa". In the following code we first make a column pandas Series of Boolean values and second we use this column of Boolean values to access the actual data from ```iris``` that we might want. 


```python
print('iris.species == "setosa"')
print(iris.species == "setosa", "\n \n")

print('iris[iris.species == "setosa"] = ')
iris[iris.species == "setosa"]
```


---

---

### Adding a New Column of Data
When performing *supervised machine learning* tasks *classification* algorithms require numerical values. So, if we were to try and classify iris species we would first need to either change the ```species``` column to be numerical, or create a new column with numerical labels. Because we want to retain the species name of each instance we decide to make a new DataFrame column. The following code shows how to do this, though you might want to see this [link](https://www.guru99.com/python-map-function.html) for more on the built in Python ```map()``` function. 
```python
def make_target(species):
    if species == "setosa":
        return 0
    elif species == "versicolor":
        return 1
    else:
        return 2

iris["target"] = list(map(make_target, iris.species))
iris

```


---

---

### Relations to NumPy

```python
print(f"{iris.shape = } \n")
print(f"{type(iris.sepal_length.to_numpy()) = } \n \n")
print(f"{iris.sepal_length.to_numpy() = } \n \n")

print("iris[['sepal_length', 'sepal_width']].iloc[0:25].to_numpy() = ")
iris[['sepal_length', 'sepal_width']].iloc[0:25].to_numpy()
```

---

---

### Visualizing the data 
Pandas DataFrame types do provide plotting and scatter methods. However, I find these methods not sufficient for the plots and figures I like to make. Because of this, we will use ```matplotlib.pyplot``` together with ```seaborn``` to produce the following objectively pretty plot. 

```python
import seaborn as sns # import seaborn!
sns.set_theme() # Set a gray background theme 

plt.figure(figsize = (10, 8), dpi = 80)

# Grab unique values in the species column
species_list = list(set(iris.species))

for color, species in zip(["magenta", "seagreen", "blue"], species_list):
    temp_iris = iris[iris.species == species] # Make a DataFrame for a specific species 
    plt.scatter(temp_iris.sepal_length,
                temp_iris.sepal_width,
                c = color, # Marker color
                s = temp_iris.petal_length * 100, # Marker size 
                alpha = .56, # Marker intensity 
                label = species # Marker label
                )

plt.xlabel("sepal length (cm)", fontsize  = 15)
plt.ylabel("sepal width (cm)", fontsize = 15)
plt.title("The Iris Data", fontsize = 20)
plt.xlim(4, 8.5)
plt.legend(fontsize = 14)
plt.show()
```

---

---

### Seaborn ```pairplot()``` method

```python
sns.pairplot(iris, hue  = "species")
```


---