# Pandas Data Structures and Operations

Pandas is a well-documented library. You may look into pandas' official documentation here: https://pandas.pydata.org/pandas-docs/stable/

In [None]:
import pandas as pd

## Loading data into a Pandas DataFrame

In [None]:
?pd.read_csv

In [None]:
grades_df = pd.read_csv("sample_gradebook.csv")

### For Google Colab:
1. Mount the google drive
2. Use read_csv to the google drive path

In [None]:
from google.colab import drive 
drive.mount('/content/gdrive')

In [None]:
# Change path with relative path to your csv file from your drive
grades_df=pd.read_csv('gdrive/My Drive/sample_gradebook.csv')

## View dataset info

In [None]:
grades_df.info()

DataFrame dimensions can be retrieved via the [`shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) property

In [None]:
grades_df.shape

In [None]:
grades_df

## Selecting a Column

In [None]:
grades_df["Name"]

In [None]:
# Another way of accessing single columns
grades_df.Name

In [None]:
type(grades_df["Name"])

## Selecting Multiple Columns

In [None]:
grades_df[["Name","Section"]]

## Selecting a Row
For selecting rows, we use `.loc` or `.iloc`

`.loc` selects a row based on the given index label, whereas `.iloc` selects a rows based on integer location (i.e., where the row is from the current state of the dataframe). 

In [None]:
grades_df.iloc[2]

In [None]:
type(grades_df.loc[2])

Or a range of rows:

*Take note that `loc` uses an inclusive, whereas `iloc` excludes the given end bound of the range*

In [None]:
grades_df.loc[0:4]

In [None]:
grades_df.iloc[0:4]

You may also specify column(s) with `.loc` and `.iloc`:

In [None]:
grades_df.iloc[0:2, 0:4]

## Filter Rows

In [None]:
grades_df[grades_df.CSMODEL >= 3.5]

## Sort Rows

In [None]:
grades_df.sort_values(by='CSMODEL')

In [None]:
# example of iloc
grades_df.sort_values(by='CSMODEL').iloc[0]

## Adding New Columns

For this, we are going to need the numpy [`where`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function

In [None]:
import numpy as np

In [None]:
grades_df["Type"] = np.where(grades_df["CSMODEL"] >= 3.5, "Excellent", "Good")

In [None]:
grades_df

## Aggregate Data

In [None]:
# Getting the Mean of the CSMODEL column
grades_df["CSMODEL"].mean()

In [None]:
# Getting the Mean of all of the applicable columns in the DataFrame
grades_df.mean()

You can also aggregate data using the [`agg`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function. 

In [None]:
grades_df["CSMODEL"].agg(['mean','std','min','max'])

## Group By Variable

In [None]:
grades_df[["Section","CSMODEL"]].groupby(by="Section").min()

In [None]:
grades_df[["Section", "CSMODEL"]]

In [None]:
grades_df[["Section","CSMODEL"]].groupby(by="Section").max()

To combine [`min`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html) and [`max`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html) of both sections under the same dataframe, we can use the [`agg`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) function

In [None]:
grades_df[["Section","CSMODEL"]].groupby(by="Section").agg(['min','max'])