# Pandas 

__Purpose:__ The purpose of this lecture is to explore the extremely powerful Python package - Pandas. We will learn what Pandas is, why it is so useful for Data Scientists, how to use Pandas Series and DataFrames for indexing and manipulation tasks and more advanced capabilities like grouping, aggregating, and sorting.  

__At the end of this lecture you will be able to:__
> 1. Understand what the Pandas Package is and what it can be used for 
> 2. Understand the concept of a Pandas Series and Pandas DataFrame and how they can be created, accessed, and manipulated
> 3. Perform indexing and manipulation (concatenating/merging/joining/reshaping) tasks on Pandas DataFrames
> 4. Perform advanced tasks on Pandas DataFrames such as aggregating and grouping 
> 5. Use Pandas built-in capabilities for Time Series data 

# Pandas Introduction and Uses

### What is Pandas?

__Overview:__
- __[Pandas](http://pandas.pydata.org/pandas-docs/stable/index.html):__ Pandas is a Python package that provides fast and flexible data structures that are designed to make working with [relational](https://en.wikipedia.org/wiki/Relational_database) or "labeled" data easy and intuitive
- In the words of [Wes McKinney](https://en.wikipedia.org/wiki/Wes_McKinney), who created Pandas in 2008, and published [this](http://www.dlr.de/sc/Portaldata/15/Resources/dokumente/pyhpc2011/submissions/pyhpc2011_submission_9.pdf) paper in 2011 at PyHPC describing the usefulness and need for Pandas (which was a play on [__Pan__ el __Da__ ta](https://en.wikipedia.org/wiki/Panel_data) )

_"Pandas enables people to analyze and work with data who are not expert computer scientists...the code is intuitive and accessible. Pandas helps people move beyond just using Excel for data analysis"_ 

- When Python was first developed, it was very difficult to perform tasks such as importing CSV files, dealing with spreadsheet-like datasets with rows and columns and merging tables 
- Therefore, Pandas was developed to solve these problems and with the introduction of the DataFrame, Pandas made it possible to do intuitive analysis and exploration in Python that was not possible and still not possible in other languages 
- In recent years, the Pandas Package has become a staple in the Data Scientist's toolbox for some of the following reasons. In fact, Python is one of the most popular programming languages for Data Scientists specifically because of packages such as Pandas and Matplotlib (Lecture 6)
> 1. As a Data Scientist, it is common to work with tabular data where the data in each column is different, known as __hetereogenously-typed data__ (similar to a SQL table or Excel spreadsheet). Pandas DataFrame replicates tabular data and allows you to do everything you would in a spreadsheet, but better and faster 
> 2. As a Data Scientist, it is common to work with time series data that may be ordered or unordered and Pandas has extensive capabilities to treat dates, times, etc. 
> 3. The most time-consuming part of any Data Scientist's job is __[Data Munging](https://en.wikipedia.org/wiki/Data_wrangling)__ (Data Cleaning/Wrangling) and Pandas provides all the necessary tools at your fingertips to do this quicker and cleaner 
> 4. Exploratory Data Analysis is often overlooked in Data Science, but remains one of the most important tasks of a Data Scientist and Pandas provides many easy and intuitive methods to perform data manipulation 

- Now you understand why Data Scientists use the Pandas Package, but what is it about the Pandas Package that allow us, as users of the Pandas Package, to realize these benefits? 
> 1. __Missing Data:__ Pandas handles missing data well (represented as NaN - recall NumPy missing value)
> 2. __Size Mutability:__ Pandas DataFrames are size mutable which means columns and rows can be inserted and deleted 
> 3. __Data Aligment:__ Pandas allows you to align an object to a specific set of labels OR allow Pandas align the data for you
> 4. __Grouping Data:__ Pandas `groupby` function allows both aggregating and transforming of data 
> 5. __Data Access:__ Pandas has extensive capabilities of slicing, indexing and subsetting large data sets 
> 6. __Reshaping Data:__ Pandas has extensive capabilities of merging, joining, and reshaping data 
> 7. __Input/Output:__ Pandas allows easy import and export of flat files such as CSV 
> 8. __Time-Series:__ Pandas has specific Time-Series functionality to work with dates

### Pandas Data Structures 

__Overview:__ 
- Recall that the usefulness of Pandas has to do with its fundamental data structures
- There are 2 types of data structures in Pandas:
> 1. [`Series`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series): Series is a one-dimensional labeled array that is capable of holding any data type (i.e. `int`, `str`, `float`, etc.), but every element is of this same type. The axis labels of a Series are referred to as the __Index__ of the Series
>> - The `pd.Series(data, index)` function which creates a Series data structure, has 2 arguments:<br>
>> a. `index`: The `index` argument is a list of axis labels<br>
>> b. `data`: The `data` argument can be any of the following:
>> > 1. Python dictionary (`dict`)
>> > 2. NumPy Array (`ndarray`)
>> > 3. Scalar value (5)<br> 
> 2. [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html): Dataframe is a two-dimensional labeled data structure with columns of potentially different types. It largely resembles a spreadsheet or SQL table. The first axis labels of a Dataframe (rows) are referred to as the __Index__ of the Series, whereas the second axis labels labels of a Dataframe (columns) are referred to as the __Columns__ of the Series
>> - The `pd.DataFrame(data, index, columns)` function which creates a Dataframe data structure, has 3 arguments:<br>
>> a. `index`: The `index` argument is a list of axis 0 labels<br>
>> b. `columns`: The `column` argument is a list of axis 1 labels<br>
>> c. `data`: The `data` argument can be any of the following:
>> > 1. Dictionary of 1D ndarrays, lists, dicts, or Series 
>> > 2. 2-D NumPy Array (`ndarray` object) 
>> > 3. A `Series` object 
>> > 4. Another `DataFrame` object
>> > 5. List of Python dictionaries 
>> > 6. From CSV or Excel file (or any of the possible file formats which can be found [here](http://pandas.pydata.org/pandas-docs/stable/io.html)

__Helpful Points:__
1. Remember that data alignment in Pandas is intrinsic which means that the link between labels and data will not be broken unless you do so explicitly 
2. If you pass an index and/or columns into the function, you guarantee these in the resulting object. Therefore, if you also pass a dictionary, for example, you will lose all the data from the dictionary that does not match with the passed index  
3. If axis labels are not passed into the function, they will be constructed from the input data 
4. We will see examples of creating both types of data structures below 

__Practice:__ Examples of creating Pandas Data Structures in Python 

In [None]:
import pandas as pd
import numpy as np 

## Part 1 (Creating Series in Pandas):

### Creating Series from Python Dictionary

In [None]:
my_dict = {"Gordon": 20, "Roberto": 10, "Jerod":15}
my_dict

In [None]:
ex_series_1 = pd.Series(my_dict)
ex_series_1

In [None]:
pd.Series(my_dict, index = ["Jerod", "Bob", "Mary"])

See "Helpful Point 2" above which states that if you pass an index into the function (`["Jerod", "Bob", "Mary"]`), you guarantee these indexes in the resulting object, whether they existed in your data or not. We passed in a dictionary of data that only had 1 of 3 indexes the same as the `index` argument. Therefore, the remaining indexes that were passed in, but not in the data are shown with `NaN` to represent "missing data" and the remaining indexes that were in the data, but not passed in, are simply not shown. 

### Creating Series from NumPy Array

In [None]:
my_array = np.arange(2,6)
print(my_array, type(my_array))

In [None]:
pd_1 = pd.Series(my_array) # no index argument so Pandas will infer it 
pd_1

In [None]:
type(pd_1)

See "Helpful Point 3" above which states that if you do not pass an index into the function, the indexes will be constructed. They are constructed with value from `[0, ..., len(data) - 1]`

In [None]:
pd_2 = pd.Series(my_array, index = ["a", "b", "c", "d"]) # explicitly define index argument 
pd_2

In [None]:
pd_2.index # extract the index labels 

### Creating Series from Scalar Value

In [None]:
my_scalar = 10

In [None]:
pd.Series(my_scalar) # no index provided, length equal to 1 by default 

In [None]:
pd.Series(my_scalar, index = [0,0,0,1,1]) # index provided with length 5, match this length in Series 

Two points here:

1. It is possible to have non-unique index values (i.e. index passed in repeats both 0s and 1s). If you peform an operation on a Pandas Series or DataFrame object with non-unique index names and the operation requires unique index names, you will receive an error only at the time. However, most operations don't actually use the index name, therefore it is not that important
2. If you create a Series object based on scalar value, the scalar will be repeated as many times as required to match the length of the index. If no index is given, a default length of 1 occurs

### Using the Series Name Attribute

In [None]:
my_series = pd.Series(my_scalar, index = [0,0,0,1,1], name = "Name1")

In [None]:
print(my_series.name,'\n')
print(my_series)

In [None]:
my_series_1 = my_series.rename("Name2") # rename the name attribute 

In [None]:
print(my_series_1.name,'\n')
print(my_series_1)

The Name Attribute will come in handy to understand when we look at DataFrames since each column of a DataFrame is a Series Object with the Name Attribute equal to the column name. 

## Part 2 (Creating DataFrames in Pandas):

### Creating DataFrames from Dictionaries

In [None]:
my_dict = {"ndarray":np.arange(4), # first value is ndarray 
           "List":[10,12,1,2], # second value is list 
           "Series":pd.Series('a', index = ["row_1", "row_2", "row_3", "row_4"])} # third value is pandas series 
my_dict

In [None]:
my_df = pd.DataFrame(my_dict)
my_df

In [None]:
type(my_df)

Few points here:

1. We can see that to create this dataframe from a dictionary that has a NumPy Array, List, and Series object, we need all objects to have the same length (4) which corresponds to the number of rows
2. The index of the Series object was explicitly assigned index labels, therefore they were not imputed
3. Each key-value pair in the dictionary becomes a column in the associated DataFrame

In [None]:
my_df.dtypes # types of each column

In [None]:
my_df.index # row names 

In [None]:
my_df.columns # column names

### Creating DataFrames from NumPy Arrays

In [None]:
my_array = np.arange(4).reshape(2,2)
my_array

In [None]:
df_1 = pd.DataFrame(my_array, index = ["row_1", "row_2"], columns = ["column_1", "column_2"])
df_1

### Creating DataFrames from Series Objects

In [None]:
my_series = pd.Series(my_scalar, index = [0,0,0,1,1])
my_series

In [None]:
df_2 = pd.DataFrame(my_series, columns = ["column_1"])
df_2

In [None]:
type(df_2)

### Creating DataFrames from other DataFrames

In [None]:
df_3 = pd.DataFrame(df_2, columns = ["new_column_1"])
df_3

Since the column labels were passed into the function, these have now been explicitly defined and the new dataframe must have the same index labels AND column labels, otherwise it will appear as missing data, which it does here (`NaN`)

### Creating DataFrames from List of Dictionaries

In [None]:
my_list = [{"First":"Gordon", "Last":"Dri"},
           {"First":"Roberto", "Last":"Reif"},
           {"First":"Jerod", "Last":"Rubalcava"}]
my_list

In [None]:
pd.DataFrame(my_list, index = ["Employee_1", "Employee_2", "Employee_3"])

We can see that each item in the list turns out to be a row in the resulting DataFrame 

### Creating DataFrames from CSV

Pandas has extensive capabilities of reading and writing to different formats. The most common input/output (I/O) that you will perform is with the [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file format, but Python also supports many other formats shown below, each with their respective reader and writer functions:

> 1. __CSV:__ `read_csv` and `to_csv`
> 2. __MS Excel:__ `read_excel` and `to_excel`
> 3. __Python Pickle:__ `read_pickle` and `to_pickle`
> 4. __SQL:__ `read_sql` and `to_sql`

See below for the CSV file that we will create a DataFrame from: <img src="img29.png">

In [None]:
pd.read_csv("csv_file_for_pandas.csv") # Note: CSV file must be in the same directory as this Jupyter Notebook, otherwise include exact path

The `pd.read_csv` function assumed that Column A was a column in our data instead of the row labels, so we need to fix this below:

In [None]:
pd.read_csv("csv_file_for_pandas.csv", index_col = 0) # specify location of row labels (at first column)

# Inspecting Pandas Data Structures 

__Overview:__
- Pandas offers many ways to "inspect" Data Structures which become paramount when peforming Exploratory Data Analysis with Pandas 
- The following are useful ways of inspecting Pandas Data Structures where `obj` is either a Series or DataFrame Object
> 1. __Dimensions:__ Dimensions can be accessed using `len(obj)`, `obj.shape`, and `obj.size`
> 2. __Row Labels:__ Row labels can be accessed using `obj.index`, `obj.index.name`
> 3. __Column Labels:__ (Only for DataFrames) Column labels can be accessed using `obj.columns`, `obj.columns.values`, `obj.columns.values.tolist()`, `obj.columns.tolist()`
> 4. __Name Labels:__ (Only for Series, unless accessing specific columns) Name labels can be accessed using `obj.name`
> 5. __Data Values:__ Data values can be accessed using `obj.values` 
> 6. __Data Type:__ Data types can be accessed using `obj.dtypes`
> 7. __Data Quick Look:__  Data quick looks can be accessed using `obj.head(n)`, `obj.tail(n)`
> 8. __Data Summary:__ Data summary can be completed by `obj.describe`

__Helpful Points:__
1. It is possible to inspect both Series and DataFrames, however DataFrames have some additional functions for inspection given that they have an additional dimension and attributes 
2. Pandas Series and DataFrames will be explored separately below

__Practice:__ Examples of Inspecting Pandas Data Structures in Python 

## Part 1 (Inspecting Pandas Series):

In [None]:
my_series_1 = pd.Series([10, 3.2, 6, 1], index = ["row_1", "row_2", "row_3", "row_4"], name = "Series_1")
my_series_1

### Dimensions

In [None]:
len(my_series_1)

In [None]:
my_series_1.shape

In [None]:
my_series_1.size

### Row Labels

In [None]:
print(my_series_1.index)
print(list(my_series_1.index))

### Name Labels

In [None]:
my_series_1.name

### Data Values

In [None]:
print(my_series_1.values)
type(my_series_1.values)

### Data Type

In [None]:
my_series_1.dtypes 

### Data Quick Look

In [None]:
my_series_1.head(2)

In [None]:
my_series_1.tail(3)

In [None]:
my_series_1.sample(2)

### Data Summary

In [None]:
my_series_1.describe()

## Part 2 (Inspecting Pandas DataFrames):

In [None]:
my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
my_df

### Dimensions

In [None]:
len(my_df)

In [None]:
my_df.shape

In [None]:
my_df.size

### Row Labels

In [None]:
my_df.index

### Column Labels

In [None]:
my_df.columns

In [None]:
my_df.columns.values

In [None]:
my_df.columns.values.tolist()

In [None]:
my_df.columns.tolist()

### Data Values

In [None]:
my_df.values

### Data Type

In [None]:
my_df.dtypes

### Data Quick Look

In [None]:
my_df.head(2) # first 2 rows 

In [None]:
my_df.tail(1) # last row 

In [None]:
my_df.sample(2) # choose any 2 rows randomly (run the cell multiple times and you will see different result)

### Data Summary

Only summarizes the numeric variables.

In [None]:
my_df.describe() # notice it only does this for the columns that are of dtype = int64

### Info

In [None]:
my_df.info()

# Indexing Pandas Data Structures:

__Overview:__ 
- Pandas offers many different techniques to index (access) elements within Data Structures which can be found [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html)
- There are 6 main ways of indexing Pandas data structures:
> 1. __Select by Column Name:__ The syntax of this method is `df[col_name]` and it returns a `Series` object
> 2. __[Select by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label):__ This describes __purely label-based indexing__ which selects labels based on what is included in the index of the object. The syntax of this method (for DataFrames) is `df.loc[row_label, column_label]` and it returns a `Series` object. The following provides a list of the possible arguments for the `row_label` and/or `column_label`:
>> a. A single label (i.e. `5` or `a`, but the number `5` is interpreted as the label and NOT as the index (use `.iloc` for this)<br>
>> b. A list or array of labels (i.e. `['a', 'b', 'c']`)<br>
>> c. A slice object with labels (i.e. `['a':'f']`, but unlike with other slices in Python, the `stop` argument is included in the slice)  
>> d. A boolean array 
> 3. __[Select by Integer Location (Position)](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer):__ This describes __purely integer-based indexing__ which requires an integer (0-based indexing) for input. The syntax of this method is (for DataFrames) `df.iloc[row_number, column_number]` and it returns a `Series` object. The following provides a list of the possible arguments for `row_number` and/or `column_number`:
>> a. An integer (i.e. `5`)<br>
>> b. A list or array of integers (i.e. `[4, 3, 0]`<br>
>> c. A slice object with integers (i.e. `[1:7]`<br>
>> d. A boolean array 
> 4. __[Slicing Ranges](http://pandas.pydata.org/pandas-docs/stable/indexing.html#slicing-ranges):__ The syntax of this method is (for DataFrames) `df[row_number_1:row_number_2]` and it returns a `DataFrame` object. The `[` and `]` operator is responsible for the slicing and this ONLY operates on rows
> 5. __[Boolean Indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing):__ The syntax of this method is (for DataFrames) `df[boolean_vector]` and it returns a `DataFrame` object. This is performed using boolean operators such as `|` for `or`, `&` for `and` and `~` for not and each must be grouped by parantheseses
> 6. __[Indexing with isin](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin):__ The syntax of this method is (for DataFrames) `df.isin(values)` and it returns a boolean vector that is true wherever the Series elements exist in the passed list 

__Helpful Points:__
1. Recall in previous lectures how we peformeed indexing in Lists, Strings and other ordered, sequence types. The process with Pandas Data Structures is quite similar with a few minor syntactical differences
2. Both Series and DataFrames can be indexed, naturally though, Series have less functionality

__Practice:__ Examples of Indexing Pandas Data Structures in Python 

## Part 1 (Select by Column Name):

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

### Access a Column by Name - Method 1

In [None]:
employee_df["Last"] # 2nd column

In [None]:
type(employee_df["Last"]) # returns a Series object 

### Access a Column by Name - Method 2

This option is limited to column names that do not have blank spaces.

In [None]:
employee_df.Last # 2nd column 

In [None]:
type(employee_df.Last) # returns a Series object

In [None]:
test = pd.DataFrame([{'First 1':'Roberto','Last_1':'Reif'},{'First 1':'Gordon','Last_1':'Dri'}])
test

In [None]:
test['First 1']
# test.First 1

In [None]:
test['Last_1']
# test.Last_1

### Access Multiple Columns by Names

In [None]:
employee_df[["First", "Last"]] # 1st and 2nd column

In [None]:
type(employee_df[["First", "Last"]]) # returns a DataFrame object

## Part 2 (Select by Label):

Recall that any `pandas` object (Series and DataFrames) has an index. The elements in these indexes are called _labels_.

### Access with Single Label - Series

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
series_1

In [None]:
series_1.loc["b"] # 2nd row value 

In [None]:
series_1.loc[1] # 2nd row value???

The above will output a (rather lengthy) error message (in times like these, just skip to the very bottom for the actual error info). The reason this results in an error is because when using `loc`, we are indexing by row label so the number `1` is interpreted as a row label and there is clearly not a row label named `1`.

In [None]:
series_1.loc[4] # no error 

However, there IS a row label named `4`, therefore we can index with the integer 4. Make sure you understand though, this is NOT integer indexing and the argument `4` is interpreted as a row label and NOT an integer position. 

### Access with Single Label - DataFrame

In [None]:
employee_df

In [None]:
employee_df.loc["Employee_1"] # 1st row 

In [None]:
employee_df.loc["Employee_1", "First"] # 1st row, 1st column 

### Access with List of Labels - Series

In [None]:
series_1

In [None]:
series_1.loc[["a", "c", 4]] # 1st row, 3rd row, last row 

### Access with List of Labels - DataFrame

In [None]:
employee_df.loc[["Employee_1", "Employee_3"]] # 1st row and last row

In [None]:
employee_df.loc[["Employee_1", "Employee_3"], ["Last", "Age"]] # 1st row and last row and last two columns 

### Access with Slice Object - Series

In [None]:
series_1.loc["b":"d"] # 2nd to 4th row, inclusive 

### Access with Slice Object - DataFrame

In [None]:
employee_df.loc["Employee_2":"Employee_3"] # 3rd and 4th row, inclusive

Notice how the `stop` argument in the slice `"Employee_2":"Employee_3"` is included in the result 

In [None]:
employee_df.loc["Employee_2":"Employee_3", "Last":] # 3rd and 4th row, inclusive and 2nd and 3rd columns

Notice how the slice `"Last":` actually includes the endpoint, which is not the same as regular slicing in Python 

In [None]:
employee_df.loc["Employee_2":"Employee_3", ["First", "Age"]] # mix slice object with list of labels

### Access with Boolean Array - Series

In [None]:
series_1

In [None]:
series_1.loc["a"] > 5

In [None]:
series_1 > 5

In [None]:
series_1[series_1 > 5]

### Access with Boolean Array - DataFrame

In [None]:
employee_df

In [None]:
employee_df.loc[:, "Age"] > 30

In [None]:
employee_df.loc[employee_df.loc[:, "Age"] >30]

### Access Scalar Values

Indexing the traditional way as we saw above is very versatile, but that comes with a downside - which is that it takes a bit of time to figure out which of the many options are being asked. If you want to only access a __[scalar value](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars)__ using position indexing, the best (and fastest) way it to use the `at` method. This method works in the same way as `loc`, in that it requires labels as arguments and not integers. 

In [None]:
employee_df.loc["Employee_1", "Age"] # method 1

In [None]:
employee_df.at["Employee_1", "Age"] # method 2

In [None]:
employee_df.loc["Employee_1", "Last":] # non scalar value is okay for .loc 

In [None]:
employee_df.at["Employee_1", "Last":] # non scalar value is not okay for .at

## Part 3 (Select by Integer Location):

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
series_1

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

### Access with Integer - Series

In [None]:
series_1.iloc[2] # 3rd row 

In [None]:
series_1.loc["c"] # same result as above using label-based indexing 

### Access with Integer - DataFrame

In [None]:
employee_df.iloc[1, 2] # 2nd row, 3rd column

In [None]:
employee_df.loc["Employee_2","Age"] # same result as above using label-based indexing 

### Access with List of Integers - Series

In [None]:
series_1.iloc[[1,2,4]] # 2nd, 3rd and 5th rows

### Access with List of Integers - DataFrame

In [None]:
employee_df.iloc[[0,2], [1,2]] # 1st row and 3rd row, 2nd and 3rd columns

### Access with Slice Object - Series

In [None]:
series_1.iloc[2:4] # 3rd and 4th rows

In [None]:
series_1.iloc[2:10] # out-of-bound indexing is handled gracefully like we saw in Lecture 2 

In [None]:
series_1.iloc[10:] # out-of-bound indexing is handled gracefully like we saw in Lecture 2 

### Access with Slice Object - DataFrame

In [None]:
employee_df.iloc[1:, :] # 2nd and 3rd rows, all columns

In [None]:
employee_df.iloc[1:, [0,1,2]] # mix of slice object and list of integers

In [None]:
employee_df.iloc[:, 2:8] # out-of-bound indexing is handled gracefully like we saw in Lecture 2

### Access with Boolean Array - Series

In [None]:
series_1

In [None]:
series_1.iloc[2] > 1

### Access with Boolean Array - DataFrame

In [None]:
employee_df.iloc[:, 2] > 35

### Access Scalar Values

Indexing the traditional way as we saw above is very versatile, but that comes with a downside - which is that it takes a bit of time to figure out which of the many options are being asked. If you want to only access a __[scalar value](https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#arrays-scalars)__ with integer indexing, the best (and fastest) way it to use the `iat` method. This method works in the same way as `iloc`, in that it requires labels as arguments and not integers. 

In [None]:
employee_df

In [None]:
employee_df.iloc[2, 2] # method 1

In [None]:
employee_df.iat[2, 2] # method 2

In [None]:
employee_df.iloc[0:2, [0,2]] # non scalar value is okay for .iloc 

In [None]:
employee_df.iat[0:2, [0,2]] # non scalar value is not okay for .at

## Part 4 (Slicing Ranges):

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
series_1

### Slicing Ranges - Series

In [None]:
# start = 2, stop = 5, step = 1
series_1[2:4] # 3rd row to 5th row (not inclusive) 

In [None]:
# start = 0, stop = 4, step = 2
series_1[::2] # every other row

In [None]:
# start = 4, stop = 0, step = -1 
series_1[::-1] # reverse the rows 

### Slicing Ranges - DataFrames

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

In [None]:
employee_df[1:2] # 2nd row, returns a DataFrame

In [None]:
employee_df.iloc[1] # 2nd row, returns a Series

In [None]:
employee_df[::2] # every other row 

In [None]:
employee_df[::-1] # reverse the rows 

Notice that when slicing ranges using the `[` and `]` operator, without `loc` or `iloc`, the slicing acts on the row and was designed like this since it is such a common operation. Note that you can't access rows AND columns here as we are just subsetting rows. If you would like to access rows AND columns, use `iloc` or `loc`.

## Part 5 (Boolean Indexing):

### Boolean Indexing - Series

In [None]:
series_1

In [None]:
series_1[series_1 > 4] # all rows greater than 4

In [None]:
series_1[(series_1 > 3) & (series_1 < 5)] # all rows greater than 3 AND less than 5

In [None]:
series_1[~(series_1 == 4)] # all rows not equal to 4

### Boolean Indexing - DataFrames

In [None]:
employee_df[employee_df["Age"] > 30] # all rows with age greater than 30

In [None]:
employee_df[(employee_df["Age"] > 30) & (employee_df["Age"] < 50)] # all rows with age greater than 30 and less than 50

In [None]:
employee_df[~(employee_df["First"] == "Gordon")] # all rows where first name is not equal to Gordon 

In [None]:
employee_df[employee_df["First"] != "Gordon"] # all rows where first name is not equal to Gordon 

## Part 6 (Using `isin`):

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
series_1

### Using `isin` - Series

In [None]:
series_1.isin([3,6,7])

In [None]:
series_1[series_1.isin([3,6,7])]

### Using `isin` - DataFrames

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

In [None]:
values = ["Gordon", "Roberto", "Reif", 31] # criteria spans multiple columns, find rows that meet this multi-part criteria 
employee_df.isin(values)

In [None]:
values = {'First': ['Gordon', 'Jerod'], 'Age': [31]} # match certain values with certain columns
employee_df.isin(values)

# Simple Manipulation of Pandas Data Structures:

__Overview:__
- Pandas offers many useful ways to manipulate both Series and DataFrames:

> /1. __Setting Values:__ Pandas offers a few different methods of setting values:
>> a. Using __Slicing Ranges__ (i.e. `df[:5] = 0`)<br>
>> b. Using __Labels__ (i.e. `df.loc["a":] = 0` or `df.at["a":] = 0`)<br>
>> c. Using __Position__ (i.e. `df.iloc[:3] = 0` or `df.iat[:3] = 0`)<br>

> /2. __Manipulating Rows:__ Pandas offers many different methods to manipulate rows:

>> a. __Changing Row Index:__ 
>> > 1. Renaming the row index (`df.rename(index=)`)
>> > 2. Setting the row index back to 0-based indexing (`df.reset_index(drop=True)`), etc.
>> > 3. Re-indexing labels (`df.reindex(index=)`) 

>> b. __Adding Rows:__
>> > 1. Inserting row within the DataFrame (`pd.concat()`)
>> > 2. Adding row at the bottom of the DataFrame (`df.append()` or `df.loc[len(df)] = val`)

>> c. __Removing Rows:__
>> > 1. Removing rows based on row index (`df.drop(df.index[])` or `df.drop([row_name])`)

>> d. __Sorting Rows:__
>> > 1. Sorting rows based on column values (`df.sort_values(by = ["col_name_1", "col_name_1"])`)
>> > 2. Sorting rows based on labels (`df.sort_index(axis=0)`)

> /3. __Manipulating Columns:__ Pandas offers many different methods to manipulate columns:

>> a. __Changing Column Names:__ 
>> > 1. Explicitly assign the column names (`df.columns = ["new_col_1", "new_col_2"]`)
>> > 2. Renaming the columns (`df.rename(columns=)`
>> > 3. Re-indexing column names (`df.reindex(columns=)`) 

>> b. __Adding Columns:__
>> > 1. Inserting column within the DataFrame (`df.insert(column_position, column = "new_col_name", value = )`)
>> > 2. Adding column to the end of the DataFrame (`df["new_col_name"] = pd.Series()`)

>> c. __Removing Columns:__
>> > 1. Deleting column using `del` (`del df["col_name"]`)
>> > 2. Deleting column using `pop` (`df.pop("col_name")`)
>> > 3. Removing columns using `drop` (`df.drop(["col_name_1", "col_name_5"], axis = 1)`)

>> d. __Sorting Columns:__
>> > 1. Sorting columns based on column names (`df.sort_index(axis=1)`)

__Helpful Points:__
1. Remember, there are always more than one way to do something in Python and this is especially true with changing, adding, and removing rows/columns in Pandas Series and DataFrames 
2. Don't worry if you can't remember all these possibilities now, be sure to come back and review this list as needed
3. Many of these functions mentioned above act "in-place" which means they perform their operation on the actual object itself 

__Practice:__ Examples of Simple Row/Column Manipulation of Pandas Series and DataFrames in Python 

## Part 1 (Setting Values):

### Setting Values - Slicing Ranges (Series)

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
series_1

In [None]:
series_1[:2] = 100 # setting values in first 2 rows 
series_1

### Setting Values - Slicing Ranges (DataFrames)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df

In [None]:
employee_df[0:2] = "New" # setting values in first 2 rows
employee_df

### Setting Values - By Labels (Series):

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])

In [None]:
series_1.loc["a":"d"] = 5 # setting values in first 4 rows using loc 
series_1

In [None]:
series_1.at[4] = 100 # setting scalar value in last row using at 
series_1

### Setting Values - By Labels (DataFrame):

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.loc["Employee_2":"Employee_3"] = "New" # setting values in last 2 rows using loc 
employee_df

In [None]:
employee_df.at["Employee_1"] = "New" # setting value in first row using at
employee_df

### Setting Values - By Position (Series):

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])

In [None]:
series_1.iloc[:2] = 21 # setting values in first 2 rows using iloc 
series_1

In [None]:
series_1.iat[3] = 100 # setting value in 4th row using iat
series_1

### Setting Values - By Position (DataFrame):

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.iloc[2:] = "New" # setting value in 3rd row using iloc
employee_df

## Part 2 (Manipulating Rows):

### Index Rename - Series

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rename.html

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
series_1

In [None]:
series_1.rename({"a":"A", "b":"B", "c":"C", "d":"D", 4:"four"}) # map old index to new index 

### Index Rename - DataFrame

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.rename({"Employee_1":"Employee1", "Employee_2":"Employee2"})# map old row index to new row index

### Index Reset - Series

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.reset_index.html

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])
type(series_1)

In [None]:
series_1.reset_index() # resets index and adds a column of old index (returns dataframe), and it is not in place

Note that the output is now a DataFrame with one column 

### Index Reset - DataFrame

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.reset_index() # resets index and adds a column of old index. It is not in place.

In [None]:
employee_df

### Re-index - Series

We can also re-index labels. Note that this can delete data from your series if you aren't careful.

- Labels in the new index that appear in the old index will preserve the original data. 
- New labels will appear with missing values in the data. 
- Old labels that do not appear in the new index will drop data from the series.

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.reindex.html

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])

In [None]:
new_index = ["a", "b", "New_1", "New_2", 4] # change some indexes 
series_1.reindex(new_index)

### Re-index - DataFrame

(Same principles apply for losing data or adding missing values, but this time across all columns when a new index is added -- indexes in DataFrames point to entire rows.)

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

In [None]:
new_index = ["Employee_1", "Employee_2", "Employee_3", "Employee_4"]
employee_df.reindex(new_index) # change some indexes

### Using `concat` to Add Rows to a DataFrame

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

In [None]:
new_entry_list = [{"First":"Jason", "Last":"Moss", "Age":34}]
new_entry_df = pd.DataFrame(new_entry_list, index = ["Employee_3"], columns = employee_df.columns)
new_entry_df

In [None]:
# form a new dataframe which is a concatenation of 2 separate dataframes 
employee_df = pd.concat([employee_df, new_entry_df])

employee_df

### Using `append` to Add Rows to a DataFrame

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
new_entry_list = [{"First":"Jason", "Last":"Moss", "Age":34}]
new_entry_df = pd.DataFrame(new_entry_list, index = ["Employee_4"], columns = employee_df.columns)
new_entry_df

In [None]:
# append new dataframe to bottom 
employee_df = employee_df.append(new_entry_df)
employee_df

In [None]:
new_entry_series = pd.Series(["Paul", "Trowbridge", 29], index = ["First", "Last", "Age"])
new_entry_series

In [None]:
# append new series to bottom
employee_df = employee_df.append(new_entry_series, ignore_index = True)
employee_df

### Using `.loc` to Add Rows to a DataFrame

Since `len(df)` returns the number of rows in a DataFrame and the last row is accessed by `df.loc[len(df)-1]`, we can access a new row below the last one using `df.loc[len(df)]`.

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.loc[len(employee_df)] = ["Paul", "Trowbridge", 40]
employee_df

In [None]:
employee_df.rename(index=({3:"Employee_4"}))

### Dropping Elements - Series

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.drop.html

In [None]:
series_1 = pd.Series(np.arange(3,8), index = ["a", "b", "c", "d", 4])

In [None]:
series_1 = series_1.drop(["a"]) # delete the "a" row 
series_1

In [None]:
series_1 = series_1.drop(series_1.index[0]) # delete the first index of the new indexes ("b" row)
series_1

### Dropping Rows - DataFrames

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df = employee_df.drop(["Employee_2"]) # delete the "Employee_2" row 
employee_df

In [None]:
employee_df = employee_df.drop(employee_df.index[1]) # delete the second index of the new indexes ("Employee_3" row)
employee_df

### Dropping Columns - DataFrames

We can also use `.drop` to remove columns from a DataFrame, by setting the `axis` parameter to 1.

In [None]:
employee_df = employee_df.drop(['Last'], axis=1)
employee_df

### Sorting Rows by Values in a Column - DataFrames

Documentation: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.sort_values.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.sort_values(by = "Age") # sort by Age column in ascending fashion

In [None]:
employee_df.sort_values(by = "Last", ascending = False) # sort by First column in descending fashion

### Sorting Rows by Index - DataFrames

Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.sort_index(axis=0, ascending=False) # sort rows

## Part 3 (Manipulating Columns):

### Renaming All Columns with a List

Method 1 is Explicitly assign the column names (`df.columns = ["new_col_1", "new_col_2"]`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.columns = ["First_Name", "Last_Name", "Age_Years"]
employee_df

### Renaming Some Columns With A Dictionary

Method 2 is Renaming the columns (`df.rename(columns=)`

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.rename(columns = {"First":"First_Name", "Last":"Last_Name")

### Changing Column Names with `reindex`

Method 3 is Re-indexing column names (`df.reindex(columns=)`) 

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
new_columns = ["First", "First_Name", "Last", "Last_Name", "Age", "Age_Years"]
employee_df.reindex(columns=new_columns)

### Inserting Columns

Method 1 is Inserting column within the DataFrame (`df.insert(column_position, column = "new_col_name", value = )`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.insert(2, column = "Salary", value = [100, 200, 150])
employee_df

### Adding New Columns

Method 2 is Adding column to the end of the DataFrame (`df["new_col_name"] = pd.Series()`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
new_series = pd.Series([100,200,150], index = employee_df.index)
new_series

In [None]:
employee_df["Salary"] = new_series
employee_df

### Removing Columns with `del`

Method 1 is Deleting column using `del` (`del df["col_name"]`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
del employee_df["Age"]
employee_df

### Removing Columns with `pop`

Method 2 is Deleting column using `pop` (`df.pop("col_name")`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)
employee_df

In [None]:
last = employee_df.pop("Last")
print(last)
employee_df

### Removing Columns with `drop`

Method 3 is Removing columns using `drop` (`df.drop(["col_name_1", "col_name_5"])`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.drop(["First", "Last"], axis = 1)

### Sorting Column Names 

Method 1 is Sorting columns based on column names (`df.sort_index(axis=1)`)

In [None]:
employee_df = my_df = pd.read_csv("csv_file_for_pandas.csv", index_col = 0)

In [None]:
employee_df.sort_index(axis=1)