In [None]:
https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-part-4-c4216f84d388
    
https://www.practicaldatascience.org/html/views_and_copies_in_pandas.html

<img align="right" width="300" src="libraries_short_color.png" alt="NYU Libraries Logo">

# Getting Started with Python Pandas

**Nicholas Wolf**<br/>
[ORCID 0000-0001-5512-6151](https://orcid.org/0000-0001-5512-6151)

This lesson is licensed under a [Creative Commons Attribution-NonCommercial 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).

**Overview**


**Materials**

 - OpenRefine, and a good text editor such as [TextWrangler](https://apps.apple.com/us/app/textwrangler/id404010395?mt=12) (for Mac) or [Notepad++](https://notepad-plus-plus.org/downloads/) (for Windows).

### 1. Using Pandas...and NOT using Pandas

Pandas can be a powerful tool, especially for those using it who have a background in other statistical software and are looking for a means to work with tabular data. But it isn't the only (or in some cases even the best) means of dealing with data munging or data analysis in Python, particularly for large data.

For example, note the respective size of these two Python objects:

In [68]:
import pandas as pd

# Create a 900 x 900 table of integers and store it as a simply Python list of list-rows:

list_lists = [list(range(0,900)) for i in range(0,900)]

# Make a Pandas dataframe out of that same table

df_list_lists = pd.DataFrame(list_lists)

# Note the size difference in memory of these two objects. This is size in bytes

print(list_lists.__sizeof__())
print(df_list_lists.__sizeof__())

7960
6480080


The Python list of lists is considerably smaller in bytes than the dataframe.

Unsurprisingly, users experience periodic issues in reading large tables into a Pandas dataframe because of this overhead. A sense of these problems and common workarounds can be found on this [Stack Overflow thread](https://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas).

On the other hand, our Pandas dataframe will start to outperform Python loops to modify data as size as our table/matrix gets larger:

In [69]:
# Update the fourth column of our list of lists

def update_list(list_lists):
    new_list_lists = []
    for row in list_lists:
        new_list_lists.append(row[0:3] + [row[3] * 3] + row[4:])
    print(new_list_lists[0][0:5])
    

print(list_lists[0][0:5])

%timeit -n 1 -r 1 update_list(list_lists)

[0, 1, 2, 3, 4]
[0, 1, 2, 9, 4]
22.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [70]:
%timeit -n 1 -r 1 df_list_lists[3] = df_list_lists[3].apply(lambda x: x*3)

df_list_lists.head(5)

2.88 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,890,891,892,893,894,895,896,897,898,899
0,0,1,2,9,4,5,6,7,8,9,...,890,891,892,893,894,895,896,897,898,899
1,0,1,2,9,4,5,6,7,8,9,...,890,891,892,893,894,895,896,897,898,899
2,0,1,2,9,4,5,6,7,8,9,...,890,891,892,893,894,895,896,897,898,899
3,0,1,2,9,4,5,6,7,8,9,...,890,891,892,893,894,895,896,897,898,899
4,0,1,2,9,4,5,6,7,8,9,...,890,891,892,893,894,895,896,897,898,899


#### Don't forget: a Pandas dataframe is a special kind of two-dimensional array, and arrays excel at performing matrix-based transformations

In other words, if you simply need a container to "hold" your data, a lot of times a simple core Python structure is great. But if you need to do full-table transformations, quick statistics, advanced statistics, and table relational joins, then Pandas is a great option.

**It is essential if you want to do the steps above AND you have non-uniform data types.**

Unlike another commonly used matrix library, numpy, Pandas dataframes accommodate tables/matrices that mix integers, strings, and other data types. (Pandas also shares some underlying code with numpy.)

### 2. Building a Dataframe: Series

To understand how a dataframe works in Pandas (or any other environment) we can think of the multiple ways we can assemble a two-dimensional table like this:

<img align="left" width="300" src="pandas_table_1.png" alt="A two-dimensional table illustrating how data might be organized"><br/><br/><br/><br/><br/><br/><br/><br/><br/>

Now we might conceive of this table as consisting of three rows, or observations, with each row consisting of elements that are ordered so that they align with a column location that tells us what the value is for any given variable.

But we also might think of a table as consisting of vertical uniform-length columns, each representing the measurement of single variable across the same number of observations, that are then assembled by stacking them from left to right:

<img align="left" width="350" src="pandas_table_6.png" alt="An image showing how a table is also built out of uniform columns"><br/><br/><br/><br/><br/><br/><br/><br/><br/>

In Python terms, we might think of rows and columns in a table as having some "dictionary-like" qualities, and some "list-like" qualities. For example, the elements of a row can be conceived of values that are each paired with a key corresponding to our column headers (or variables):

<img align="left" width="800" src="pandas_table_3.png" alt="Image showing how we might think of a table row as an equivalent of a Python key-value dictionary"><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>

We can also think of the rows and columns as having an index order, like a Python list, so that we might slice and retrieve a column or row (or a value using both column and row) using its index:

<img align="left" width="350" src="pandas_table_2.png" alt="Image showing how we might think of a table as having index ordered rows and columns"><br/><br/><br/><br/><br/><br/><br/><br/><br/>

And we can think of each column as being like a Python list, with an order so that once again we might access individual values and stacked next to each other to form a table:

<img align="left" width="600" src="pandas_table_4.png" alt="Image showing how we might think of a table as consisting of several uniform-length lists placed next to each other"><br/><br/><br/><br/><br/><br/><br/><br/><br/>
 

#### The Pandas Series object

Recognizing these hybrid dictionary- and list-like qualities of the components of a two-dimensional array, the building block for Pandas dataframe is the Series object.

In [75]:
# We might create a Series from a list:

list_series = pd.Series(["student1", "student2", "student3", "student4"])

list_series

0    student1
1    student2
2    student3
3    student4
dtype: object

Note that our resulting Series has an index, and looks like a 4 x 1 (4 rows x 1 column) array. Let's add a name so that we understand what this column/vector of values refers to:

In [80]:
named_list_series = pd.Series(["student1", "student2", "student3", "student4"], name="student_name")

named_list_series

0    student1
1    student2
2    student3
3    student4
Name: student_name, dtype: object

Our Series can be sliced by index location, much like a list:

In [81]:
named_list_series[0]

'student1'

In [82]:
named_list_series[0:2]

0    student1
1    student2
Name: student_name, dtype: object


Great! But in and of themselves, a Series object isn't that helpful. But putting several together gives us a dataframe. We can do this by instantiating a DataFrame object which has been passed a dictionary of Series, i.e. one or more Series objects identified with a key that will serve as the column header:


In [87]:
year_series = pd.Series([1990, 1991, 1992, 1993])

pop_series = pd.Series([1.5, 1.6, 1.8, 2.0])

population_table = pd.DataFrame({"year":year_series, "pop":pop_series})

population_table

Unnamed: 0,year,pop
0,1990,1.5
1,1991,1.6
2,1992,1.8
3,1993,2.0


Note that Pandas automatically builds for us a row index, highlighted in bold, on the lefthand side. If we had failed to provide column names, it would have used index numbers to label them.

That's all we need to know about the Pandas Series object to get started. Mostly, this is helpful so that we know that when we operate on a single column sliced from a dataframe, we are operating on a Series object.

### 3. Loading a DataFrame

We have several options for how to make a dataframe and start working in Pandas:

1. We can load a tabular data file and allow Pandas to parse it as a dataframe

2. We can instantiate an empty dataframe and append rows or columns in the form of Series objects

3. We can transform a Python complex array (such as a list of lists or a list of dictionaries) into a dataframe

No matter what approach is taken, I recommend taking some time to set the various parameters of the pd.DataFrame object so that your work on the dataframe later has expected results. This includes setting column names, column order, data types of variables, and (when reading from file) encoding.

Here are examples of all three:

#### Load from CSV/Excel/TSV, etc.

In [71]:
type(df_list_lists.iloc[0:1,:])

pandas.core.frame.DataFrame

In [72]:
type(df_list_lists[3])

pandas.core.series.Series

In [74]:
x = pd.Series([0.05, 0.04, 0.03])

x

0    0.05
1    0.04
2    0.03
dtype: float64

In [None]:
### 4. Selecting/filtering rows and columns from a dataframe

.iloc
.loc
.at