![pandas logo](https://pandas.pydata.org/_static/pandas_logo.png)
# Pandas

> *`pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.* - [pandas.pydata.org](https://pandas.pydata.org)

### Library Highlights
* <mark>A fast and efficient **DataFrame** object for data manipulation with integrated indexing;</mark>
* <mark>Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;</mark>
* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
* <mark>Flexible reshaping and pivoting of data sets;</mark>
* <mark>Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;</mark>
* <mark>Columns can be inserted and deleted from data structures for size mutability;</mark>
* <mark>Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;</mark>
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
* Highly optimized for performance, with critical code paths written in Cython or C.
* <mark>Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.</mark>

***
#### Installation
If you use **anaconda** you probably already have `pandas`. It's one of the 1000s of packages already included in your download.

If you use **anaconda** and don't have `pandas` you can easily install it with:<br>
`conda install pandas` from a **terminal** or <br>`!conda install pandas -y` from within a `jupyter notebook` cell.

you can also use another popular python package manager `pip`:
`pip install pandas` from a **terminal** or <br>`!pip install pandas` from within a `jupyter notebook` cell.

***
__note__: If you're wondering why does the `jupyter notebook` command has a `!` before it it's because `jupyter` is running a `python` kernel but what you need is to tell your computer (not `python`) to run the command and adding a `!` tells `jupyter` to send that command out of `python` and to the computer rather than to try to run it as a `python` command.<br>
the `-y` at the end of the `conda install` command means "when the computer asks me something tell it I said 'yes'". this is important if you're installing something using `conda` from within a `jupyter notebook` because before you download anything it'll ask you to confirm that you do indeed want to download said package (in this case `pandas` and it's _dependencies_ [other `python` libraries it depends on]). If you are running the code in a `jupyter notebook` cell you have no way to confirm directly to the `terminal` so you send the confirmation along with the command. <br>
If you use `pip` to install any package it'll just do it, they don't care about your confirmation. You asked for it, you get it. 
***
***

## pandas basics

### reading and writing data

`pandas` uses the methods `.from_***()` and `.to_***()` to read and write data where `***` is one of the different types of data formats.

```python
import pandas as pd ## this is a convention. It just saves time, 66.7% to be exact.

# we store the data in a DataFrame. Conventionally, you'd name it 'df' but it can be anything you want.
df = pd.read_csv("path_to_your_file.csv")
```

In [None]:
import pandas as pd

In [None]:
df = pd.read_

In [None]:
df.to_

***
**cool fact**: You can save your `clipboard` into a pandas dataframe with `pd.read_clipboard()`. So if you go to a website and they have their data written out but not available to download you can highlight > copy >  and `pd.read_clipboard()` > do your analysis > save it to an excel file or csv or .dta or whatever you want.
***

### slicing/filtering datasets

Once you have your dataset set up you may want to use only a few columns or rows. In `pandas` you can `slice` the dataframe in a few different ways.

* `.loc[]` and `.iloc[]` notation.
* boolean indexing

In [None]:
# import pandas as pd

df = pd.read_csv("../data/raw/Bee Colony Census Data by County.csv")

df.head()

***
```python
df.head()
```
will show you the first rows of your __`dataframe`__ (`.tail()` shows you the last rows). By default, it shows 5 but you can change it to whatever you'd like by *changing the parameter __`n`__*

to explore this more type out `df.head()` and **between the parentheses** press `shift + tab`


***

##### .iloc[] / .loc[]

Both `.iloc[]` and `.loc[]` accept a _row indexer_ and a _column indexer_ as parameters like this:
```python
df.iloc[4, 5]
df.loc[df['State'] == 'CALIFORNIA', 'County']
```
`.iloc[]` is integer based and locates the i-th row in your __dataframe__ 
`.loc[]` is primarily label based. 

This depends on the values of your `index` in your __dataframe__

In [None]:
df.index

Since your `index` is integer-based you can use `.iloc[]` to access it's rows and columns. For example,

In [None]:
df.iloc[4, 3:8]

This translates to "using the __dataframe__ `df` locate the 4th row and the 3-8th columns (ending column not included)"

If your index was label-based you could use `.loc[]`

In [None]:
dff = df.set_index('State')

dff.head()

In [None]:
dff.loc['CALIFORNIA']

##### boolean indexing / filtering

Another way of grabbing subsets of your __dataframe__ is by using _boolean indexing_. Basically, logic conditions. For example,

In [None]:
df['Ag District'] == 'SAN JOAQUIN VALLEY'

This long __series__ of `true` and `false` can work as a filter. Essentially saying "show me all the rows where _X condition_ is true".

The notation is as follows:
```python
df[df[column] == value] 
```
where `column` is one of your column names and `value` is one of the values found in that column. Common operators are `< > == !=`.  

our condition before was
```python
df['Ag District'] == 'SAN JOAQUIN VALLEY'
```

In [None]:
df[df['Ag District'] == 'SAN JOAQUIN VALLEY']

***
**Note**: This is a "view" of the original __dataframe__ `df` and can be stored in another variable. 
```python
san_joaquin_df = df[df['Ag District'] == 'SAN JOAQUIN VALLEY']
```
However, you __cannot__ edit the data in the original __dataframe__ `df` by editing `san_joaquin_df`.

In [None]:
san_joaquin_df = df[df['Ag District'] == 'SAN JOAQUIN VALLEY']

In [None]:
san_joaquin_df['County ANSI'] = 33

In [None]:
san_joaquin_df.head()

In [None]:
df[df['Ag District'] == 'SAN JOAQUIN VALLEY'].head()

***

**Note**: You can use multiple conditions to subset a __dataframe__
```python
df[(df['Ag District'] == 'SAN JOAQUIN VALLEY') & (df['Year'] == 2012)]
```
Just make sure to use __parentheses__ around each condition and use `&` for 'and' and `|` for 'or'. 

***

### grabbing a subset of columns

You can access a column of a `pandas` __dataframe__ like this `df[column_name]`.
```python
df['County']
```
This returns all the values of that column. Because __dataframes__ are composed by __series__, when you access one column of a __dataframe__ you'll get a __series__ in return. __dataframes__ and __series__ have different properties. We'll be working mostly with __dataframes__. 

To grab more than one column you must pass a `list` with the column names you'd like to access. For example,
```python
df[['State','County', 'Value', 'Year']]
```
Notice the double brackets. `lists` in `python` use `[]` and to access a column in a __dataframe__ you use `[]` too. This is why you end up with two sets of brackets. Another way to code this would be,
```python
columns_i_need = ['State', 'County', 'Value', 'Year']

df[columns_i_need]
```

***
__Notice__ that you the columns in the new __dataframe__ are in the order you specified in your list and not the original order. This is because your new __dataframe__ is being build one step at a time by going back to the original __dataframe__ and grabbing the column that matches your lists' value. <br>
You could in theory pass the `list` `columns_i_need = ['State', 'Year', 'County', 'Year', 'Value', 'Year']` and your new __dataframe__ would have the 6 columns in that specific order. 
***

### series and `dtypes`

In `pandas` a __series__ is a data structure that represents a column in a __dataframe__, the most basic unit of a __dataframe__. `stata`, for example, doesn't have a separate data structure for this.
Here's a labeled closer look into the components of a __dataframe__.
![dataframe_df](../images/anatomy_df.png)


here `director_name`, `duration`, `imdb_score` are all __series__. Each series can have a `dtype`. There are a few different `dtypes` in `pandas`. There are those common across programming languages like `bool`, `int` and `float`. In `pandas` string __series__ are of `dtype` `object` which is a blanket type for most data that is non-numeric. `pandas` also has `datetime` and `categorical` __series__ which have their own special properties.

To _access_ each `dtypes` properties we use `accessors`. For example, to access the `string` properties of a column, which means you can apply `string` methods to each value in that __series__, you would do the following:
```python
df['State'].str.upper()
```
`.str` allows us to `access` the string methods like `.upper()`, `.capitalize()`, `.split()`, `.strip()` and apply them to each of the values in `df['State']`.<br>
Try below, grab a column (or __series__) and use some string methods to change the values in it.

__Tip__: when you apply these methods to a __series__ you are not actually changing the values of that __series__ themselves. You are telling `pandas` "grab the column XX in __dataframe__ `df` apply this `.str` method, give me __that__ series you just made." <br>
To clean up the data __in__ the __dataframe__ you need to _reassign_ the values of that __series__. For example,
```python
df['State'].str.lower() # this would give me back a series with all the values lowercased

df['State'] = df['State'].str.lower() # this takes that new series and assigns it to the column 'State'
```

***

### grouping / aggregating

to aggregate values by a group you can use `.groupby()`. For example,
```python
df.groupby('State')
```
If you run this command you'll get something like this `<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000000007F3BE10>` which means a `groupby` object was created. To do anything useful with it we apply aggregators to it. For example, 
```python
df.groupby('State').sum()
```
would return the sum of all the numeric values in your __dataframe__ by 'State'.
If you want to look at a specific value only you can grab that column like this:
```python
df.groupby('State')['Value'].sum()
```
Use a `list` if you want to grab multiple columns.

***
__note__: there are multiple `aggreagators` like `.sum()`. You can use:
* `.mean()`
* `.median()`
* `.min()`
* `.max()` 
* `.count()` 

and many others. You will not remember these so you can always google them or _assign_ your `groupby` object to a `variable` and explore your options with `tab` like we explored in the [first notebook](00_Intro.ipynb).
```python
state_groups = df.groupby('State') # run this in a cell

# in another cell type your variable, add a . at the end, hit `tab`
state_groups.
```
***

***
If you ran that code above you'd see you don't get the values you expected. That is because the column `'Value'` is not a numeric column.

Run the following code to explore your __dataframe__ further:
```python
df.info()
```

In [None]:
df.info()

The `dtype` (or data type) of the `Value` column is _object_ which is a blanket term for "anything non-numeric". 

***
Let's clean up the __dataframe__ then! <br>
[case study](02_case-study.ipynb)

or go back to the
[index](04_index.ipynb)