# Getting Started With Pandas

We will begin by introducing the `Series`, `DataFrame`, and `Index` classes, which are the basic building blocks of the pandas library, and showing how to work with them. By the end of this section, you will be able to create DataFrames and perform operations on them to inspect and filter the data.

More information [here](https://pandas.pydata.org/docs/user_guide/index.html)

## Anatomy of a DataFrame

A **DataFrame** is composed of one or more **Series**. The names of the **Series** form the column names, and the row labels form the **Index**.

In [None]:
import pandas as pd

meteorites = pd.read_csv('data/Meteorite_Landings.csv', nrows=5)
meteorites

*Source: [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh)*

#### Series:

In [None]:
meteorites.name

#### Columns:

In [None]:
meteorites.columns

#### Index:

In [None]:
meteorites.index

## Creating DataFrames

We can create DataFrames from a variety of sources such as other Python objects, flat files, webscraping, and API requests. Here, we will see just a couple of examples, but be sure to check out [this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) in the documentation for a complete list.

### Using a flat file

In [None]:
import pandas as pd

meteorites = pd.read_csv('data/Meteorite_Landings.csv')

*Tip: There are many parameters to this function to handle some initial processing while reading in the file &ndash; be sure check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).*

### Using data from an API

Collect the data from [NASA's Open Data Portal](https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh) using the Socrata Open Data API (SODA) with the `requests` library:

In [None]:
import requests

response = requests.get(
    'https://data.nasa.gov/resource/gh4g-9sfh.json',
    params={'$limit': 50_000}
)

if response.ok:
    payload = response.json()
else:
    print(f'Request was not successful and returned code: {response.status_code}.')
    payload = None

In [None]:
payload

Create the DataFrame with the resulting payload:

In [None]:
import pandas as pd

df = pd.DataFrame(payload)
df.head(3)

*Tip: `df.to_csv('data.csv')` writes this data to a new file called `data.csv`.*

## Inspecting the data
Now that we have some data, we need to perform an initial inspection of it. This gives us information on what the data looks like, how many rows/columns there are, and how much data we have. 

Let's inspect the `meteorites` data.

#### How many rows and columns are there?

In [None]:
meteorites.shape

#### What are the column names?

In [None]:
meteorites.columns

#### What type of data does each column currently hold?

In [None]:
meteorites.dtypes

#### What does the data look like?

In [None]:
meteorites.head()

Sometimes there may be extraneous data at the end of the file, so checking the bottom few rows is also important:

In [None]:
meteorites.tail()

#### Get some information about the DataFrame

In [None]:
meteorites.info()

###### [Exercise 1.1](../exercises/pandas_intro_exercise.ipynb#Exercise-1.1)

Create a DataFrame by reading in the `2019_Yellow_Taxi_Trip_Data.csv` file. Examine the first 5 rows.

###### [Exercise 1.2](../exercises/pandas_intro_exercise.ipynb#Exercise-1.2)

Find the dimensions (number of rows and number of columns) in the data.

## Extracting subsets

A crucial part of working with DataFrames is extracting subsets of the data: finding rows that meet a certain set of criteria, isolating columns/rows of interest, etc. After narrowing down our data, we are closer to discovering insights. This section will be the backbone of many analysis tasks.

#### Selecting columns

We can select columns as attributes if their names would be valid Python variables:

In [None]:
meteorites.name

If they aren't, we have to select them as keys. However, we can select multiple columns at once this way:

In [None]:
meteorites[['name', 'mass (g)']]

#### Selecting rows

In [None]:
meteorites[100:104]

#### Indexing

We use `iloc[]` to select rows and columns by their position:

In [None]:
meteorites.iloc[100:104, [0, 3, 4, 6]]

We use `loc[]` to select by name:

In [None]:
meteorites.loc[100:104, 'mass (g)':'year']

#### Filtering with Boolean masks

A **Boolean mask** is a array-like structure of Boolean values &ndash; it's a way to specify which rows/columns we want to select (`True`) and which we don't (`False`).

Here's an example of a Boolean mask for meteorites weighing more than 50 grams that were found on Earth (i.e., they were not observed falling):

In [None]:
(meteorites['mass (g)'] > 50) & (meteorites.fall == 'Found')

**Important**: Take note of the syntax here. We surround each condition with parentheses, and we use bitwise operators (`&`, `|`, `~`) instead of logical operators (`and`, `or`, `not`).

We can use a Boolean mask to select the subset of meteorites weighing more than 1 million grams (1,000 kilograms or roughly 2,205 pounds) that were observed falling:

In [None]:
meteorites[(meteorites['mass (g)'] > 1e6) & (meteorites.fall == 'Fell')]

*Tip: Boolean masks can be used with `loc[]` and `iloc[]`.*

An alternative to this is the `query()` method:

In [None]:
meteorites.query("`mass (g)` > 1e6 and fall == 'Fell'")

*Tip: Here, we can use both logical operators and bitwise operators.*

## Calculating summary statistics

In the next section of this workshop, we will discuss data cleaning for a more meaningful analysis of our datasets; however, we can already extract some interesting insights from the `meteorites` data by calculating summary statistics.

#### How many of the meteorites were found versus observed falling?

In [None]:
meteorites.fall.value_counts()

*Tip: Pass in `normalize=True` to see this result as percentages. Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) for additional functionality.*

#### What was the mass of the average meterorite?

In [None]:
meteorites['mass (g)'].median()

We can take this a step further and look at quantiles:

In [None]:
meteorites['mass (g)'].quantile([0.01, 0.05, 0.95, 0.99])

#### What was the mass of the heaviest meteorite?

In [None]:
meteorites['mass (g)'].max()

Let's extract the information on this meteorite:

In [None]:
meteorites.loc[meteorites['mass (g)'].idxmax()]

#### How many different types of meteorite classes are represented in this dataset?

In [None]:
meteorites.recclass.nunique()

Some examples:

In [None]:
meteorites.recclass.unique()[:14]

*Note: All fields preceded with "rec" are the values recommended by The Meteoritical Society. Check out [this Wikipedia article](https://en.wikipedia.org/wiki/Meteorite_classification) for some information on meteorite classes.*

#### Get some summary statistics on the data itself
We can get common summary statistics for all columns at once. By default, this will only be numeric columns, but here, we will summarize everything together:

In [None]:
meteorites.describe(include='all')

**Important**: `NaN` values signify missing data. For instance, the `fall` column contains strings, so there is no value for `mean`; likewise, `mass (g)` is numeric, so we don't have entries for the categorical summary statistics (`unique`, `top`, `freq`).

#### Check out the documentation for more descriptive statistics:

- [Series](https://pandas.pydata.org/docs/reference/series.html#computations-descriptive-stats)
- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html#computations-descriptive-stats)

##### [Exercise 1.3](../exercises/pandas_intro_exercise.ipynb#Exercise-1.3)

Using the data in the `2019_Yellow_Taxi_Trip_Data.csv` file, calculate summary statistics for the `fare_amount`, `tip_amount`, `tolls_amount`, and `total_amount` columns.

##### [Exercise 1.4](../exercises/pandas_intro_exercise.ipynb#Exercise-1.4)
Find the dimensions (number of rows and number of columns) in the data.

return to [overview](../00_overview.ipynb)