# Python Workshop 2: Getting started with Pandas
[Placeholder for intro text]

## What is a Python library?

We've covered how to code in the Python programming language, and now we'll move on to the Pandas library. A "Library" in this context is a package of code that adds to the functionality of Python. Base Python offers a lot of features, but not everything -- Python libraries can be imported at the beginning of your code to use for your specific purpose. 

For example, you may import Matplotlib to create graphs and plots, or Natural Language Toolkit (NLTK) to do natural language processing. Today we will be using the pandas library to manipulate a dataset.

## What is Pandas?

Pandas is a high-level data manipulation tool first created in 2008 by Wes McKinney. The name comes from the term “panel data,” an econometrics term for data sets that include observations over multiple time periods for the same individuals.<sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>

From Jake Vanderplas’ book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do):

> As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

### What does Pandas do?
* Reading and writing data from persistent storage
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data
* Visualization with help from Matplotlib

## Importing a Python library

To use any library, we must import it into our Python document.

In [2]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd

## Importing files into Pandas
We have prepared the data from the FAA website for this workshop. We will import those datasets into our notebook to use them for data analysis.

Datasets can be stored in several types of files, including .csv, .json, .txt, .xls, .xlsx, and more. Here we will import a .csv file and a .json file.

### CSV Files

A comma separated values (CSV) file is a plain text file containing data separated by commas.

In [3]:
# Import a comma-sperated values (csv) file as a DataFrame

# The file location
csv_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Python_Open_Labs/data/FAA_Wildlife_strikes_1990-1999.csv'

# Read in the file and print out the DataFrame
wl_strikes_csv = pd.read_csv(csv_file_url)
wl_strikes_csv.head()

Unnamed: 0,INDX_NR,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME,TIME_OF_DAY,AIRPORT_ID,AIRPORT,RUNWAY,STATE,...,SIZE,NR_INJURIES,NR_FATALITIES,COMMENT,REPORTER_NAME,REPORTER_TITLE,SOURCE,PERSON,LUPDATE,TRANSFER
0,633309,1999-12-24,12,1999,10:15,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,18R,NC,...,Small,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Air Transport Operations,2000-03-10,False
1,634726,1999-12-15,12,1999,,Day,KRDU,RALEIGH-DURHAM INTL,23R,NC,...,Small,,,/Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,,2000-03-09,False
2,636216,1999-12-14,12,1999,07:40,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,18R,NC,...,Small,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Tower,2000-03-09,False
3,633739,1999-12-11,12,1999,17:00,Dusk,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,36R,NC,...,Large,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Tower,2000-03-09,False
4,636671,1999-12-11,12,1999,,,KRDU,RALEIGH-DURHAM INTL,,NC,...,Medium,,,SOURCE = X AIRLINE REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2001-06-25,False


### Excel Files

[Placeholder for info about Excel files]

In [23]:
excel_file_url = ""

# Read in the file and print out the DataFrame
wl_strikes_excel = pd.read_excel(excel_file_url)
wl_strikes_excel.head()

FileNotFoundError: [Errno 2] No such file or directory: ''

### JSON Files

JSON (JavaScript Object Notation) is a data storage format that uses name/value pairs to create objects and associative arrays. Learn more about [JSON files structure and syntax from W3Schools](https://www.w3schools.com/js/js_json_syntax.asp)

In [10]:
# Importing a JavaScript object notation (JSON) file

# The file location
json_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2010-2019.json'

# Read in the file and print out the DataFrame
wl_strikes_json = pd.read_json(json_file_url)
wl_strikes_json.head()

Unnamed: 0,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME,TIME_OF_DAY,AIRPORT_ID,AIRPORT,RUNWAY,STATE,FAAREGION,...,SIZE,NR_INJURIES,NR_FATALITIES,COMMENT,REPORTER_NAME,REPORTER_TITLE,SOURCE,PERSON,LUPDATE,TRANSFER
1080125,2020-11-05,11,2020,05:00,,KRDU,RALEIGH-DURHAM INTL,23R,NC,ASO,...,Small,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2020-12-04,False
1080118,2020-11-05,11,2020,22:35,,KGSO,PIEDMONT TRIAD INTL,5R,NC,ASO,...,Large,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2020-12-04,False
1080126,2020-11-05,11,2020,11:00,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,36C,NC,ASO,...,,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2020-12-04,False
1080130,2020-11-05,11,2020,06:52,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,36C,NC,ASO,...,,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2020-12-04,False
1078243,2020-11-04,11,2020,05:20,Dawn,KRDU,RALEIGH-DURHAM INTL,23R,NC,ASO,...,Small,,,,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2020-11-27,False


## Pandas data structures

Pandas uses two main data structures: `Series` and `DataFrame`.

<img src="https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Data_Manipulation_with_Python/assets/nc_dataframes.png" alt="DataFrames are composed of Series" width="75%">

### `Series`
A `Series` is a one-dimensional array of indexed data, or a single column of data. It can be thought of as a specialized dictionary or a generalized NumPy array. You can learn more about the Series data type in the [Pandas documentation for Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

### `DataFrame`
A `DataFrame` is a two-dimensional array composed of one or more `Series`, similar to tabluar data (think of Excel). They can optionally have an `Index` and have flexible row indices and flexible column names. 

It can be thought of as a generalization of a two-dimensional NumPy array, or a specialization of a dictionary in which each column name maps to a `Series` of column data. You can learn more about the DataFrame data type in the [Pandas documentation for DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

A `DataFrame` is made up of `Series` in a similar way in which a table is made up of columns. The only restriction is that each column must be of the same data type.  Many of the operations that can be performed on a `DataFrame` can also be performed on an individual `Series`.

## Exploring your data

Now that we have our data, we can use Pandas to explore our data for analysis. This can be useful if you are new to a dataset to see what's there and how you should start analyzing.

### View DataFrame column labels

Our DataFrame has 92 columns. We can quickly view the label names for each column using the DataFrame `columns` property.

In [11]:
# View column labels (headers)
wl_strikes_csv.columns

Index(['INDX_NR', 'INCIDENT_DATE ', 'INCIDENT_MONTH', 'INCIDENT_YEAR', 'TIME',
       'TIME_OF_DAY', 'AIRPORT_ID', 'AIRPORT', 'RUNWAY', 'STATE', 'FAAREGION',
       'LOCATION', 'ENROUTE STATE', 'OPID', 'REG', 'FLT', 'AIRCRAFT', 'AMA',
       'AMO', 'EMA', 'EMO', 'AC_CLASS', 'AC_MASS', 'TYPE_ENG', 'NUM_ENGS',
       'ENG_1_POS', 'ENG_2_POS', 'ENG_3_POS', 'ENG_4_POS', 'PHASE_OF_FLIGHT',
       'HEIGHT', 'SPEED', 'DISTANCE', 'SKY', 'PRECIPITATION', 'AOS',
       'COST_REPAIRS', 'OTHER_COST', 'COST_REPAIRS_INFL_ADJ',
       'COST_OTHER_INFL_ADJ', 'INGESTED', 'INDICATED_DAMAGE', 'DAMAGE_LEVEL',
       'STR_RAD', 'DAM_RAD', 'STR_WINDSHLD', 'DAM_WINDSHLD', 'STR_NOSE',
       'DAM_NOSE', 'STR_ENG1', 'DAM_ENG1', 'STR_ENG2', 'DAM_ENG2', 'STR_ENG3',
       'DAM_ENG3', 'STR_ENG4', 'DAM_ENG4', 'STR_PROP', 'DAM_PROP',
       'STR_WING_ROT', 'DAM_WING_ROT', 'STR_FUSE', 'DAM_FUSE', 'STR_LG',
       'DAM_LG', 'STR_TAIL', 'DAM_TAIL', 'STR_LGHTS', 'DAM_LGHTS', 'STR_OTHER',
       'DAM_OTHER', 'OTHER_SPEC

### View summaries of a DataFrame

We can quickly generate summaries of our DataFrame to observe some basic statistics and information such as column data types and non-null value counts.

In [12]:
# Get summary statistics of DataFrame columns using "describe()" (only includes
# numerical data types)
wl_strikes_csv.describe()

Unnamed: 0,INDX_NR,INCIDENT_MONTH,INCIDENT_YEAR,ENROUTE STATE,AMO,EMA,EMO,AC_MASS,NUM_ENGS,ENG_1_POS,...,HEIGHT,SPEED,DISTANCE,AOS,COST_REPAIRS,OTHER_COST,COST_REPAIRS_INFL_ADJ,COST_OTHER_INFL_ADJ,NR_INJURIES,NR_FATALITIES
count,669.0,669.0,669.0,0.0,639.0,628.0,620.0,647.0,647.0,640.0,...,622.0,542.0,211.0,18.0,16.0,7.0,16.0,7.0,3.0,0.0
mean,626943.282511,7.544096,1994.789238,,26.032864,26.56051,13.229032,3.491499,2.060278,3.81875,...,746.172026,139.924354,0.654028,105.944444,53193.0,5523.571429,100497.75,9129.142857,1.0,
std,37159.222694,2.882413,2.847754,,24.238506,10.51654,14.268682,0.872439,0.410107,1.831294,...,1557.121925,38.821595,3.717234,334.724904,125021.768963,4954.84335,244639.428923,8437.919163,0.0,
min,608234.0,1.0,1985.0,,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,20.0,0.0,1.0,100.0,250.0,168.0,398.0,1.0,
25%,613346.0,6.0,1992.0,,11.0,19.0,4.0,3.0,2.0,1.0,...,0.0,120.0,0.0,2.25,3025.0,2050.0,4755.5,3263.5,1.0,
50%,619597.0,8.0,1995.0,,21.0,34.0,10.0,4.0,2.0,5.0,...,100.0,135.0,0.0,9.5,12000.0,5000.0,18750.0,7665.0,1.0,
75%,627351.0,10.0,1997.0,,36.0,34.0,10.0,4.0,2.0,5.0,...,737.5,150.0,0.0,54.75,29925.0,7657.5,50603.75,13513.0,1.0,
max,829805.0,12.0,1999.0,,97.0,37.0,46.0,4.0,4.0,7.0,...,17000.0,270.0,40.0,1440.0,500000.0,14000.0,978000.0,22288.0,1.0,


In [13]:
# Get summary statistics of single column using "describe()"
wl_strikes_csv['AIRPORT'].describe()

count                             669
unique                             26
top       CHARLOTTE/DOUGLAS INTL ARPT
freq                              306
Name: AIRPORT, dtype: object

In [14]:
# Summarize column data types, non-null values, and memory usage using "info()"
wl_strikes_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 669 entries, 0 to 668
Data columns (total 91 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   INDX_NR                669 non-null    int64  
 1   INCIDENT_DATE          669 non-null    object 
 2   INCIDENT_MONTH         669 non-null    int64  
 3   INCIDENT_YEAR          669 non-null    int64  
 4   TIME                   669 non-null    object 
 5   TIME_OF_DAY            633 non-null    object 
 6   AIRPORT_ID             669 non-null    object 
 7   AIRPORT                669 non-null    object 
 8   RUNWAY                 669 non-null    object 
 9   STATE                  669 non-null    object 
 10  FAAREGION              669 non-null    object 
 11  LOCATION               10 non-null     object 
 12  ENROUTE STATE          0 non-null      float64
 13  OPID                   669 non-null    object 
 14  REG                    669 non-null    object 
 15  FLT   

### Referencing and indexing a DataFrame

#### Referencing Rows (.loc and .iloc)

In [15]:
# Reference a row by index label
# Returns a Series

# Access first row of wl_strikes_csv by index label
# In this case the index label is 0
wl_strikes_csv.loc[0]

# Access first row of wl_strikes_json by index label
# In this case the index label is not 0
wl_strikes_json.loc['INCIDENT_DATE ']

KeyError: 'INCIDENT_DATE '

In [16]:
# Reference multiple rows by index label (in this case the index label 0 through 2)
# Returns a DataFrame
wl_strikes_csv.loc[0:3]

Unnamed: 0,INDX_NR,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME,TIME_OF_DAY,AIRPORT_ID,AIRPORT,RUNWAY,STATE,...,SIZE,NR_INJURIES,NR_FATALITIES,COMMENT,REPORTER_NAME,REPORTER_TITLE,SOURCE,PERSON,LUPDATE,TRANSFER
0,633309,1999-12-24,12,1999,10:15,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,18R,NC,...,Small,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Air Transport Operations,2000-03-10,False
1,634726,1999-12-15,12,1999,,Day,KRDU,RALEIGH-DURHAM INTL,23R,NC,...,Small,,,/Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,,2000-03-09,False
2,636216,1999-12-14,12,1999,07:40,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,18R,NC,...,Small,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Tower,2000-03-09,False
3,633739,1999-12-11,12,1999,17:00,Dusk,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,36R,NC,...,Large,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Tower,2000-03-09,False


In [17]:
# Reference a row or multiple rows by zero-based integer position

# Access first row of wl_strikes_csv by row integer value
# In this case the row is row 0
wl_strikes_csv.iloc[0]

# Access first row of wl_strikes_json by row integer value
# In this case the row is also row 0
wl_strikes_json.iloc[0]

INCIDENT_DATE             2020-11-05
INCIDENT_MONTH                    11
INCIDENT_YEAR                   2020
TIME                           05:00
TIME_OF_DAY                     None
                         ...        
REPORTER_TITLE              REDACTED
SOURCE             FAA Form 5200-7-E
PERSON            Airport Operations
LUPDATE                   2020-12-04
TRANSFER                       False
Name: 1080125, Length: 90, dtype: object

In [18]:
# Reference multiple rows by row number (in this case rows 0 through 2)
# Note that this time the range doesn't include the stop number
wl_strikes_csv.iloc[0:3]

Unnamed: 0,INDX_NR,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME,TIME_OF_DAY,AIRPORT_ID,AIRPORT,RUNWAY,STATE,...,SIZE,NR_INJURIES,NR_FATALITIES,COMMENT,REPORTER_NAME,REPORTER_TITLE,SOURCE,PERSON,LUPDATE,TRANSFER
0,633309,1999-12-24,12,1999,10:15,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,18R,NC,...,Small,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Air Transport Operations,2000-03-10,False
1,634726,1999-12-15,12,1999,,Day,KRDU,RALEIGH-DURHAM INTL,23R,NC,...,Small,,,/Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,,2000-03-09,False
2,636216,1999-12-14,12,1999,07:40,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,18R,NC,...,Small,,,SOURCE = TWO XXXX-X REPTS /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Tower,2000-03-09,False


#### Referencing Columns

In [19]:
# Referencing a column by column label (in this case, "INDX_NR")
wl_strikes_csv['AIRPORT']

0      CHARLOTTE/DOUGLAS INTL ARPT
1              RALEIGH-DURHAM INTL
2      CHARLOTTE/DOUGLAS INTL ARPT
3      CHARLOTTE/DOUGLAS INTL ARPT
4              RALEIGH-DURHAM INTL
                  ...             
664            RALEIGH-DURHAM INTL
665    CHARLOTTE/DOUGLAS INTL ARPT
666    CHARLOTTE/DOUGLAS INTL ARPT
667            PIEDMONT TRIAD INTL
668             SMITH REYOLDS ARPT
Name: AIRPORT, Length: 669, dtype: object

In [20]:
# Referencing multiple columns by a list of column labels 
# (in this case, the columns "INDX_NR" and "SPECIES")
wl_strikes_csv[['INDX_NR', 'AIRPORT']]
wl_list = ['INDX_NR', 'AIRPORT']
wl_strikes_csv[wl_list]

Unnamed: 0,INDX_NR,AIRPORT
0,633309,CHARLOTTE/DOUGLAS INTL ARPT
1,634726,RALEIGH-DURHAM INTL
2,636216,CHARLOTTE/DOUGLAS INTL ARPT
3,633739,CHARLOTTE/DOUGLAS INTL ARPT
4,636671,RALEIGH-DURHAM INTL
...,...,...
664,611224,RALEIGH-DURHAM INTL
665,609163,CHARLOTTE/DOUGLAS INTL ARPT
666,611004,CHARLOTTE/DOUGLAS INTL ARPT
667,828185,PIEDMONT TRIAD INTL


#### Referencing both rows and columns

In [21]:
# Referencing a subset of rows and columns using index and column labels
# Note that we're using a range of column labels instead of a list
# Make sure that your column range starts with the leftmost label
wl_strikes_csv.loc[:10, 'INDX_NR':'TIME']

Unnamed: 0,INDX_NR,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME
0,633309,1999-12-24,12,1999,10:15
1,634726,1999-12-15,12,1999,
2,636216,1999-12-14,12,1999,07:40
3,633739,1999-12-11,12,1999,17:00
4,636671,1999-12-11,12,1999,
5,634197,1999-11-28,11,1999,
6,633113,1999-11-22,11,1999,
7,632226,1999-11-07,11,1999,
8,632444,1999-11-07,11,1999,
9,633640,1999-11-05,11,1999,16:15


## Writing data to a file

In [22]:
# Save the subset from the previous cell in a variable
first_few = wl_strikes_csv.loc[:10, 'INDX_NR':'TIME']

# Write to csv
first_few.to_csv('new_data.csv')

In [None]:
#Write to an Excel file
first_few.to_excel('new_data.xls')

In [None]:
# Write to a JSON file
first_few.to_json('new_data.json')