# Python Workshop 2: Getting started with Pandas
[Placeholder for intro text]

Notes:

- Converting to a JSON file?
- Web API materials

## What is a Python library?

We've covered how to code in the Python programming language, and now we'll move on to the Pandas library. A "Library" in this context is a package of code that adds to the functionality of Python. Base Python offers a lot of features, but not everything -- Python libraries can be imported at the beginning of your code to use for your specific purpose. 

For example, you may import Matplotlib to create graphs and plots, or Natural Language Toolkit (NLTK) to do natural language processing. Today we will be using the pandas library to manipulate a dataset.

## What is Pandas?

Pandas is a high-level data manipulation tool first created in 2008 by Wes McKinney. The name comes from the term “panel data,” an econometrics term for data sets that include observations over multiple time periods for the same individuals.<sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>

From Jake Vanderplas’ book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do):

> As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

### What does Pandas do?
* Reading and writing data from persistent storage
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data
* Visualization with help from Matplotlib

## Importing a Python library

To use any library, we must import it into our Python document.

In [1]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd

## Importing files into Pandas
We have prepared the data from the FAA website for this workshop. We will import those datasets into our notebook to use them for data analysis.

Datasets can be stored in several types of files, including .csv, .json, .txt, .xls, .xlsx, and more. Here we will import a .csv file and a .json file.

### CSV Files

A comma separated values (CSV) file is a plain text file containing data separated by commas.

In [2]:
# Import a comma-sperated values (csv) file as a DataFrame

# The file location
csv_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-instruction/main/MI_REU_2021/data/perovskite_DFT_EaH_FormE.csv'

# Read in the file and print out the DataFrame
ts_csv = pd.read_csv(csv_file_url)
ts_csv.head()

Unnamed: 0,Material Composition,A site #1,A site #2,A site #3,B site #1,B site #2,B site #3,X site,Empty,Empty.1,Empty.2,Empty.3,Empty.4,energy_above_hull (meV/atom),formation_energy (eV/atom)
0,Ba1Sr7V8O24,Ba,Sr,,V,,,O,,,,,,29.747707,-2.113335
1,Ba2Bi2Pr4Co8O24,Ba,Bi,Pr,Co,,,O,,,,,,106.702335,-1.311863
2,Ba2Ca6Fe8O24,Ba,Ca,,Fe,,,O,,,,,,171.608093,-1.435607
3,Ba2Cd2Pr4Ni8O24,Ba,Cd,Pr,Ni,,,O,,,,,,284.89819,-0.868639
4,Ba2Dy6Fe8O24,Ba,Dy,,Fe,,,O,,,,,,270.007913,-1.746806


### Excel Files

[Placeholder for info about Excel files]

In [3]:
# The file location
excel_file_url = 'https://github.com/NCSU-Libraries/data-viz-instruction/blob/main/MI_REU_2021/data/perovskite_DFT_EaH_FormE.xlsx?raw=true'

# Read in the file and print out the DataFrame
ts_excel = pd.read_excel(excel_file_url)
ts_excel.head()

Unnamed: 0,Material Composition,A site #1,A site #2,A site #3,B site #1,B site #2,B site #3,X site,Empty,Empty.1,Empty.2,Empty.3,Empty.4,energy_above_hull (meV/atom),formation_energy (eV/atom)
0,Ba1Sr7V8O24,Ba,Sr,,V,,,O,,,,,,29.747707,-2.113335
1,Ba2Bi2Pr4Co8O24,Ba,Bi,Pr,Co,,,O,,,,,,106.702335,-1.311863
2,Ba2Ca6Fe8O24,Ba,Ca,,Fe,,,O,,,,,,171.608093,-1.435607
3,Ba2Cd2Pr4Ni8O24,Ba,Cd,Pr,Ni,,,O,,,,,,284.89819,-0.868639
4,Ba2Dy6Fe8O24,Ba,Dy,,Fe,,,O,,,,,,270.007913,-1.746806


### JSON Files

JSON (JavaScript Object Notation) is a data storage format that uses name/value pairs to create objects and associative arrays. Learn more about [JSON files structure and syntax from W3Schools](https://www.w3schools.com/js/js_json_syntax.asp)

In [4]:
# Importing a JavaScript object notation (JSON) file

# The file location
json_file_url = ''

# Read in the file and print out the DataFrame
ts_json = pd.read_json(json_file_url)
ts_json.head()

ValueError: Expected object or value

### Web API

[Placeholder for web API instructions]

## Pandas data structures

Pandas uses two main data structures: `Series` and `DataFrame`.

<img src="https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Data_Manipulation_with_Python/assets/nc_dataframes.png" alt="DataFrames are composed of Series" width="75%">

### `Series`
A `Series` is a one-dimensional array of indexed data, or a single column of data. It can be thought of as a specialized dictionary or a generalized NumPy array. You can learn more about the Series data type in the [Pandas documentation for Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

### `DataFrame`
A `DataFrame` is a two-dimensional array composed of one or more `Series`, similar to tabluar data (think of Excel). They can optionally have an `Index` and have flexible row indices and flexible column names. 

It can be thought of as a generalization of a two-dimensional NumPy array, or a specialization of a dictionary in which each column name maps to a `Series` of column data. You can learn more about the DataFrame data type in the [Pandas documentation for DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

A `DataFrame` is made up of `Series` in a similar way in which a table is made up of columns. The only restriction is that each column must be of the same data type.  Many of the operations that can be performed on a `DataFrame` can also be performed on an individual `Series`.

## Exploring your data

Now that we have our data, we can use Pandas to explore our data for analysis. This can be useful if you are new to a dataset to see what's there and how you should start analyzing.

### View DataFrame column labels

Our DataFrame has 92 columns. We can quickly view the label names for each column using the DataFrame `columns` property.

In [5]:
# View column labels (headers)
ts_csv.columns

Index(['Material Composition', 'A site #1', 'A site #2', 'A site #3',
       'B site #1', 'B site #2', 'B site #3', 'X site', 'Empty', 'Empty.1',
       'Empty.2', 'Empty.3', 'Empty.4', 'energy_above_hull (meV/atom)',
       'formation_energy (eV/atom)'],
      dtype='object')

### View summaries of a DataFrame

We can quickly generate summaries of our DataFrame to observe some basic statistics and information such as column data types and non-null value counts.

In [6]:
# Get summary statistics of DataFrame columns using "describe()" (only includes
# numerical data types)
ts_csv.describe()

Unnamed: 0,Empty,Empty.1,Empty.2,Empty.3,Empty.4,energy_above_hull (meV/atom),formation_energy (eV/atom)
count,0.0,0.0,0.0,0.0,0.0,1929.0,1929.0
mean,,,,,,105.532633,-1.91446
std,,,,,,98.395552,0.57034
min,,,,,,0.0,-3.2085
25%,,,,,,33.436112,-2.315473
50%,,,,,,84.202506,-1.900529
75%,,,,,,155.909864,-1.474341
max,,,,,,956.831956,-0.488125


In [8]:
# Get summary statistics of single column using "describe()"
ts_csv['formation_energy (eV/atom)'].describe()

count    1929.000000
mean       -1.914460
std         0.570340
min        -3.208500
25%        -2.315473
50%        -1.900529
75%        -1.474341
max        -0.488125
Name: formation_energy (eV/atom), dtype: float64

In [9]:
# Summarize column data types, non-null values, and memory usage using "info()"
ts_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1929 entries, 0 to 1928
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Material Composition          1929 non-null   object 
 1   A site #1                     1929 non-null   object 
 2   A site #2                     1161 non-null   object 
 3   A site #3                     34 non-null     object 
 4   B site #1                     1929 non-null   object 
 5   B site #2                     1249 non-null   object 
 6   B site #3                     33 non-null     object 
 7   X site                        1929 non-null   object 
 8   Empty                         0 non-null      float64
 9   Empty.1                       0 non-null      float64
 10  Empty.2                       0 non-null      float64
 11  Empty.3                       0 non-null      float64
 12  Empty.4                       0 non-null      float64
 13  ene

### Referencing and indexing a DataFrame

#### Referencing Rows (.loc and .iloc)

In [10]:
# Reference a row by index label
# Returns a Series

# Access first row of wl_strikes_csv by index label
# In this case the index label is 0
ts_csv.loc[0]

# Access first row of wl_strikes_json by index label
# In this case the index label is not 0
#wl_strikes_json.loc['INCIDENT_DATE ']

Material Composition            Ba1Sr7V8O24
A site #1                                Ba
A site #2                                Sr
A site #3                               NaN
B site #1                                 V
B site #2                               NaN
B site #3                               NaN
X site                                    O
Empty                                   NaN
Empty.1                                 NaN
Empty.2                                 NaN
Empty.3                                 NaN
Empty.4                                 NaN
energy_above_hull (meV/atom)        29.7477
formation_energy (eV/atom)         -2.11333
Name: 0, dtype: object

In [11]:
# Reference multiple rows by index label (in this case the index label 0 through 2)
# Returns a DataFrame
ts_csv.loc[0:3]

Unnamed: 0,Material Composition,A site #1,A site #2,A site #3,B site #1,B site #2,B site #3,X site,Empty,Empty.1,Empty.2,Empty.3,Empty.4,energy_above_hull (meV/atom),formation_energy (eV/atom)
0,Ba1Sr7V8O24,Ba,Sr,,V,,,O,,,,,,29.747707,-2.113335
1,Ba2Bi2Pr4Co8O24,Ba,Bi,Pr,Co,,,O,,,,,,106.702335,-1.311863
2,Ba2Ca6Fe8O24,Ba,Ca,,Fe,,,O,,,,,,171.608093,-1.435607
3,Ba2Cd2Pr4Ni8O24,Ba,Cd,Pr,Ni,,,O,,,,,,284.89819,-0.868639


In [12]:
# Reference a row or multiple rows by zero-based integer position

# Access first row of wl_strikes_csv by row integer value
# In this case the row is row 0
ts_csv.iloc[0]

# Access first row of wl_strikes_json by row integer value
# In this case the row is also row 0
#wl_strikes_json.iloc[0]

Material Composition            Ba1Sr7V8O24
A site #1                                Ba
A site #2                                Sr
A site #3                               NaN
B site #1                                 V
B site #2                               NaN
B site #3                               NaN
X site                                    O
Empty                                   NaN
Empty.1                                 NaN
Empty.2                                 NaN
Empty.3                                 NaN
Empty.4                                 NaN
energy_above_hull (meV/atom)        29.7477
formation_energy (eV/atom)         -2.11333
Name: 0, dtype: object

In [13]:
# Reference multiple rows by row number (in this case rows 0 through 2)
# Note that this time the range doesn't include the stop number
ts_csv.iloc[0:3]

Unnamed: 0,Material Composition,A site #1,A site #2,A site #3,B site #1,B site #2,B site #3,X site,Empty,Empty.1,Empty.2,Empty.3,Empty.4,energy_above_hull (meV/atom),formation_energy (eV/atom)
0,Ba1Sr7V8O24,Ba,Sr,,V,,,O,,,,,,29.747707,-2.113335
1,Ba2Bi2Pr4Co8O24,Ba,Bi,Pr,Co,,,O,,,,,,106.702335,-1.311863
2,Ba2Ca6Fe8O24,Ba,Ca,,Fe,,,O,,,,,,171.608093,-1.435607


#### Referencing Columns

In [14]:
# Referencing a column by column label (in this case, "INDX_NR")
ts_csv['A site #1']

0       Ba
1       Ba
2       Ba
3       Ba
4       Ba
        ..
1924     Y
1925     Y
1926     Y
1927     Y
1928     Y
Name: A site #1, Length: 1929, dtype: object

In [15]:
# Referencing multiple columns by a list of column labels 
# (in this case, the columns "INDX_NR" and "SPECIES")
ts_csv[['A site #1', 'A site #3']]

Unnamed: 0,A site #1,A site #3
0,Ba,
1,Ba,Pr
2,Ba,
3,Ba,Pr
4,Ba,
...,...,...
1924,Y,
1925,Y,
1926,Y,
1927,Y,


#### Referencing both rows and columns

In [17]:
# Referencing a subset of rows and columns using index and column labels
# Note that we're using a range of column labels instead of a list
# Make sure that your column range starts with the leftmost label
ts_csv.loc[:10, 'A site #1':'X site']

Unnamed: 0,A site #1,A site #2,A site #3,B site #1,B site #2,B site #3,X site
0,Ba,Sr,,V,,,O
1,Ba,Bi,Pr,Co,,,O
2,Ba,Ca,,Fe,,,O
3,Ba,Cd,Pr,Ni,,,O
4,Ba,Dy,,Fe,,,O
5,Ba,Gd,,Fe,,,O
6,Ba,Ho,,Fe,,,O
7,Ba,La,,Co,,,O
8,Ba,La,,Cr,,,O
9,Ba,La,,Fe,,,O


## Writing data to a file

In [None]:
# Save the subset from the previous cell in a variable
first_rows = ts_csv.loc[:10, 'A site #1':'Number of elements']

# Write to csv
first_few.to_csv('new_data.csv')

In [None]:
#Write to an Excel file
first_rows.to_excel('new_data.xls')

In [None]:
# Write to a JSON file
first_rows.to_json('new_data.json')