<div style="text-align:center;">
  <img src="images/molssi_main_horizontal.png" style="display: block; margin: 0 auto; max-height:200px;">
</div>

# Analyzing Periodic Trends with Python

<div class="alert alert-block alert-info">

Questions:

* How can I analyze data using Python?

* How can I use Python to visualize data?


Objectives:

* Learn to import a Python package.

* Learn to use Python to analyze tabular data.

* Learn to use the Python package `plotly` to create interactive plots.

</div>

## Introduction

As chemists, one of the first things we learn about in our courses is the periodic table.
The periodic table arranges chemical elements into rows and columns based on their atomic number and electron configuration.

In this notebook, we will use two Python libraries to analyze and visualize periodic trends.

Our data is stored in a comma-separated values (CSV) file. You might be familiar with this format from spreadsheet software like Microsoft Excel or Google Sheets.  We will use the `pandas` library to read the data from the CSV file and analyze it. We wiill then examine periodic trends using the `plotly` library to create plots.

We obtained this data from the [Periodic Table on PubChem](https://pubchem.ncbi.nlm.nih.gov/periodic-table/). \
For this workshop, we also slightly processed the data so that there is a column for the period of the element in the periodic table.
Our data is in a CSV file in the `data` directory of this repository.

### Importing Libraries

For this notebook, we will be reading and analyzing data using `pandas`. 
These are not part of standard Python, and we must import the package to use it.
When we import a package, we can give it an alias to make it easier to use.

The synatx for importing a package is 

```python
import package_name as alias
```

Then, we can use functions in the package by calling `alias.function_name`.

## Reading and Analyzing Data using `pandas`

In [None]:
# This is how we import the pandas library
# we give it the name "pd"
import pandas as pd

To read data from a CSV file, we will use the `read_csv` function from the `pandas` package.
You can read more about this function in your Jupyter notebook by typing `pd.read_csv?` in the cell below. When you run the cell a window will pop up with the documentation for the function.

In [None]:
# Use this cell to read about pd.read_csv


<div class="alert alert-block alert-success"> 
<strong> Python Documentation </strong>

Most popular Python libraries have very good online documentation. 
You can find the pandas documentation by googling "pandas docs".
You will be able to find the same help message you get for `read_csv` as well as tutorials and other types of documentation.

1. [Pandas Documentation](https://pandas.pydata.org/docs/)
2. [`read_csv` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
</div>

Now that we've read about `read_csv`, let's use the function to read our data. 
It is common to name the variable from the `read_csv` function `df`. This is short for `DataFrame`.
Since this file is relatively simple, we do not need any additional arguments to the function. The `read_csv` function reads in tabular data which is comma delimited by default.


In [None]:
# Use the read_csv function to read the file "data/PubChemElements_all.csv"

### Examining the data

The variable `df` is now a pandas DataFrame with the information contained in the csv file. You can examine the DataFrame using the `.head()` method. 
This shows the first 5 rows stored in the DataFrame.
The first row of the file was used for column headers.


In [None]:
# Use the df.head() method


<div class="alert alert-block alert-success"> 
<strong>Python Skills: Python Objects</strong>

In this notebook, we will be working with Pandas Dataframe objects.
In Python, we use the word "object" to refer to a variable structure that has associated data and can perform actions (called "methods").
One example of an object we have seen in notebooks is a list - we could also call it a "list object". An object has `attributes` (data) and `methods`. 
You access information about objects with the syntax

```python
object.data
```
where data is the attribute name.

You acceess object methods with the syntax
```python
object.method(arguments)
```

For example, for a list "`append` is a method that was covered in the introductory lesson.

```
my_list = []
my_list.append(1) # "append" is a method
```
</div>    

The `.info` function will give information about the columns and the data type of those columns. The data type will become very important later as we work with more data.

In [None]:
# use the df.info() method

This output will show information about each column of data.
Pandas assigns data types to columns, and will do its best to decide the data column for each column based on what is in the column. 
We can also see how many values are in each column.

We can also see descriptive statisics easily using the `.describe()` command.

In [None]:
# use the df.describe() method

### Accessing Data in DataFrames

One way to get information in a data frame is by using the headers, or the column names using square brackets. The synatx for this is

```python
dataframe["column_name"]
```

For example, to get the column with information about the atomic symbol, we would use the "Symbol" column.

In [None]:
df["Symbol"]

In [None]:
# try some other column names

In [None]:
# try some other column names


If you want to select multiple columns, you use a list of column names in square brackets.

In [None]:
df[["Symbol", "AtomicNumber"]]

### Get data using number index

If we want to get information in the dataframe using row and column numbers, we use the `iloc` function.

The syntax for `iloc` is

```python
dataframe.iloc[row_number, column_number]
```

If you specify only a row number, you will get all the columns.

In [None]:
# This will get the first row, all of the columns.
df.iloc[0]

In [None]:
# This will get the first row and the second column.
df.iloc[0, 1]

<div class="alert alert-block alert-warning"> 

<h3>Check Your Understanding</h3>

How would you get the `ElectronConfiguration` column?

How would you get the value in row index 10 of the `ElectronConfiguration` column?
</div>

## Performing Calculations
You can do mathematical operations on entire columns  or rows of `pandas` dataframes using single lines of code.
For example, if we wanted to subtract 273.15 from our melting point column, we could do so by writing one expression.

In [None]:
df["MeltingPoint"] - 273.15

To save your calculation in a new column in your dataframe,
use the syntax

```python
df["new_column_name"] = CALCULATION
```

In [None]:
df["MeltingPointC"] = df["MeltingPoint"] - 273.15

### Filtering Data

We can also filter data using signs like greater than `>`, less than `<`, or equal to `==`.

In the cell below, we check whether or not each boiling point is greater than 500.
If it is, `True` is returned, if not `False`.


In [None]:
df["BoilingPoint"] > 500

We can use this as an index or slice, similar to how we learned with lists.
We can see from the output that this gives us 80 elements.

In [None]:
above_500 = df[df["BoilingPoint"]>500]

above_500.head()

<div class="alert alert-block alert-warning"> 

<strong>Check Your Understanding</strong>

Use filtering to find all elements with a "Solid" standard state.

You can solve this problem by working in steps:

(1) Get the `StandardState` column.

(2) Use a logic operation to find if the value of the `StandardState` column is equal to "Solid".

(3) Use the result of the logic operation to filter the data.

</div>

In [None]:
# Put your code here

### Saving a new DataFrame

If you wanted to save your data to a csv, you could do it using the method `to_csv`.

In [None]:
df.to_csv("data/periodic_data_processed.csv", index=False)

## Visualization: Analyzing Periodic Trends

Now that we've learned how to access and manipulate data in a DataFrame, we can start to analyze periodic trends.

The first trends we'll observe are related to **Atomic radius** - how does the atomic radius change as we increase atomic number? How does it change across a period or down a group? An effective way to see this will be to create visualizations of the trends. We will use a library called [plotly](https://plotly.com/python/) to create interactive plots. 

Plotly is a powerful library for interactive plotting in Jupyter notebooks. 
You can see examples by clicking the provided link.

In this workshop, we'll be keeping our plots simple. We'll use an interface called `plotly.express`, which is meant to be a quick way to create plots.

In [None]:
import plotly.express as px

After we've imported the `plotly.express` package, we can use the `line` function to create a line plot.
The first argument is the DataFrame, the second argument is the x-axis, and the third argument is the y-axis.

In [None]:
fig = px.line(df, x="AtomicNumber", y="AtomicRadius")
fig.show()

When you hover your mouse over the points in the plot above, you can see the values of the data points. This is a feature of `plotly` plots.

You could also use the `scatter` function to create a scatter plot.
Try using the `scatter` function to create a plot with `AtomicRadius` on the y-axis and `AtomicNumber` on the x-axis in the cell below.

In [None]:
# Use px.scatter to make a scatter plot

<div class="alert alert-block alert-warning"> 

<strong>Check Your Understanding</strong>

Create a scatter plot with `ElectronAffinity` on the y-axis and `AtomicNumber` on the x-axis.

Try adding another argument called `color` to the `scatter` function and set it equal to `GroupBlock`. Try coloring by `Period`. Are there any trends you can see in the data?

What happens if you use `px.line` instead of `px.scatter`?

</div>

Plotly can also be used to create box plots to more easily see data distributions.

In [None]:

# Create a box plot
fig = px.box(df, x="GroupBlock", y="AtomicRadius", title="Atomic Radius by Group Block")

# Show plot
fig.show()


<div class="alert alert-block alert-warning"> 

<strong>Final Challenge</strong>

Create a line plot with `ElectronAffinity` on the y axis and `Period` on the x axis. Your plot have a line for `Halogen`, a line for `Alkali metal`, and a line for `Noble gas`. 

Starting code is given in the cell below to show you how to filter the data for multiple groups.

The steps to solving this will be:

1. Add `Alkali metal` to the filter.

2. Get just the rows where the filter is `True` using indexing.

3. Save a new dataframe with the filtered data.

4. Create the line plot with the new dataframe.

</div>

In [None]:
# This will tell you if the value in "GroupBlock" for each row is either "Halogen" or "Noble gas"
df["GroupBlock"].isin(["Halogen", "Noble gas"])