[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1QecD3OF_fvgp73ZcbDweGS-0la53h6t-?usp=sharing)

# **Week 2 - Introduction to Matplotlib**

## **Abstract**


1. **Introduction to Matplotlib**
   
   This educational codebase serves as an in-depth guide to data visualization and management in the field of Chemistry, utilizing Python's Matplotlib and Pandas libraries.


2. **Introduction to Molecular Properties:**
   
   The content dives deep into the concepts of **"Highest Occupied Molecular Orbital (*HOMO*)"** and **"Lowest Unoccupied Molecular Orbital (*LUMO*)"** and their relevance in Chemistry. It elaborates on the energy gap between these levels, influencing molecular stability and reactivity.

3. **Analyzing Molecular Properties:**
   
   The guide uses **Matplotlib** to visualize key molecular attributes like **HOMO** and **LUMO** levels. These visualizations offer insights into molecular stability and reactivity.

4. **Introduction to Heatmaps with Seaborn**
   
   The guide introduces **heatmaps** as a powerful visualization tool for complex datasets, using the [**Seaborn**](https://seaborn.pydata.org/) library. **Heatmaps** are particularly useful for identifying patterns and correlations in large datasets.

5. **Plotting: Data Analysis and Visualization:**
   
   Using [**Matplotlib**](https://matplotlib.org/stable/contents.html) for data analysis, including calculating metrics like mean, median, standard deviation, and correlation. It also focuses on enhancing plot aesthetics with titles, line modifications, and color schemes, aiming to make data interpretation more intuitive.

6. **Additional Resources:**
   
   While the primary focus is on Python coding, a selection of additional academic resources is included to supplement the hands-on experience.




## **References: Essential Resources for Further Learning**

1. **Matplotlib User Guide**: [Official Documentation](https://matplotlib.org/stable/contents.html)
2. **Pandas Documentation**: [Official Website](https://pandas.pydata.org/docs/)
3. **"Python for Data Analysis" by Wes McKinney**: [Book Information](https://wesmckinney.com/pages/book.html)
4. **Seaborn User Guide**: [Official Documentation](https://seaborn.pydata.org/)
5. **Computational Chemistry Tools**: [Online Course](https://www.sciencedirect.com/topics/chemistry/highest-occupied-molecular-orbital)
6. **Pandas User Guide**: [Scholarly Article](https://pandas.pydata.org/docs/)

Feel free to explore these resources to deepen your understanding of data visualization, data management, and computational tools in Chemistry.


## **Introduction: The Role of Data Visualization in Chemistry**

Data visualization is an indispensable tool in the field of Chemistry, especially when leveraging computational tools and machine learning. Whether you're identifying patterns in molecular structures, analyzing spectroscopic data, or interpreting results from machine learning models, visual representation can elucidate intricate details that might otherwise be obscured in raw numerical data.


### **Further Learning**
> Learn more about the Role of Data Visualization in Chemistry:
> 1. **Recent Progress in Visualization and Analysis of Fingerprint**: [Journal Article](https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/open.202200091)
> 2. **Good Practices in Data Analytics**: [Webinar](https://www.chemistryworld.com/webinars/good-practices-in-data-analytics/4013015.article)


### **Why is Data Visualization Crucial?**

Data visualization is a vital component in recent research. It allows for more efficent interpretation of data, enables deeper understanding, and aids in decision-making processes. Below is a simple plot to illustrate the point.

In [None]:
# Sample code block to generate a simple plot
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y)
plt.title('Sample Plot: Sine Function')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

### **Mathematical Concepts in Data Visualization**

In data analysis, grasping the numerical aspects is key. The term **$covariance$**  serves as a tool to evaluate how two variables change relative to one another. When both variables increase together - this is known as positive covariance. This dependance can be  expressed through correlation.

<br>

$$
\text{Correlation (r)} = \frac{\text{Covariance (X, Y)}}{\text{Standard Deviation (X)} \times \text{Standard Deviation (Y)}}
$$

<br>

**$Covariance$** is one of many statistical tests researchers can utilize to infer more from their data.

Integrating mathematical tools and Python libraries such as **`Matplotlib`** and **`Pandas`** for data visualization and analysis, aids researchers particularly in the context of Chemistry.

<br>

#### **Further Learning**
> Learn more about Data Visulization in Python using **`Matplotlib`** and **`Pandas`** reading  [**Computing in Science & Engineering (2007)**](https://ieeexplore.ieee.org/document/4160265) by [John D. Hunter](https://ieeexplore.ieee.org/author/38192695400). For an extended introduction to Data Visualization, watch [**The Beauty of Data Visualization**](https://www.youtube.com/watch?v=5Zg-C8AAIGg&ab_channel=TED-Ed) by [David McCandless](https://davidmccandless.com/).








## **Data Loading: Initiating the Analytical Process**

Before visualizing or analysing the data, the first step is to load the data into an environment where it can be manipulated. This is crucial in scientific computing where datasets can be large and complex.

### **Introduction to Pandas: Python's Data Management Library**

[**Pandas**](https://pandas.pydata.org/docs/) is a  versatile Python library, for data manipulation and analysis. **Pandas** is appropriate for working with labeled data, like tables in a spreadsheet. The code block below demonstrates how to load a dataset.


In [None]:
# Sample code to demonstrate data loading in Pandas
import numpy as np
import pandas as pd

# Load a chemistry dataset from a GitHub repository
data_url = "https://github.com/RodrigoAVargasHdz/CHEM-4PB3/raw/main/Course_Notes/data/qm9.csv"
data = pd.read_csv(data_url)

# Print the first few rows of the data
print('Properties:')
print('------------')
for i, c in enumerate(data.columns):
    print(i, ': ', c)

### **Example: Data Handling with Pandas**
After loading the dataset, it's important to get a quick snapshot of its structure and contents.


#### **Pandas - Data Preview**
The [`head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) function provides the first few rows of the DataFrame, which can help in understanding what types of data are in each column. For instance, if one column contains numerical data and another contains text data, this will be immediately obvious. This preliminary look can also reveal if any preprocessing steps, such as handling missing values, might be required later.








In [None]:
print("Preview of Data:")
print(data.head())


#### **Pandas - Data Information**
The [`info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) function serves a dual purpose: it identifies the data types for each column and shows the count of non-null entries. This snapshot is vital for recognizing if data types align with their expected values and if missing data needs to be handled. Being aware of these aspects aids in streamlining the data preprocessing stage.

In [None]:
print("Data Information:")
print(data.info())

#### **Pandas - Statistical Summary**
By using [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html), users obtain a quick statistical summary of the numerical columns in the DataFrame. This includes central tendency measures like the mean and spread measures like standard deviation. This initial insight into the data's characteristics assists in choosing appropriate analytical or visualization techniques.

In [None]:
print("Statistical Summary of Data:")
print(data.describe())

While the dataset may initially seem overwhelming, it's worth noting that ***Statistical Techniques*** and ***Matplotlib*** can be powerful allies in simplifying and understanding the data. Through effective visualization and analysis, complex numerical patterns can be distilled into actionable insights, making the data much more approachable.



## **Basic Plotting: Visual Data Representation**

Visualizing your data is the next step after loading and initial exploration. Effective data visualization can offer nuanced insights into complex datasets. In the context of Chemistry, this can involve plotting properties like **Highest Occupied Molecular Orbital (HOMO)** and **Lowest Unoccupied Molecular Orbital (LUMO)**.


### **Worked Example: Scatter Plot of Molecular Orbitals**

Scatter plots are invaluable for understanding the relationship between two variables. Here, we visualize the relationship between *HOMO* and *LUMO*. These properties can be utilized in a Machine Learning model.  

> Before delving into the visual representation of data, it's essential to prepare the dataset by extracting variables of interest, specifically *HOMO* and *LUMO*.









In [None]:
# Select single columns for HOMO and LUMO
homo = np.array(data['homo'])  # Extract HOMO
lumo = data.lumo.to_numpy()  # Extract LUMO
print('HOMO ->', homo.shape)
print('LUMO ->', homo.shape)

# Select multiple columns for HOMO and LUMO
homo_and_lumo = np.array(data[['homo', 'lumo']])
print('HOMO & LUMO ->', homo_and_lumo.shape)

> With the data prepared, the next step involves setting up the plot libraries.

In [None]:
# load relevant libraries
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

> Now that [*Matplotlib*](https://matplotlib.org/stable/index.html) is set up, the core components of a scatter plot can be implemented. This involves using the extracted *HOMO* and *LUMO* values as coordinates for the plot.


In [None]:
%matplotlib inline
figure(figsize=(8, 6), dpi=80)

plt.scatter(homo,lumo)
plt.xlabel('HOMO',fontsize=15)
plt.ylabel('LUMO',fontsize=15)
# plt.show()

A scatter plot visualization offers an intuitive grasp of the interrelationships among these Molecular Properties, and it provides a method to validate hypotheses through statistical tests.

### **Understanding HOMO and LUMO: A Contextual Overview**

In molecular chemistry, **HOMO** and **LUMO** are acronyms for the **Highest Occupied Molecular Orbital** and the **Lowest Unoccupied Molecular Orbital**, respectively. These orbitals are central to understanding a molecule's electronic structure, reactivity, and various other chemical properties.

<br>

#### **Why Are *HOMO* and *LUMO* Important?**
The energy gap between *HOMO* and *LUMO* levels can reveal a lot about a molecule's stability and reactivity. A smaller gap often signifies that the molecule is more reactive, whereas a larger gap typically implies greater stability. This is crucial information when analyzing molecular interactions in computational chemistry.

<br>

![HOMOS and LUMOS for the Butadiene System ](https://cdn.masterorganicchemistry.com/wp-content/uploads/2019/12/6-molecular-orbitals-of-butadiene-showing-homo-and-lumo-orbitals.gif) <br>
[**Figure 1 - HOMOS and LUMOS for the Butadiene System. Image by James Ashenhurst**](https://www.masterorganicchemistry.com/2018/03/23/molecular-orbitals-in-the-diels-alder-reaction/)

<br>

#### **Real-World Application: Organic Photovoltaic Cells**

One specific area where the understanding of *HOMO* and *LUMO* levels is vital is in the development of ***Organic Photovoltaic Cells (OPVs)***. These are solar cells that use organic compounds to convert solar energy into electrical energy. The energy levels of *HOMO* and *LUMO* play a crucial role in the efficiency of these cells.


They determine the ease with which electrons can be "excited" from the *HOMO* to the *LUMO* level upon the absorption of photons. This excited state facilitates the movement of electrons, thereby generating an electric current.

<br>

![Organic Photovoltaic Devices](https://www.oe.phy.cam.ac.uk/files/media/organic11.png) <br>
[**Figure 2 - Organic Photovoltaic Devices. Image by the University of Cambridge**](https://www.oe.phy.cam.ac.uk/research/photovoltaics/ophotovoltaics)

<br>

> By leveraging computational chemistry and data visualization techniques to study these levels, researchers can optimize the materials used in OPVs. This enables the development of more effective and efficient solar energy solutions.

<br>

#### **Learn More**
>1. **HOMO and LUMO Molecular Orbitals for Conjugated Systems**: [YouTube Video](https://www.youtube.com/watch?v=cEOgSn5pvhw&ab_channel=Leah4sci)
>2. **Frontier MOs: An Acid-Base Theory**: [Report](https://chem.libretexts.org/Bookshelves/General_Chemistry/General_Chemistry_Supplement_(Eames)/Molecular_Orbital_Theory/Frontier_MOs%3A_An_Acid-Base_Theory)
>3. **Describing Chemical Reactivity with Frontier Molecular Orbitalets**: [Journal Article](https://pubs.acs.org/doi/10.1021/jacsau.2c00085)


## **Plotting: Visuals and Data Analysis**
#### **Exercises and Challenges - Visualizing Data on Scatter Plots**

1. **Exercises**
   - **Plotting *Alpha* and *Mu*:** Remake the scatter plot using the columns **Alpha** and **Mu** instead of **HOMO** and **LUMO**. How do the distributions compare?
   - **Statistical Insights:** Calculate and print the mean, median, and standard deviation for both the **HOMO** and **LUMO** columns.
   

2. **Challenges**

   - **Enhanced Visualization:** Modify the scatter plot by adding a title, grid lines, and a legend.
   - **Data Highlighting:** Highlight any points where **HOMO** values are greater than **$-0.2$** and **LUMO** values are less than **$0.2$** in a different color.  



In [None]:
# Exercise - Plotting Alpha and Mu

"""Your Code Here"""




In [None]:
# Exercise - Statistical Insights

"""Your Code Here"""




>  Review the Mean, Median, and Mode syntax in the Week 1 lesson.

In [None]:
# Challenge - Enhanced Visualization

"""Your code here"""




> Consult the [**`plt.grid()`**](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.grid.html) documentation to learn more about it's syntax and use cases.



In [None]:
# Challenge - Data Highlighting

"""Your code here"""




## **Unlocking Insights through Correlation Analysis**

When stepping into the domain of machine learning, a solid grasp of correlation analysis can be instrumental. Correlation serves as a gateway concept that familiarizes one with the nuances of data relationships, thereby paving the way for more complex machine learning models. Let's delve deeper.

<br>

### **Correlation Analysis: Decoding Variable Relationships for Machine Learning**

Before delving into more complex machine learning models, it's vital to understand what correlation is. This statistical measure represents the strength and direction of a relationship between two variables.

<br>

#### **Exercise: Fitting Curves with `np.polyfit()`**

The task is to use the function [**`np.polyfit()`**](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) to determine the parameters of a linear model for a given set of data. The **NumPy** documentation can be consulted for guidance on how to use the function.



#### **Quick recap of Linear Model**


*   The goal here is to calculate the slope **$(m)$** and intercept **$(b)$** for a line that fits the data points.

*   $y = m * x + b$

*   $y = [m,b]^T [x,1]$

> Use the formula to calculate the slope and intercept for **HOMO** and **LUMO** - and plot the graph with the regression model.



In [None]:
# Exercise: Fitting Curves with np.polyfit()

"Your Code Here"




> Calculating the Slope **$(m)$** and intercept **$(b)$** to calculate the regression curve allows for easier visualization and data analysis.

### **Exercises and Challenges - First Foray into Machine Learning**

#### **Exercises**

1. **Experimenting with Colors and Symbols**
   - Alter the scatter plot by changing both the color and marker symbols. This will help you understand the flexibility and capabilities of data visualization tools.

2. **Customizing Line Styles**
   - Modify the line style of the regression model. This alteration will provide a different visual impact and may help you interpret the model better.

#### **Challenges**

1. **Creating a Histogram with *HOMO* Data**
   - Construct a histogram using the **HOMO** data. Try to interpret the distribution and see if it reveals anything interesting about your dataset.

2. **Exploring Covariance and Correlation**
   - Creating a Histogram with **HOMO** Data
   - Since covariance was discussed earlier, create a challenge for yourself to visualize and calculate the **covariance** and **correlation** between **HOMO** and **LUMO**.

> **Challenge** - Exploring Covariance and Correlation can be completed mathmatically with **NumPy's** **[`np.cov()`](https://numpy.org/doc/stable/reference/generated/numpy.cov.html)** and can be further visualized with **Seaborn's** [**Heatmaps**](https://seaborn.pydata.org/generated/seaborn.heatmap.html).

<br>


---


> Recall the forumla for Correlation and Covariance:



$$
\text{Correlation (r)} = \frac{\text{Covariance (X, Y)}}{\text{Standard Deviation (X)} \times \text{Standard Deviation (Y)}}
$$

<br>

---


In [None]:
# Exercise - Experimenting with Colors and Symbols

"""Your Code Here"""





> Learn more about **`Matplotlib Styling`** syntax and use cases at **Python Charts'** **[lesson](https://python-charts.com/matplotlib/title/)**.

In [None]:
# Exercise - Customizing Line Styles

"""Your Code Here"""




In [None]:
# Challange - Creating a Histogram with HOMO Data

"""Your Code Here"""




In [None]:
# Challange - Exploring Covariance

"""Your Code Here"""




> Learn more about **`ddof`** syntax, type, and use cases at **NumPy's** **[documentation](https://numpy.org/doc/stable/reference/generated/numpy.std.html)**.

## **Challange - Extra Styles**

Use Matplotlib and NumPy to generate a **["scatter plot with histograms"](https://matplotlib.org/stable/gallery/lines_bars_and_markers/scatter_hist.html#sphx-glr-gallery-lines-bars-and-markers-scatter-hist-py)** template for the columns of the QM9 data set.

> How could the plot be made interactive?


**Hints - Extra Styles**
> - **Matplotlib's** [**`plt.inset_axes`**](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.inset_axes.html) and [**`plt.subplots`**](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) can assist with this challange.
> - Learn more about the [**`bins`**](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) in **Matplotlib's** Histograms.



In [None]:
# Challange - Extra Styles

"Your code here"





In [None]:
# Challange - Extra Styles Interactive

"Your code here"



