# Best Practices: Rewrtiting this notebook using my own code

Based off the paper "Machine Learning for Materials Scientists: An Introductory Guide toward Best Practices"
I am creating my own notebooks on machine learning/train-validate-test splitting etc.
The aim of this is to become familiar with packages I have not used before and the general process of creating code that can be used to analyse chemical data available

This focus for these notebooks is not to create the most effcient code, but rather create code which helps me learn and understand this procedure, get familiar with new packages, and provide an example to look back on for future reference.
This code will therefore be heavily commented in order to make everything extremely explicit for myself in both learning and understanding while I write this, and in order to help me understand what I am doing when reading this again in future

# Data loading, cleanup and processing

The first step to a ML project is to obtain the dataset you will be working with. 
There are many repositories for materials science-specific data (whether online or offline)---consult the accompanying paper for a list of the more commonly used ones.

Once you have identified the repository and dataset you will use for your project, you will have to download it to your local machine, or establish a way to reliably access the dataset.
Consult the documentation of the repository for how to do this.

For this tutorial, we have collected heat capacity ($C_p$) data from the [NIST-JANAF Thermochemical Tables](https://doi.org/10.18434/T42S31).

In [15]:
#importing libraries

import os                #This module provides a portable way of using operating system dependent functionality

import numpy as np

import pandas as pd       #Pandas is a Python library used for working with data sets. 
                          #It has functions for analyzing, cleaning, exploring, and manipulating data.
    
import matplotlib.pyplot as plt

%matplotlib inline 
#causes graphs to be shown inline in python (basically pl.show()) is no longer needed when graphing)
#This is a bit redundant for the current jupyter as having import matplotlib.pyplot as plt 
#basically calls this already, but it is good practice to have it in to make everything explicit
        
%config InlineBackend.figure_format='retina' #displays plots at higher resolution (retina == retina display)
                                             #on non retina displays this just makes graphs bigger
                                             #NOT REALLY SURE ABOUT THIS NEED TO LOOK INTO!!!
        

from pandas_profiling import ProfileReport   #Pandas works with DataFrame(basically a spreadsheet)
                                             #ProfileReport allows fucntions to be used which are very helpful
                                             #in interacting with this datafram
                                             #NOT REALLY SURE ABOUT THIS NEED TO LOOK INTO!!!

## Load data

Using Pandas, we read in the dataset into a DataFrame. 

We also print the shape of the DataFrame, which indicates the number of rows and columns in this dataset.

In [16]:
PATH = os.getcwd() #The os.getcwd() method is used to get the current working directory of a process.
                   #current working directory is C:\Users\sweet\Desktop\FYP\code

data_path = os.path.join(PATH, "../data_for_notebook_bestpractice/cp_data_demo.csv") #links this path and
#'../data_for_notebook_bestpractice/cp_data_demo.csv' with cwd
#NOCLUE WHY .. AND FORWARD SLASHES ARE USED

df = pd.read_csv(data_path) #reads the csv from the location it is told to (the data_path that was created)

print("Original DataFrame shape:", df.shape) #df.shape gives rows x colums of dataframe

Original DataFrame shape: (4583, 3)


This means that our input dataset has 4583 data samples, each with 3 variables.

## Examine the data

We examine some rows and look at the data's basic statistics.

We see that the dataset contains information about the formula, measurement condition (in this case, temperature in K), and the target property, heat capacity (in J/(mol * K)).

In [17]:
df.head(10) #.head(n) returns the first n rows of the dataframe

Unnamed: 0,FORMULA,CONDITION: Temperature (K),PROPERTY: Heat Capacity (J/mol K)
0,B2O3,1400.0,134.306
1,B2O3,1300.0,131.294
2,B2O3,1200.0,128.072
3,B2O3,1100.0,124.516
4,B2O3,1000.0,120.625
5,B2O3,900.0,116.19
6,B2O3,800.0,111.169
7,B2O3,723.0,106.692
8,B2O3,700.0,105.228
9,B2O3,600.0,98.115


First thing you should notice: we have many observations of the same compound (B2O3) but measured at different measurement conditions, resulting in a different property value.

We can get some simple summary statistics of the DataFrame by calling the `.describe()` method on the database.

In [18]:
df.describe() #It returns the statistical summary of the Series and DataFrame.

Unnamed: 0,CONDITION: Temperature (K),PROPERTY: Heat Capacity (J/mol K)
count,4579.0,4576.0
mean,1170.920341,107.483627
std,741.254366,67.019055
min,-2000.0,-102.215
25%,600.0,61.3125
50%,1000.0,89.497
75%,1600.0,135.645
max,4700.0,494.967


Using the `pandas-profiling` library, we can generate a more in-depth report of our starting dataset.
Note that generating this profile report might take upwards of 20 seconds.

In [19]:
profile = ProfileReport(df) #this creates a detailed review of the dataframe
#original notebook uses ProfileReport()
#df.copy(), title='Pandas Profiling Report of Cp dataset', html={'style':{'full_width':True}})
#no idea what the hell df.copy(), title='Pandas Profiling Report of Cp dataset', html={'style':{'full_width':True}} does
#output looks the exact same to me when I leave it out so I'll ignore it for now

profile.to_widgets() #prints out a tidier interactive window instead of having to scroll for everything

Summarize dataset:   0%|          | 0/17 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Notice a few things from the profile report:
* We have some missing cells in the dataset ("Overview" tab)
* We have some unrealistic Temperature and Heat Capacity values in the dataset ("Variables" tab)
* We have some missing Temperature, Formula and Heat Capacity values in the dataset ("Variables" tab)

Also notice that on the "Overview" tab, there is the following warning: `FORMULA` has a high cardinality: 245 distinct values.

Cardinality is the number of distinct values in a column of a table, relative to the number of rows in the table.
In our dataset, we have a total of 4583 data observations, but only 245 distinct formulae.
We will have to keep this in mind later, when we process and split the dataset.

## Rename the column names for brevity

In [20]:
df.columns

Index(['FORMULA', 'CONDITION: Temperature (K)',
       'PROPERTY: Heat Capacity (J/mol K)'],
      dtype='object')

In [21]:
new_col_names = {"FORMULA" : "formula"                #creating variable for renaming columns
                 , "CONDITION: Temperature (K)" : "T"
                , "PROPERTY: Heat Capacity (J/mol K)" : "Cp"}

df = df.rename(columns = new_col_names) #assigning new column names to df
df.columns

Index(['formula', 'T', 'Cp'], dtype='object')

## Check for and remove `NaN` values

Here we can use the built-in Pandas methods to check for `NaN` values in the dataset, which are missing values.
We then remove the dataset rows which contain `NaN` values.

In [22]:
#Check for NaNs in the respective dataset columns, and get the indices
df2 = df.copy() #copying the original dataframe and manipulating that so that the original dataframe isn't
                #altered in case a mistake is made
#isnull() This function takes a scalar or array-like object and indicates whether values are missing
#(NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

bool_nan_formula = pd.isnull(df2["formula"]) #using different syntax here but I think it does the same and makes way more sense this way
bool_nan_T = pd.isnull(df2["T"])
bool_nan_Cp = pd.isnull(df2["Cp"])

#Drop the rows of the DataFrame which contain NaNs

#The drop() method removes the specified row or column.
#By specifying the column axis (axis='columns'), the drop() method removes the specified column.
#By specifying the row axis (axis='index'), the drop() method removes the specified row.
df2 = df2.drop(df2.loc[bool_nan_formula].index, axis = "index")        #df2 = df2.drop(df2.loc[bool_nans_formula].index, axis=0)
#DONTUNDERSTANDWHY IT ONLY DROPS TRUE VALUES
#axis = 0 and axis = "index" are the same I'm pretty sure
#df2.loc[bool_nan_formula].index gives the indexes (in this case row values) of the rows in bool_nan_formula
#therefore above code drops all the rows for which bool_nan_formula == True??
df2 = df2.drop(df2.loc[bool_nan_T].index, axis = "index")
df2 = df2.drop(df2.loc[bool_nan_Cp].index, axis = "index")

print("DataFrame shape before dropping NaNs:", df.shape)
print("DataFrame shape before dropping NaNs:", df2.shape)

#I don't understand what the difference between using df2.loc[] and just df2[] is but
#not using .loc gives the error: Boolean Series key will be reindexed to match DataFrame index.


DataFrame shape before dropping NaNs: (4583, 3)
DataFrame shape before dropping NaNs: (4570, 3)


Pandas also includes the convenient built-in method `.dropna()` to check for and remove `NaNs` in-place:

In [23]:
df3 = df.copy()
df3 = df3.dropna(axis = "index", how = "any") #how is any or all. removes if "any" are nan. removes if "all" are nan

print("DataFrame before removing NaN:", df.shape)
print("DataFrame before removing NaN:", df3.shape)

df = df3.copy()

DataFrame before removing NaN: (4583, 3)
DataFrame before removing NaN: (4570, 3)


## Check for and remove unrealistic values

In some cases, you might also get data values that simply don't make sense.
For our dase, this could be negative values in the temperature or heat capacity values.

In [24]:
invalid_T = df['T'] < 0
invalid_Cp = df['Cp'] < 0

df = df.drop(df.loc[invalid_T].index, axis = 0)
df = df.drop(df.loc[invalid_Cp].index, axis = 0)

print("DataFrame before removing NaN:", df.shape)

DataFrame before removing NaN: (4564, 3)


## Save cleaned data to csv

Finally, after cleaning and processing the data, you can save it to disk in a cleaned state for you to use later.

Pandas allows us to save our data as a comma separated value `.csv` file. 

In [25]:
out_path = os.path.join(PATH, "../data_for_notebook_bestpractice/cp_data_cleaned_by_me.csv")
df.to_csv(out_path, index = False) #false means a column of indexes is not added, true would add an index column