## Linear Regression Model
The block of codes below utilize Linear Regression Model as implemented in Scikit-learn to predict eigen values 
of HSE06 quality using the elementwise-orbital contribution of the atoms to the corresponding eigenstates in 
PBE calculations.
### Note:
We will require two .tar.gz files for running this notebook namely, specific_concatenations.tar.gz and 
atom_sum_csv_files.tar.gz. As size of both of these tarballs exceed 25 MB, I couldn't directly upload them in
the github. However you could access these tarballs from dropbox (dropbox/ML_eigenshift) or 
from the Hyperion cluster (/work/sutton-lab/ml_eigenshift/krr/concatenated).

## Importing modules:
We will require following modules for using the linear model. Don't forget to use eg. pip install numpy and so on
to install the missing modules.

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import os
import sys
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib as mpl
import pylab as pl
import ase
from ase.io import read, write
from sklearn.metrics import mean_squared_log_error, mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
import tarfile

We are defining three functions below. Please also see the comments above each functions for more details. 
First couple of functions namely 'two_csv_train_test', and 'single_csv_train_test' prepare the numpy arrays
necessary for training and testing the linear regression model. The last function named 'plot' is to get the
scatter plots using the outputs of the first two functions.

In [None]:
# Functions two_csv_train_test takes 2 arguments. First is the path of csv file used for training while the
# second is the path of csv for testing. The default input columns are [1s,2s,.....,6p,PBE_EF]
#It returns four numpy arrays in following order: 1. train_input 2. test_input 3. train_target 4. test_target. 
# Note these in order are the arguments for function plot

def two_csv_train_test(path_csv_train,path_csv_test):
    train_csv_to_df=pd.read_csv(path_csv_train) # eg. concat_dir+'Ca1O1.csv'
    columns_input=[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,20]
    train_input_feature=train_csv_to_df.iloc[:,columns_input].values
    train_target_feature=train_csv_to_df.iloc[:,-1].values
    #print(train_input_feature)
    #print(train_target_feature)
    test_csv_to_df=pd.read_csv(path_csv_test) # eg. all_files_dir+'Ag2O1.csv'
    test_input_feature=test_csv_to_df.iloc[:,columns_input].values
    test_target_feature=test_csv_to_df.iloc[:,-1].values
    return train_input_feature,test_input_feature,train_target_feature,test_target_feature

# Functions single_csv_train_test takes 2 arguments. First is the path of the csv file used for both training and 
# testing. Second is the fraction of train/test split (recommended = 0.8). It returns four numpy arrays in following 
# order: train_input, test_input, train_target, test_target. Note these in order are the arguments for function plot

def single_csv_train_test(path_csv_file,split_frac): 
    ag2o_csv_to_df=pd.read_csv(path_csv_file) #eg. all_files_dir+'Ag2O1.csv'
    columns_input=[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,20]
    actual_columns=columns_input+[-1]
    #ag2o_input_feature=ag2o_csv_to_df.iloc[:,[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,20,-1]].values
    ag2o_input_feature=ag2o_csv_to_df.iloc[:,actual_columns].values
    random_indices=ag2o_input_feature[np.random.randint(ag2o_input_feature.shape[0], size=ag2o_input_feature.shape[0]), :]
    siz_x,siz_yy=np.shape(random_indices)
    split=int(siz_x*split_frac)
    array1= random_indices[:split,:] # indexing/selection of the 80%
    array2 = random_indices[split:,:]
    return array1[:, :-1],array2[:, :-1],array1[:,-1],array2[:,-1]

# As suggested above function plot is used for generating scatter plot it takes four numpy arrays in following order as
# input: 1. train_input 2. test_input 3. train_target 4. test_target . It returns plot as plt, train_mae, and test_mae
# We can save, display the plot obtained from here. do not forget to close the plot after calling it. eg. plt.close()

def plot(x,p,y,q):
    clf=linear_model.LinearRegression()
    y2=clf.fit(x,y)
    y_hat2=clf.predict(x)
    q_hat2=clf.predict(p)
    train_mae=mean_absolute_error(y,y_hat2)
    test_mae=mean_absolute_error(q,q_hat2)
    #print(train_mae,test_mae)
    a=np.min(y)
    b=np.max(y)
    #print(a,b)
    diag_array=np.linspace(a, b, num=1000)
    aspect_ratio=(3,5)
    left_margin=0.3 # Adjust it for your best results
    #plt.ylim(band_min,band_max)
    #plt.xlim(xmin,xmax)
    plt.scatter(diag_array,diag_array,s=0.5,color='black',alpha=1.0)
    plt.scatter(y,y_hat2,color='blue',alpha=0.2,label='train')
    plt.scatter(q,q_hat2,color='red',alpha=0.2,label='test')
    plt.xlabel("DFT shift (in eV)",fontsize=14)
    plt.ylabel("predicted shift (in eV)",fontsize=14)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)
    #plt.axhline(y=0.0,color='purple',linestyle='--',linewidth=1.5)
    plt.subplots_adjust(left=left_margin)
    plt.legend(fontsize='10',ncol=1, loc='best',frameon=False)
    #plt.savefig('_train='+str(train_mae)+'_test'+str(test_mae)+'.pdf',dpi=80)
    #plt.show()
    return plt,train_mae,test_mae

We here provide the location of the csv files. These csv files provide the information of mostly PBE calculations.
It displays sum of elementwise-orbital contribution to eigen states along with the eigenvalues, shift of eigenvalue
wrt the fermi level, corresponding eigen values and shift from HSE06 calculations and finally the shift between PBE and
HSE06 eigenvalues at each eigenstates. The csv files corresponding to each compound are found in folder atom_sum_csv_files
or its corresponding tarball. The folder specific_concatenations or its corresponding tarball consists of the
concatenation of various files mostly intuitive by their name. Note again as stated in the first block that the 
folders or tarballs mentioned above are not found here in github as they exceed 25 MB size. Look for dropbox
or Hyperion cluster to access them as detailed in first block above.

In [None]:
cwd=os.getcwd()
cwd=os.getcwd()  # gives the path of the current working directory
list_cwd=os.listdir(cwd)
#print(list_cwd)
import tarfile
tar_files=['specific_concatenations','atom_sum_csv_files']
for entries in tar_files:
    if entries in tar_files not in list_cwd:
        file = tarfile.open(str(entries)+'.tar.gz')  
        # extracting file
        file.extractall()
        file.close()

#path of the directories containing csv file of all systems and specific concatenations
concat_dir=os.path.join(cwd,'specific_concatenations/')
all_files_dir=os.path.join(cwd,'atom_sum_csv_files/')
concat_files_list=os.listdir(concat_dir)
all_files_list=os.listdir(all_files_dir)

#print(cwd)
#print(concat_dir)
#print(concat_files_list)
#print(all_files_dir)
#print(all_files_list)




The block below utilizes functions defined above to show/save the plot. Don't forget to use plt.savefig("name.pdf",dpi=80)
to save the figure.

In [None]:
#'''
x,p,y,q=single_csv_train_test(all_files_dir+'Ag2O1.csv',0.8)
plt,train_mae,test_mae=plot(x,p,y,q)
print(train_mae,test_mae)
plt.show()
plt.close()
'''
x,p,y,q=two_csv_train_test(concat_dir+'Ca1O1.csv',all_files_dir+'Ag2O1.csv')
plt,train_mae,test_mae=plot(x,p,y,q)
print(train_mae,test_mae)
plt.show()
plt.close()
'''
pass