# Task

Build a regression-based machine learning model to predict saturation vapour pressure. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFE

# Data Exploration and Pre-processing

In [9]:
# Load data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Drop "ID" since it is a random number for the test data
train_data = train_data.drop(columns=["ID"])
test_data = test_data.drop(columns=["ID"])

# Data exploration
print(train_data.describe())
print(train_data.info())

# Check missing values
train_data.isna().any()

        log_pSat_Pa            MW    NumOfAtoms        NumOfC        NumOfO  \
count  26637.000000  26637.000000  26637.000000  26637.000000  26637.000000   
mean      -5.516747    264.638341     26.251567      6.862409      9.937042   
std        3.120191     49.618151      5.229818      1.453679      2.485167   
min      -18.822563     30.010565      4.000000      1.000000      0.000000   
25%       -7.515147    233.017166     23.000000      6.000000      8.000000   
50%       -5.450577    266.986260     26.000000      7.000000     10.000000   
75%       -3.429192    299.012475     30.000000      7.000000     12.000000   
max        8.390642    386.044503     41.000000     10.000000     17.000000   

             NumOfN  NumHBondDonors     NumOfConf  NumOfConfUsed  \
count  26637.000000    26637.000000  26637.000000   26637.000000   
mean       1.063558        2.201637    229.856778      25.700417   
std        0.710745        1.021029    203.234312      14.689993   
min        0.000

log_pSat_Pa                     False
MW                              False
NumOfAtoms                      False
NumOfC                          False
NumOfO                          False
NumOfN                          False
NumHBondDonors                  False
NumOfConf                       False
NumOfConfUsed                   False
parentspecies                    True
C=C (non-aromatic)              False
C=C-C=O in non-aromatic ring    False
hydroxyl (alkyl)                False
aldehyde                        False
ketone                          False
carboxylic acid                 False
ester                           False
ether (alicyclic)               False
nitrate                         False
nitro                           False
aromatic hydroxyl               False
carbonylperoxynitrate           False
peroxide                        False
hydroperoxide                   False
carbonylperoxyacid              False
nitroester                      False
dtype: bool

We can see "parentspecies" column is categorical data with some missing values. So we need to transform these categorical variables into a numerical format.

For some features with large values, considering the logarithmic scale instead of raw data to bring the quantities to more manageable range.

Standardize numerical data.

In [None]:
# To do: transform "parentspecies" categorical variables into a numerical format
# To do: log transform
# To do: Normalization/Standardization

# Feature selection



In [None]:
# Check for correlations
correlation_matrix = train_data.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()


By correlation analysis, we may remove highly correlated features to reduce multicollinearity. Also, considering creating new features such as ratios or interactions between different chemical properties, which might help in improving the model's performance. 

Use models like Random Forest or XGBoost to identify important features. Alternatively, techniques like Recursive Feature Elimination (RFE) can be used to select features.

# Model Selection

Test various regression models such as Linear Regression, Ridge/Lasso Regression, Decision Trees, Random Forest, and Support Vector Regression.

Use K-fold cross-validation to evaluate the performance of each model.

Adjust model parameters using techniques like Grid Search or Random Search to find the optimal settings for models.

# R2 Score Estimation
