Session 3 Homework Solution
========================

## Exercise 1 - Linear Fitting

Use scikitlearn to perform multiple linear regression for solubility using the dataset `delaney-processed.csv`. Unfortunately the dataset doesn't contain all of the molecular descriptors as described in the original paper. Use the available descriptors as independent variables in the linear fit. 

- Use your model to predict solubilities for the dataset.
- Compute the $R^2$ statistic for the fit


In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

### Loading and Transforming the Data

In [2]:
df = pd.read_csv("data/delaney-processed.csv")

In [3]:
df.head()

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,Cc1occc1C(=O)Nc2ccccc2
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC(=O)
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4,Thiophene,-2.232,2,84.143,0,1,0,0.0,-1.33,c1ccsc1


In [4]:
# Get X and Y values as NumPy arrays
X = df.iloc[:, 2:8]
X.head()

Unnamed: 0,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area
0,1,457.432,7,3,7,202.32
1,1,201.225,1,2,2,42.24
2,1,152.237,0,0,4,17.07
3,2,278.354,0,5,0,0.0
4,2,84.143,0,1,0,0.0


In [5]:
Y = df["measured log solubility in mols per litre"]
Y.head()

0   -0.77
1   -3.30
2   -2.06
3   -7.87
4   -1.33
Name: measured log solubility in mols per litre, dtype: float64

In [6]:
X = X.to_numpy()
Y = Y.to_numpy()

### Fitting the linear model

In [7]:
regression = LinearRegression().fit(X, Y)

In [8]:
regression.coef_

array([-0.49921068, -0.01362162,  0.07281654, -0.41338402, -0.14337233,
        0.03159255])

### Using the model to predict values

In [9]:
predicted = regression.predict(X)

In [10]:
df["multiple_regression"] = predicted

In [11]:
df.head()

Unnamed: 0,Compound ID,ESOL predicted log solubility in mols per litre,Minimum Degree,Molecular Weight,Number of H-Bond Donors,Number of Rings,Number of Rotatable Bonds,Polar Surface Area,measured log solubility in mols per litre,smiles,multiple_regression
0,Amigdalin,-0.974,1,457.432,7,3,7,202.32,-0.77,OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...,-2.081763
1,Fenfuram,-2.885,1,201.225,1,2,2,42.24,-3.3,Cc1occc1C(=O)Nc2ccccc2,-2.955797
2,citral,-2.579,1,152.237,0,0,4,17.07,-2.06,CC(C)=CCCC(C)=CC(=O),-2.616479
3,Picene,-6.618,2,278.354,0,5,0,0.0,-7.87,c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43,-6.866323
4,Thiophene,-2.232,2,84.143,0,1,0,0.0,-1.33,c1ccsc1,-2.567318


### Computing $r^2$ score

In [12]:
from sklearn.metrics import r2_score

In [13]:
r2_score(Y, predicted)

0.6856666003196058

In [14]:
help(r2_score)

Help on function r2_score in module sklearn.metrics._regression:

r2_score(y_true, y_pred, *, sample_weight=None, multioutput='uniform_average')
    R^2 (coefficient of determination) regression score function.
    
    Best possible score is 1.0 and it can be negative (because the
    model can be arbitrarily worse). A constant model that always
    predicts the expected value of y, disregarding the input features,
    would get a R^2 score of 0.0.
    
    Read more in the :ref:`User Guide <r2_score>`.
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Ground truth (correct) target values.
    
    y_pred : array-like of shape (n_samples,) or (n_samples, n_outputs)
        Estimated target values.
    
    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights.
    
    multioutput : {'raw_values', 'uniform_average', 'variance_weighted'},             array-like of shape (n_outputs,) or None

## Exercise 2 - Regular Expressions

Parse the file `SBE-b-CD-data.sdf` in your `data` folder using regular expressions. 

The SDF file contains information for 220 molecules. For each molecule, there is a section which looks like this:

```
>  <ID>
(-)__Sulpiride

>  <Temperature_K>
293

>  <Kapp>
35
```

### Loading the data

For this homework  thwe recommend you read in the file using `open` function. We recommend this because the file is irregular, it is not just a table. Additionally, the assignment is to use regular expressions which work on strings. When you use `with open` then use `.read` the entire file contents will be in one string. Do note that if your data is tabular and contains text, you can still [use regular expressions in pandas dataframes](https://pandas.pydata.org/docs/user_guide/text.html). The question of strategies for opening a file and the question of using regular expressions are not related. You should choose the best option to open a file based on the file structure, and the best option for text processing based on what you're looking for.

Notably, there are other options you might use with `open`. You could also use `readlines` (will put each line into a list, or `readline` to read a single line at a time.

In [15]:
import re

In [16]:
with open("data/SBE-b-CD-data.sdf") as f:
    data = f.read()

In [17]:
type(data)

str

In [18]:
# Preview the first 500 characters
print(data[:500])

(-)__Sulpiride
  Marvin  01071110562D
 
 23 24  0  0  0  0  0  0  0  0999 V2000
    0.3200    0.2956    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
    1.1450    0.2956    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
    1.5575    1.0101    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
    2.3824    1.0101    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.7949    0.2956    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3824   -0.4189    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0



### Composing our regular expression

Now we will write our regular expression. To do a proof of concept on this, I would recommend constructing your regular expression using part of the file and [pythex](https://pythex.org/) as a proof of concept.

Let's break down the rules for what we're looking for then translate that into regular expression characters.


```
>  <ID>
(-)__Sulpiride

>  <Temperature_K>
293

>  <Kapp>
35
```


- Notice that these all follow the pattern of `<PROPERTY>`, a line break and then the value. Let's write a regular expression to get this kind of group. For `PROPERTY`, what we are looking for is mostly words, along with a _ for `Temperature_K`.

Let's first write an expression for getting `<PROPERTY>`. We want characters and `_` to be between `<>` We could write `[A-Za-z]` to indicate any letter A-Z of any case. You can add a `_` to also have this be a valid character.

`<[A-Za-z_]>`

This pattern will not match anything because as written it will only look for a single letter between the the brackets. To say that there should be more than one character, add `+` to the end. Note that you could have also used `.` instead of `[A-Za-z_]` to indicate that you would accept *any* character. For our case, this will have an effect on our results. If we had numbers (like `<Kapp_2>`) the first pattern would not find the key, while the second would.

In [63]:
pattern = re.compile('<[A-Za-z_]+>')

In [64]:
found = pattern.findall(data)

# Show first 20 found
found[:20]

['<ID>',
 '<Temperature_K>',
 '<Kapp>',
 '<ID>',
 '<Temperature_K>',
 '<Kapp>',
 '<ID>',
 '<Temperature_K>',
 '<Kapp>',
 '<ID>',
 '<Temperature_K>',
 '<Kapp>',
 '<ID>',
 '<Temperature_K>',
 '<Kapp>',
 '<ID>',
 '<Temperature_K>',
 '<Kapp>',
 '<ID>',
 '<Temperature_K>']

Next, we'll add in to look for a newline. On the following line, we would like to get the value. We'll specify this should be any character up to the next line (`.`). Recall that the period `.` represents any character except a line break. Use the plust modifier to indicate more than one character.

In [65]:
pattern = re.compile('<[A-Za-z_]+>\n.+')

In [66]:
found = pattern.findall(data)
found[:20]

['<ID>\n(-)__Sulpiride',
 '<Temperature_K>\n293',
 '<Kapp>\n35',
 '<ID>\n1-naphthol',
 '<Temperature_K>\n298',
 '<Kapp>\n1720.0',
 '<ID>\n1-naphthylamine',
 '<Temperature_K>\n293',
 '<Kapp>\n518',
 '<ID>\n1-phenylpyrrole',
 '<Temperature_K>\n293',
 '<Kapp>\n555',
 '<ID>\n17-a-methyltestosterone',
 '<Temperature_K>\n298',
 '<Kapp>\n12933',
 '<ID>\n1_2_3-Trichlorobenzene',
 '<Temperature_K>\n293',
 '<Kapp>\n31567',
 '<ID>\n2-(1-Adamantyl)-4-methylphenol',
 '<Temperature_K>\n293']

Finally, we'll add parenthesis to group the parts we would like to keep. Put parenthesis around `[A-Za-z_]+` and `.+` to get the property name and value in different groups.

In [67]:
pattern = re.compile('<([A-Za-z_]+)>\n(.+)')

In [68]:
found = pattern.findall(data)
found[:20]

[('ID', '(-)__Sulpiride'),
 ('Temperature_K', '293'),
 ('Kapp', '35'),
 ('ID', '1-naphthol'),
 ('Temperature_K', '298'),
 ('Kapp', '1720.0'),
 ('ID', '1-naphthylamine'),
 ('Temperature_K', '293'),
 ('Kapp', '518'),
 ('ID', '1-phenylpyrrole'),
 ('Temperature_K', '293'),
 ('Kapp', '555'),
 ('ID', '17-a-methyltestosterone'),
 ('Temperature_K', '298'),
 ('Kapp', '12933'),
 ('ID', '1_2_3-Trichlorobenzene'),
 ('Temperature_K', '293'),
 ('Kapp', '31567'),
 ('ID', '2-(1-Adamantyl)-4-methylphenol'),
 ('Temperature_K', '293')]

## Exercise 3 - Bonus

This exercise directs you how to retrieve papers from ChemRxiv (chemistry preprint server) using their REST API. Your task is then to do some processing in pandas to retrieve article abstracts, then to look for phrases in the abstract using regular expressions. This homework requires learning some extra material, so it is a bonus.

First, you can use the Python `requests` module (part of the Python Standard Library) to query the rest API for ChemRxiv. This is the URL we will go to. If you visit this url in your browser, you will get a list of the 100 most recent papers uploaded to ChemRxiv.

`https://api.figshare.com/v2/articles?institution=259&page_size=100`

To retrieve this information using Python, do

```python
import requests

results = requests.get('https://api.figshare.com/v2/articles?institution=259&page_size=100')
```

The 'payload' of this is stored in `results.json()`. This is where the information which you see in your browser. You can convert what you've retrieved into a dataframe by doing `df = pd.DataFrame(results.json())`.

You can retrieve the abstracts by calling `requests.get` on the `url` for each paper. Save this in a column called `detail`. This step will take a while to execute.

After retrieving the details, you must get the json and retreive the "description" field. You can write a custom function which does this and apply it to the `detail` column.

Next, use [pandas str contains](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html) to search the abstract for phrases of interest. We suggest trying out machine/deep learning. Your results will vary since it will always grab the 100 most recently uploaded papers!

In [35]:
import requests

In [36]:
results = requests.get('https://api.figshare.com/v2/articles?institution=259&page_size=100')

In [38]:
df = pd.DataFrame(results.json())

In [39]:
df.head()

Unnamed: 0,defined_type_name,handle,url_private_html,timeline,url_private_api,url_public_api,id,doi,thumb,title,url,defined_type,resource_title,url_public_html,resource_doi,published_date,group_id
0,preprint,,https://figshare.com/account/articles/14579535,"{'revision': '2021-05-14T07:58:17', 'firstOnli...",https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/14579535,14579535,10.26434/chemrxiv.14579535.v1,https://s3-eu-west-1.amazonaws.com/ppreviews-c...,Screening of Influenza a (H1N1) Neuraminidase ...,https://api.figshare.com/v2/articles/14579535,12,,https://chemrxiv.org/articles/preprint/Screeni...,,2021-05-14T07:58:15Z,13668
1,preprint,,https://figshare.com/account/articles/14587656,"{'revision': '2021-05-13T13:11:44', 'firstOnli...",https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/14587656,14587656,10.26434/chemrxiv.14587656.v1,https://s3-eu-west-1.amazonaws.com/ppreviews-c...,COVID-19: The CaMKII_Like System of S Protein ...,https://api.figshare.com/v2/articles/14587656,12,,https://chemrxiv.org/articles/preprint/COVID-1...,,2021-05-13T13:11:31Z,13668
2,preprint,,https://figshare.com/account/articles/14587401,"{'revision': '2021-05-13T13:04:54', 'firstOnli...",https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/14587401,14587401,10.26434/chemrxiv.14587401.v1,https://s3-eu-west-1.amazonaws.com/ppreviews-c...,DoE Optimization Empowers the Automated Prepar...,https://api.figshare.com/v2/articles/14587401,12,,https://chemrxiv.org/articles/preprint/DoE_Opt...,,2021-05-13T13:04:51Z,13668
3,preprint,,https://figshare.com/account/articles/14585934,"{'revision': '2021-05-13T12:57:08', 'firstOnli...",https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/14585934,14585934,10.26434/chemrxiv.14585934.v1,https://s3-eu-west-1.amazonaws.com/ppreviews-c...,A Quantitative Metric for Organic Radical Pers...,https://api.figshare.com/v2/articles/14585934,12,,https://chemrxiv.org/articles/preprint/A_Quant...,,2021-05-13T12:57:04Z,13668
4,preprint,,https://figshare.com/account/articles/14052293,"{'revision': '2021-05-13T09:26:26', 'firstOnli...",https://api.figshare.com/v2/account/articles/1...,https://api.figshare.com/v2/articles/14052293,14052293,10.26434/chemrxiv.14052293.v3,https://s3-eu-west-1.amazonaws.com/ppreviews-c...,Mixed Chirality α-Helix in a Stapled Bicyclic ...,https://api.figshare.com/v2/articles/14052293,12,,https://chemrxiv.org/articles/preprint/Mixed_C...,,2021-05-13T09:26:21Z,13668


In [40]:
df["detail"] = df["url"].apply(requests.get)

In [47]:
def get_abstract(paper_detail):
    abstract = paper_detail.json()["description"]
    return abstract

In [48]:
df["abstract"] = df["detail"].apply(get_abstract)

In [49]:
df["abstract"]

0     <p>Due\nto erratic climate change, vector-born...
1     COVID-19 is a unique disease characterized by ...
2     PARP inhibitors are proven chemotherapeutics a...
3     <p>Long-lived organic radicals are promising c...
4     <p></p><p>The\npeptide α-helix is right-handed...
                            ...                        
95    <p>The use of enzymes for organic synthesis al...
96    We prepared a new class of luciferins based on...
97    <p>The functional diversity of the green fluor...
98    <p>Lasso peptides are a structurally diverse s...
99    The oxygen evolution reaction (OER) from water...
Name: abstract, Length: 100, dtype: object

In [53]:
# use str.contains to find strings in df which contain phrase. Regular expressions work with this function.
ml_papers = df[df["abstract"].str.contains("(machine|deep) learning")]

  return func(self, *args, **kwargs)


In [54]:
ml_papers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 11 to 95
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   defined_type_name  8 non-null      object
 1   handle             8 non-null      object
 2   url_private_html   8 non-null      object
 3   timeline           8 non-null      object
 4   url_private_api    8 non-null      object
 5   url_public_api     8 non-null      object
 6   id                 8 non-null      int64 
 7   doi                8 non-null      object
 8   thumb              8 non-null      object
 9   title              8 non-null      object
 10  url                8 non-null      object
 11  defined_type       8 non-null      int64 
 12  resource_title     8 non-null      object
 13  url_public_html    8 non-null      object
 14  resource_doi       8 non-null      object
 15  published_date     8 non-null      object
 16  group_id           8 non-null      int64 
 17 

In [59]:
ml_papers["abstract"]

11    Asymmetric catalysis enabling divergent contro...
28    For CO* and H* binding energy prediction, we d...
37    Lithium-ion batteries (LIBs) represent the sta...
42    <p></p><p>Enclosed you will find the article e...
59    <p>Application of deep learning techniques for...
78    <div>MOFs and COFs are porous materials with a...
80    <p></p><p>The accurate description of protein ...
95    <p>The use of enzymes for organic synthesis al...
Name: abstract, dtype: object

In [73]:
# Read more from one paper. Notice that the row name is preserved from the original dataframe.
# note that yours will be different because this code returns the 100 most recently uploaded papers to chemrxiv!
ml_papers["abstract"][37]

'Lithium-ion batteries (LIBs) represent the state of the art in high-density energy storage. To further advance LIB technology, a fundamental understanding of the underlying chemical processes is required. In particular, the decomposition of electrolyte species and associated formation of the solid electrolyte interphase (SEI) is critical for LIB performance. However, SEI formation is poorly understood, in part due to insufficient exploration of the vast reactive space.  The Lithium-Ion Battery Electrolyte(LIBE) dataset reported here aims to provide accurate first-principles data to improve the understanding of SEI species and associated reactions. The dataset was generated by fragmenting a set of principal molecules, including solvents, salts, and SEI products, and then selectively recombining a subset of the fragments.  All candidate molecules were analyzed at the ωB97X-V/def2-TZVPPD/SMD level of theory at various charges and spin multiplicities.  In total, LIBE contains structural,t