# Introduction

**Main Topic**

This notebook is for **Generate Molecular Image using [PubChem](https://pubchem.ncbi.nlm.nih.gov/)** 

**References**

[**PubChem Official Docs**](https://pubchemdocs.ncbi.nlm.nih.gov/about)

[**Generate SMILES Molecular Image(Korean)**](https://dacon.io/competitions/official/235640/codeshare/1630?dtype=recent)


# Install PubChem from scratch

In [None]:
!conda install -y -c rdkit rdkit;

## Download PubChem Compound ID(CDI) for InChI

We can download tons of Molecular Images from https://ftp.ncbi.nlm.nih.gov/pubchem

There are index of ftp at /pubchem/Compound/Extras, and I'm going to download **CID-InChI-Key.gz**

![](https://drive.google.com/uc?export=view&id=1kgOTcGQnZFchzyQvZV4HbVdxEtf5uME2)

In [None]:
! wget https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-InChI-Key.gz

## Set up environment¶

In [None]:
import cv2
import os
import gzip
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt


import rdkit
from rdkit import Chem
from rdkit.Chem import Draw

## Open gz file

There are tons of InChI Component in **CID-InChI-Key.gz** so I'll just extract 500 Components from on it.

**- note -**

Components are stored with similar components in order, So It would be better to select randomly if you want to use this datasets  

In [None]:
length = 500
with gzip.open('CID-InChI-Key.gz', 'r') as InChIs:
    data = [InChIs.readline().decode() for _ in tqdm(range(length))]

In [None]:
Chem.MolFromInchi(data[0].split('\t')[1])

In [None]:
InChI_dict = {'InChI':[]}
for i, d in tqdm(enumerate(data)):
    InChI = d.split('\t')[1]
    m = Chem.MolFromInchi(InChI)
    if m != None:
        InChI_dict['InChI'].append(InChI)

## Generate DataFrame

In [None]:
save_path = './images/'
train = pd.DataFrame(InChI_dict)
train['file_name'] = 'train_' + train.index.astype('str') + '.png'
train['file_path'] = save_path + train['file_name']
train.head()

## Save Images

In [None]:
if not (os.path.isdir(save_path)):
    os.makedirs(os.path.join(save_path))
    
for idx, row in tqdm(train.iterrows()):
    file = row['file_path']
    InChI = row['InChI']
    m = Chem.MolFromInchi(InChI)
    if m != None:
        img = Draw.MolToImage(m, size=(300,300))
        img.save(file)    

In [None]:
image_paths = train.file_path

## Show Images

In [None]:
plt.figure(figsize=(20, 18))
for i in range(20):
    img = cv2.imread(image_paths[i])
    plt.subplot(5, 4, i+1)
    plt.imshow(img)
plt.show()

## Next Step

As we know, Competition dataset is low resolution images.
So it would be better to matching resolution using Image Processing.

-low resolution --> high resolution

-high resolution --> low resolution

I don't know which one is better now. But we can figure it out :)

Hope to be helpful this NB.