## Scientists Database Creator

- The intention is to crawl through various website to create a database of (Indian) scientists working in physics and related fields

- Results will be stored in the followng structure

    ```
project root
|
│   code.ipynb
|
└───database folder
        |
        │   high_energy_physics.csv
        |
        │   strongly_correlated_electrons.csv
        |
        │   quantum_magnetism.csv
        |
        |   low_dimensional_materials.csv
```


- Websites to crawl include google scholar and those of institutes like the IITs, IISERs, etc.

- The code will be structured in the form of separate modules for each website.
- Each module outputs results in a common format. This common format will be a dictionary, of the form

`results={'complex_networks':[[Name 1, Affil 1, Webpage 1, Interests 1], [Name 2, Affil 2, Webpage 2, Interests 2]], 'quantum_magnetism':[[...]]}`

- In other words, the keys of the dictionary are the interests, and the data of each key is a list. This list is itself formed by several lists, each of these smaller lists giving the information for a particular scientist. 

- Only the information of those scientits who have that particular interest will enter that particular key.

- As an example, lets say the results are as follows. Scientist A1 with affiliation B1 has interests X1 and X2, while scientist A2 with affiliation B2 has interests X2 and X3. Then, the complete dictionary is `results={X1: [[A1, B1, "X1, X2"]], X2: [[A1, B1, "X1, X2"],[A2, B2, "X2, X3"]], X3: [[A2, B2, "X2, X3"]]}`

- Each of these separate keys will be written to individual files, `X1.csv, X2.csv` and so on. 

- In each file, the first row will have the headers "Name, Affiliation", etc, and each of the subsequent rows will consist of the details of the scientists. Each scientist will occupy one complete row.

- The delimiter has been chosen to be `'\t'`, because comma is pretty common in the text.

In [3]:
# All imports go here

import os
import sys
import requests
from bs4 import BeautifulSoup as bs
import csv 

## Google Scholar Retriever

In [19]:
def google_scholar():
    
    # URL to search for scientits. The keyword is 'physic', which will hopefull 
    # return most of the physics scientist. The Indian search is enforced through
    # the domains .ac.in and .res.in. Currently searches only through first page,
    # needs to be extended to all pages
    
    url = "https://scholar.google.co.in/citations?hl=en&view_op=search_authors&mauthors=physic+%2B+.ac.in+%7C+.res.in&btnG="

    page = requests.get(url)
    soup = bs(page.content, features='lxml')
    results = {}

    for tag in soup.findAll('h3', attrs={'class': "gs_ai_name"}):

        # obtain name, affiliation, interests and homepage (if exists)
        name = tag.text
        link = "https://scholar.google.com"+tag.next['href']
        author_soup = bs(requests.get(link).content, features='lxml')
        affil_tag = author_soup.find('div', attrs={'class':"gsc_prf_il"})
        affil = affil_tag.text
        try:
            homepage = author_soup.find('a', text = "Homepage")['href']
        except:
            homepage = ""
        interests = [child.text for child in author_soup.find('div', attrs={'class':"gsc_prf_il", 'id':"gsc_prf_int"}).findChildren()]

        data = [name, affil, homepage, ', '.join(interests)]
        
        # append data for this scientist to the dictionary
        for interest in interests:
            
            # sanitise interest by changing space to
            # _ and converting all to lower case
            interest_sanitised = interest.replace(' ', '_').lower()
            
            # create key if does not exist
            if interest_sanitised not in results:
                results[interest_sanitised] = [data]
            else:
                results[interest_sanitised].append(data)
        
    return results