# Some Starter Code for Retrieving, and Analyzing Data Using API
In this notebook I include a basic example of 
1. retrieving data using [SemanticScholar APIs](https://api.semanticscholar.org/graph/v1)
2. store it in a pandas dataframe  
3. write it to a .csv file. 

In [1]:
import requests 
import pandas as pd 

As an example, the following API performs a search by keyword and:
1. Returns with total=639637, offset=0, next=100, and data is a list of 100 papers.
2. Each paper has paperId, abstract, year, referenceCount, citationCount, influentialCitationCount and fieldsOfStudy 

Feel free to change the strings after 'query=' and 'fields='to specify what keyword you want to search and what fields, i.e. data, you want the API to return.  Add 'limit=' to specify how many data you want it to return.
For more information on other APIs refer to [SemanticScholar APIs](https://api.semanticscholar.org/graph/v1)

In [12]:
response = requests.get('https://api.semanticscholar.org/graph/v1/paper/search?&query=convex&fields=abstract,year,referenceCount,authors,citationCount,influentialCitationCount,fieldsOfStudy&offest=0&limit=100')

Using api to access a author using authorID, and calculate the h-index of the specified author.

In [4]:
def get_h_index(authorID):
    """given string authorID, calculate H-Index"""
    response_author = requests.get('https://api.semanticscholar.org/graph/v1/author/{}?fields=name,papers,papers.citationCount'.format(authorID))
    papers = response_author.json()['papers']

    paper_citation = []
    for i in papers:
        paper_citation.append(i['citationCount'])
    paper_citation.sort(key = lambda x: -x)
    
    h = 0
    for i, c in enumerate(paper_citation):
        if i + 1 > c:
            h = i 
            break
    return h


exampleID = '1741103'
print('H-index of {} is {}'.format(exampleID,get_h_index(exampleID)))

H-index of 1741103 is 19


The API only supports 100 resquests per 5 minutes. Hence after every 99 requests are made the code will sleep for 5 mins. The keywords contains the disciplines and their corresponding keywords that are used to conduct a search using the API. The data collected are saved to the current working directory as csv files. 

The features that are collected are:
1. paperid 
2. abstract
3. Year of publish
4. fields of study
5. list of authors (to be transformed to h-index later using get_h_index func above)
6. reference_count
7. citation_count
8. influential_citation_count 

To collect data, run the following cell. Modify keywords dict if you want to collect data from other disciplines or containg other keyword. 

In [46]:
import time 
num_requests = 33
keywords = {
    'Aeronautics': ['Aerospace','aircraft','fluid','aerodynamics', 'radar', 'orbital', 'combustion'],
    'Mathematics':['Analysis', 'Algebraic', 'Arithmetic', 'Number', 'Vector', 'Set', 'Geometric'],
    'Chemistry': ['Chemical', 'Thermodynamic', 'kinetics', 'electrochemistry', 'spectroscopy', 'molecular', 'geochemistry'],
    'Computer science': ['Algorithm', 'Computation', 'Intelligent', 'System', 'Graphics', 'Visualization', 'Architecture'],
    'Physics': ['Force', 'Newtonian', 'Mechanics', 'Relativity', 'Equilibrium', 'Quantum', 'Nuclear', 'Electromagnetic'],
    'Material Science': ['Solids', 'metallurgy', 'mineralogy', 'nanotechnology', 'biomaterials', 'metallurgy', 'failure'],
    'Civil Engineering': ['Geology', 'Soils', 'Environmental', 'Design', 'pavement', 'construction', 'residential', 'commercial'],
    'Biology': ['natural science', 'organisms', 'physiology', 'anatomy', 'plants', 'animals', 'earth', 'ecosystem' ],
    'Medicine': ['Cardiology', 'Cardiovascular Surgery', 'Dermatology', 'Dentistry', 'Emergency Medicine', 'Endocrinology', 'Gastroenterology', 'General Practice'],
    'Economics':['Goods', 'services', 'production', 'consumption', 'macroeconomics', 'microeconomics', 'contract', 'econophysics', 'political economy']
}

for fields in keywords.keys():
    data = []
    for counter,q in enumerate(keywords[fields]):
        for i in range(num_requests):
            query = 'https://api.semanticscholar.org/graph/v1/paper/search?&query={}&fields=abstract,year,referenceCount,citationCount,influentialCitationCount,fieldsOfStudy,authors&offest={}&limit=100'.format(q,i*100)
            response = requests.get(query)
            data += response.json()['data']
        if (counter + 1) % 3 == 0:
            time.sleep(301) #sleep for 5 min
    df = pd.DataFrame(data)
    df = df[df['year'] < 2010]
    df.to_csv(fields+'data.csv',index=False) # this writes a csv file to the current working directory 

The following cell is a example of storing the retrived data into a pandas dataframe and write it into a csv file. 

Search by some keyword and then filter the data by year/discipline. Get 10k datapoints for 10 different disciplines each.   

Potential disciplines to consider: [Math,Physics,Chemetry,Computer science, Aeronautics, Material Science, Civil Engineering, Biology, Medicine, scociology,economics]

For each paper, get author ID, perform a search with author ID, get papers and citation count published by the author, calculate H-index.       


train Models:        

1.Linear Regression (with kernel)           
2.NN: fully connected, 1st layer: D * 20, 2nd layer 20 * 1 , activation function: relu, loss: MSE  Try regularization. 

The following cells demonstrate how to define and train a simple regression model using Pytorch. We will use the data collecred above. The model will be a linear regression model that takes citationCount as input and predicts influentialCitationCount.

In [84]:
import torch
import torch.nn as nn
import numpy as np
from sklearn.preprocessing import MinMaxScaler

X_train = df['citationCount'].to_numpy(dtype=np.float32)
y_train = df['influentialCitationCount'].to_numpy(dtype=np.float32)
sc = MinMaxScaler() #scale the input so the gradient won't explode. 
X_train=sc.fit_transform(X_train.reshape(-1,1))
y_train =y_train.reshape(-1,1)

X_train = torch.from_numpy(X_train)
y_train = torch.from_numpy(y_train)

input_size,output_size = 1,1

class LinearRegressionModel(torch.nn.Module):

    def __init__(self):
        super(LinearRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(input_size, output_size)  

    def forward(self, x):
        y_pred = self.linear(x)
        return y_pred

model = LinearRegressionModel()
learning_rate = 0.01
l = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr =learning_rate)

Train the model:

In [85]:
num_epochs = 20000

for epoch in range(num_epochs):
     #forward feed
    y_pred = model(X_train.requires_grad_())

    #calculate the loss
    loss= l(y_pred, y_train)

    #backward propagation: calculate gradients
    loss.backward()

    #update the weights
    optimizer.step()

    #clear out the gradients from the last step loss.backward()
    optimizer.zero_grad()
    
    if epoch % 1000 == 0:
     print('epoch {}, loss {}'.format(epoch, loss.item()))

epoch 0, loss 9078.8662109375
epoch 1000, loss 2224.139404296875
epoch 2000, loss 1482.6478271484375
epoch 3000, loss 1100.35009765625
epoch 4000, loss 903.2459716796875
epoch 5000, loss 801.622314453125
epoch 6000, loss 749.2279663085938
epoch 7000, loss 722.2144775390625
epoch 8000, loss 708.2868041992188
epoch 9000, loss 701.106201171875
epoch 10000, loss 697.4038696289062
epoch 11000, loss 695.4951171875
epoch 12000, loss 694.510986328125
epoch 13000, loss 694.0035400390625
epoch 14000, loss 693.741943359375
epoch 15000, loss 693.6071166992188
epoch 16000, loss 693.5374755859375
epoch 17000, loss 693.501708984375
epoch 18000, loss 693.483154296875
epoch 19000, loss 693.4736938476562


In [86]:
model.forward(X_train).detach().numpy() #make prediction

array([[457.44882 ],
       [325.22952 ],
       [155.64798 ],
       ...,
       [ 33.368767],
       [ 35.577686],
       [ 29.992285]], dtype=float32)