# Some Starter Code for Retrieving, and Analyzing Data Using API
In this notebook I include a basic example of 
1. retrieving data using [SemanticScholar APIs](https://api.semanticscholar.org/graph/v1)
2. store it in a pandas dataframe  
3. write it to a .csv file. 
test

In [2]:
import requests 
import pandas as pd 

As an example, the following API performs a search by keyword and:
1. Returns with total=639637, offset=0, next=100, and data is a list of 100 papers.
2. Each paper has paperId, abstract, year, referenceCount, citationCount, influentialCitationCount and fieldsOfStudy 

Feel free to change the strings after 'query=' and 'fields='to specify what keyword you want to search and what fields, i.e. data, you want the API to return.  Add 'limit=' to specify how many data you want it to return.
For more information on other APIs refer to [SemanticScholar APIs](https://api.semanticscholar.org/graph/v1)

In [33]:
response = requests.get('https://api.semanticscholar.org/graph/v1/paper/search?&query=covid&fields=abstract,year,referenceCount,authors,citationCount,influentialCitationCount,fieldsOfStudy&offest=0&limit=100')

An example of the paper instance returned by the above API call

In [35]:
response.json()['data'][0]

{'paperId': '8e787e925eeb7ad735a228b2b1e8dd6d9620be83',
 'abstract': '\n               Summary\n               \n                  Background\n                  Since December, 2019, Wuhan, China, has experienced an outbreak of coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Epidemiological and clinical characteristics of patients with COVID-19 have been reported but risk factors for mortality and a detailed clinical course of illness, including viral shedding, have not been well described.\n               \n               \n                  Methods\n                  In this retrospective, multicentre cohort study, we included all adult inpatients (≥18 years old) with laboratory-confirmed COVID-19 from Jinyintan Hospital and Wuhan Pulmonary Hospital (Wuhan, China) who had been discharged or had died by Jan 31, 2020. Demographic, clinical, treatment, and laboratory data, including serial samples for viral RNA detection, 

The API only supports 100 resquests per 5 minutes. Here is an example of making 50 requests that retrieve the papers with keyword 'covid' and 'vaccination' and load them into a pandas dataframe       

Note here the response.json() is a dictionary with keys 'total', 'offset','next', and 'data'. Here the value of the key 'data' is of our interest, and it is a list of dictionaries. Each dictionary stores the relevant data of a paper specified in your query. 

For demonstration purpose only [paperId	year,referenceCount,citationCount,influentialCitationCount,fieldsOfStudy] are collecrted.

In [24]:
num_requests = 50
keyword = ['covid','vaccination']  
data = []
for i in range(num_requests):
    q = '+'.join(keyword)
    query = 'https://api.semanticscholar.org/graph/v1/paper/search?&query={}&fields=year,referenceCount,citationCount,influentialCitationCount,fieldsOfStudy&offest={}&limit=100'.format(q,i*100)
    response = requests.get(query)
    data += response.json()['data']

The following cell is a example of storing the retrived data into a pandas dataframe and write it into a csv file. 

In [25]:
df = pd.DataFrame(data)
df

Unnamed: 0,paperId,year,referenceCount,citationCount,influentialCitationCount,fieldsOfStudy
0,8e787e925eeb7ad735a228b2b1e8dd6d9620be83,2020,43,14366,485,[Medicine]
1,97881c6577c310f50fc86738c0268896b970dfa4,2020,12,10176,341,[Medicine]
2,ca019e1e38edf9d2112ea987362da454f909ac1b,2020,4,4802,227,[Medicine]
3,dd86b3551add27004b5bf3f5fb206bec9cd69c4f,2020,18,5011,129,[Medicine]
4,d23288ee99138421d6a771a14a98a9cdddd97f98,2020,5,5014,143,[Medicine]
...,...,...,...,...,...,...
4995,ebe3f062c05b57cb1f0f0f1e73ad23d8af6aef33,2020,126,775,60,[Medicine]
4996,b638d404a28a56d5553e84bea7450712f5cf00ba,2020,61,974,48,"[Biology, Medicine]"
4997,535ae4b3525c0a104b007f190fcce59de617a56e,2020,135,927,44,"[Chemistry, Medicine]"
4998,b557fb52771f9a5ff953ca9825f38e82dff33f50,2020,48,997,44,[Medicine]


In [29]:
df.describe()

Unnamed: 0,year,referenceCount,citationCount,influentialCitationCount
count,5000.0,5000.0,5000.0,5000.0
mean,2020.0,52.01,2210.5064,73.87
std,0.0,56.934684,1759.102379,61.506339
min,2020.0,0.0,775.0,26.0
25%,2020.0,16.0,1349.5,45.75
50%,2020.0,35.5,1742.5,57.5
75%,2020.0,61.25,2395.25,78.5
max,2020.0,353.0,14367.0,485.0


In [31]:
df.to_csv('data.csv',index=False) # this writes a csv file to the current working directory 

Search by some keyword and then filter the data by year/discipline. Get 1k datapoints for 10 different disciplines each.   

Potential disciplines to consider: [Math,Physics,Chemetry,Computer science, Aeronautics, Material Science, Civil Engineering, Biology, Medicine, scociology,economics]

For each paper, get author ID, perform a search with author ID, get papers and citation count published by the author, calculate H-index.       


train Models:        

1.Linear Regression (with kernel)           
2.NN            

The following cells demonstrate how to define and train a simple regression model using Pytorch. We will use the data collecred above. The model will be a linear regression model that takes citationCount as input and predicts influentialCitationCount.

In [84]:
import torch
import torch.nn as nn
import numpy as np
from sklearn.preprocessing import MinMaxScaler

X_train = df['citationCount'].to_numpy(dtype=np.float32)
y_train = df['influentialCitationCount'].to_numpy(dtype=np.float32)
sc = MinMaxScaler() #scale the input so the gradient won't explode. 
X_train=sc.fit_transform(X_train.reshape(-1,1))
y_train =y_train.reshape(-1,1)

X_train = torch.from_numpy(X_train)
y_train = torch.from_numpy(y_train)

input_size,output_size = 1,1

class LinearRegressionModel(torch.nn.Module):

    def __init__(self):
        super(LinearRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(input_size, output_size)  

    def forward(self, x):
        y_pred = self.linear(x)
        return y_pred

model = LinearRegressionModel()
learning_rate = 0.01
l = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr =learning_rate)

Train the model:

In [85]:
num_epochs = 20000

for epoch in range(num_epochs):
     #forward feed
    y_pred = model(X_train.requires_grad_())

    #calculate the loss
    loss= l(y_pred, y_train)

    #backward propagation: calculate gradients
    loss.backward()

    #update the weights
    optimizer.step()

    #clear out the gradients from the last step loss.backward()
    optimizer.zero_grad()
    
    if epoch % 1000 == 0:
     print('epoch {}, loss {}'.format(epoch, loss.item()))

epoch 0, loss 9078.8662109375
epoch 1000, loss 2224.139404296875
epoch 2000, loss 1482.6478271484375
epoch 3000, loss 1100.35009765625
epoch 4000, loss 903.2459716796875
epoch 5000, loss 801.622314453125
epoch 6000, loss 749.2279663085938
epoch 7000, loss 722.2144775390625
epoch 8000, loss 708.2868041992188
epoch 9000, loss 701.106201171875
epoch 10000, loss 697.4038696289062
epoch 11000, loss 695.4951171875
epoch 12000, loss 694.510986328125
epoch 13000, loss 694.0035400390625
epoch 14000, loss 693.741943359375
epoch 15000, loss 693.6071166992188
epoch 16000, loss 693.5374755859375
epoch 17000, loss 693.501708984375
epoch 18000, loss 693.483154296875
epoch 19000, loss 693.4736938476562


In [86]:
model.forward(X_train).detach().numpy() #make prediction

array([[457.44882 ],
       [325.22952 ],
       [155.64798 ],
       ...,
       [ 33.368767],
       [ 35.577686],
       [ 29.992285]], dtype=float32)