# Raw Data Processing

## 1. Web crawler and flux sites htm data preprocessing
For 2017 version of this paper builds, the resource of site information must be reliable, and only two websites are selected for web crawling: The new Fluxnet site and old Fluxnet ORNL site. In case that the wegpages are rather different,we chose Scrapy to download all the pages to local disk, and Beautifulsoup to extract information from webpages.

### 1.1 Web crawler (Scrapy) deployment

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class FluxSpider(scrapy.Spider):
    name = "fluxnet"
    allowed_domains = ["fluxnet.ornl.gov","fluxnet.fluxdata.org/"]
    start_urls = [
        "https://fluxnet.ornl.gov/site_list/sitename/-",
        "http://fluxnet.fluxdata.org/sites/site-list-and-pages/"
                ]

    def parse(self, response):
        filename = response.url.split("/")[-1]
        with open(filename, 'wb') as f:
            f.write(response.body)

#Initiate the Spider in Python Console
settings = get_project_settings()
process = CrawlerProcess(settings=settings)
process.crawl(FluxSpider)
process.start()
#Crawl results are regrouped into FluxNet_Old_ORNL and FLuxnet_2015Datasets

### 1.2 Data extraction and cleaning using Beautifulsoup
Beautifulsoup is quite efficient in extracting information form tabled webpages, we just inspected the source code of one page to seek for pattern, then applied this pattern to extraction codes, which would automatically deliever the values into right fields. There were two bunches of webpages coming from different sites, so we extract them seperately, and combined them together. Few 'br/'s exist in Investigators field, just define a function to delelte them. Let see how the table looks like.

In [15]:
from bs4 import BeautifulSoup
import pandas as pd

def Delbr(List):
    temp = List
    for line in List:
        if line == '<br/>':
            temp.remove('<br/>')
    return(temp)

#Define the positions of webpages
FluxORNLPos = 'D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/SiteInfo/FluxNet_Old_ORNL/'
FluxPos = 'D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/SiteInfo/FLuxnet_2015Datasets/'

#Old site
Fileinput = open('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/SiteInfo/Oldfluxnet.txt')
htmllist = Fileinput.read()
htmltable = []
#Parser for extraction
for html in htmllist.split('\n')[:-1]:
    htmlmarker = open(FluxORNLPos + html)
    soup = BeautifulSoup(htmlmarker,'lxml')
    tables = soup.find_all('table') #Locate all tables
    Fluxinfo = tables[0].find_all('td')#Extract each table
    Locinfo = tables[1].find_all('td')
    Investinfo = tables[2].find_all('td')
    temp = {}
    #First td tab processing
    temp.update({'SiteName':Fluxinfo[1].contents})
    temp.update({'Description':Fluxinfo[3].contents})
    temp.update({'Status':Fluxinfo[5].contents})
    temp.update({'Code':Fluxinfo[7].contents})
    #Second
    temp.update({'Country':Locinfo[1].contents})
    temp.update({'Coordinates':Locinfo[3].contents})
    #Third
    Investors = []
    for i in range(len(Investinfo)):
        if i % 2 != 0:
            Investors.append(str(Investinfo[i]).replace('<br/>','|')[4:-5])
    temp.update({'Investigators':Investors})
    htmltable.append(temp)
Fileinput.close()

#New Site
Fileinput = open('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/SiteInfo/Fluxnet.txt')
htmllist = Fileinput.read()
Newhtmltable = []
#Parser for extraction
for html in htmllist.split('\n')[:-1]:
    htmlmarker = open(FluxPos+ html)
    soup = BeautifulSoup(htmlmarker,'lxml')
    tables = soup.find_all('table') # Locate all tables
    Fluxinfo = tables[0].find_all('td')# Only one table left in New Files 
    temp = {}
    temp.update({'SiteName':Fluxinfo[4].contents})
    temp.update({'Code':Fluxinfo[2].contents})
    temp.update({'Coordinates':(Fluxinfo[8].contents,Fluxinfo[10].contents)})
    Investors = []
    for Inves in Fluxinfo[7].contents[0].split('">'):
        Investors.append(Inves)
    temp.update({'Investigators':Investors})
    Newhtmltable.append(temp)
Fileinput.close()

pd.DataFrame.from_dict(htmltable)

Unnamed: 0,Code,Coordinates,Country,Description,Investigators,SiteName,Status
0,[US-Dix],"[39.97122889, -74.43455028]","[New Jersey, United States]",[The Fort Dix site is located in the upland fo...,[Kenneth Clark|kennethclark@fs.fed.us||USDA Fo...,[Fort Dix],"[Inactive, core measurements no longer being m..."
1,[US-Slt],"[39.91375444, -74.595985]","[New Jersey, United States]",[The Silas Little Experimental Forest site is ...,[Kenneth Clark|kennethclark@fs.fed.us||USDA Fo...,[Silas Little Experimental Forest],"[Active, core measurements presently being made]"
2,[US-NMj],"[46.6465, -88.5194]","[Michigan, United States]",[The jack pine site is owned by Michigan Techn...,[Jiquan Chen|jqchen@msu.edu||Michigan State Un...,[Northern Michigan Jack Pine Stand],"[Inactive, core measurements no longer being m..."
3,[US-Syv],"[46.242017, -89.34765]","[Michigan, United States]",[The Sylvania Wilderness Area is a 8500 ha old...,"[Ankur Desai<a href=""http://flux.aos.wisc.edu/...",[Sylvania Wilderness Area],"[Inactive, core measurements no longer being m..."
4,[US-NPn],"[42.31288889, -106.5571111]","[Wyoming, United States]",[Subalpine/alpine],"[William Smith<a href=""http://www.wfu.edu/~smi...",[Northern Plains Site],"[Inactive, core measurements no longer being m..."
5,[US-Oho],"[41.55454, -83.84376]","[Ohio, United States]",[The Ohio Oak Openings site is located within ...,[Jiquan Chen|jqchen@msu.edu||Michigan State Un...,[Oak Openings],"[Active, core measurements presently being made]"
6,[US-MRf],"[44.64649416, -123.551483]","[Oregon, United States]","[The Marys River Fir site is part of the ""Synt...","[Beverly Law<a href=""http://terraweb.forestry....",[Marys River (Fir) site],"[Active, core measurements presently being made]"
7,[US-PFa],"[45.94587778, -90.27230417]","[Wisconsin, United States]",[The 447 m tall WLEF-TV television tower is lo...,[Arlyn Andrews|arlyn.andrews@noaa.gov||NOAA ES...,[Park Falls],"[Active, core measurements presently being made]"
8,[US-Pon],"[36.76666667, -97.13333333]","[Oklahoma, United States]",[The Ponca Winter Wheat site is a 65 ha rainfe...,"[Shashi Verma<a href=""http://snr.unl.edu/about...",[Ponca City],"[Inactive, core measurements no longer being m..."
9,[US-Upa],"[70.28147222, -148.8848333]","[Alaska, United States]",[Arctic tundra:tossock tundra],"[Walter Oechel<a href=""http://www.sci.sdsu.edu...",[Upad],"[Inactive, core measurements no longer being m..."


Now we are almost there, the next job is to extract investigators and their respective sites, bring them together and save this information into a list. As this list is still a Site -> Investigators list, we will transform it into a Investigator -> Sites list. In case some pages didn't mentioned Investigator, we defined a Name_Exam function to drop those pages.

In [22]:
def Name_Exam(name):
    if ('</td>' in name) or name == '':
        return(0)
    else:
        return(1)

#Combine Investors and Sites
Site_Invest = []
for html in htmltable:
    for invest in html['Investigators']:
        Site_Invest.append([html['Code'][0],invest.split('<a')[0].split('|')[0]]) #Here <a is just the head of human page introduction

#Create Investor List and Remove Duplicate
InvestList = []
for Site in Site_Invest:
    Investor = Site[1]
    if Investor not in InvestList:
        if Name_Exam(Investor):
            InvestList.append(Investor)

InvestList = sorted(InvestList)

#Rebuld Investor >> Site List
Invest_Site = {}
for Site in Site_Invest:
    name = Site[1]
    if Invest_Site.has_key(name):
        Invest_Site[name] += ',' + Site[0] 
    else:
        Invest_Site.update({name:Site[0]})
    
Invest_Site

{'': u'SK-Ta1,SK-Ta2,CA-MA1,CA-MA2,CA-MA3',
 'A. Chris Oishi': u'US-Dk1,US-Dk2,US-Dk3',
 'Abad Chabbi': u'FR-Lus',
 'Abel Rodrigues': u'PT-Esp',
 'Achim Grelle': u'SE-Asa,SE-Fla,SE-Kno,SE-Sk1',
 'Adam Wolf': u'KZ-AL1,KZ-AL2,KZ-AL3,KZ-AL4,KZ-CW1,KZ-CW2,KZ-CW3,KZ-CW4,KZ-VL1,KZ-VL2,KZ-VL3,KZ-VL4,KZ-VL5,KZ-Wht',
 'Adrian Rocha': u'US-An1,US-An2,US-An3',
 'Aikaterini Trepekli': u'GR-Vcs',
 'Akira Miyata': u'JP-Mas,JP-Aka,JP-Ksa,JP-Onn,JP-Yaw',
 'Alan Barr': u'CA-SF2,CA-SF3,CA-SJ1,CA-SJ2,CA-SJ3',
 'Alan Knapp': u'US-Kon,US-Man,US-Ra1,US-Ra2',
 'Alana Oakins': u'US-ICt,US-ICh,US-ICs',
 'Albert Olioso': u'FR-Avi',
 'Alberto Trotta': u'IT-MsN',
 'Alejandro Castellanos': u'MX-Col',
 'Alejandro Cueva': u'MX-EMg',
 'Alessandra Lagomarsino': u'IT-Cng',
 'Alessandra Vinci': u'IT-CdD,IT-Mrs',
 'Alessandro Cescatti': u'IT-Cas,IT-La2,IT-Vig,IT-Isp,IT-SR2,IT-PT1,IT-SRo',
 'Alessandro Matese': u'IT-Lec,IT-Pia',
 'Alessandro Peressotti': u'IT-Be1,IT-Be2',
 'Alessandro Zaldei': u'IT-OVr,IT-OXm,IT-Lec,IT-Pi

### 1.3 Save processed data to disk

In [23]:
Fileoutput = open('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/SiteInfo/InvestSite.txt','w')
for investor in InvestList:
    print >> Fileoutput, investor + '|' + str(Invest_Site[investor])
Fileoutput.close()

## 2. Web of Science raw data processing
Web of Science full records directly download form WoS interface are important resources for bibliometrics analysis, while they are in the form of TAG + contents, it is not convinient for python to handle. We'll transform it into a more neat form, then save those beautiful new Records to disk for further processing. 

### 2.1 Load the data by Tags
The downloaded WoS full record data was coded in UTF-8, which contained \xef\xbb\xbf and TM codes in the head. Delete them! Here the original data is rather small, We'll dump in the Web of Science ISI data into our memory directly. Usually this process should be done Line by Line, putting all files into memory is not a good way in processing a great bunch of data. 

In [1]:
import pandas as pd

Fileinput = open('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/5654Records_Original.txt')

#Set Reocrds as a List and each reocord as a dict to save attributes
Records = []
record = {}
CurrentTag = ''

for line in Fileinput.readlines():
    #Delete BOM Data in front, TM symbol in middle and \n in the last
    line = line.replace('\xef\xbb\xbf','').replace('\xe2\x84\xa2','').replace('\n','')
    #Extraction
    if line[:2] != '  ' and len(line) > 0:
        CurrentTag = line[:2]
        if CurrentTag != 'ER':
            record.update({CurrentTag:[line[3:]]})
        elif line[:2] == 'ER':
            Records.append(record)
            record = {}
    elif line[:2] == '  ':
        record[CurrentTag].append(line[3:])
        
Fileinput.close()

### 2.2 New record structure
Now the raw data are already well saved in Records, and they could also be processed in python. All attributes are recorded as Lists in memory. For single value attributes, there is only one value in the list, we'll transform them back to normal string. While for multi-value atrributes like AF, AU and CR, values are  well listed in the data structure. 
Transport the data into a 2D chart(pandas.DataFrame), let's see how they look like.

In [2]:
for index,record in enumerate(Records):
    for attr in record:
        if isinstance(record[attr],list):
            temp = ''
            for line in record[attr]:
                temp += line + '|'
            Records[index][attr] = temp[:-1]
    
PRecords = pd.DataFrame.from_records(Records, index = range(len(Records)))
PRecords

Unnamed: 0,AB,AF,AR,AU,BE,BN,BP,BS,C1,CA,...,SU,TC,TI,U1,U2,UT,VL,VR,WC,Z9
0,"In this study, net surface radiation (R-n) was...","Mahalakshmi, D. V.|Paul, Arati|Dutta, D.|Ali, ...",,"Mahalakshmi, DV|Paul, A|Dutta, D|Ali, MM|Dadhw...",,,1,,"[Mahalakshmi, D. V.; Ali, M. M.; Dadhwal, V. K...",,...,,0,Estimation of net surface radiation using eddy...,1,1,WOS:000381162400001,33,1.0,Geochemistry & Geophysics,0
1,"To date, direct validation of city-wide emissi...","Vaughan, Adam R.|Lee, James D.|Misztal, Pawel ...",,"Vaughan, AR|Lee, JD|Misztal, PK|Metzger, S|Sha...",,,455,,"[Vaughan, Adam R.] Univ York, Dept Chem, York,...",,...,,3,Spatially resolved flux measurements of NOx fr...,8,10,WOS:000380099700022,189,,"Chemistry, Physical",3
2,Large variability in N2O emissions from manage...,"Grant, Robert F.|Neftel, Albrecht|Calanca, Pie...",,"Grant, RF|Neftel, A|Calanca, P",,,3549,,"[Grant, Robert F.] Univ Alberta, Dept Renewabl...",,...,,0,Ecological controls on N2O emission in surface...,11,12,WOS:000379427700003,13,,"Ecology; Geosciences, Multidisciplinary",0
3,"Conversions of natural ecosystems, e.g., from ...","Merten, Jennifer|Roell, Alexander|Guillaume, T...",5,"Merten, J|Roll, A|Guillaume, T|Meijide, A|Tari...",,,,,"[Merten, Jennifer; Dittrich, Christoph; Faust,...",,...,,2,Water scarcity and oil palm expansion: social ...,16,28,WOS:000380049100006,21,,Ecology; Environmental Studies,2
4,A scheme describing the process of stream-aqui...,"Zeng, Yujin|Xie, Zhenghui|Yu, Yan|Liu, Shuang|...",,"Zeng, YJ|Xie, ZH|Yu, Y|Liu, S|Wang, LY|Jia, BH...",,,2333,,"[Zeng, Yujin; Xie, Zhenghui; Liu, Shuang; Wang...",,...,,3,Ecohydrological effects of stream-aquifer wate...,10,15,WOS:000379419500013,20,,"Geosciences, Multidisciplinary; Water Resources",3
5,There have been few studies conducted on the c...,"Yang, Zesu|Zhang, Qiang|Hao, Xiaocui",6809749,"Yang, ZS|Zhang, Q|Hao, XC",,,,,"[Yang, Zesu] Chengdu Univ Informat Technol, Co...",,...,,0,Evapotranspiration Trend and Its Relationship ...,8,8,WOS:000379433600001,,,Meteorology & Atmospheric Sciences,0
6,The lifetime of nitrogen oxides (NOx) affects ...,"Romer, Paul S.|Duffey, Kaitlin C.|Wooldridge, ...",,"Romer, PS|Duffey, KC|Wooldridge, PJ|Allen, HM|...",,,7623,,"[Romer, Paul S.; Duffey, Kaitlin C.; Wooldridg...",,...,,2,The lifetime of nitrogen oxides in an isoprene...,16,26,WOS:000379417300009,16,,Meteorology & Atmospheric Sciences,2
7,"The emission, dispersion, and photochemistry o...","Su, Luping|Patton, Edward G.|de Arellano, Jord...",,"Su, LP|Patton, EG|de Arellano, JVG|Guenther, A...",,,7725,,"[Su, Luping; Mak, John E.] SUNY Stony Brook, S...",,...,,3,Understanding isoprene photooxidation using ob...,7,10,WOS:000379417300016,16,,Meteorology & Atmospheric Sciences,3
8,"We measured volatile organic compounds (VOCs),...","Rantala, Pekka|Jarvi, Leena|Taipale, Risto|Lau...",,"Rantala, P|Jarvi, L|Taipale, R|Laurila, TK|Pat...",,,7981,,"[Rantala, Pekka; Jarvi, Leena; Taipale, Risto;...",,...,,0,Anthropogenic and biogenic influence on VOC fl...,3,12,WOS:000379417300032,16,,Meteorology & Atmospheric Sciences,0
9,The dry component of total nitrogen and sulfur...,"Rumsey, Ian C.|Walker, John T.",,"Rumsey, IC|Walker, JT",,,2581,,"[Rumsey, Ian C.] Coll Charleston, Dept Phys & ...",,...,,0,Application of an online ion-chromatography-ba...,4,10,WOS:000379397100008,9,,Meteorology & Atmospheric Sciences,0


### 2.3 Data output
Just output the whole dataframe to one single xlsx file using the function pandas.DataFrame.to_excel and pandas.DataFrame.to_csv().

In [9]:
import sys
reload(sys)
sys.setdefaultencoding('utf8')

writer = pd.ExcelWriter('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Full_Record_WoS.xlsx')
PRecords.to_excel(writer, 'Sheet1')
writer.save()

In [11]:
PRecords.to_csv('D:/_Research/Project_Sharing_Data_FromLinux/ProjectRebuild_2017/Data/Full_Record_WoS.csv')