This code is based on Python 2.7.

To simplify, I propose a code without chunk (the whole file is processed in a single step) . The version with chunks could be easily derived from other exercises.

To accelerate the performance, I used an uncompressed data file but, if necessary, the "pd.read_csv" function can be applied to compressed files by adding the option "compression='bz2'"

This code exploits GeoBase to retrieve airport countries.

To test the webservice, 
   http://localhost:8080/top/n            
It returns JSON file with Top n airports.


In [1]:
import web # Import the library web to create the web service
import pandas as pd # import the library pandas
import json # import the library json for JSON files
from GeoBases import GeoBase # import GeoBase to retrieve airport countries

# Columns names and filenames
filenameBooking="bookings.csv" # name of the bookings file
usedColumns = ['arr_port','pax'] # columns used to process the file

The next cell retrieves the GeoBase information on airport countries.

It also defines the reading function that is used to remove unknown airport codes.

In [2]:
geo_o = GeoBase(data='ori_por', verbose=False) # load the GeoBase data

# Create a function that retrieves the country name from the airport code
# If the airport code is unknown, it returns the default value "UNKNOWN"
# I use the strip function to remove whitespaces in airport codes (otherwise, no airport code is recognized)
def strCountryName(x):
    return geo_o.get(x.strip(), 'city_name_ascii',default='UNKNOWN')

In next cell, I define the function that produces the top ranking and write the JSON file.

In [3]:
# Define the function that writes the JSON file
def writeJsonFile(topNumber,filenameJSON):
    # read the CSV file, keeping only columns 'arr_port' and'pax'
    df = pd.read_csv(filenameBooking,sep='^', usecols=usedColumns, nrows=1000) 
    dfTop=df['pax'].groupby(df['arr_port']).sum().reset_index().sort_values(by='pax', ascending=False)[:topNumber]
        
    dfTop['Rank']=range(1,topNumber+1,1) # Generate integers up to topNumber+1, but not including topNumber+1
    dfTop = dfTop.reindex(columns=['Rank','arr_port','pax'])
    dfTopJSON=dfTop.rename(columns={'arr_port': 'Airport','pax': 'Number of bookings'})
        
    # I "map" the function 'strCountryName' to each row of the column 'Airport' to create a new column named 'Country'       
    dfTopJSON['Country']=dfTopJSON['Airport'].map(strCountryName)    

    # Write the data inside a json file        
    jsonData = dfTopJSON.to_json(path_or_buf=filenameJSON, orient='columns') 

The following cell defines the webservice.

Warning: this code uses the localhost web server. This server is also used by Jupyter. So there may exist a conflict.
    The best way to test the code is to launch it into a .py file which is given in the Github website.

In [4]:
# Define the url coding    
urls = (
    '/top/(.*)', 'get_top'
)

app = web.application(urls, globals())
      
# number is the number of tops to return     
class get_top:
    def GET(self, number):
        topNumber=int(float(number)) # number of tops I retrieve (cast to float and integer to process number like 2.3)
        # If negative or null value, return an information to user       
        if topNumber <= 0: 
            return "Requested number of top is negative or null! Please retry."
        filenameJSON = "top"+str(topNumber)+".json" # name of the JSON file
        writeJsonFile(topNumber,filenameJSON) # write the JSON file
        web.header('Content-Type', 'application/json') # Precise the content type of the returned data
        json_data=open(filenameJSON) # Open the json file
        data = json.load(json_data) # Read the data inside the json file
        return json.dumps(data) # Return the JSON content
        
# Launch the webservice
if __name__ == "__main__":
    app.run()    

ValueError: -f is not a valid IP address/port