## Esther Kundin - The Bloomberg presentation

https://databricks.com/session/integrating-existing-c-libraries-into-pyspark

## From the SWIG tutorial page

http://www.swig.org/tutorial.html

## Efficiency.

### Challenges
- UDFs are run on a once per row basis
- All function objects passed from the driver to workers insude the UDF need to be able to be pickles - but C/C++ modukes are not able to be pickled
- Most interfaces can't be pickled
- if not, would create on the executor, row by row

### Solutions
- Do not keep state in your C++ objects
- Spark 2.3 - Use Apache Arrow on vectorised UDFs (good for Pandas, but won't help us with C/C++)
- Use Python Singletons for state
- df.mapPartitions()


## spark-submit directives

| Directive Passed | Set To | Purpose |
| ---------------------------------------------- | -------------- | ------------------------------------- |
| spark.executor.extraLibraryPath | append path where .so deployed to | Ensure C++ lib is loadable |
| spark.driver.extraLibraryPath | append path where .so deployed to | Ensure C++ lib is loadable |
| --archives | .zip or .tgz file tha has the .so and config files | Distributes the file to all worker locations |
| --pyfiles | .py file that has te UDF code | Distribues the UDF to the workers. Other options are to include it directly in the .py that gets called in the spark-submit directive |
| spark.executorEnv.ENVIRONMENT_VARIABLE | Env. Var Value | If the UDF needs Env Vars |
| spark.yarn.appMasterEnv.ENVIRONMENT_VARIABLE | Env. Var Value | If the driver code needs Env Vars |
    
## Example spark-submit

$ spark2-submit --master yarn --deploy-mode cluster \  
--conf "spark.executor.extraLibraryPath=<path>:myfolder" \  
--conf "spark.driver.extraLibraryPath=<path>:myfolder" \   
--archives myfolder.zip#myfolder \  
--conf "spark.executor.MY_ENV=my_environment_variable" \  
--conf "spark.yarn.appMasterEnv.MY_DRIVER_ENV=my_driverenvironment_variable" \  
the_pyspark_program.py &lt; add file params here &gt;


## Using the mapPartitions example

Imports the module once per partition, then runs it over row by row on the dataframe rows on that partition.

### The Partitioner class

In [2]:
class Partitioner:
    def __init__(self):
        self.callPerDriverSetup
        
    def callPerDriverSetup(self):
        pass
    
    def callPerPartitionSetup(self):
        sys.path.append('example')                  ### <=== Either append or use the --pyfiles parameter
        import example
        self.example=example
        
    def doProcess(self, element):
        return self.example.my_mod(element.wire, 7) ### <==== here's the call to the library
    
    def processPartition(self, partition):
        self.callPerPartitionSetup()
        for element in partition:
            yield self.doProcess(element)

### The UDF

In [None]:
def calculateMod7(val):
    sys.path.append('example')
    import example
    return example.my_mod(val, 7)

def main():
    calcmod7= udf(calculateMod7, iType() )

### how we set up the session and call the class and modules once per partition

In [None]:
def main():
    # ====================================================
    # Set up hive and spark contexts
    # ====================================================
    
    sc = SparkSession.builder.enableHiveSupport().getOrCreate()
    

    sCtx=sc.sparkContext
    sc.sql("use fits_investigation")
    hiveContext = HiveContext(sc)
    
    # ====================================================
    # now we get the dataframe
    #  ====================================================
    
    sqlStmt = ("""
        <create the dataframe from the source files>
    """)
    
    try:
        sofiaDF=sc.sql(sqlStmt)

    except Exception as e:
        errMsg=("Creating the data input dataframe has failed - {}".format(str(e)))
        raise(DataframeError(errMsg))
        
    p = Partitioner()
    
    rddout = sofiaDF.rdd.mapPartitions(p.processPartition)
    
    # ...
    # ...
    # additional code ...
    # ...
    # ...

In [None]:
if __name__ == '__main__':
    timer=Timer()

    start_elapsed=tm.time()
    start_cpu=tm.clock()
    
    timer.start()
    
    main()
    
    timer.stop()
    

