####Welcome to the example notebook for FIFEforSpark! 

You may recognize the following example from FIFE's notebook found [here](https://github.com/IDA-HumanCapital/fife/blob/master/examples/country_leadership.ipynb). Like that example notebook, we use the July 2020 edition of the Rulers, Elections, and Irregular Governance dataset (REIGN) dataset, a monthly panel of national leaders and political conditions since January 1950. We load the REIGN data directly from its online archive.

First, we import the necessary packages. In this case, we import SparkFiles which is required to read the data in from the url, in addition to several fifeforspark modules.

In [0]:
import pyspark
from pyspark import SparkFiles
import fifeforspark
from fifeforspark.utils import create_example_data2
from fifeforspark.processors import PanelDataProcessor
from fifeforspark.lgb_modelers import LGBSurvivalModeler

Now that we have the necessary packages loaded, we read in the data from a url:

In [0]:
url = "https://www.dl.dropboxusercontent.com/s/3tdswu2jfgwp4xw/REIGN_2020_7.csv?dl=0"
spark.sparkContext.addFile(url)

df = spark.read.csv("file://"+SparkFiles.get("REIGN_2020_7.csv"), header=True, inferSchema= True)

The data is stored in a Spark DataFrame which is different than you may expect if you are more familiar with FIFE. Let's examine our data a bit more.

In [0]:
df.show()

This isn't very pleasant to look at; however, the advantage of using a Spark DataFrame here (even though this could fit on one node) is that it's distributed. Let's see how many partitions we have.

In [0]:
df.rdd.getNumPartitions()

Wow! The data is split across 8 partitions! We can always change that; however, for this example, we will leave it as 8.

Next, we make some changes to the data to prepare it for analysis

In [0]:
from pyspark.sql.functions import lit, lpad, col, concat, date_format
from pyspark.sql.types import DateType


df = df.withColumn('country-leader', concat(col('country'),lit(":"),col('leader')))
df = df.withColumn('year-month', concat(col('year').cast('integer').cast('string'),lit("-"), lpad(col('month').cast('integer').cast('string'), 2, "0"), lit("-"),lit("01")))

df = df.withColumn('year-month', df['year-month'].cast(DateType()))

cols = ['country-leader', 'year-month'] + [x for x in df.columns if x not in ["ccode", "country-leader", "leader", "year-month"]]
df = df.select(cols)
total_obs = df.count()
df = df.drop_duplicates(subset = ["country-leader", "year-month"])
n_duplicates = total_obs - df.count()
print(f"{n_duplicates} observations with a duplicated identifier pair deleted.")

Now that we have created unique identifiers for the individual and time, we pass the data through the Panel Data Processor, specifying a value of 4 for 'TEST_INTERVALS' as we want to test the last 4 periods. For the time being, we transform the time_id back to a numeric feature given constraints regarding datetime functionality.

In [0]:
test_intervals = 4
processor = PanelDataProcessor(data=df, config = {'TEST_INTERVALS': test_intervals}, shuffle_parts = 20)
processor.build_processed_data()

Now, we build the model. You can pass parameters into the model that will be passed to lightgbm as well.

In [0]:
modeler = LGBSurvivalModeler(data=processor.data)
modeler.build_model(n_intervals=test_intervals)

Now we want to see how well our model performs on the test data and take a look at the predictions

In [0]:
metrics = modeler.evaluate()

In [0]:
metrics

And finally, we want to forecast future survival probabilities for country-leaders in the last period

In [0]:
forecasts = modeler.forecast()
forecasts