# 1. Insert project token, API key, and region

<img src="https://cp4d-outcomes.techzone.ibm.com/img/data-fabric-lab/trusted-ai/project_token_for_notebook.png" width=400 align=left>

Click the **three vertical dots** icon above and select **Insert project token** to provide this notebook API access to your project.

The API key you created earlier in the lab should be pasted into the cell below as the value for `API_KEY`.

The **LOCATION** value below will depend on where you provisioned your services. According to the [WML Client documentation](https://ibm-wml-api-pyclient.mybluemix.net/#authentication), valid values for **LOCATION** are:
* Dallas: https://us-south.ml.cloud.ibm.com
* London: https://eu-gb.ml.cloud.ibm.com
* Frankfurt: https://eu-de.ml.cloud.ibm.com
* Tokyo: https://jp-tok.ml.cloud.ibm.com

Run the cell above, and continue running cells individually until you reach step 2.

In [None]:
import os

API_KEY = 'xxxxxxxxxxxxxxxxxxx'
PROJECT_ID = os.environ['PROJECT_ID']
LOCATION = 'https://us-south.ml.cloud.ibm.com'

In [None]:
if "p-" in PROJECT_ID:
    raise Exception("You have not correctly set the value for your PROJECT_ID. The value beginning with 'p-' is your project access \
    token. Please copy the value of the project_id into the previous cell and re-run it.")

The first model you will create in this notebook uses the scikit-learn framework. The `sklearn` package is available by default in Watson Studio Python environments, and does not need to be installed.

In [None]:
import sklearn
sklearn.__version__

The next cell uses the API key and location variables defined above to authenticate with your Watson Machine Learning service. An error in this cell likely means that you do not have access to a WML service, or that the API key or location provided above is incorrect.

In [None]:
from ibm_watson_machine_learning import APIClient

wml_credentials = {
    "apikey": API_KEY,
    "url": LOCATION
}

wml_client = APIClient(wml_credentials)

## <span style="color:red">LIKELY ACTION REQUIRED: restart the kernel on error messages</span>

The cell below will install the IBM Factsheets service using the `pip` utility, then authenticate with the IBM Factsheet service using credentials you have already supplied and initialize Factsheet monitoring for this model.

**If you receive an error message from running the cell, you will need to restart the kernel and run all previous cells again**. Due to an issue with different levels of libraries available in the Python and Spark environment, you may receive an error message when importing *ibm_aigov\_facts\_client* from the Factsheets library. Restarting the kernel typically fixes this issue, though in rare cases you may have to do it more than once. Click the **Kernel** menu item above and select **Restart**. Once the kernel has restarted, click the **Cell** item and select **Run All Above**. Once those cells have finished executing, run the cell below 

Note that Python notebooks in Watson Studio have full support for `pip install`, which allows you to add whatever libraries you need to the notebook environment. For example, if you wanted to use Python to parse command line arguments, you could run `!pip install argparse`.

In [None]:
!pip uninstall -y ibm-aigov-facts-client
!pip install --upgrade ibm-aigov-facts-client  --no-cache | tail -n 1

from ibm_aigov_facts_client import AIGovFactsClient

# <span style="color:red">2. !!--STOP--!! Insert data to code below</span>

Place your cursor in the empty code cell below. Then click the **Code snippets** icon in the upper right corner of the screen -- it looks like an HTML tag.

<img src="https://cp4d-outcomes.techzone.ibm.com/img/data-fabric-lab/ai-governance/find_and_add_data.png" width=400 align=left>

Click the **Read data** tile beneath the **Data Ingestion** header, then click the **Select data from project** button. Click **Data asset** from the **Categories** list, then select *modeling_records_2022.csv* from the asset list, then click **Select**.

<img src="https://cp4d-outcomes.techzone.ibm.com/img/data-fabric-lab/ai-governance/data_asset.png" width=400 align=left>

Use the **Load as** dropdown beneath to select **pandas DataFrame**, then click the **Insert code to cell** button. A code block is automatically inserted into the empty cell that will import your data into a dataframe. Like the `sklearn` package, `pandas` is automatically provided in Watson Studio Python environments.

## <span style="color:red">IMPORTANT: replace all instances of `df_data_x` with `df_data_1` in the code</span>

The automated dataframe will likely use the `df_data_3` variable to hold the data. Update the last two lines of code to import data into the `df_data_1` variable for the rest of the notebook to work correctly. The last lines of your cell should look like this:

<img src="https://cp4d-outcomes.techzone.ibm.com/img/data-fabric-lab/trusted-ai/dataframe_insert.png" width=300 align=left>

Run the inserted code cell below. If you have correctly imported the data, you will see a table populated with employee data. Continue running cells individually until you reach step 3.

The next cell splits the training data into the feature columns and the label columns, and then further splits the data further into a training data set and a testing data set. If this cell generates an error, it is likely because you have not imported the data into the `df_data_1` variable as described above. You will need to alter the previous cell to use `df_data_1` and then rerun it.

In [None]:
from sklearn.model_selection import train_test_split

X = df_data_1.drop(['ATTRITION'], axis=1)  # Features
y = df_data_1['ATTRITION']  # Labels

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15) # 85% training and 15% test

Now you will tell Watson Machine Learning to use the current project to store the model.

In [None]:
X.columns.tolist()

The cell below tells the Watson Machine Learning client to save the models in the current project. If you receive an error here, it is likely because you did not correctly set your project ID at the beginning of the notebook.

In [None]:
wml_client.set.default_project(PROJECT_ID)

The following cell provides connection information to the model training data, which will be stored with the model and in FactSheets. You could use the Cloud Object Storage information for this particular project by changing the credentials to match those from above where you inserted the file to code, but for simplicity's sake, you will use a pre-existing file.

In [None]:
training_data_references = [
                {
                    "id": "attrition",
                    "type": "container",
                    "connection": {},
                    "location": {
                        "path": "modeling_records_2022.csv"
                    },

                    #"type": "s3",
                    #"connection": {
                    #    "access_key_id": "yqcPbWZ0AQPHleHVerrR4Wx5e9pymBdMgydbEra5zCif",
                    #    "endpoint_url": "https://s3.us.cloud-object-storage.appdomain.cloud",
                    #    "resource_instance_id": "crn:v1:bluemix:public:cloud-object-storage:global:a/7d8b3c34272c0980d973d3e40be9e9d2:2883ef10-23f1-4592-8582-2f2ef4973639::"
                    #},
                    #"location": {
                    #    "bucket": "faststartlab-donotdelete-pr-nhfd4jnhlxgpc7",
                    #    "path": "modeling_records_2022.csv"
                    #},
                    "schema": {
                        "id": "training_schema",
                        "fields": [
                            {"name": "POSITION_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DEPARTMENT_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DAYS_WITH_COMPANY", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "COMMUTE_TIME", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "AGE_BEGIN_PERIOD", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "GENDER_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "PERIOD_TOTAL_DAYS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "STARTING_SALARY", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "ENDING_SALARY", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "NB_INCREASES", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "BONUS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "NB_BONUS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "VACATION_DAYS_TAKEN", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "SICK_DAYS_TAKEN", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "PROMOTIONS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "NB_MANAGERS", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DAYS_IN_POSITION", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DAYS_SINCE_LAST_RAISE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "RANKING_CODE", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "OVERTIME", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "DBLOVERTIME", "nullable": True, "metadata": {}, "type": "double"},
                            {"name": "TRAVEL", "nullable": True, "metadata": {}, "type": "double"}
                        ]
                    }
                }
            ]

The next three cells construct metadata for your model and connect to the Factsheets client. This metadata will be saved with the model itself and will appear on its Factsheet. If you get errors trying to save the model, they will most likely be from the metadata contained in the model properties, specifically the `TYPE` and `SOFTWARE_SPEC_UID`, which frequently change as Watson Studio adds support for new versions of Python, and removes support for outdated versions. You can get a list of current supported specifications by running `wml_client.software_specifications.list()`.

In [None]:
fields=X_train.columns.tolist()
metadata_dict = {'target_col' : 'ATTRITION', 'fields':fields}

In [None]:
PROJECT_UID = os.environ['PROJECT_ID']
CPD_URL=os.environ['RUNTIME_ENV_APSX_URL'][len('https://api.'):]
CONTAINER_ID=PROJECT_UID
CONTAINER_TYPE='project'
EXPERIMENT_NAME='predictive_attrition'

PROJECT_ACCESS_TOKEN=project.project_context.accessToken.replace('Bearer ','')

facts_client = AIGovFactsClient(api_key=API_KEY,experiment_name=EXPERIMENT_NAME,container_type=CONTAINER_TYPE,container_id=CONTAINER_ID,set_as_current_experiment=True)

In [None]:
software_spec_uid = wml_client.software_specifications.get_id_by_name("runtime-22.2-py3.10")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
    wml_client._models.ConfigurationMetaNames.NAME:"{}".format("attrition challenger - sklearn"),
    wml_client._models.ConfigurationMetaNames.TYPE: "scikit-learn_1.0",
    wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client._models.ConfigurationMetaNames.TRAINING_DATA_REFERENCES: training_data_references,
    wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "ATTRITION",
    wml_client._models.ConfigurationMetaNames.CUSTOM: metadata_dict
}

facts_client.export_facts.prepare_model_meta(wml_client=wml_client,meta_props=model_props)

The next three cells fit the data the the model using a Random Forest classifier, run predictions on the test data, and then print out the accuracy for how the model did on the test data. Finally, the notebook calculates and displays feature importance. For more information on Random Forest classifiers, see [here](https://www.ibm.com/cloud/learn/random-forest).

In [None]:
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf = RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

In [None]:
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

The next three cells export data from the model you just created to the FactSheet. The first lists experiments tracked by FactSheets. The second writes the URL and other info on this notebook as custom data to the FactSheet. Note that any data can be written to the FactSheet that might be helpful for model validators.

In [None]:
facts_client.runs.list_runs_by_experiment('1')

In [None]:
nb_name = "attrition model creation and deployment"
nb_asset_id = "tbd"
nb_asset_url = "https://" + CPD_URL + "/analytics/notebooks/v2/" + nb_asset_id + "?projectid=" + PROJECT_UID + "&context=cpdaas"

latestRunId = facts_client.runs.list_runs_by_experiment('1').sort_values('start_time').iloc[-1]['run_id']
facts_client.runs.set_tags(latestRunId, {"Notebook name": nb_name, "Notebook id": nb_asset_id, "Notebook URL" : nb_asset_url})
facts_client.export_facts.export_payload(latestRunId)

In [None]:
RUN_ID=facts_client.runs.get_current_run_id()
facts_client.export_facts.export_payload(RUN_ID)

Finally, the model is stored to the project with all of the metadata defined above.

In [None]:
print("Storing model...")
published_model_details = wml_client.repository.store_model(
    model=clf, 
    meta_props=model_props,
    training_target=['ATTRITION'],
    training_data=X)
model_uid = wml_client.repository.get_model_id(published_model_details)

print("Done")
print("Model ID: {}".format(model_uid))

Next, the notebook uses Apache Spark to create a second model. Because you specified a Spark environment when you created this notebook, the `pyspark` runtime will be available without needing to be installed via `pip`.

In [None]:
try:
    from pyspark.sql import SparkSession
except:
    print('Error: Spark runtime is missing. If you are using Watson Studio change the notebook runtime to Spark by clicking \
    the Vew notebook info button above (the lowercase i in a circle). Click on the Environment tab and use the Environment \
    definition dropdown to select an environment with Spark and Python.')
    raise
spark.version

# <span style="color:red">3. !!--STOP--!! Insert data to code below</span>

Place your cursor in the empty code cell below. Then click the **Find and add data** icon in the upper right corner of the screen like you did in step 2. Locate the *modeling_records_2022.csv* file, click its associated **Insert to code** dropdown, and select **SparkSession DataFrame**.

## <span style="color:red">IMPORTANT: replace all instances of `df_data_x` with `df_data_2` in the code</span>

The automated dataframe will likely use the `df_data_3` variable to hold the data. Update the last two lines of code to import data into the `df_data_2` variable for the rest of the notebook to work correctly. The last lines of your cell should look like this:

<img src="https://cp4d-outcomes.techzone.ibm.com/img/data-fabric-lab/trusted-ai/dataframe_insert_2.png" width=700 align=left>

Run the inserted code cell below. If you have correctly imported the data, you will see a table populated with employee data. The remainder of the notebook is very similar to the training of the sklearn model. It will enable FactSheets for the second model, train a Spark Gradient Boost Classifier, and then save that model to the project. You may run the rest of the notebook to its conclusion.

Similar to the `sklearn` model, you need to specify metadata for the spark model.

In [None]:
software_spec_uid = wml_client.software_specifications.get_id_by_name("spark-mllib_3.3")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
    wml_client._models.ConfigurationMetaNames.NAME:"{}".format("attrition challenger - spark"),
    wml_client._models.ConfigurationMetaNames.TYPE: "mllib_3.3",
    wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client._models.ConfigurationMetaNames.TRAINING_DATA_REFERENCES: training_data_references,
    wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "ATTRITION"
}

facts_client.export_facts.prepare_model_meta(wml_client=wml_client,meta_props=model_props)

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model

For the second model, you will create a Gradient Boosted Tree classifier. For more information on Gradient Boosting, see [here](https://www.ibm.com/cloud/learn/boosting).

In [None]:
from pyspark.sql.types import FloatType
for field in fields:
    df_data_2=df_data_2.withColumn(field,df_data_2[field].cast("float").alias(field))
df_data_2=df_data_2.withColumn('ATTRITION',df_data_2['ATTRITION'].cast("int").alias('ATTRITTION'))
df_data_2.take(5)

In [None]:
va = VectorAssembler(inputCols = fields, outputCol='features')
va_df = va.transform(df_data_2)
va_df = va_df.select(['features', 'ATTRITION'])
va_df.show(3)

In [None]:
gbtc = GBTClassifier(labelCol="ATTRITION", maxIter=20)

pipeline = Pipeline(stages=[va, gbtc])

In [None]:
split_data = df_data_2.randomSplit([0.8, 0.2], 24)
train_data = split_data[0]
test_data = split_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

In [None]:
spark_model = pipeline.fit(train_data)

pred = spark_model.transform(test_data)
pred.show(3) 

In [None]:
evaluator = BinaryClassificationEvaluator()
evaluator.setLabelCol("ATTRITION")
print("Test Area Under ROC: " + str(evaluator.evaluate(pred, {evaluator.metricName: "areaUnderROC"})))

In [None]:
print("Storing spark model...")
published_model_details = wml_client.repository.store_model(
    model=spark_model, 
    meta_props=model_props,
    training_target=['ATTRITION'],
    training_data=train_data,
    pipeline=pipeline
)
model_uid = wml_client.repository.get_model_id(published_model_details)

print("Done")
print("Model ID: {}".format(model_uid))

# Congratulations!

You have completed this notebook. You can now return to the [Data and AI Live Demos lab page](https://cp4d-outcomes.techzone.ibm.com/data-fabric-lab/trusted-ai) and continue with the lab.