# 1. Insert project token, API key, and region

Click the **three vertical dots** icon above and select **Insert project token** to provide this notebook API access to your project. The code inserted above will have a line that looks like this:

`project = Project(project_id='xxxxxxxx-xxx-xxxx-xxxx-xxxxxxxxxx', project_access_token='p-xxxxxxxxxxxxxxxxxx')`

That `project_id` value should be pasted into the cell below as the value for `PROJECT_ID`. The API key you created earlier in the lab should be pasted into the cell as the value for `API_KEY`. The region your account is in (such as `us-south`) should be pasted into the cell as the value for LOCATION.

Run the cell above, and continue running cells individually until you reach step 2.

In [None]:
API_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
PROJECT_ID = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
LOCATION = 'us-south'

The first model you will create in this notebook uses the scikit-learn framework. The `sklearn` package is available by default in Watson Studio Python environments, and does not need to be installed.

In [None]:
import sklearn
sklearn.__version__

The next cell uses the API key and location variables defined above to authenticate with your Watson Machine Learning service. An error in this cell likely means that you do not have access to a WML service, or that the API key or location provided above is incorrect.

In [None]:
from ibm_watson_machine_learning import APIClient

wml_credentials = {
    "apikey": API_KEY,
    "url": 'https://' + LOCATION + '.ml.cloud.ibm.com'
}

wml_client = APIClient(wml_credentials)

# 2. --STOP-- Insert data to code below

Place your cursor in the empty cell below. Then click the **Find and add data** icon in the upper right corner of the screen. Locate the *modeling_records_2022.csv* file, and use the **Insert to code** dropdown beneath it to insert the data as a pandas DataFrame. Like the `sklearn` package, `pandas` is automatically provided in Watson Studio Python environments.

Verify that the data is imported into the `df_data_1` variable. If it is imported into a different variable, you will need to alter code in some of the cells below to reflect the correct variable.

Run the cell below, and continue running cells individually until you reach step 3.

The next cell splits the training data into the feature columns and the label columns, and then splits the data further into a training data set and a testing data set.

In [None]:
from sklearn.model_selection import train_test_split

X = df_data_1.drop(['ATTRITION'], axis=1)  # Features
y = df_data_1['ATTRITION']  # Labels

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15) # 85% training and 15% test

Next you will tell Watson Machine Learning to use the current project to store the model.

In [None]:
wml_client.set.default_project(PROJECT_ID)

The following cell provides connection information to the model training data, which will be stored with the model and in FactSheets. You could use the Cloud Object Storage information for this particular project by changing the credentials to match those from above where you inserted the file to code, but for simplicity's sake, you will use a pre-existing file.

In [None]:
training_data_references = [
                {
                    "id": "attrition",
                    "type": "s3",
                    "connection": {
                        "access_key_id": "yqcPbWZ0AQPHleHVerrR4Wx5e9pymBdMgydbEra5zCif",
                        "endpoint_url": "https://s3.us.cloud-object-storage.appdomain.cloud",
                        "resource_instance_id": "crn:v1:bluemix:public:cloud-object-storage:global:a/7d8b3c34272c0980d973d3e40be9e9d2:2883ef10-23f1-4592-8582-2f2ef4973639::"
                    },
                    "location": {
                        "bucket": "faststartlab-donotdelete-pr-nhfd4jnhlxgpc7",
                        "path": "modeling_records_2022.csv"
                    }
                }
            ]

The cell below will initialize IBM FactSheet monitoring for this model, and authenticate with the FactSheet service using credentials you have already supplied. Note that Python notebooks in Watson Studio have full support for `pip install`, which allows you to add whatever libraries you need to the notebook environment.

In [None]:
try:
    from ibm_aigov_facts_client import AIGovFactsClient
except:
    !pip install -U ibm-aigov-facts-client
    from ibm_aigov_facts_client import AIGovFactsClient
        
PROJECT_UID= os.environ['PROJECT_ID']
CPD_URL=os.environ['RUNTIME_ENV_APSX_URL'][len('https://api.'):]
CONTAINER_ID=PROJECT_ID
CONTAINER_TYPE='project'
EXPERIMENT_NAME='predictive_attrition'

PROJECT_ACCESS_TOKEN=project.project_context.accessToken.replace('Bearer ','')

facts_client = AIGovFactsClient(api_key=API_KEY,experiment_name=EXPERIMENT_NAME,container_type=CONTAINER_TYPE,container_id=CONTAINER_ID,set_as_current_experiment=True)

The next two cells construct metadata for the model, which will be saved with the model itself, and will appear on its FactSheet. If you get errors trying to save the model, they will most likely be from the metadata contained in the model props, specifically the `TYPE` and `SOFTWARE_SPEC_UID`, which frequently change as Watson Studio adds support for new versions of Python, and removes support for outdated versions.

In [None]:
fields=X_train.columns.tolist()
metadata_dict = {'target_col' : 'ATTRITION', 'fields':fields}

In [None]:
software_spec_uid = wml_client.software_specifications.get_id_by_name("runtime-22.1-py3.9")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
    wml_client._models.ConfigurationMetaNames.NAME:"{}".format("attrition challenger - sklearn"),
    wml_client._models.ConfigurationMetaNames.TYPE: "scikit-learn_1.0",
    wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client._models.ConfigurationMetaNames.TRAINING_DATA_REFERENCES: training_data_references,
    wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "ATTRITION",
    wml_client._models.ConfigurationMetaNames.CUSTOM: metadata_dict
}

facts_client.export_facts.prepare_model_meta(wml_client=wml_client,meta_props=model_props)

The next three cells fit the data using a random forest classifier, run predictions on the test data, and then print out the accuracy for how the model did on the test data. Finally, the notebook calculates and displays feature importance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf = RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

In [None]:
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

The next three cells export data to the FactSheet. The first lists experiments tracked by FactSheets. The second writes some custom data, in this case the URL and other info for this notebook. Note that any data can be written to the FactSheet that might be helpful for model validators.

In [None]:
facts_client.runs.list_runs_by_experiment('1')

In [None]:
nb_name = "attrition model creation and deployment"
nb_asset_id = "tbd"
nb_asset_url = "https://" + CPD_URL + "/analytics/notebooks/v2/" + nb_asset_id + "?projectid=" + PROJECT_UID + "&context=cpdaas"

latestRunId = facts_client.runs.list_runs_by_experiment('1').sort_values('start_time').iloc[-1]['run_id']
facts_client.runs.set_tags(latestRunId, {"Notebook name": nb_name, "Notebook id": nb_asset_id, "Notebook URL" : nb_asset_url})
facts_client.export_facts.export_payload(latestRunId)

In [None]:
RUN_ID=facts_client.runs.get_current_run_id()
facts_client.export_facts.export_payload(RUN_ID)

Finally, the model is stored to the project with all of the metadata defined above.

In [None]:
print("Storing model...")
published_model_details = wml_client.repository.store_model(
    model=clf, 
    meta_props=model_props,
    training_target=['ATTRITION'],
    training_data=X)
model_uid = wml_client.repository.get_model_id(published_model_details)

print("Done")
print("Model ID: {}".format(model_uid))

Next, the notebook will use Apache Spark to create a second model. Because you specified a Spark environment when you created this notebook, the `pyspark` runtime will be available without needing to be installed via `pip`.

In [None]:
try:
    from pyspark.sql import SparkSession
except:
    print('Error: Spark runtime is missing. If you are using Watson Studio change the notebook runtime to Spark.')
    raise
spark.version

# 3. --STOP-- Insert data to code below

Place your cursor in the empty cell below. Then click the **Find and add data** icon in the upper right corner of the screen. Locate the *modeling_records_2022.csv* file, and use the **Insert to code** dropdown beneath it to insert the data as a SparkSession DataFrame.

Verify that the data is imported into the `df_data_2` variable. If it is imported into a different variable, you will need to alter code in some of the cells below to reflect the correct variable.

The remainder of the notebook is very similar to the training of the sklearn model. It will enable FactSheets for the second model, train a Spark Gradient Boost Classifier, and then save that model to the project. You may run the rest of the notebook to its conclusion.

Similar to the `sklearn` model, you need to specify metadata for the spark model.

In [None]:
software_spec_uid = wml_client.software_specifications.get_id_by_name("spark-mllib_3.2")
print("Software Specification ID: {}".format(software_spec_uid))
model_props = {
    wml_client._models.ConfigurationMetaNames.NAME:"{}".format("attrition challenger - spark"),
    wml_client._models.ConfigurationMetaNames.TYPE: "mllib_3.2",
    wml_client._models.ConfigurationMetaNames.SOFTWARE_SPEC_UID: software_spec_uid,
    wml_client._models.ConfigurationMetaNames.TRAINING_DATA_REFERENCES: training_data_references,
    wml_client._models.ConfigurationMetaNames.LABEL_FIELD: "ATTRITION"
}

facts_client.export_facts.prepare_model_meta(wml_client=wml_client,meta_props=model_props)

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model

In [None]:
from pyspark.sql.types import FloatType
for field in fields:
    df_data_2=df_data_2.withColumn(field,df_data_2[field].cast("float").alias(field))
df_data_2=df_data_2.withColumn('ATTRITION',df_data_2['ATTRITION'].cast("int").alias('ATTRITTION'))
df_data_2.take(5)

In [None]:
va = VectorAssembler(inputCols = fields, outputCol='features')
va_df = va.transform(df_data_2)
va_df = va_df.select(['features', 'ATTRITION'])
va_df.show(3)

In [None]:
gbtc = GBTClassifier(labelCol="ATTRITION", maxIter=20)

pipeline = Pipeline(stages=[va, gbtc])

In [None]:
split_data = df_data_2.randomSplit([0.8, 0.2], 24)
train_data = split_data[0]
test_data = split_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

In [None]:
spark_model = pipeline.fit(train_data)

pred = spark_model.transform(test_data)
pred.show(3) 

In [None]:
evaluator = BinaryClassificationEvaluator()
evaluator.setLabelCol("ATTRITION")
print("Test Area Under ROC: " + str(evaluator.evaluate(pred, {evaluator.metricName: "areaUnderROC"})))

In [None]:
print("Storing spark model...")
published_model_details = wml_client.repository.store_model(
    model=spark_model, 
    meta_props=model_props,
    training_target=['ATTRITION'],
    training_data=train_data,
    pipeline=pipeline
)
model_uid = wml_client.repository.get_model_id(published_model_details)

print("Done")
print("Model ID: {}".format(model_uid))

# Congratulations!

You have completed this notebook. You can now return to the [Data and AI Live Demos lab page](https://cp4d-outcomes.techzone.ibm.com/data-fabric-lab/trusted-ai) and continue with the lab.