## Model Development Using Scikit-Learn
- Please open this notebook in edit mode.

#### Load enhanced customer history data as pandas dataframe from the csv
- This is the csv file we stored as a data asset in the previous step.

In [1]:
import pandas as pd
from project_lib import Project

project = Project.access()
enhanced_customer_churn=pd.read_csv(project.get_file("enhanced_customer_history.csv"))
enhanced_customer_churn = enhanced_customer_churn.fillna(0)
enhanced_customer_churn.head()

Unnamed: 0,LONGDISTANCE,INTERNATIONAL,LOCAL,DROPPED,PAYMETHOD,LOCALBILLTYPE,LONGDISTANCEBILLTYPE,USAGE,RATEPLAN,CHURN,GENDER,STATUS,CHILDREN,ESTINCOME,CAROWNER,AGE,Phase
0,23,0,206,0,CC,Budget,Intnl_discount,229,3,T,F,S,1,38000.0,N,24.393333,Adult
1,29,0,45,0,CH,FreeLocal,Standard,75,2,F,M,M,2,29616.0,N,49.426667,Adult
2,24,0,22,0,CC,FreeLocal,Standard,47,3,F,M,M,0,19732.8,N,50.673333,Adult
3,26,0,32,1,CC,Budget,Standard,59,1,F,M,S,2,96.33,N,56.473333,Adult
4,12,0,46,4,CC,FreeLocal,Standard,58,1,F,M,M,2,53010.8,N,18.84,Adult


### Data preprocessing
- Here we are showing a few data preprocessing steps.

#### Select categorical columns

In [2]:
cat_cols = ["GENDER", "STATUS", "CAROWNER", "PAYMETHOD", "LOCALBILLTYPE", "LONGDISTANCEBILLTYPE","Phase"]

#### Select numerical columns

In [3]:
num_cols = ["CHILDREN", "ESTINCOME", "AGE", "LONGDISTANCE", "INTERNATIONAL", "LOCAL","DROPPED","USAGE","RATEPLAN"]

#### Import required python modules

In [4]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer

#### Let's create a preprocessing step
- It does transformation in categorial columns and normalization in numerial columns

In [5]:
ohenc = OneHotEncoder(handle_unknown='ignore')
scaler = StandardScaler() 
preprocessor = ColumnTransformer(transformers=[
    ('num', scaler, num_cols),
    ('cat', ohenc, cat_cols)
])

### Prepare training data and pipeline 
- Pipeline includes preprocessing steps and a tree based classifier.

In [6]:
fields = ["LONGDISTANCE","INTERNATIONAL","LOCAL","DROPPED","PAYMETHOD","LOCALBILLTYPE","LONGDISTANCEBILLTYPE",\
          "USAGE","RATEPLAN","GENDER","STATUS","CHILDREN","ESTINCOME","CAROWNER","AGE","Phase"]

y = enhanced_customer_churn["CHURN"]
X = enhanced_customer_churn[fields]
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('modeling', RandomForestClassifier())])  

### Let's evaluate our pipeline with K-fold (K=10) cross validation

In [7]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X, y, cv=10,scoring="f1_micro")
scores.mean()

0.9709819198881231

### Fit the pipeline with data
- It generates a trained model, which we will store in the Watson ML repository.

In [8]:
#using 70% data to fit the model
df=enhanced_customer_churn.sample(frac=0.7);

model=pipeline.fit(df[num_cols+cat_cols],df["CHURN"])

## Store trained model
- We will store the trained model in the project. The model should appear in the `Assets` tab under `Models` section.


##### Create a WML client
- It fetches required credentials from the environment. **Please don't replace anything.**

In [9]:
import os
from watson_machine_learning_client import WatsonMachineLearningAPIClient
token = os.environ['USER_ACCESS_TOKEN']
wml_credentials = {
   "token": token,
   "instance_id" : "wml_local",
   "url": os.environ['RUNTIME_ENV_APSX_URL'],
   "version": "3.0.0"
}
wml_client = WatsonMachineLearningAPIClient(wml_credentials)

### Prepare metadata for storing the model
- <font color='red'>Please provide a unique model name to `MODEL_NAME` before running the following cell.</font>
- Note: `RUNTIME_UID` is very important for metadata.


In [10]:
MODEL_NAME="<model_name>"
project = Project.access()
project_id=project.get_metadata()['metadata']['guid']

metadata = {
    wml_client.repository.ModelMetaNames.NAME: MODEL_NAME,
    wml_client.repository.ModelMetaNames.RUNTIME_UID: "scikit-learn_0.22-py3.6",
    wml_client.repository.ModelMetaNames.TYPE: "scikit-learn_0.22"
}


### Store model in the WML repository
- It gets schema details from the training data and target.
- It stores model in project space.

In [11]:
#setting default project
wml_client.set.default_project(project_id)

#Storing 
model_details = wml_client.repository.store_model( model, meta_props=metadata,training_data=X, training_target=y)
model_details

{'metadata': {'name': 'scikit-cusotmer-churn',
  'guid': 'dcf7df6f-df85-4cb1-b7d1-06fc78a520b9',
  'id': 'dcf7df6f-df85-4cb1-b7d1-06fc78a520b9',
  'project_id': '342adc21-c83c-47a3-b5d7-17283683b850',
  'modified_at': '2020-06-23T22:59:52.002Z',
  'created_at': '2020-06-23T22:59:50.002Z',
  'owner': '1000331005',
  'href': '/v4/models/dcf7df6f-df85-4cb1-b7d1-06fc78a520b9?project_id=342adc21-c83c-47a3-b5d7-17283683b850'},
 'entity': {'name': 'scikit-cusotmer-churn',
  'project': {'id': '342adc21-c83c-47a3-b5d7-17283683b850',
   'href': '/v2/projects/342adc21-c83c-47a3-b5d7-17283683b850'},
  'training_data_references': [{'location': {'bucket': 'not_applicable'},
    'type': 'fs',
    'connection': {'access_key_id': 'not_applicable',
     'secret_access_key': 'not_applicable',
     'endpoint_url': 'not_applicable'},
    'schema': {'id': '1',
     'type': 'DataFrame',
     'fields': [{'name': 'LONGDISTANCE', 'type': 'int64'},
      {'name': 'INTERNATIONAL', 'type': 'int64'},
      {'name':