# Map Business Terms to Data Headers
## Weather Company Data Limited Edition: Sales Prediction using The Weather Company Data

### Data Disclaimer

The weather and business input data provided in this Accelerator is simulated data, designed to illustrate how to solve a common business problem. You are not permitted to utilize the simulated data contained in the Accelerator outside of this Accelerator or the Sample Materials contained within it.

### Copyright

This project contains Sample Materials, provided under license. <br>
Licensed Materials - Property of IBM. <br>
© Copyright IBM Corp. 2020. All Rights Reserved. <br>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

### Terms and Conditions

The terms under which you are licensing IBM Cloud Pak for Data also apply to your use of the Industry Accelerators. 
 
Before you use the Industry Accelerators, you must agree on these additional terms and conditions that are set forth here.
 
This information contains sample modules, exercises, and code samples (the code may be provided in source code form (“Source Code”)) (collectively “Sample Materials”).

### License
 
Subject to the terms herein, you may copy, modify, and distribute these Sample Materials within your enterprise only, for your internal use only; provided such use is within the limits of the license rights of the IBM agreement under which you are licensing IBM Cloud Pak for Data.
 
The Industry Accelerators might include applicable third-party licenses. Review the third-party licenses before you use any of the Industry Accelerators. You can find the third-party licenses that apply to each Sample Material in the notices.txt file that is included with each Sample Material.

### Code Security
 
Source Code may not be disclosed to any third parties for any reason without IBM’s prior written consent, and access must be limited to your employees who have a need to know. 
 
You have implemented and will maintain the technical and personnel focused security policies, procedures, and controls that are necessary to protect the Source Code against loss, alteration, unlawful forms of processing, unauthorized disclosure, and unauthorized access.
 
You will promptly (and in no event any later than 48 hours) notify IBM after becoming aware of any breach or other security incident that you know, or should reasonably suspect, affects or will affect the Source Code or IBM, and will provide IBM with reasonably requested information about such security incident and the status of any remediation and restoration activities.
 
You will not permit any Source Code to reside on servers located in the Russian Federation, the People’s Republic of China, or any territories worldwide in which the Russian Federation or People’s Republic of China claim sovereignty (collectively, “China or Russia”).  Company shall not permit anyone to access or use any such Source Code from or within China or Russia, and Company will not permit any development, testing, or other work to occur in China or Russia that would require such access or use.  Upon reasonable written notice, IBM may extend these restrictions to other countries that the United States government identifies as potential cyber security concerns.
IBM may request that you verify compliance with these Code Security obligations, and you agree to cooperate with IBM in that regard.

### General
 
Notwithstanding anything to the contrary, IBM PROVIDES THE SAMPLE MATERIALS ON AN "AS IS" BASIS AND IBM DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY, SATISFACTORY QUALITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY OR ECONOMIC CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR OPERATION OF THE SAMPLE MATERIALS. IBM SHALL NOT BE LIABLE FOR LOSS OF, OR DAMAGE TO, DATA, OR FOR LOST PROFITS, BUSINESS REVENUE, GOODWILL, OR ANTICIPATED SAVINGS. IBM HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS OR MODIFICATIONS TO THE SAMPLE MATERIALS.


## Introduction

In this notebook we programmatically publish a dataset into a catalog and map business terms to the dataset column headers. The business terms and their mappings are specified in a csv file included with the project. The user must first ensure that the catalog exists and the imported business terms have been published.

The user can also assign business terms to column headers manually or by using the Data Discovery capability within Cloud Pak for Data. 

This notebook is optional. The analytics project runs as expected even if this notebook is not used. 

**Note that as only Admin users can import terms, this notebook should be run by an Admin user only.**

In [1]:
# imports for the rest APIs interactions with WKC
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import json
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from pandas.io.json import json_normalize
import pandas as pd
import os
import re
# use this library for reading and saving data in CP4D
from project_lib import Project
project = Project()

## Create Catalog

The dataset must first be published into a catalog. The catalog must be manually created. Under **Organize** in the navigation menu, select **All Catalogs** and select **New Catalog**. Enter the name for the catalog and the description if necessary and create the catalog. If the user has already created the catalog this step can be skipped and the existing catalog name should be specified in the code cell below.


## User Inputs

The user must enter the following before running the rest of the notebook: 
1. **host :** host url of the cluster we are working on.
2. **uname :** username for user on this cluster.
3. **pword :** password for user on this cluster.
4. **catalog_name :** Name of the catalog that we would like to publish the csv to. This catalog is created based on the instructions above or an existing catalog.

In [2]:
# sample input and syntax
# host = 'https://xxxxx.com'
# uname = 'admin'
# pword = '******'
# catalog_name = 'Sales Prediction'

host = '******'
uname = '******'
pword = '******'
catalog_name = '******'

We also create additional variables. The user does not need to change the code cell below, unless they change the business terms category name or the name of the csv file with mappings.

1. **category_name :** Name of the business term category corresponding to the project.
3. **terms_file :** Name of the csv file containing the list of mappings between column headers and business terms.
3. **csv_file_to_publish :** Name of the csv file that will be published into the catalog and for which we map business terms.

In [3]:

category_name = "Sales Prediction using The Weather Company Data"
terms_file = "sales-prediction-using-the-weather-company-data-map-terms.csv" 
csv_file_to_publish = ['fauxsales1.csv','fauxweather1.csv','fauxsales2.csv','fauxweather2.csv']

Create a requests session and use the same session throughout the notebook. 

In [4]:
# Creates requests session and stores in `s`
s = requests.Session()

## Authentication

Generate a token and validate the token on this cluster.

In [5]:
# Authenticate the cluster with specified username and password and store the access token for future reference
wkcURLauth=host+"/icp4d-api/v1/authorize"

# Payload with username and password
payload={
    "username": uname,
    "password": pword
}

# Header with json format
headers = {
    'Content-Type': "application/json",
    'cache-control': "no-cache"
    }

# Creates a post request with the endpoints specified above in the wkcURLauth variable with payload and header.
# When successfully authenticated token is stored in a variable as below
# catch error if the specified url is not correct
try:
    res = s.post(wkcURLauth,headers=headers,json=payload, verify=False)
except:
    print("The below error has occurred. Please check that the hostname entered is correct.")
    raise
    
if res.status_code == 200:
    print("Authentication Successful")
    accessToken=json.loads(res.text)['token']
else:
    print('The below error has occurred. Please check entered username and password are correct.')
    raise ValueError(res.text)

Authentication Successful


## Map Business Terms to Headers

We complete the following steps to map the business terms to column headers:

1. Check if the  Category, `Sales Prediction using The Weather Company Data`, exists.
2. Load the business terms from the `Sales Prediction using The Weather Company Data` category into a dataframe.
3. Publish the specified datasets into the catalog.
4. Assign business terms to the dataset column headers.

### 1. Check for the Category
Below cell fetches all the categories present in the cluster and stores category id of `Sales Prediction using The Weather Company Data`.

In [6]:
search_url=host+"/v3/search"
try:
    headers = {
        'Content-Type': "application/json",
        'Authorization': "Bearer "+accessToken,
        'Cache-Control': "no-cache",
        'Connection': "keep-alive"
        }
    
    search_body = {
        "size": 1000,
        "_source": ["artifact_id","metadata.name"],
       "query": {    
               "match": {"metadata.artifact_type": "category"}
       }
    }
    parent_cat = s.post(search_url, verify=False,  json=search_body, headers=headers)
    
    
    
    # Check if Industry accelerator category exists and load its id into a variable `parent_id`
    if parent_cat.status_code == 200:
        category=json.loads(parent_cat.text)
        for i in category['rows']:
            
            if i['metadata']['name']== category_name:
                print("Category ",category_name,"exists")

                exists_category=True
                category_id=i['artifact_id'] 
            
                   
except:
    print("The below error has occurred. " + "Please ensure that category, '" + category_name + "', exists.")
    raise ValueError(parent_cat.text)

Category  Sales Prediction using The Weather Company Data exists


### 2. Load Subcategory Business Terms into Dataframe 

Get all of the terms in the `Sales Prediction using The Weather Company Data` subcategory and store them in the `df_terms` dataframe.

In [7]:
# Create a payload for the post request, This payload contains information on size of the terms, source, category and subcategory ids
payload={"size":300,"from":0,"_source":["artifact_id","metadata.artifact_type","metadata.name","metadata.description","categories","entity.artifacts"],"query":{"bool":{"filter":{"bool":{"minimum_should_match":1,"should":[{"term":{"categories.primary_category_id":category_id}},{"term":{"categories.secondary_category_ids":category_id}}],"must_not":{"terms":{"metadata.artifact_type":["category"]}}}}}}}
# create a post request with above payload 
wf=s.post(host+"/v3/search",headers=headers,json=payload,verify=False)
# it will return all the terms , load these terms into a dataframe
wf_json=json.loads(wf.text)['rows']
df_terms=pd.json_normalize(wf_json)

df_terms=df_terms[['entity.artifacts.global_id','metadata.name']]
# terms dataframe looks as below
df_terms.head()

Unnamed: 0,entity.artifacts.global_id,metadata.name
0,5d2d5419-0032-4c64-90e2-ce68c6997bb5_7dd0fc97-...,Precipitation type
1,5d2d5419-0032-4c64-90e2-ce68c6997bb5_39ba2c2d-...,Relative humidity
2,5d2d5419-0032-4c64-90e2-ce68c6997bb5_20749e45-...,Alert instruction
3,5d2d5419-0032-4c64-90e2-ce68c6997bb5_6824bd87-...,Forecasted trajectory
4,5d2d5419-0032-4c64-90e2-ce68c6997bb5_52de759c-...,Almanac period


### 3. Publish Datasets into Catalog

Get the ID of the catalog that was specified in the user inputs at the beginning of this notebook.

In [8]:
## Get catalog that created and its id by providing name of the catalog created, wich should be same as the one entered in the previous cells
catalog_endpoint=host+"/v2/catalogs"
# Create new header for the requests
headers = {
'Content-Type': "application/json",
'Authorization': "Bearer "+accessToken

}

# endpoint to get all the catalogs 
get_catalog=s.get(catalog_endpoint,verify=False, headers=headers)


## Find the catalog created with specific name and store name and id of it into catalog_name and catalog_id respectively
try:
    get_catalog_json=json.loads(get_catalog.text)['catalogs']
except:
    print("The below error has occurred. Please ensure that catalog, '" + catalog_name + "', exists")
    raise
    
catalog_id = ''
for metadata in get_catalog_json:
    if metadata['entity']['name']==catalog_name:
        catalog_id=metadata['metadata']['guid']
        print("catalog_id for",catalog_name, catalog_id)

if catalog_id == '':
    print("The provided catalog name cannot be found. Please ensure that catalog, '" + catalog_name + "', exists")
    raise ValueError("Catalog cannot be found")

catalog_id for ind_acc 3e856bf5-46da-41fd-81ad-187a0905eda5


Get the project id. All project assets can be accessed using this project id.

In [10]:
project_id=os.environ['PROJECT_ID']

Get all existing csv files in the project folder and store the names of these files. 

In [11]:
# payload 
payload={"query":"*:*","limit":200}
# endpoint to access all the project assets in the project folder 
asset_url=host+"/v2/asset_types/asset/search?project_id="+project_id
get_asset=s.post(asset_url,json=payload,verify=False)

Next we get the asset id of the dataset to be published to the catalog.

In [12]:
# Get asset ids of all csv files to be published in to the catalog and store the asset ids in an array

project_asset_id=[]
# Payload to query all project assets
payload={"query":"*:*","limit":200}

get_asset=s.post(host+"/v2/asset_types/asset/search?project_id="+project_id,json=payload,verify=False, headers=headers)
get_asset_json=json.loads(get_asset.text)
for j in get_asset_json['results']:
    if j['metadata']['name'] in csv_file_to_publish:
        print("Asset id of",j['metadata']['name'],":",j['metadata']['asset_id'])
        project_asset_id.append(j['metadata']['asset_id'])

Asset id of fauxweather1.csv : bdf619ec-70e4-4720-b396-423089f44d7a
Asset id of fauxsales2.csv : 1589e824-c7de-43a1-bc25-32ff1585f668
Asset id of fauxsales1.csv : 59be9cbd-2ec7-4514-a5ee-f0f257a043c2
Asset id of fauxweather2.csv : 9c42cf2e-6c79-4da3-af2a-2eaec96b78f3


Using the asset ID for the dataset, upload the dataset into the catalog using the post request below. Get the new asset ID of the newly published dataset.

In [13]:
print("ASSET ID's of the published assets")
# Creates a empty dictionary
catalog_asset_ids={}
for asset_id in project_asset_id:
    #for  each asset in the project , publish them into the catalog 
    # pyload to publish the asset
    payload={"mode":0,"catalog_id":catalog_id,"metadata":{}}
    # endpoint to publish asset
    asset_publish_url=host+"/v2/assets/"+asset_id+"/publish?project_id="+project_id
    # Post request with endpoint, heaeder and payload
    publishasset=requests.post(asset_publish_url,json=payload,headers=headers,verify=False)
    # api endpoint returns below text
    publishasset_json=json.loads(publishasset.text)
    # extract csv file published and its asset id and append it to the dictionary
    catalog_asset_ids[publishasset_json['metadata']['name']]=publishasset_json['asset_id']
    
print(catalog_asset_ids)

ASSET ID's of the published assets
{'fauxweather1.csv': '7697bf8c-2aaa-4ab5-85b4-41170cb0c438', 'fauxsales2.csv': '45d41c0f-8d37-411b-a330-7142d4572597', 'fauxsales1.csv': 'dfb26431-94a5-4f7b-8f4f-432d7297ab29', 'fauxweather2.csv': '0e4e25ed-4f5c-4f13-aacb-6b022b9ac2c2'}


### 4. Assign Business Terms to Column Headers

Read in the file with business terms and their associated column headers and view a sample of the data.

In [14]:
my_file = project.get_file(terms_file)
my_file.seek(0)
map_terms = pd.read_csv(my_file)

In [15]:
print(map_terms.shape)
map_terms.head()

(246, 4)


Unnamed: 0,Business Terms,Column_header,Table,File
0,Air temperature,TemperatureLocalDayMin,WEATHER-SIGNALS,REALWEATHER
1,Air temperature,TemperatureLocalAfternoonMin,WEATHER-SIGNALS,REALWEATHER
2,Air temperature,TemperatureLocalEveningAvg,WEATHER-SIGNALS,REALWEATHER
3,Air temperature,TemperatureLocalDaytimeMax,WEATHER-SIGNALS,REALWEATHER
4,Air temperature,TemperatureLocalEveningMax,WEATHER-SIGNALS,REALWEATHER


In [16]:
df_terms.head()

Unnamed: 0,entity.artifacts.global_id,metadata.name
0,5d2d5419-0032-4c64-90e2-ce68c6997bb5_7dd0fc97-...,Precipitation type
1,5d2d5419-0032-4c64-90e2-ce68c6997bb5_39ba2c2d-...,Relative humidity
2,5d2d5419-0032-4c64-90e2-ce68c6997bb5_20749e45-...,Alert instruction
3,5d2d5419-0032-4c64-90e2-ce68c6997bb5_6824bd87-...,Forecasted trajectory
4,5d2d5419-0032-4c64-90e2-ce68c6997bb5_52de759c-...,Almanac period


Join the `df_terms` and `map_terms` dataframes and map each column header to a business term. The code below loops through each file in the catalog (one file in our case) and performs the following tasks:

1. Create a dataframe with column headers in the catalog and associated business term and term ids.
2. Fetch catalog asset id for each csv in the catalog.
3. Create a column_info attribute for all the files in the catalog.
4. Map column header to the business terms. 

In [18]:
# For every file in the map terms csv do the following
# Join the csv with specified file name with the published terms to get its term id
# drop if any duplicates found to avoid multiple mappings for the same term

#map_terms=map_terms[map_terms['File']==file]
map_terms=map_terms.sort_values(by=['File','Column_header'])
Terms_Headers=pd.merge(map_terms,df_terms,left_on='Business Terms',right_on='metadata.name',how='inner')
Terms_Headers=Terms_Headers.drop_duplicates()

for file in catalog_asset_ids:
    # Catalog asset id of the particular csvs
    # for each file name in the map_terms if the csv with this file name exists, get its asset_id from the catalog and use the post request publish create column_info attribute
    # This column info attribute is necessary to map the busines terms to column to header
    filetext=file.upper().replace(".CSV","")
    filetext=re.sub('\d', '', filetext)
    if filetext=="FAUXWEATHER":
        filetext="REALWEATHER"
    
    Terms_Headers_new=Terms_Headers[Terms_Headers.File==filetext].copy()
    catalog_asset_id=catalog_asset_ids[file]
    print(file,  catalog_asset_id)
    #### 
    payload={"name": "column_info",
       "entity":{
                 
               }
    }
    t=requests.post(host+"/v2/assets/"+catalog_asset_id+"/attributes?catalog_id="+catalog_id,json=payload,headers=headers,verify=False)
    #print(t.text)
    # For each column header in the file map its corresponding business term retrieved from the above join in the dataframe

    i=0
    for index, rows in Terms_Headers_new.iterrows(): 
        i+=1
        print(i,rows.Column_header.strip(), "is mapped to", rows['Business Terms'])
        # Create list for the current row 
        # Below payload is used for the patch request to map the  header to business terms
        payload=[{"op":"add","path":"/"+rows.Column_header.strip(),"value":{"column_terms":[{"term_display_name":rows['Business Terms'],"term_id":rows["entity.artifacts.global_id"]}]},"attribute":"column_info"}]
    #
        # Endpoint for patch request
        url=host+"/v2/assets/"+catalog_asset_id+"/attributes/column_info?catalog_id="+catalog_id
    # patch request to map busines terms to column header using term_id
        patch_attribute=s.patch(url,json=payload,headers=headers,verify=False)
    #
        json.loads(patch_attribute.text)

fauxweather1.csv 7697bf8c-2aaa-4ab5-85b4-41170cb0c438
1 date is mapped to Observation date time
2 postalcode is mapped to Postal Code
3 DewpointLocalAfternoonAvg is mapped to Dew point temperature
4 DewpointLocalAfternoonMax is mapped to Dew point temperature
5 DewpointLocalAfternoonMin is mapped to Dew point temperature
6 DewpointLocalDayAvg is mapped to Dew point temperature
7 DewpointLocalDayMax is mapped to Dew point temperature
8 DewpointLocalDayMin is mapped to Dew point temperature
9 DewpointLocalDaytimeAvg is mapped to Dew point temperature
10 DewpointLocalDaytimeMax is mapped to Dew point temperature
11 DewpointLocalDaytimeMin is mapped to Dew point temperature
12 DewpointLocalEveningAvg is mapped to Dew point temperature
13 DewpointLocalEveningMax is mapped to Dew point temperature
14 DewpointLocalEveningMin is mapped to Dew point temperature
15 DewpointLocalMorningAvg is mapped to Dew point temperature
16 DewpointLocalMorningMax is mapped to Dew point temperature
17 Dewpoint

134 SnowAmountLocalDayMin is mapped to Snow accumulation amount
135 SnowAmountLocalDaytimeAvg is mapped to Snow accumulation amount
136 SnowAmountLocalDaytimeMax is mapped to Snow accumulation amount
137 SnowAmountLocalDaytimeMin is mapped to Snow accumulation amount
138 SnowAmountLocalEveningAvg is mapped to Snow accumulation amount
139 SnowAmountLocalEveningMax is mapped to Snow accumulation amount
140 SnowAmountLocalEveningMin is mapped to Snow accumulation amount
141 SnowAmountLocalMorningAvg is mapped to Snow accumulation amount
142 SnowAmountLocalMorningMax is mapped to Snow accumulation amount
143 SnowAmountLocalMorningMin is mapped to Snow accumulation amount
144 SnowAmountLocalNighttimeAvg is mapped to Snow accumulation amount
145 SnowAmountLocalNighttimeMax is mapped to Snow accumulation amount
146 SnowAmountLocalNighttimeMin is mapped to Snow accumulation amount
147 SnowAmountLocalOvernightAvg is mapped to Snow accumulation amount
148 SnowAmountLocalOvernightMax is mapped to

21 DewpointLocalOvernightAvg is mapped to Dew point temperature
22 DewpointLocalOvernightMax is mapped to Dew point temperature
23 DewpointLocalOvernightMin is mapped to Dew point temperature
24 FeelsLikeLocalAfternoonAvg is mapped to Feels like temperature
25 FeelsLikeLocalAfternoonMax is mapped to Feels like temperature
26 FeelsLikeLocalAfternoonMin is mapped to Feels like temperature
27 FeelsLikeLocalDayAvg is mapped to Feels like temperature
28 FeelsLikeLocalDayMax is mapped to Feels like temperature
29 FeelsLikeLocalDayMin is mapped to Feels like temperature
30 FeelsLikeLocalDaytimeAvg is mapped to Feels like temperature
31 FeelsLikeLocalDaytimeMax is mapped to Feels like temperature
32 FeelsLikeLocalDaytimeMin is mapped to Feels like temperature
33 FeelsLikeLocalEveningAvg is mapped to Feels like temperature
34 FeelsLikeLocalEveningMax is mapped to Feels like temperature
35 FeelsLikeLocalEveningMin is mapped to Feels like temperature
36 FeelsLikeLocalMorningAvg is mapped to Feels

151 TemperatureLocalAfternoonMax is mapped to Air temperature
152 TemperatureLocalAfternoonMin is mapped to Air temperature
153 TemperatureLocalDayAvg is mapped to Air temperature
154 TemperatureLocalDayMax is mapped to Air temperature
155 TemperatureLocalDayMin is mapped to Air temperature
156 TemperatureLocalDaytimeAvg is mapped to Air temperature
157 TemperatureLocalDaytimeMax is mapped to Air temperature
158 TemperatureLocalDaytimeMin is mapped to Air temperature
159 TemperatureLocalEveningAvg is mapped to Air temperature
160 TemperatureLocalEveningMax is mapped to Air temperature
161 TemperatureLocalEveningMin is mapped to Air temperature
162 TemperatureLocalMorningAvg is mapped to Air temperature
163 TemperatureLocalMorningMax is mapped to Air temperature
164 TemperatureLocalMorningMin is mapped to Air temperature
165 TemperatureLocalNighttimeAvg is mapped to Air temperature
166 TemperatureLocalNighttimeMax is mapped to Air temperature
167 TemperatureLocalNighttimeMin is mapped t

The specified dataset is now published to the catalog and its column headers are mapped to their associated business terms. 

Navigate to below path to verify the mappings created above. <br>

**All Catalogs --> New Catalog --> csv file --> any column header from the above list**.

The associated business term for the column header is displayed.

In [19]:
s.close()