# DDL - Data Pipeline Example - Create an asset 

This notebook explores how to create a data asset using the DDL Data Pipeline. The notebook creates both a data asset with associated datasets and the creation of an asset without associated datasets. 

While this notebook includes most of the 'need to know' information on using the pipeline, explore the full documentation by running the following code. 

``` 
from DDL_Pipeline import explore_docs
explore_docs()
```

**import package**

In [51]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append('C:/Users/alightner/Documents/API_Packages/DDL_Pipeline')

from DDL_Pipeline._create import create_asset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1. Generate the create_asset object

The first step is to generate the create asset object. After the object is created and named, we can add the various 'parts' of the object such as attachments, metadata, and datasets. 

In [2]:
ddl = create_asset()

**A bit of exploration**

There are a set of default attributes which we will set throughout the process. 

In [3]:
ddl.asset_name

'asset name'

In [4]:
ddl.asset_description

'asset descripton'

Here is a list of all the attributes of the `create_asset()` object. These include value attributes such as names and descriptions to functions such as the `add_data()` or `read_asset_metadata()` function. 



In [5]:
print(dir(ddl))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'add_consentfile', 'add_data', 'add_metadata', 'api_key', 'asset_description', 'asset_name', 'asset_uid', 'associated_datasets', 'client_close', 'create_parent', 'datasets', 'delete_asset', 'folder_path', 'isParent', 'link', 'login', 'prod', 'push_asset_metadata', 'push_attachments', 'push_datasets', 'read_asset_metadata', 'reset_environment', 'today']


## 2. Login and set initial attributes

Many of these fields will be set automatically through useful functions. However, some we have to provide input. 

**Login**

In [6]:
ddl.login()

Socrata user? (email address) mdaniels@devtechsys.com
Socrata password?  ············


The login function takes this information and generates an authentication object for the socrata-py package and a client connectiion via the sodapy package. 

In [7]:
ddl.auth_obj

<socrata.authorization.Authorization at 0x184902256d8>

In [8]:
ddl.client

<sodapy.Socrata at 0x18490225a58>

**Read asset_metadata.csv field.** While we can add all the metadata fields seperately, it can be easier to place all the metadata input in a csv format (following the naming conventions in the csv file) to fill these fields. 

**Steps:**
1. Set the folder path to the asset. 
2. Place the asset_metadata.csv file at the root of directory. 
3. Apply the `read_asset_metadata()` method to the `create_asset()` object. 

The function sets the asset attributes such as name and description, etc. but also returns the metadata in a dictionary format if useful for exploration. 

In [9]:
# note that the ending '/' is necessary
ddl.folder_path = './Example_Asset/'

In [97]:
metadata = ddl.read_asset_metadata()
#metadata

Now the metadata attributes will be filled out at the asset level

In [70]:
ddl.asset_name

'Construction Survey'

In [71]:
ddl.asset_uid

'6ee23912-b343-4ba6-bca7-77a310292f10-000V01'

The function also places the metadata in the correct format to push to socrata. 

In [122]:
#ddl.asset_metadata

## 2. Create the initial asset

The first step is to create an empty data asset (with only the name and description). We then add additional features as we go. 

In [14]:
ddl.asset_name = ddl.asset_name + ' Example'

In [15]:
ddl.create_parent()

'9e7p-y54g'

## 3. Add Attachments

The `add_attachments()` takes each document within the 'Attchments' directory in the folder_path and adds the file as an attachment to asset. Windows records file paths slightly differently, thus if you are using a windows computer set windows == True. 

In [77]:
attachments = ddl.push_attachments(windows = True)

Attempting to change PB-AAB-700 to a DEC title.
USAID construction assessment


Although it returns the values for convenience, the results are automatically saved as the attachments attribute in the create_asset() object. 

In [78]:
ddl.attachments

[{'asset_id': '3a2c1d21-e31a-4e0c-b252-fe01af9feacb',
  'filename': 'Informed_Consent_Document.pdf',
  'name': 'Informed Consent Document'},
 {'asset_id': '31ad5f89-877c-4b1b-b27e-6bd18e35fc5f',
  'filename': 'PBAAB700.pdf',
  'name': 'USAID construction assessment'},
 {'asset_id': 'f7c6db10-7220-4cbf-8dec-5eef5facc981',
  'filename': 'USAID_Construction_Assessment_Questionnaire.pdf',
  'name': 'USAID_Construction_Assessment_Questionnaire'}]

## 4. Set/Generate the Informed Consent file 

In addition to the general attachments, we also specify an informed consent file within the attachments or within a particular file. These values can be set in the asset_metadata.csv or set manually in the program. 

In [18]:
ddl.informed_consent_file

''

In this example, however, for the demonstration, we will set a consent file name and page number. 

In [19]:
ddl.informed_consent_file = 'PBAAB700.pdf'
ddl.consent_pagenumber = 4

In [20]:
ddl.add_consentfile()

'https://usaid-ddl-dev.data.socrata.com/api/views/9e7p-y54g/files/a4131566-9d7f-4f0d-b409-ce795bc46e1a (Informed_Consent_Document.pdf)'

## 5. Prepare datasets to be pushed to socrata 

There are several modifications which need to occur prior to pushed assocated datasets to socrata. These include: 

1. Fix errors in recoding missings ('nan', '.', etc.) 
2. Divide datasets if there are two many columns. 
3. Find column descriptions. 
4. If divided, change descriptions and GUIDS. 



In [21]:
datasets = ddl.add_data()

['Analysis', 'Awards', 'Subawards']
Analysis
Awards
Subawards


In [75]:
datasets[7]['description']

'In the process of migrating data to the current DDL platform, datasets with a large number of variables required splitting into multiple spreadsheets. They should be reassembled by the user to understand the data fully. This is the fourth spreadsheet in the USAID Construction Assessment, Subawards.'

In [23]:
[datasets[i]['name'] for i in range(0, len(datasets))]

['USAID Construction Assessment, Analysis',
 'USAID Construction Assessment, Primary Awards: Section 1',
 'USAID Construction Assessment, Primary Awards: Section 2',
 'USAID Construction Assessment, Primary Awards: Section 3',
 'USAID Construction Assessment, Subawards: Section 1',
 'USAID Construction Assessment, Subawards: Section 2',
 'USAID Construction Assessment, Subawards: Section 3',
 'USAID Construction Assessment, Subawards: Section 4',
 'USAID Construction Assessment, Subawards: Section 5',
 'USAID Construction Assessment, Subawards: Section 6',
 'USAID Construction Assessment, Subawards: Section 7',
 'USAID Construction Assessment, Subawards: Section 8',
 'USAID Construction Assessment, Subawards: Section 9',
 'USAID Construction Assessment, Subawards: Section 10',
 'USAID Construction Assessment, Subawards: Section 11',
 'USAID Construction Assessment, Subawards: Section 12',
 'USAID Construction Assessment, Subawards: Section 13']

In [24]:
datasets[0]['codebook_info']['column_matches']

129

In [36]:
ddl.datasets = datasets

## 6. Push the datasets to socrata 

After checking to make sure the changes made to the datasets are appropriate, the next step is the push the datasets to Socrata. 

In [None]:
associated_datasets = ddl.push_datasets(quiet=False)

In [118]:
associated_datasets

[{},
 {'name': 'USAID Construction Assessment, Analysis',
  'description': 'This dataset contains data on the primary awards identified in the survey of USAID construction carried out between June 1, 2011 to June 20 to learn about the character, scope, value and management of USAID supported construction activities.',
  'uid': 'vxgm-jksz',
  'urls': {'dataset': 'https://data.usaid.gov/d/vxgm-jksz'},
  'title': 'USAID Construction Assessment, Analysis'}]

## 7. Add metadata to asset 

The `push_asset_metadata()` function takes the results from `ddl.attachments` and `ddl.associated_datasets`, adds the metadata to the asset metadata, and pushes the new updates to the metadata fields to socrata. 

In [120]:
r = ddl.push_asset_metadata()

## 8. Delete if just a trial

The `delete_asset()` method will ask whether you want to delete the asset and the associated datasets. 

In [125]:
ddl.delete_asset()

## 9. Close the client connection

In [126]:
ddl.client.close()