# Custom Python Transforms
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

There will be scenarios when the easiest thing for you to do is just to write some Python code. This SDK provides three extension points that you can use.

1. New Script Column
2. New Script Filter
3. Transform Partition

Each of these are supported in both the scale-up and the scale-out runtime. A key advantage of using these extension points is that you don't need to pull all of the data in order to create a dataframe. Your custom python code will be run just like other transforms, at scale, by partition, and typically in parallel.

## Initial data prep

We start by loading some data from Azure Blob.

In [1]:
import azureml.dataprep as dprep
col = dprep.col

df = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv', skip_rows=1)
df.head(5)

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
1,ALABAMA,1,101710,Hale County,10171002162,Greensboro High Sch,94,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101710,Hale County,10171000588,Hale Co High Sch,257,74,2,PS,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1,101710,Hale County,10171000589,Moundville Elem Sch,304,95,.,.,...,.,.,.,.,.,.,.,.,.,.


We trim the dataset down and do some basic tranforms.

In [2]:
df = df.keep_columns(['stnam', 'leanm10', 'ncessch', 'MAM_MTH00numvalid_1011'])
df = df.replace_na(columns=['leanm10', 'MAM_MTH00numvalid_1011'], custom_na_list='.')
df = df.to_number(['ncessch', 'MAM_MTH00numvalid_1011'])
df.head(5)

Unnamed: 0,stnam,leanm10,ncessch,MAM_MTH00numvalid_1011
0,ALABAMA,Hale County,10171000000.0,
1,ALABAMA,Hale County,10171000000.0,
2,ALABAMA,Hale County,10171000000.0,
3,ALABAMA,Hale County,10171000000.0,2.0
4,ALABAMA,Hale County,10171000000.0,


We look for null values using a filter. We found some, so now we'll look at a way to fill these missing values.

In [3]:
df.filter(col('MAM_MTH00numvalid_1011').is_null()).head(5)

Unnamed: 0,stnam,leanm10,ncessch,MAM_MTH00numvalid_1011
0,ALABAMA,Hale County,10171000000.0,
1,ALABAMA,Hale County,10171000000.0,
2,ALABAMA,Hale County,10171000000.0,
3,ALABAMA,Hale County,10171000000.0,
4,ALABAMA,Hale County,10171000000.0,


## Transform Partition

We want to replace all null values with a 0, so we decide to use a handy pandas function. This code will be run by partition, not on all of the dataset at a time. This means that on a large dataset, this code may run in parallel as the runtime processes the data partition by partition.

In [4]:
pt_df = df
df = pt_df.transform_partition("""
def transform(df, index):
    df['MAM_MTH00numvalid_1011'].fillna(0,inplace=True)
    return df
""")
h = df.head(5)
h

Unnamed: 0,stnam,leanm10,ncessch,MAM_MTH00numvalid_1011
0,ALABAMA,Hale County,10171000000.0,0.0
1,ALABAMA,Hale County,10171000000.0,0.0
2,ALABAMA,Hale County,10171000000.0,0.0
3,ALABAMA,Hale County,10171000000.0,2.0
4,ALABAMA,Hale County,10171000000.0,0.0


### Transform Partition With File

Being able to use any python code to manipulate your data as a pandas DataFrame is extremely useful for complex and specific data operations that DataPrep doesn't handle natively. Though the code isn't very testable unfortunately, it's just sitting inside a string.
So to improve code testability and ease of script writing there is another transform_partiton interface that takes the path to a python script which must contain a function matching the 'transform' signature defined above.

The `script_path` argument should be a relative path to ensure Dataflow portability. Here `map_func.py` contains the same code as in the previous example.

In [5]:
df = pt_df.transform_partition_with_file('../data/map_func.py')
h = df.head(5)
h

Unnamed: 0,stnam,leanm10,ncessch,MAM_MTH00numvalid_1011
0,ALABAMA,Hale County,10171000000.0,0.0
1,ALABAMA,Hale County,10171000000.0,0.0
2,ALABAMA,Hale County,10171000000.0,0.0
3,ALABAMA,Hale County,10171000000.0,2.0
4,ALABAMA,Hale County,10171000000.0,0.0


## New Script Column

We want to create a new column that has the county name and the state name. We also want the state name to be title cased. We can do this using Python code by using the `new_script_column()` method on the dataflow.

In [6]:
df = df.new_script_column(new_column_name='county_state', insert_after='leanm10', script="""
def newvalue(row):
    return row['leanm10'] + ', ' + row['stnam'].title()
""")
h = df.head(5)
h

Unnamed: 0,stnam,leanm10,county_state,ncessch,MAM_MTH00numvalid_1011
0,ALABAMA,Hale County,"Hale County, Alabama",10171000000.0,0.0
1,ALABAMA,Hale County,"Hale County, Alabama",10171000000.0,0.0
2,ALABAMA,Hale County,"Hale County, Alabama",10171000000.0,0.0
3,ALABAMA,Hale County,"Hale County, Alabama",10171000000.0,2.0
4,ALABAMA,Hale County,"Hale County, Alabama",10171000000.0,0.0


## New Script Filter

Now we want to filter the dataset down to only rows where 'Hale' is not in the new county_state column. We can build a Python expression that returns True if we want to keep the row, and False to drop the row.

In [7]:
df = df.new_script_filter("""
def includerow(row):
    val = row['county_state']
    return 'Hale' not in val
""")
h = df.head(5)
h

Unnamed: 0,stnam,leanm10,county_state,ncessch,MAM_MTH00numvalid_1011
0,ALABAMA,Jefferson County,"Jefferson County, Alabama",10192000000.0,1.0
1,ALABAMA,Jefferson County,"Jefferson County, Alabama",10192000000.0,0.0
2,ALABAMA,Jefferson County,"Jefferson County, Alabama",10192000000.0,0.0
3,ALABAMA,Jefferson County,"Jefferson County, Alabama",10192000000.0,0.0
4,ALABAMA,Jefferson County,"Jefferson County, Alabama",10192000000.0,0.0
