In [1]:
from syft import Worker
import syft as sy
import numpy as np
import pandas as pd
worker = Worker.named("test-domain-1", processes=1, reset=False)
root_domain_client = worker.root_client



SQLite Store Path:
!open file:///var/folders/q1/ryq93kwj055dlbpngxv1c7z40000gn/T/7bca415d13ed1ec841f0d0aede098dbb.sqlite

> Starting Worker: test-domain-1 - 7bca415d13ed1ec841f0d0aede098dbb - NodeType.DOMAIN - [<class 'syft.core.node.new.user_service.UserService'>, <class 'syft.core.node.new.metadata_service.MetadataService'>, <class 'syft.core.node.new.action_service.ActionService'>, <class 'syft.core.node.new.test_service.TestService'>, <class 'syft.core.node.new.dataset_service.DatasetService'>, <class 'syft.core.node.new.user_code_service.UserCodeService'>, <class 'syft.core.node.new.request_service.RequestService'>, <class 'syft.core.node.new.data_subject_service.DataSubjectService'>, <class 'syft.core.node.new.network_service.NetworkService'>, <class 'syft.core.node.new.policy_service.PolicyService'>, <class 'syft.core.node.new.message_service.MessageService'>, <class 'syft.core.node.new.project_service.ProjectService'>, <class 'syft.core.node.new.data_subject_member_service.Data

# Summary
By the end of this chapter, we're going to have downloaded all of Canada's weather data for 2012, and saved it to a CSV.

We'll do this by downloading it one month at a time, and then combining all the months together.

## Get mocks

In [2]:
guest_domain_client = worker.guest_client
guest_client = guest_domain_client.login(email="jane@caltech.edu", password="abc123")

In [3]:
ds = guest_domain_client.datasets[0]

In [4]:
asset = ds.assets[0]

In [5]:
requests = asset.mock.syft_action_data

# How do we know if it's messy?
# TODO: Fix this formatting
We're going to look at a few columns here. I know already that there are some problems with the zip code, so let's look at that first.

To get a sense for whether a column has problems, I usually use .unique() to look at all its values. If it's a numeric column, I'll instead plot a histogram to get a sense of the distribution.

When we look at the unique values in "Incident Zip", it quickly becomes clear that this is a mess.

Some of the problems:

Some have been parsed as strings, and some as floats
There are nans
Some of the zip codes are 29616-0759 or 83
There are some N/A values that pandas didn't recognize, like 'N/A' and 'NO CLUE'
What we can do:

Normalize 'N/A' and 'NO CLUE' into regular nan values
Look at what's up with the 83, and decide what to do
Make everything strings

You'll see that the 'Weather' column has a text description of the weather that was going on each hour. We'll assume it's snowing if the text description contains "Snow".

pandas provides vectorized string functions, to make it easy to operate on columns containing text. There are some great examples in the documentation.

In [6]:
requests['Incident Zip'].unique()

array(['10557', '10703', '10626', ..., '10040', '10488', '10562'],
      dtype=object)

# Fixing the nan values and string/float confusion
We can pass a na_values option to pd.read_csv to clean this up a little bit. We can also specify that the type of Incident Zip is a string, not a float.



In [7]:
na_values = ['NO CLUE', 'N/A', '0']
requests.replace(na_values, np.NaN);

In [8]:
requests['Incident Zip'].unique()

array(['10557', '10703', '10626', ..., '10040', '10488', '10562'],
      dtype=object)

# What's up with the dashes?

In [9]:
rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False)
len(requests[rows_with_dashes])

1147

In [10]:
requests[rows_with_dashes]

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
113,663921,,,,,,,,10996-1234,,...,,,,,,,,85.405230,-35.355370,
178,691392,,,,,,,,11000-1234,,...,,,,,,,,51.249293,34.319935,
216,600216,,,,,,,,10991-1234,,...,,,,,,,,-86.816864,-54.904622,
267,427512,,,,,,,,10994-1234,,...,,,,,,,,-81.581028,35.786734,
401,535755,,,,,,,,11000-1234,,...,,,,,,,,48.671431,-39.485994,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110562,970136,,,,,,,,10996-1234,,...,,,,,,,,-88.944901,29.992659,
110642,417332,,,,,,,,10993-1234,,...,,,,,,,,43.930082,-21.462159,
110891,834496,,,,,,,,10993-1234,,...,,,,,,,,-33.978987,88.131189,
110929,148327,,,,,,,,11000-1234,,...,,,,,,,,34.782395,-29.094525,


I thought these were missing data and originally deleted them like this:

`requests['Incident Zip'][rows_with_dashes] = np.nan`

But then my friend Dave pointed out that 9-digit zip codes are normal. Let's look at all the zip codes with more than 5 digits, make sure they're okay, and then truncate them.

In [11]:
long_zip_codes = requests['Incident Zip'].str.len() > 5
requests['Incident Zip'][long_zip_codes].unique()

array(['10996-1234', '11000-1234', '10991-1234', '10994-1234',
       '10998-1234', '10995-1234', '10992-1234', '10997-1234',
       '10999-1234', '10993-1234'], dtype=object)

Those all look okay to truncate to me.

In [12]:
requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)

Done.

Earlier I thought 00083 was a broken zip code, but turns out Central Park's zip code 00083! Shows what I know. I'm still concerned about the 00000 zip codes, though: let's look at that.

In [13]:
requests[requests['Incident Zip'] == '00000'] 

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location


This looks bad to me. Let's set these to nan.

In [14]:
zero_zips = requests['Incident Zip'] == '00000'
requests.loc[zero_zips, 'Incident Zip'] = np.nan

Great. Let's see where we are now:

In [15]:
unique_zips = requests['Incident Zip'].unique()
unique_zips.sort()
unique_zips

array(['10000', '10001', '10002', ..., '10998', '10999', '11000'],
      dtype=object)

Amazing! This is much cleaner. There's something a bit weird here, though -- I looked up 77056 on Google maps, and that's in Texas.

Let's take a closer look:

In [16]:
zips = requests['Incident Zip']
# Let's say the zips starting with '0' and '1' are okay, for now. (this isn't actually true -- 13221 is in Syracuse, and why?)
is_close = zips.str.startswith('0') | zips.str.startswith('1')
# There are a bunch of NaNs, but we're not interested in them right now, so we'll say they're False
is_far = ~(is_close) & zips.notnull()

In [17]:
zips[is_far]

Series([], Name: Incident Zip, dtype: object)

Okay, there really are requests coming from LA and Houston! Good to know. Filtering by zip code is probably a bad way to handle this -- we should really be looking at the city instead.



In [18]:
requests['City'].str.upper().value_counts()

BROOKLYN    37185
NEW YORK    37110
BRONX       36774
Name: City, dtype: int64

It looks like these are legitimate complaints, so we'll just leave them alone.

## Putting it together

Now we want to request the full code execution.

Let's put all that together, to prove how easy it is. 6 lines of magical pandas!

If you want to play around, try changing sum to max, numpy.median, or any other function you like.

In [19]:
@sy.syft_function(input_policy=sy.ExactMatch(df=ds.assets[0]),
                  output_policy=sy.SingleExecutionExactOutput())
def zip_codes(df):
    import pandas as pd
    import numpy as np
    na_values = ['NO CLUE', 'N/A', '0']
    def fix_zip_codes(zips):
        # Truncate everything to length 5 
        zips = zips.str.slice(0, 5)

        # Set 00000 zip codes to nan
        zero_zips = zips == '00000'
        zips[zero_zips] = np.nan

        return zips
    df['Incident Zip'] = fix_zip_codes(df['Incident Zip'])
    result = df['Incident Zip'].unique()
    # todo, we are adding list(result) here to fix serialization errors
    return list(result)

Request code execution

In [20]:
req = guest_domain_client.api.services.code.request_code_execution(zip_codes)

In [21]:
submitted_code = guest_domain_client.code[0]

In [22]:
assert guest_domain_client.api.services.code.get_all()

Create and submit project

In [23]:
new_project = sy.Project(name="Pandas Chapter 7",
                         description="Hi, I would like to get some insights about the zip codes of the complaints")

In [24]:
new_project.add_request(obj=submitted_code, permission=sy.UserCodeStatus.EXECUTE)

In [25]:
guest_domain_client.submit_project(new_project)