<h1 style="text-align:center">Data Science and Machine Learning Capstone Project</h1>
<p style="text-align:center">IBM: DS0720EN</p>
<img style="text-align:center" src="https://prod-edxapp.edx-cdn.org/static/edx.org/images/logo.790c9a5340cb.png">
<p style="text-align:center">1 of 4</p>

# Table of Contents
1. [Problem](#problem)
    1. [Statement](#statement)
    2. [Questions](#questions)
2. [Question addressed by this Notebook](#questionheading)
    1. [Question](#question)
3. [Answer](#answerheading)
    1. [Answer](#answer)
4. [Approach](#approach)
5. [Analysis](#analysis)
    1. [Data Handling](#ingestion)
        1. [Source](#source)
        2. [IBM Cloud](#ingestcloud)
        3. [Local](#ingestlocal)
    2. [Wrangling](#wrangling)
    3. [Images](#images)
6. [Insights](#insights)
7. [Reasoning](#reasoning)

<a id="problem"></a>
## Problem
---

<a id="statement"></a>
### Statement

The people of New Yorker use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. These are their  questions:

<a id="questions"></a>
### Questions

1.  Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?
2.  Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or street (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?
3.  Does the Complaint Type that you identified in response to question 1 have an obvious relationship with any particular characteristic or characteristics of the houses or buildings?
4.  Can a predictive model be built for a future prediction of the possibility of complaints of the type that you have identified in response to question 1?

<a id="questionheading"></a>
## Question addressed by this Notebook
---

<a id="question"></a>
### Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?

<a id="answerheading"></a>
## Answer
---

<a id="answer"></a>
### Answer

<a id="approach"></a>
## Approach
---

<a id="analysis"></a>
## Analysis
---

<a id="ingestion"></a>
### Data Handling

<a id="source"></a>
#### Data Source

The course provides the data sources to use as these:  [New York City 311 SODA API Endpoint](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) and [Primary Land Use Tax Lot Output PLUTO housing dataset](https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_18v1.zip).

A little poking around yields the [Open Data Page schema](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) and [SODA documentation](https://dev.socrata.com/foundry/data.cityofnewyork.us/fhrw-4uyv).

The course provides a [SODA URL](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status).  It lets us download a subset of the entire 311 data set.  Only certain columns, sets a maximum number of rows, and limits to the Department of Housing Preservation and Development.

I will create two additional SODA URL to use for this project:

Because the first quiz asks questions only pertaining through December 2018:

[Through 2018 Only](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status&$where=created_date%3C%3D%272018-12-31T23:59:59.999%27)

The data set is still very large, so I will break off just one year to use during initial examination when I am figuring out what approach I will take:

[2018 Only](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status&$where=created_date%20between%20%272018-01-01T00:00:00.000%27%20and%20%272018-12-31T23:59:59.999%27)

The data sets will be read directly into a Pandas Dataframe then saved as "pickle" files.  Either locally or in the IBM Cloud, depending on where the notebook is being run.

In [8]:
import pandas as pd
#df = pd.read_csv("https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status&$where=created_date%3C%3D%272018-12-31T23:59:59.999%27", parse_dates=[1,13])

In [9]:
df.head()

Unnamed: 0,created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status
0,2016-09-08T17:33:44.000,34269448,UNSANITARY CONDITION,10460.0,1512 BEACH AVENUE,BEACH AVENUE,ADDRESS,BRONX,The Department of Housing Preservation and Dev...,BRONX,40.837901,-73.867293,2019-09-24T14:31:15.000,RESIDENTIAL BUILDING,Closed
1,2018-11-16T11:37:56.000,40956852,WATER LEAK,10457.0,230 EAST 173 STREET,EAST 173 STREET,ADDRESS,BRONX,The Department of Housing Preservation and Dev...,BRONX,40.843718,-73.908433,2019-09-11T08:31:39.000,RESIDENTIAL BUILDING,Closed
2,2018-10-01T16:42:56.000,40426596,APPLIANCE,10460.0,2133 DALY AVENUE,DALY AVENUE,ADDRESS,BRONX,The Department of Housing Preservation and Dev...,BRONX,40.845532,-73.88098,2019-09-06T13:13:13.000,RESIDENTIAL BUILDING,Closed
3,2018-11-10T06:39:53.000,40904132,HEAT/HOT WATER,10030.0,118 WEST 139 STREET,WEST 139 STREET,ADDRESS,NEW YORK,The complaint you filed is a duplicate of a co...,MANHATTAN,40.817195,-73.940298,2019-08-25T00:00:00.000,RESIDENTIAL BUILDING,Closed
4,2018-11-10T06:55:58.000,40899286,HEAT/HOT WATER,10030.0,118 WEST 139 STREET,WEST 139 STREET,ADDRESS,NEW YORK,The complaint you filed is a duplicate of a co...,MANHATTAN,40.817195,-73.940298,2019-08-25T00:00:00.000,RESIDENTIAL BUILDING,Closed


<a id="ingestcloud"></a>
#### IBM Cloud

Code snippets for Uploading and Downloading pickle files between a Pandas data frame and the IBM Cloud.

In [11]:
# Import IBM specific imports here.

In [None]:
# Create variables needed for IBM Cloud operations.

# The values needed to replace the "xxx" placeholders in the following sections should be kept secret.
# Do not keep them in any notebook that is then uploaded to anywhere publically visible.
# These can be obtained by invoking the Watson Studio "insert to code" functionality on any uploaded file.
# That will generate code with the required values.
# Copy / paste from that generated code then discard the generated code.

# Create a credentials variable.
client_cred = ibm_boto3.client(service_name='xxx',
ibm_api_key_id='xxx',
ibm_auth_endpoint='xxx',
config=Config(signature_version='xxx'),
endpoint_url='xxx')

# Create a bucket variable.
bucket = 'xxx'

In [None]:
# "Load"
#Download a pickle file from IBM Cloud Object Store:
client_cred.download_file(Bucket=bucket,Key='df_raw_cos.pkl',Filename='./df_raw_local.pkl')
#Fill a Dataframe from of the pickle file:
df = pd.read_pickle('./df_raw_local.pkl')

In [None]:
# "Save"
#Create a pickle file from a dataframe:
df.to_pickle('./df_raw.pkl')
#Upload a pickle (PKL) file to the IBM Cloud Object Store:
client_cred.upload_file('./df_raw.pkl',bucket,'df_raw_cos.pkl')

<a id="ingestlocal"></a>
#### Local

To avoid using up monthly Capacity Unit Hours (CUH) while working on the notebook, the notebook can be hosted on a local machine runnning Jupyter Notebook.

Code snippets for saving and loading Pandas data frames as local files.

In [12]:
#set variables to make either operation easier
file_path_local = 'C:\\Users\\It_Co\\Documents\\DataScience\\Capstone\\'
filename_full = 'ny311full.pkl'
filename_small = 'ny311small.pkl'
filename_truncated = 'ny311truncated.pkl'

In [4]:
#Save
#Create a pickle file from a dataframe.
df.to_pickle(file_path_local + filename_truncated)

In [None]:
#Load
#Fill a dataframe from a pickle file.
df.read_pickle(file_path_local + filename_truncated)

<a id="wrangling"></a>
### Wrangling

<a id="saving"></a>
### Saving

<a id="savecloud"></a>
#### IBM Cloud

Set credentials and bucket variables as described in the [Data Ingestion](#ingestcloud) section.

<a id="savelocal"></a>
#### Local

Set path and file name variables as described in the [Data Ingestion](#ingestlocal) section.

In [10]:
#Save the dataset as a pickle file.
df.to_pickle(file_path_local + filename_truncated)

<a id="images"></a>
### Images

<a id="insights"></a>
## Insights
---

<a id="reasoning"></a>
## Reasoning
---