<h1>edX Capstone Project</h1>

<h2>Problem Statement</h2>

The people of New Yorker use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics.

<h3>Data</h3>

A great deal of this first notebook contains information about how to create and use the data set.

<h4>SODA</h4>

The Socrata Open Data API (SODA) provides programmatic access to this dataset including the ability to filter, query, and aggregate data.

<h4>The data to use to solve the problem</h4>

API Endpoint of New york City 311 dataset.<br/>
https://data.cityofnewyork.us/resource/fhrw-4uyv.json<br/>
<br/>
Open Data page<br/>
https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9<br/>
<br/>
Primary Land Use Tax Lot Output "PLUTO" housing dataset.  Five spreadsheets, one for each borough.<br/>
https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_18v1.zip<br/>

<h4>Use SODA API to extract a relevant subset of the data for use in this project.</h4>

The SODA documentation is here:

https://dev.socrata.com/foundry/data.cityofnewyork.us/fhrw-4uyv

This is the SODA url provided by the course materials.  It specifies just relevant columns, sets a maximum number of rows, and limits to the Department of Housing Preservation and Development.

https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status

<h4>Further adjusted SODA url that limits through year 2018</h4>
Because the quiz question asks for only through December 2018.<br/>

https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status&$where=created_date%3C%3D%272018-12-31T23:59:59.999%27

<h4>Further adjusted SODA url that limits only to 2018</h4>
To give a smaller subset to use while exploring the data.<br/>

https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status&$where=created_date%20between%20'2018-01-01T00:00:00.000'%20and%20'2018-12-31T23:59:59.999'

<h3>IBM Cloud Object Storage</h3>

<h4>From Watson Studio, invoking the "Insert to code" functionality on any uploaded file (any small text file will do) inserts a code snippet into the notebook that includes the bucket and credentials.</h4>
Fill them in below to use the various cloud functionalities.<br/>
Remove them before placing the notebook anywhere public-facing.<br/>

In [3]:
# The "Insert to code" ends up in here.

<h4>Create Credential and Bucket Variables</h4>

In [None]:
# Need to add some imports here.
client_cred = ibm_boto3.client(service_name='xxx',
ibm_api_key_id='xxx',
ibm_auth_endpoint='xxx',
config=Config(signature_version='x'),
endpoint_url='xxx')

bucket = 'xxx'

<h4>Upload a File to Cloud Object Store by Using the Credential and Bucket Variables</h4>

In [None]:
#Create a pickle (PKL) file out of the Dataframe:
df.to_pickle('./df_raw.pkl')
#Upload the pickle (PKL) file:
client_cred.upload_file('./df_raw.pkl',bucket,'df_raw_cos.pkl')

<h4>Download a File from Cloud Object Store by Using Credential and Bucket Variables</h4>

In [None]:
#Download the file from Cloud Object Store:
client_cred.download_file(Bucket=bucket,Key='df_raw_cos.pkl',Filename='./df_raw_local.pkl')
#Create a Dataframe out of the file:
df = pd.read_pickle('./df_raw_local.pkl')

<h3>When working outside of IBM cloud</h3>
To avoid using up monthly Capacity Unit Hours (CUH) while working on the notebook, the notebook can be hosted on any machine runnning Jupyter Notebook.<br/>
In which case the loading of the file into the Pandas dataframe is more direct.

In [52]:
import pandas as pd
local_file_path = 'C:\\Users\\It_Co\\Documents\\DataScience\\Capstone\\'
full_dataset = 'fhrw-4uyv.csv'
small_dataset = 'fhrw-4uyv-only2018.csv'
truncated_dataset = 'fhrw-4uyv-thru2018.csv'

In [53]:
current_dataset = full_dataset

In [54]:
df = pd.read_csv(local_file_path + current_dataset, parse_dates=[1,13])

<h1 style="color:Blue;">Question 1 of 4</h1>

<p style="color:Blue;">Which type of complaint should the Department of Housing Preservation and Development of New York City focus on first?</p>

<h4>Explore the datasets and identify the key problem.</h4>

In [55]:
df.complaint_type.isnull().sum()

0

<p style="color:Red;">None of the complaint types are null.</p>

In [56]:
df['complaint_type'].describe()

count            5862383
unique                29
top       HEAT/HOT WATER
freq             1152592
Name: complaint_type, dtype: object

<p style="color:Red;">HEAT/HOT WATER is the most common of the 29 unique complaint types, but closer examination is necessary to make a final answer to the question.</p>

In [57]:
unique_types = df['complaint_type'].unique()
unique_types.sort()
unique_types

array(['AGENCY', 'APPLIANCE', 'Appliance', 'CONSTRUCTION', 'DOOR/WINDOW',
       'ELECTRIC', 'ELEVATOR', 'FLOORING/STAIRS', 'GENERAL',
       'GENERAL CONSTRUCTION', 'General', 'HEAT/HOT WATER', 'HEATING',
       'HPD Literature Request', 'Mold', 'NONCONST', 'OUTSIDE BUILDING',
       'Outside Building', 'PAINT - PLASTER', 'PAINT/PLASTER', 'PLUMBING',
       'Plumbing', 'SAFETY', 'STRUCTURAL', 'Safety',
       'UNSANITARY CONDITION', 'Unsanitary Condition', 'VACANT APARTMENT',
       'WATER LEAK'], dtype=object)

<p style="color:Red;">Some of these appear to be duplicate ways to represent the same thing.</p>

In [60]:
#Although we may need dummy values for later questions, for now just normalize the data in-place.
df['complaint_type'].replace('Appliance', 'APPLIANCE', inplace = True)
df['complaint_type'].replace('GENERAL CONSTRUCTION', 'CONSTRUCTION', inplace = True)
df['complaint_type'].replace('General', 'GENERAL', inplace = True)
df['complaint_type'].replace('HEATING', 'HEAT/HOT WATER', inplace = True)
df['complaint_type'].replace('Outside Building', 'OUTSIDE BUILDING', inplace = True)
df['complaint_type'].replace('PAINT - PLASTER', 'PAINT/PLASTER', inplace = True)
df['complaint_type'].replace('Plumbing', 'PLUMBING', inplace = True)
df['complaint_type'].replace('Safety', 'SAFETY', inplace = True)
df['complaint_type'].replace('Unsanitary Condition', 'UNSANITARY CONDITION', inplace = True)

In [65]:
print(df['complaint_type'].describe())
unique_types = df['complaint_type'].unique()
unique_types.sort()
unique_types

count            5862383
unique                20
top       HEAT/HOT WATER
freq             2040461
Name: complaint_type, dtype: object


array(['AGENCY', 'APPLIANCE', 'CONSTRUCTION', 'DOOR/WINDOW', 'ELECTRIC',
       'ELEVATOR', 'FLOORING/STAIRS', 'GENERAL', 'HEAT/HOT WATER',
       'HPD Literature Request', 'Mold', 'NONCONST', 'OUTSIDE BUILDING',
       'PAINT/PLASTER', 'PLUMBING', 'SAFETY', 'STRUCTURAL',
       'UNSANITARY CONDITION', 'VACANT APARTMENT', 'WATER LEAK'],
      dtype=object)

<p style="color:Red;">Need to go back to the Open Data Page to see if there are any descriptions of these values, in case for example, water leaks should be lumped in with plumbing.</p>

In [64]:
#Save the dataset as it is so far while breaking for a meal.
df.to_pickle(local_file_path + 'save.pkl')

<h4>Identify and create features to create machine learning models.</h4>

<h4>Explore various machine learning algorithms to arrive at the best possible machine learning model.</h4>