# IMT 573 - Lab 4 - Data Integration

### Instructions

Before beginning this assignment, please ensure you have access to a working instance of Jupyter Notebooks with Python 3.

1. First, replace the “YOUR NAME HERE” text in the next cell with your own full name. Any collaborators must also be listed in this cell.

2. Be sure to include well-documented (e.g. commented) code cells, figures, and clearly written text  explanations as necessary. Any figures should be clearly labeled and appropriately referenced within the text. Be sure that each visualization adds value to your written explanation; avoid redundancy – you do no need four different visualizations of the same pattern.

3. Collaboration on problem sets and labs is fun, useful, and encouraged. However, each student must turn in an individual write-up in their own words as well as code/work that is their own. Regardless of whether you work with others, what you turn in must be your own work; this includes code and interpretation of results. The names of all collaborators must be listed on each assignment. Do not copy-and-paste from other students’ responses or code - your code should never be on any other student's screen or machine.

4. All materials and resources that you use (with the exception of lecture slides) must be appropriately referenced within your assignment.

Name: Pratiibh Bassi

Collaborators: 

In this module, we have focused on integrating and cleaning data. In this lab, we'll look at integrating different data sources.

The data we will use comes from the City of Seattle. It consists of police beats in the Seattle area and provides information on their geographic locations. You can learn more about police precincts and beats [here](https://www.seattle.gov/police/about-us/about-policing/precinct-and-patrol-boundaries). We'll use this same dataset in a future problem set. 

The data can be found in the `Police_Beat_and_Precinct_Centerpoints.csv` file.

In [4]:
import pandas as pd
beats_data = pd.read_csv('../Downloads/Police_Beat_and_Precinct_Centerpoints.csv')

In [6]:
beats_data.head()

Unnamed: 0,Name,Location 1,Latitude,Longitude
0,B1,"(47.7097756394592, -122.370990523069)",47.70978,-122.37099
1,B2,"(47.6790521901374, -122.391748391741)",47.67905,-122.39175
2,B3,"(47.6812920482227, -122.364236159741)",47.68129,-122.36424
3,C1,"(47.6342500180223, -122.315684762418)",47.63425,-122.31568
4,C2,"(47.6192385752996, -122.313557430551)",47.61924,-122.31356


In [8]:
beats_data.tail()

Unnamed: 0,Name,Location 1,Latitude,Longitude
52,U3,"(47.6660083487855, -122.312204733721)",47.66601,-122.3122
53,W,"(47.6300237833357, -122.368053164444)",47.63002,-122.36805
54,W1,"(47.5788164080083, -122.378814011668)",47.57882,-122.37881
55,W2,"(47.5607068301888, -122.386946475037)",47.56071,-122.38695
56,W3,"(47.5255479889804, -122.384581696918)",47.52555,-122.38458


In [10]:
beats_data

Unnamed: 0,Name,Location 1,Latitude,Longitude
0,B1,"(47.7097756394592, -122.370990523069)",47.70978,-122.37099
1,B2,"(47.6790521901374, -122.391748391741)",47.67905,-122.39175
2,B3,"(47.6812920482227, -122.364236159741)",47.68129,-122.36424
3,C1,"(47.6342500180223, -122.315684762418)",47.63425,-122.31568
4,C2,"(47.6192385752996, -122.313557430551)",47.61924,-122.31356
5,C3,"(47.6300792887474, -122.292087128251)",47.63008,-122.29209
6,CITYWIDE,"(47.6210041048652, -122.332993498998)",47.621,-122.33299
7,D1,"(47.6274421308028, -122.345705781837)",47.62744,-122.34571
8,D2,"(47.6256548876049, -122.331370005506)",47.62565,-122.33137
9,D3,"(47.6103493249325, -122.328653706199)",47.61035,-122.32865


In [12]:
beats_data.describe()

Unnamed: 0,Latitude,Longitude
count,57.0,57.0
mean,47.616469,-122.329209
std,0.056895,0.032597
min,47.50935,-122.4
25%,47.57581,-122.35187
50%,47.61576,-122.32996
75%,47.65855,-122.30659
max,47.72655,-122.25954


### Problem 1: Inspection

Inspect the beats data. How many records are there? What are the variables? Is there any missing or seemingly anomolous data?

In [15]:
len(beats_data)

57

In [17]:
beats_data.shape

(57, 4)

In [21]:
beats_data.columns

Index(['Name', 'Location 1', 'Latitude', 'Longitude'], dtype='object')

In [19]:
beats_data.isna().sum()

Name          0
Location 1    0
Latitude      0
Longitude     0
dtype: int64

There are 57 records, with 4 variables for each record (name, location, latitude, longitude).  There is one record that stands out in regards to its Name variable where its different from every other record, record 6.

### Problem 2: Using an API

We're going to join census data to the beats dataset. To do so, we need to first get census tract information for the beats. 

We'll use the `censusgeocode` package to get census tract information for this task. We have seen how different websites/data sources can have APIs and leverage API keys. Python also has many packages that will leverage APIs and `censusgeocode` is one such package in that it can interact with the US Census' APIs.

To start, import the `censusgeocode` package. As always, if the package does not import, you may need to install it first.

In [30]:
import censusgeocode as cg 

In [32]:
cg.coordinates(x=76, y=41)

{}

Now, use the [documentation](https://pypi.org/project/censusgeocode/) from the `censusgeocode` package to write a function with the following specifications: 

- the function should accept two arguments - one for longitude and one for latitude (in that order)
- the function should return the census tract number (often coded as `GEOID`) for the inputted latitude and longitude as a string
- the function should be named `get_census_tract`

You can find example outputs below to test your function

In [36]:
def get_census_tract(long_in, lat_in):
    return cg.coordinates(long_in, lat_in)['Census Tracts'][0]['GEOID']

In [40]:
get_census_tract(-77.036543, 38.898691) #should return '11001980000'

'11001980000'

In [42]:
get_census_tract(-73.985428, 40.748817) #should return '36061007600'

'36061007600'

In [44]:
get_census_tract(-118.321495, 34.134117) #should return '06037980009'

'06037980009'

### Problem 3: Get census tracts

Now, for each of the beats in the beats dataset, find the associated census tract. Keep this code as you'll use it in a future problem set.

Census tracts are codes to designate specific locations. The block codes are comprised of state/territory codes, followed by county codes, tract codes, and block codes. You can learn more about this [here](https://transition.fcc.gov/form477/Geo/more_about_census_blocks.pdf) . Confirm that each of the tracts for the beats data is from the state of Washington (code 53) and King County (the county that the city of Seattle is in - code 033).