# Week 2

## Installing and importing a python package/library 

- Installing Pandas Library for Data Processing 
- What is Pandas? 
    - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
    - Its main data format is the DataFrame, which is like a 2D array, but with column names and indexes for the rows.




### Using pip to install Pandas 

In [None]:
# Need to only be run once per environment. No need to run it again in every project.

import sys
!{sys.executable} -m pip install pandas

### Importing pandas package library 

In [127]:
# Imprting pandas library as pd so instead of wrting pandas everywhere we can call pd which is shorter
import pandas as pd

## Loading and exploring data with Pandas



First we load in the data. This particular dataset is from [NYC granted film permits](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) and is in the **comma-separated variable (.csv)** format.

``
df = pd.read_csv("data/Film_Permits.csv")
``

Pandas shows us the **head** (first rows) and the **tail** (last rows), as well as the **shape**. So we know there are 14 columns (separate pieces of info about each permit), and 67359 permits granted.

We can see its a mix of categorical data (EventType, Borough etc...), unique IDs, and dates. The categories are mainly **nominal**, in that they have no instrinsic order. Even the ones which are numbers (like CommunityBoard(s) and PolicePrecinct(s)) are **nominal**, in that the numbers don't represent a ranking (one isn't best), and we would never want to do any maths with them 

The dates however, are **continuous**, in that they are numbers that could take any value, and that we could do arithmetic (e.g. subtract one from the other to find a time).

In [128]:
#load
#pd.options.display.max_rows = 100
df = pd.read_csv("data/Film_Permits.csv")

In [129]:
# Standard format is rows x colums
# Hence our dataset has 67359 rows and 14 colums
df.shape

(67359, 14)

In [130]:
df

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,446040,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 04:00:00 AM,10/16/2018 11:57:27 AM,"Mayor's Office of Film, Theatre & Broadcasting",THOMPSON STREET between PRINCE STREET and SPRI...,Manhattan,2,1,Television,Cable-episodic,United States of America,10012
1,446168,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 02:00:00 AM,10/16/2018 07:03:56 PM,"Mayor's Office of Film, Theatre & Broadcasting",MARBLE HILL AVENUE between WEST 227 STREET an...,Manhattan,"12, 8","34, 50",Film,Feature,United States of America,"10034, 10463"
2,186438,Shooting Permit,10/30/2014 07:00:00 AM,10/31/2014 02:00:00 AM,10/27/2014 12:14:15 PM,"Mayor's Office of Film, Theatre & Broadcasting",LAUREL HILL BLVD between REVIEW AVENUE and RUS...,Queens,"2, 5","104, 108",Television,Episodic series,United States of America,11378
3,445255,Shooting Permit,10/20/2018 07:00:00 AM,10/20/2018 06:00:00 PM,10/09/2018 09:34:58 PM,"Mayor's Office of Film, Theatre & Broadcasting",JORALEMON STREET between BOERUM PLACE and COUR...,Brooklyn,2,84,Still Photography,Not Applicable,United States of America,11201
4,128794,Theater Load in and Load Outs,11/16/2013 12:01:00 AM,11/17/2013 06:00:00 AM,11/07/2013 03:48:28 PM,"Mayor's Office of Film, Theatre & Broadcasting",WEST 31 STREET between 7 AVENUE and 8 AVENUE...,Manhattan,"4, 5",14,Theater,Theater,United States of America,"10001, 10121"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67354,511630,Shooting Permit,10/11/2019 10:00:00 AM,10/11/2019 11:00:00 PM,10/08/2019 02:05:06 PM,"Mayor's Office of Film, Theatre & Broadcasting","44 ROAD between 24 STREET and HUNTER STREET, ...",Queens,2,108,Television,Episodic series,United States of America,11101
67355,548906,Shooting Permit,10/18/2020 06:00:00 AM,10/19/2020 06:00:00 PM,10/14/2020 03:50:06 PM,"Mayor's Office of Film, Theatre & Broadcasting",MADISON AVENUE between EAST 45 STREET and EA...,Manhattan,5,"14, 18",Commercial,Commercial,United States of America,10017
67356,491040,Shooting Permit,06/12/2019 07:00:00 AM,06/12/2019 08:00:00 PM,06/10/2019 12:38:04 PM,"Mayor's Office of Film, Theatre & Broadcasting",SPOFFORD AVENUE between COSTER STREET and FAIL...,Bronx,"1, 2, 7","40, 41, 72",Television,Episodic series,United States of America,"10454, 10474, 11220"
67357,548583,Shooting Permit,10/19/2020 08:00:00 AM,10/19/2020 11:59:00 PM,10/08/2020 05:05:23 PM,"Mayor's Office of Film, Theatre & Broadcasting",QUEENS PLAZA SOUTH between 21 STREET and 22 ST...,Queens,2,108,Television,Episodic series,United States of America,11101


In [131]:
# Let's say we want to just peak at the first five rows of our dataset we can then use .head()
# Since our dataframe is in the varaible df we are asking it to show us the first five rows using df.head(5)

df.head(5)

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,446040,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 04:00:00 AM,10/16/2018 11:57:27 AM,"Mayor's Office of Film, Theatre & Broadcasting",THOMPSON STREET between PRINCE STREET and SPRI...,Manhattan,2,1,Television,Cable-episodic,United States of America,10012
1,446168,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 02:00:00 AM,10/16/2018 07:03:56 PM,"Mayor's Office of Film, Theatre & Broadcasting",MARBLE HILL AVENUE between WEST 227 STREET an...,Manhattan,"12, 8","34, 50",Film,Feature,United States of America,"10034, 10463"
2,186438,Shooting Permit,10/30/2014 07:00:00 AM,10/31/2014 02:00:00 AM,10/27/2014 12:14:15 PM,"Mayor's Office of Film, Theatre & Broadcasting",LAUREL HILL BLVD between REVIEW AVENUE and RUS...,Queens,"2, 5","104, 108",Television,Episodic series,United States of America,11378
3,445255,Shooting Permit,10/20/2018 07:00:00 AM,10/20/2018 06:00:00 PM,10/09/2018 09:34:58 PM,"Mayor's Office of Film, Theatre & Broadcasting",JORALEMON STREET between BOERUM PLACE and COUR...,Brooklyn,2,84,Still Photography,Not Applicable,United States of America,11201
4,128794,Theater Load in and Load Outs,11/16/2013 12:01:00 AM,11/17/2013 06:00:00 AM,11/07/2013 03:48:28 PM,"Mayor's Office of Film, Theatre & Broadcasting",WEST 31 STREET between 7 AVENUE and 8 AVENUE...,Manhattan,"4, 5",14,Theater,Theater,United States of America,"10001, 10121"


##  Data types

### Checking data types


We've considered what data is there, but some is numbers, some is text, some are dates. What does **Pandas** think each one is? This is important because it will determine what we can do with the data in each column, how it will be sorted, and filtered etc...

We can use 

``df.dtypes``

to see each columns data type. We see that they are all **objects**, which is the Pandas type for strings or mixed values. We can load the data in again and tell it which columns represent dates, and it will automatically parse them is they are in a consistent format. This means we can do things like compare them (e.g. which one is earlier?), which is really useful for sorting. 

You will see the loading takes longer, as it has to parse the dates, and that afterwards the chosen columns are of the ``datetime64[ns]`` type

In [132]:
# What are the types?
# Checking the data type of each columne

df.dtypes

EventID               int64
EventType            object
StartDateTime        object
EndDateTime          object
EnteredOn            object
EventAgency          object
ParkingHeld          object
Borough              object
CommunityBoard(s)    object
PolicePrecinct(s)    object
Category             object
SubCategoryName      object
Country              object
ZipCode(s)           object
dtype: object

### Loading and formating date types

- As we can see above that the columns "StartDateTime", "EndDateTime", and "EnteredOn" are being read as a object even though they contain date and time hence should be in the datetime format.
- Refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html for more information on other formats.

In [133]:
# Reloading the csv data and parsing the dates in the correct format.

df = pd.read_csv("data/Film_Permits.csv", parse_dates=["StartDateTime","EndDateTime","EnteredOn"])

In [134]:
# Checking if the data has been parsed in the correct format

df.dtypes

EventID                       int64
EventType                    object
StartDateTime        datetime64[ns]
EndDateTime          datetime64[ns]
EnteredOn            datetime64[ns]
EventAgency                  object
ParkingHeld                  object
Borough                      object
CommunityBoard(s)            object
PolicePrecinct(s)            object
Category                     object
SubCategoryName              object
Country                      object
ZipCode(s)                   object
dtype: object

## Summarising Data

**Pandas** can also give us a summary of our data (67359 is a lot to look at ourselves!). 

``df.describe(include = "all")``


In [135]:
df.describe()

Unnamed: 0,EventID
count,67359.0
mean,282618.005849
std,146069.216982
min,42069.0
25%,152266.0
50%,276047.0
75%,408197.5
max,549009.0


If we don't put ``include = "all"``, then we will only get summary stastics for **numeric** columns. 


In [136]:
df.describe(include = "all")

  df.describe(include = "all")


Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
count,67359.0,67359,67359,67359,67359,67359,67359,67359,67341.0,67341.0,67359,67359,67359,67341.0
unique,,4,26509,32433,66808,1,39417,5,836.0,2598.0,10,30,10,5058.0
top,,Shooting Permit,2018-11-13 06:00:00,2019-12-04 22:00:00,2018-01-30 12:43:07,"Mayor's Office of Film, Theatre & Broadcasting",WEST 48 STREET between 6 AVENUE and 7 AVENUE,Manhattan,1.0,94.0,Television,Episodic series,United States of America,11222.0
freq,,58638,24,16,6,67359,1444,32958,14828.0,7263.0,36985,21549,67291,6427.0
first,,,2012-01-01 06:00:00,2012-01-02 23:59:00,2011-12-07 16:38:54,,,,,,,,,
last,,,2020-10-19 08:00:00,2020-12-31 01:00:00,2020-10-15 16:07:28,,,,,,,,,
mean,282618.005849,,,,,,,,,,,,,
std,146069.216982,,,,,,,,,,,,,
min,42069.0,,,,,,,,,,,,,
25%,152266.0,,,,,,,,,,,,,


When we run the code, we get a table of stats describing our data. We can see that alot of the stats that are number based dont return values for our **nominal** columns, which is fine. But things such as the most common, and number of unique entries are stil interesting. 

For example, all 5 New York Boroughs are present, and Manahattan is the most common. We can also see the first filming started on **2012-01-01 06:00:00**, and this works because we formatted it as a date, so Pandas is able to order them. 

## Selecting Columns 


We can select a column using its name 

``df["ColumnName"]``

Or we can select a bunch of columns by passing an array 

``df[["ColumnName1","ColumnName2"]]``

This returns a smaller **Series** object with the results but we can also use 

``result.values``


In [137]:
#select column
df["SubCategoryName"]

0         Cable-episodic
1                Feature
2        Episodic series
3         Not Applicable
4                Theater
              ...       
67354    Episodic series
67355         Commercial
67356    Episodic series
67357    Episodic series
67358    Episodic series
Name: SubCategoryName, Length: 67359, dtype: object

In [138]:
#select columns
df[["Category","SubCategoryName"]]

Unnamed: 0,Category,SubCategoryName
0,Television,Cable-episodic
1,Film,Feature
2,Television,Episodic series
3,Still Photography,Not Applicable
4,Theater,Theater
...,...,...
67354,Television,Episodic series
67355,Commercial,Commercial
67356,Television,Episodic series
67357,Television,Episodic series


In [139]:
#select column and get it back as an array of values
df["SubCategoryName"].values

array(['Cable-episodic', 'Feature', 'Episodic series', ...,
       'Episodic series', 'Episodic series', 'Episodic series'],
      dtype=object)

For more information on the differnce between lists and arrays check out https://www.geeksforgeeks.org/difference-between-list-and-array-in-python/

## Counting Columns 


We can get counts to see what the most prevalent combinations of categories are. 
- Here, for the type of thing being filmed, we can see **Feature Films** and **Epsiodic TV Series** are the most common. This is a good way for us to get a feel for all the different things that are filmed in New York and what you need a permit for. 

``df[["Category","SubCategoryName"]].value_counts()``

I wonder what **Television - Not Applicable** is?

In [140]:
df[["Category","SubCategoryName"]].value_counts()

Category             SubCategoryName        
Television           Episodic series            21549
Film                 Feature                     9116
Television           Cable-episodic              7262
Theater              Theater                     6759
Commercial           Commercial                  4267
Still Photography    Not Applicable              4027
WEB                  Not Applicable              2460
Television           Pilot                       1691
                     News                        1519
Film                 Not Applicable              1036
Television           Cable-other                  765
Film                 Short                        727
Television           Not Applicable               702
                     Made for TV/mini-series      665
                     Reality                      646
                     Morning Show                 612
Commercial           Promo                        599
Television           Special/Awards S

## Filtering 

As well as picking whole columns, we can also pick columns that fit certain parameters, using **filtering**. To do this we pick columns that equal a certain value

``df[df["SubCategoryName"]=="Independent Artist"]``



In [141]:
df[df["SubCategoryName"]=="Independent Artist"]

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
384,193044,Shooting Permit,2014-12-07 17:00:00,2014-12-08 06:00:00,2014-12-05 05:15:28,"Mayor's Office of Film, Theatre & Broadcasting",SOUTH STREET between RUTGERS SLIP and PIKE SLIP,Manhattan,3,7,Music Video,Independent Artist,United States of America,10002
863,446208,Shooting Permit,2018-10-25 06:00:00,2018-10-26 23:30:00,2018-10-17 12:08:31,"Mayor's Office of Film, Theatre & Broadcasting","53 STREET between 1 AVENUE and 2 AVENUE, 54 S...",Brooklyn,7,72,Music Video,Independent Artist,United States of America,"11220, 11232"
5776,181545,Shooting Permit,2014-09-23 05:30:00,2014-09-23 19:00:00,2014-09-16 09:08:59,"Mayor's Office of Film, Theatre & Broadcasting",BROADWAY between 29 STREET and CRESCENT STREET,Queens,1,114,Music Video,Independent Artist,United States of America,11106
7023,238243,Shooting Permit,2015-08-10 06:00:00,2015-08-10 20:00:00,2015-07-30 11:52:43,"Mayor's Office of Film, Theatre & Broadcasting",VARICK STREET between WEST HOUSTON STREET and ...,Manhattan,2,1,Music Video,Independent Artist,United States of America,10014
9285,123588,Shooting Permit,2013-10-06 19:00:00,2013-10-06 23:00:00,2013-09-25 15:31:48,"Mayor's Office of Film, Theatre & Broadcasting",WALKER STREET between CORTLANDT ALLEY and LAFA...,Manhattan,1,5,Music Video,Independent Artist,United States of America,10013
10441,84732,Shooting Permit,2013-01-26 06:00:00,2013-01-26 22:00:00,2013-01-24 14:16:50,"Mayor's Office of Film, Theatre & Broadcasting",WEST THIRD STREET between THOMPSON STREET and ...,Manhattan,2,6,Music Video,Independent Artist,United States of America,10012
10654,103560,Shooting Permit,2013-05-27 08:00:00,2013-05-27 20:00:00,2013-05-22 12:43:29,"Mayor's Office of Film, Theatre & Broadcasting",GANSEVOORT STREET between GREENWICH STREET and...,Manhattan,2,6,Music Video,Independent Artist,United States of America,10014
11741,164905,Shooting Permit,2014-06-11 04:00:00,2014-06-11 22:00:00,2014-06-09 15:32:49,"Mayor's Office of Film, Theatre & Broadcasting",CHARLES STREET between WASHINGTON STREET and W...,Manhattan,"2, 3","6, 9",Music Video,Independent Artist,United States of America,"10009, 10014"
16408,124768,Shooting Permit,2013-10-09 06:00:00,2013-10-09 19:00:00,2013-10-04 14:28:18,"Mayor's Office of Film, Theatre & Broadcasting",MONTROSE AVENUE between MANHATTAN AVENUE and G...,Brooklyn,1,90,Music Video,Independent Artist,United States of America,11206
16779,85561,Shooting Permit,2013-02-02 06:00:00,2013-02-02 22:00:00,2013-02-01 10:39:36,"Mayor's Office of Film, Theatre & Broadcasting",MADISON AVENUE between EAST 23 STREET and EA...,Manhattan,"1, 5","13, 90",Music Video,Independent Artist,United States of America,"10010, 11249"


## Sorting

Strangely, this dataset isn't actually sorted by date, but we can do that using ``sort_values``. We tell **Pandas** which column we want to sort by, and this must be either a number, or an **ordinal** value (such as a date). We also say which direction we want to sort the results in, and this is useful if we want to get the top or bottom slice

**Most recent 20**

``df.sort_values(by='StartDateTime', ascending=False)[:20]``

**Earliest 20**

``df.sort_values(by='StartDateTime', ascending=True)[:20]``

In [142]:
df.sort_values(by='StartDateTime', ascending=True)[:20]

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
6313,43257,Theater Load in and Load Outs,2012-01-01 06:00:00,2012-01-31 20:00:00,2011-12-29 11:40:41,"Mayor's Office of Film, Theatre & Broadcasting",WEST 48 STREET between 6 AVENUE and 7 AVENUE,Manhattan,5,18,Theater,Theater,United States of America,10036
23616,43053,Shooting Permit,2012-01-02 00:01:00,2012-01-02 23:59:00,2011-12-21 12:13:45,"Mayor's Office of Film, Theatre & Broadcasting",WEST 33 STREET between 6 AVENUE and 8 AVENUE...,Manhattan,"4, 5",14,Theater,Theater,United States of America,"10001, 10121"
27209,43264,Theater Load in and Load Outs,2012-01-02 06:00:00,2012-01-03 18:00:00,2011-12-29 14:40:59,"Mayor's Office of Film, Theatre & Broadcasting",ASHLAND PLACE between DEKALB AVENUE and FULTON...,Brooklyn,2,88,Theater,Theater,United States of America,11217
8402,42926,Theater Load in and Load Outs,2012-01-02 06:00:00,2012-01-31 23:00:00,2011-12-19 14:05:39,"Mayor's Office of Film, Theatre & Broadcasting",WEST 46 STREET between 7 AVENUE and 6 AVENUE,Manhattan,5,18,Theater,Theater,United States of America,10036
10934,42737,Theater Load in and Load Outs,2012-01-02 06:00:00,2012-01-18 20:00:00,2011-12-15 14:06:52,"Mayor's Office of Film, Theatre & Broadcasting",WEST 46 STREET between 7 AVENUE and 8 AVENUE,Manhattan,5,18,Theater,Theater,United States of America,10036
14509,42069,Theater Load in and Load Outs,2012-01-02 07:00:00,2012-01-27 22:00:00,2011-12-07 16:38:54,"Mayor's Office of Film, Theatre & Broadcasting",WEST 44 STREET between 7 AVENUE and 8 AVENUE,Manhattan,5,14,Theater,Theater,United States of America,10036
3483,43070,Shooting Permit,2012-01-02 07:00:00,2012-01-06 23:00:00,2011-12-21 16:32:43,"Mayor's Office of Film, Theatre & Broadcasting",WEST 37 STREET between 10 AVENUE and 11 AVENUE,Manhattan,4,10,Television,Game show,United States of America,10018
19958,42780,Theater Load in and Load Outs,2012-01-02 18:00:00,2012-01-03 00:00:00,2011-12-15 17:47:56,"Mayor's Office of Film, Theatre & Broadcasting",WEST 56 STREET between 7 AVENUE and 6 AVENUE,Manhattan,5,18,Theater,Theater,United States of America,10019
10679,42781,Theater Load in and Load Outs,2012-01-03 00:00:00,2012-01-03 18:00:00,2011-12-15 17:52:18,"Mayor's Office of Film, Theatre & Broadcasting",WEST 56 STREET between 7 AVENUE and 6 AVENUE,Manhattan,5,18,Theater,Theater,United States of America,10019
12627,43081,Shooting Permit,2012-01-03 06:00:00,2012-01-03 21:00:00,2011-12-22 10:17:26,"Mayor's Office of Film, Theatre & Broadcasting",CITY ISLAND AVENUE between BEACH STREET and CR...,Bronx,"10, 12, 28","45, 47",Television,Episodic series,United States of America,"10461, 10464, 10465, 10469, 10475"


## New Column - Finding the Longest shoot 

Now we're going to look at making new values (and columns!) from the existing data. 

``df['LengthOfShoot'] = df['EndDateTime'] - df['StartDateTime']``

Because we formatted them as dates, we can subtract them from each other to get a **time difference**. We can then sort by the new column **LengthOfShoot** to find the longest and shortest shoots.

We see the longest shoot is **360 days 00:59:00** and the shortest is **0 days 00:01:00**, possibly indicating that these are the boundaries of permits you can receive

In [143]:
#Make a new column containing the end time minus the start time
df['LengthOfShoot'] = df['EndDateTime'] - df['StartDateTime']
#Sort values and select the first 20 values
df.sort_values(by='LengthOfShoot', ascending=False)[:20]

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s),LengthOfShoot
65909,525411,Shooting Permit,2020-01-06 00:01:00,2020-12-31 01:00:00,2020-01-02 16:58:34,"Mayor's Office of Film, Theatre & Broadcasting",WEST 53 STREET between BROADWAY and 8 AVENUE...,Manhattan,5,18,Television,Talk Show,United States of America,10019,360 days 00:59:00
9759,458022,Shooting Permit,2019-01-07 00:01:00,2019-12-31 19:00:00,2019-01-02 10:12:33,"Mayor's Office of Film, Theatre & Broadcasting",WEST 53 STREET between BROADWAY and 8 AVENUE...,Manhattan,5,18,Television,Talk Show,United States of America,10019,358 days 18:59:00
25839,74427,Theater Load in and Load Outs,2012-10-28 06:00:00,2013-03-24 06:00:00,2012-10-09 12:44:44,"Mayor's Office of Film, Theatre & Broadcasting",AMSTERDAM AVENUE between WEST 73 STREET and ...,Manhattan,7,20,Theater,Theater,United States of America,10023,147 days 00:00:00
6088,72410,Shooting Permit,2012-09-18 09:00:00,2012-12-31 01:00:00,2012-09-14 18:50:15,"Mayor's Office of Film, Theatre & Broadcasting",WEST 33 STREET between 10 AVENUE and 11 AVENUE,Manhattan,4,10,Television,Talk Show,United States of America,10001,103 days 16:00:00
26202,58198,Theater Load in and Load Outs,2012-05-21 07:00:00,2012-07-31 23:59:00,2012-05-15 17:38:45,"Mayor's Office of Film, Theatre & Broadcasting",WEST 51 STREET between 5 AVENUE and 6 AVENUE...,Manhattan,5,18,Theater,Theater,United States of America,"10019, 10020",71 days 16:59:00
15761,351665,Theater Load in and Load Outs,2017-06-01 00:01:00,2017-08-08 23:59:00,2017-05-15 13:19:18,"Mayor's Office of Film, Theatre & Broadcasting",WEST 62 STREET between COLUMBUS AVENUE and A...,Manhattan,7,20,Theater,Theater,United States of America,10023,68 days 23:58:00
29017,44756,Theater Load in and Load Outs,2012-01-23 07:00:00,2012-03-30 20:00:00,2012-01-20 13:56:43,"Mayor's Office of Film, Theatre & Broadcasting",WEST 44 STREET between 7 AVENUE and 8 AVENUE,Manhattan,5,14,Theater,Theater,United States of America,10036,67 days 13:00:00
57099,418275,Theater Load in and Load Outs,2018-06-04 00:01:00,2018-08-08 23:59:00,2018-05-15 16:34:08,"Mayor's Office of Film, Theatre & Broadcasting",WEST 62 STREET between COLUMBUS AVENUE and A...,Manhattan,7,20,Theater,Theater,United States of America,10023,65 days 23:58:00
20801,64937,Theater Load in and Load Outs,2012-07-09 06:00:00,2012-09-09 22:00:00,2012-07-06 13:52:34,"Mayor's Office of Film, Theatre & Broadcasting",WEST 47 STREET between 8 AVENUE and BROADWAY,Manhattan,5,18,Theater,Theater,United States of America,10036,62 days 16:00:00
28368,44604,Theater Load in and Load Outs,2012-01-27 05:00:00,2012-03-27 22:00:00,2012-01-19 11:00:08,"Mayor's Office of Film, Theatre & Broadcasting",WEST 55 STREET between 6 AVENUE and 7 AVENUE,Manhattan,5,18,Theater,Theater,United States of America,10019,60 days 17:00:00


## Missing Values

**NaN** in Pandas represents a missing value, or something that it is unable to format, and are excluded by **count()**

This means we can divide the ``count()`` by the ``len`` and anything that is not 1.00 will have missing values!

We can see that "CommunityBoard(s)","ZipCode(s)","PolicePrecinct(s)" all have missing values, so we can use **isna()** to filter all the ones out to see if theres a reason why these are missing!

``df[df["CommunityBoard(s)"].isna()]``

We can look at the summary statistics to try and identify any trends 

``df[df["CommunityBoard(s)"].isna()].describe(include="all")``

And it looks like there is no trend based on date. Most are in Manhattan, although most films are in Manhattan regardless. 3 are at the same location exactly, and some look like they are for the same mini-series. 

We're still not clear why this data is missing really! If it was important, or a greater amount of data, we could investigate further with the source

In [144]:
#Divide without NaN by total length revealing which columns have missing values
df.count() / len(df)

EventID              1.000000
EventType            1.000000
StartDateTime        1.000000
EndDateTime          1.000000
EnteredOn            1.000000
EventAgency          1.000000
ParkingHeld          1.000000
Borough              1.000000
CommunityBoard(s)    0.999733
PolicePrecinct(s)    0.999733
Category             1.000000
SubCategoryName      1.000000
Country              1.000000
ZipCode(s)           0.999733
LengthOfShoot        1.000000
dtype: float64

In [145]:
#Look at rows for which CommunityBoard(s) contains NaN
df[df["CommunityBoard(s)"].isna()]

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s),LengthOfShoot
1961,447520,Shooting Permit,2018-11-01 16:00:00,2018-11-02 03:00:00,2018-10-28 16:56:02,"Mayor's Office of Film, Theatre & Broadcasting","Center Blvd between 55th Ave and 57th Ave, 57...",Queens,,,Still Photography,Not Applicable,United States of America,,0 days 11:00:00
5458,162029,Shooting Permit,2014-05-28 07:00:00,2014-05-28 12:00:00,2014-05-27 13:22:10,"Mayor's Office of Film, Theatre & Broadcasting",West 67th Street between Columbus Ave and Cent...,Manhattan,,,Television,Talk Show,United States of America,,0 days 05:00:00
13467,464332,Shooting Permit,2019-02-14 15:00:00,2019-02-15 05:00:00,2019-02-12 15:50:53,"Mayor's Office of Film, Theatre & Broadcasting",Greenpoint Ave between Railroad Ave and Dead E...,Queens,,,Television,Cable-episodic,United States of America,,0 days 14:00:00
15460,128473,Shooting Permit,2013-11-11 08:00:00,2013-11-11 20:00:00,2013-11-06 10:22:50,"Mayor's Office of Film, Theatre & Broadcasting",AMSTERDAM AVENUE between WEST 62 STREET and ...,Manhattan,,,Television,Cable-other,United States of America,,0 days 12:00:00
20015,179034,Shooting Permit,2014-08-30 12:00:00,2014-08-30 22:00:00,2014-08-28 13:49:39,"Mayor's Office of Film, Theatre & Broadcasting",Victory Boulevard between Wild Avenue and Crab...,Staten Island,,,Film,Feature,United States of America,,0 days 10:00:00
36044,47343,Shooting Permit,2012-02-23 13:00:00,2012-02-23 17:00:00,2012-02-22 14:33:38,"Mayor's Office of Film, Theatre & Broadcasting",PEARL ST between HANOVER SQUARE and COENTIES A...,Manhattan,,,Television,Episodic series,United States of America,,0 days 04:00:00
38043,194942,Shooting Permit,2015-01-06 16:00:00,2015-01-07 01:00:00,2014-12-18 12:23:13,"Mayor's Office of Film, Theatre & Broadcasting",Withers St between Meeker Ave and Union Ave,Brooklyn,,,Television,Cable-episodic,United States of America,,0 days 09:00:00
44832,372043,Shooting Permit,2017-08-30 08:00:00,2017-08-30 21:00:00,2017-08-22 19:32:04,"Mayor's Office of Film, Theatre & Broadcasting",LAFAYETTE STREET between SPRING STREET and PRI...,Manhattan,,,Commercial,Commercial,United States of America,,0 days 13:00:00
46163,372040,Shooting Permit,2017-08-28 08:00:00,2017-08-28 21:00:00,2017-08-22 19:21:21,"Mayor's Office of Film, Theatre & Broadcasting",LAFAYETTE STREET between SPRING STREET and PRI...,Manhattan,,,Commercial,Commercial,United States of America,,0 days 13:00:00
55547,413934,Shooting Permit,2018-05-03 12:00:00,2018-05-03 22:00:00,2018-04-30 10:30:17,"Mayor's Office of Film, Theatre & Broadcasting",MANGIN STREET between DELANCY STREET and DELAN...,Manhattan,,,Television,Episodic series,United States of America,,0 days 10:00:00


In [146]:
#Look at the summary of rows for which CommunityBoard(s) contains NaN
df[df["CommunityBoard(s)"].isna()].describe(include="all")

  df[df["CommunityBoard(s)"].isna()].describe(include="all")


Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s),LengthOfShoot
count,18.0,18,18,18,18,18,18,18,0.0,0.0,18,18,18,0.0,18
unique,,2,18,18,18,1,15,4,0.0,0.0,4,8,1,0.0,
top,,Shooting Permit,2018-11-01 16:00:00,2018-11-02 03:00:00,2018-10-28 16:56:02,"Mayor's Office of Film, Theatre & Broadcasting",LAFAYETTE STREET between SPRING STREET and PRI...,Manhattan,,,Television,Cable-episodic,United States of America,,
freq,,14,1,1,1,18,3,12,,,12,4,18,,
first,,,2012-02-23 13:00:00,2012-02-23 17:00:00,2012-02-22 14:33:38,,,,,,,,,,
last,,,2020-10-16 07:00:00,2020-10-16 20:00:00,2020-10-14 15:33:36,,,,,,,,,,
mean,343499.222222,,,,,,,,,,,,,,0 days 11:38:20
std,159036.161302,,,,,,,,,,,,,,0 days 03:05:44.030319624
min,47343.0,,,,,,,,,,,,,,0 days 04:00:00
25%,212747.75,,,,,,,,,,,,,,0 days 10:15:00


### Fixing missing values

We can use the **fillna()** function to replace missing values with a string. 

``df[["CommunityBoard(s)","ZipCode(s)","PolicePrecinct(s)"]].fillna("unknown")``

In [147]:
#Replace missing values with string ("unknown")
df[["CommunityBoard(s)","ZipCode(s)","PolicePrecinct(s)"]].fillna("unknown")

Unnamed: 0,CommunityBoard(s),ZipCode(s),PolicePrecinct(s)
0,2,10012,1
1,"12, 8","10034, 10463","34, 50"
2,"2, 5",11378,"104, 108"
3,2,11201,84
4,"4, 5","10001, 10121",14
...,...,...,...
67354,2,11101,108
67355,5,10017,"14, 18"
67356,"1, 2, 7","10454, 10474, 11220","40, 41, 72"
67357,2,11101,108


## Selection Bias 

Thinking about interesting machine learning models we could build from this, the first think I thought of was something that could predict whether a permit would be granted or not. However, this is something that we can't model from this dataset as it **only shows granted permits**. 

This is an example of **selection bias** in a dataset and is something we want to be careful of. If we are trying to model something and the dataset doesnt include all occurrences, it will be biased towards the things that got recorded, or selected to be in the dataset. 



## Cooked data - Whats missing?



We actually dont have data for who the permit was granted to, or how much they paid for it. Which I think would be interesting!!!



# Task 1 : Import a JSON file instead of a CSV using Pandas

You maye use the json data being provided by the exchange rate api.
- https://api.exchangerate-api.com/v4/latest/USD

Before importing the data visit the above link to see the json format in it's raw format and later compare it to what it looks like in a pandas dataframe.

## Dictionaries 

- Dictionaries are used to store data values in key:value pairs.

- A dictionary is a collection which is ordered*, changeable and do not allow duplicates.

- More on https://www.w3schools.com/python/python_dictionaries.asp

In [162]:
thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(thisdict)

{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}


In [163]:
# Print the "brand" value of the dictionary:
print(thisdict["brand"])

Ford





# Requesting Data from an API

- API is the acronym for Application Programming Interface, which is a software intermediary that allows two applications to talk to each other. Each time you use an app like Facebook, send an instant message, or check the weather on your phone, you’re using an API.


- Coronavirus Covid-19 API - DOCUMENTATION
https://documenter.getpostman.com/view/10808728/SzS8rjbc

Requests is an elegant and simple HTTP library for Python. It “is an application-layer protocol for transmitting hypermedia documents, such as HTML. It was designed for communication between web browsers and web servers”
So, requests is a package that is going to help us communicate between our browser and a web server somewhere that is storing data we are interested in. 

In [148]:
import requests

In [149]:
url = 'https://api.covid19api.com/summary'

In [150]:
# Using the requests package to make a GET request from this API endpoint.
r = requests.get(url)

What does <Response [200]> mean? That the request has succeeded.

In [151]:
r

<Response [200]>

We will now use a method called json() to extract the json-structured data from the request 'r'

In [152]:
json = r.json()

In [153]:
# .keys() is a method of returning keys of a dictionary 
# More on https://www.w3schools.com/python/ref_dictionary_keys.asp

json.keys()

dict_keys(['ID', 'Message', 'Global', 'Countries', 'Date'])

In [154]:
json['Global']

{'NewConfirmed': 247389,
 'TotalConfirmed': 530357202,
 'NewDeaths': 617,
 'TotalDeaths': 6293184,
 'NewRecovered': 0,
 'TotalRecovered': 0,
 'Date': '2022-06-06T00:52:54.553Z'}

In [155]:
json['Countries']

[{'ID': '2dc7249a-503a-47ca-9901-09799587a982',
  'Country': 'Afghanistan',
  'CountryCode': 'AF',
  'Slug': 'afghanistan',
  'NewConfirmed': 31,
  'TotalConfirmed': 180615,
  'NewDeaths': 0,
  'TotalDeaths': 7708,
  'NewRecovered': 0,
  'TotalRecovered': 0,
  'Date': '2022-06-06T00:52:54.553Z',
  'Premium': {}},
 {'ID': '3b4e7a32-560e-4f75-a665-a9b90f8b938f',
  'Country': 'Albania',
  'CountryCode': 'AL',
  'Slug': 'albania',
  'NewConfirmed': 32,
  'TotalConfirmed': 276342,
  'NewDeaths': 0,
  'TotalDeaths': 3497,
  'NewRecovered': 0,
  'TotalRecovered': 0,
  'Date': '2022-06-06T00:52:54.553Z',
  'Premium': {}},
 {'ID': '6254e385-f97d-4574-ba0c-04dc9ef9cc21',
  'Country': 'Algeria',
  'CountryCode': 'DZ',
  'Slug': 'algeria',
  'NewConfirmed': 0,
  'TotalConfirmed': 265889,
  'NewDeaths': 0,
  'TotalDeaths': 6875,
  'NewRecovered': 0,
  'TotalRecovered': 0,
  'Date': '2022-06-06T00:52:54.553Z',
  'Premium': {}},
 {'ID': 'ff4fc1d2-280d-40da-a784-62e4fc3ad00c',
  'Country': 'Andorra',


In [156]:
type(json['Global'])

dict

In [157]:
type(json['Countries'])

list

In [158]:
type(json['Date'])

str

In [159]:
# Countires is a list of dictionaries so let's explore further 
json['Countries'][0]

{'ID': '2dc7249a-503a-47ca-9901-09799587a982',
 'Country': 'Afghanistan',
 'CountryCode': 'AF',
 'Slug': 'afghanistan',
 'NewConfirmed': 31,
 'TotalConfirmed': 180615,
 'NewDeaths': 0,
 'TotalDeaths': 7708,
 'NewRecovered': 0,
 'TotalRecovered': 0,
 'Date': '2022-06-06T00:52:54.553Z',
 'Premium': {}}

In [160]:
df_countries = pd.DataFrame(json['Countries'])

In [161]:
df_countries

Unnamed: 0,ID,Country,CountryCode,Slug,NewConfirmed,TotalConfirmed,NewDeaths,TotalDeaths,NewRecovered,TotalRecovered,Date,Premium
0,2dc7249a-503a-47ca-9901-09799587a982,Afghanistan,AF,afghanistan,31,180615,0,7708,0,0,2022-06-06T00:52:54.553Z,{}
1,3b4e7a32-560e-4f75-a665-a9b90f8b938f,Albania,AL,albania,32,276342,0,3497,0,0,2022-06-06T00:52:54.553Z,{}
2,6254e385-f97d-4574-ba0c-04dc9ef9cc21,Algeria,DZ,algeria,0,265889,0,6875,0,0,2022-06-06T00:52:54.553Z,{}
3,ff4fc1d2-280d-40da-a784-62e4fc3ad00c,Andorra,AD,andorra,0,43067,0,153,0,0,2022-06-06T00:52:54.553Z,{}
4,909d6a0b-00b3-4880-ab89-7a6cea152a13,Angola,AO,angola,0,99761,0,1900,0,0,2022-06-06T00:52:54.553Z,{}
...,...,...,...,...,...,...,...,...,...,...,...,...
190,dd283451-a0aa-4909-ae69-c9fd97e2773b,Venezuela (Bolivarian Republic),VE,venezuela,64,523833,0,5722,0,0,2022-06-06T00:52:54.553Z,{}
191,2930b2cc-b998-4b54-a514-d6c65cca54a2,Viet Nam,VN,vietnam,881,10724554,0,43080,0,0,2022-06-06T00:52:54.553Z,{}
192,fa78125c-a45f-42a3-afd5-619c009efbc6,Yemen,YE,yemen,0,11822,0,2149,0,0,2022-06-06T00:52:54.553Z,{}
193,486d405d-6f60-4274-8a66-1baa8f6b6047,Zambia,ZM,zambia,0,322207,0,3988,0,0,2022-06-06T00:52:54.553Z,{}


# Task 2: Get data of confirmed cases in the UK starting from the first recorded case

- Hint: Refer to the API documenataion to understand how to query this data  

# Task 3: Create a pandas dataframe for the data obtained in task 2 