# Week 2

## Installing and importing a python package/library 

- Installing Pandas Library for Data Processing 
- What is Pandas? 
    - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
    - Its main data format is the DataFrame, which is like a 2D array, but with column names and indexes for the rows.




### Using pip to install Pandas 

In [None]:
# Need to only be run onced 

import sys
!{sys.executable} -m pip install pandas

### Importing pandas package library 

In [34]:
import pandas as pd

## Loading and exploring data with Pandas



First we load in the data. This particular dataset is from [NYC granted film permits](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) and is in the **comma-separated variable (.csv)** format.

``
df = pd.read_csv("data/Film_Permits.csv")
``

Pandas shows us the **head** (first rows) and the **tail** (last rows), as well as the **shape**. So we know there are 14 columns (separate pieces of info about each permit), and 67359 permits granted.

We can see its a mix of categorical data (EventType, Borough etc...), unique IDs, and dates. The categories are mainly **nominal**, in that they have no instrinsic order. Even the ones which are numbers (like CommunityBoard(s) and PolicePrecinct(s)) are **nominal**, in that the numbers don't represent a ranking (one isn't best), and we would never want to do any maths with them 

The dates however, are **continuous**, in that they are numbers that could take any value, and that we could do arithmetic (e.g. subtract one from the other to find a time).

In [35]:
#load
#pd.options.display.max_rows = 100
df = pd.read_csv("data/Film_Permits.csv")

In [36]:
df

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,446040,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 04:00:00 AM,10/16/2018 11:57:27 AM,"Mayor's Office of Film, Theatre & Broadcasting",THOMPSON STREET between PRINCE STREET and SPRI...,Manhattan,2,1,Television,Cable-episodic,United States of America,10012
1,446168,Shooting Permit,10/19/2018 02:00:00 PM,10/20/2018 02:00:00 AM,10/16/2018 07:03:56 PM,"Mayor's Office of Film, Theatre & Broadcasting",MARBLE HILL AVENUE between WEST 227 STREET an...,Manhattan,"12, 8","34, 50",Film,Feature,United States of America,"10034, 10463"
2,186438,Shooting Permit,10/30/2014 07:00:00 AM,10/31/2014 02:00:00 AM,10/27/2014 12:14:15 PM,"Mayor's Office of Film, Theatre & Broadcasting",LAUREL HILL BLVD between REVIEW AVENUE and RUS...,Queens,"2, 5","104, 108",Television,Episodic series,United States of America,11378
3,445255,Shooting Permit,10/20/2018 07:00:00 AM,10/20/2018 06:00:00 PM,10/09/2018 09:34:58 PM,"Mayor's Office of Film, Theatre & Broadcasting",JORALEMON STREET between BOERUM PLACE and COUR...,Brooklyn,2,84,Still Photography,Not Applicable,United States of America,11201
4,128794,Theater Load in and Load Outs,11/16/2013 12:01:00 AM,11/17/2013 06:00:00 AM,11/07/2013 03:48:28 PM,"Mayor's Office of Film, Theatre & Broadcasting",WEST 31 STREET between 7 AVENUE and 8 AVENUE...,Manhattan,"4, 5",14,Theater,Theater,United States of America,"10001, 10121"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67354,511630,Shooting Permit,10/11/2019 10:00:00 AM,10/11/2019 11:00:00 PM,10/08/2019 02:05:06 PM,"Mayor's Office of Film, Theatre & Broadcasting","44 ROAD between 24 STREET and HUNTER STREET, ...",Queens,2,108,Television,Episodic series,United States of America,11101
67355,548906,Shooting Permit,10/18/2020 06:00:00 AM,10/19/2020 06:00:00 PM,10/14/2020 03:50:06 PM,"Mayor's Office of Film, Theatre & Broadcasting",MADISON AVENUE between EAST 45 STREET and EA...,Manhattan,5,"14, 18",Commercial,Commercial,United States of America,10017
67356,491040,Shooting Permit,06/12/2019 07:00:00 AM,06/12/2019 08:00:00 PM,06/10/2019 12:38:04 PM,"Mayor's Office of Film, Theatre & Broadcasting",SPOFFORD AVENUE between COSTER STREET and FAIL...,Bronx,"1, 2, 7","40, 41, 72",Television,Episodic series,United States of America,"10454, 10474, 11220"
67357,548583,Shooting Permit,10/19/2020 08:00:00 AM,10/19/2020 11:59:00 PM,10/08/2020 05:05:23 PM,"Mayor's Office of Film, Theatre & Broadcasting",QUEENS PLAZA SOUTH between 21 STREET and 22 ST...,Queens,2,108,Television,Episodic series,United States of America,11101


##  Data types

### Checking data types


We've considered what data is there, but some is numbers, some is text, some are dates. What does **Pandas** think each one is? This is important because it will determine what we can do with the data in each column, how it will be sorted, and filtered etc...

We can use 

``df.dtypes``

to see each columns data type. We see that they are all **objects**, which is the Pandas type for strings or mixed values. We can load the data in again and tell it which columns represent dates, and it will automatically parse them is they are in a consistent format. This means we can do things like compare them (e.g. which one is earlier?), which is really useful for sorting. 

You will see the loading takes longer, as it has to parse the dates, and that afterwards the chosen columns are of the ``datetime64[ns]`` type

In [37]:
# What are the types?
# Checking the data type of each columne

df.dtypes

EventID               int64
EventType            object
StartDateTime        object
EndDateTime          object
EnteredOn            object
EventAgency          object
ParkingHeld          object
Borough              object
CommunityBoard(s)    object
PolicePrecinct(s)    object
Category             object
SubCategoryName      object
Country              object
ZipCode(s)           object
dtype: object

### Loading and formating date types

- As we can see above that the columns "StartDateTime", "EndDateTime", and "EnteredOn" are being read as a object even though they contain date and time hence should be in the datetime format.
- Refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html for more information on other formats.

In [38]:
# Reloading the csv data and parsing the dates in the correct format.

df = pd.read_csv("data/Film_Permits.csv", parse_dates=["StartDateTime","EndDateTime","EnteredOn"])

In [39]:
# Checking if the data has been parsed in the correct format

df.dtypes

EventID                       int64
EventType                    object
StartDateTime        datetime64[ns]
EndDateTime          datetime64[ns]
EnteredOn            datetime64[ns]
EventAgency                  object
ParkingHeld                  object
Borough                      object
CommunityBoard(s)            object
PolicePrecinct(s)            object
Category                     object
SubCategoryName              object
Country                      object
ZipCode(s)                   object
dtype: object

## Summarising Data

**Pandas** can also give us a summary of our data (67359 is a lot to look at ourselves!). 

``df.describe(include = "all")``


In [40]:
df.describe()

Unnamed: 0,EventID
count,67359.0
mean,282618.005849
std,146069.216982
min,42069.0
25%,152266.0
50%,276047.0
75%,408197.5
max,549009.0


If we don't put ``include = "all"``, then we will only get summary stastics for **numeric** columns. 


In [41]:
df.describe(include = "all")

  df.describe(include = "all")


Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
count,67359.0,67359,67359,67359,67359,67359,67359,67359,67341.0,67341.0,67359,67359,67359,67341.0
unique,,4,26509,32433,66808,1,39417,5,836.0,2598.0,10,30,10,5058.0
top,,Shooting Permit,2018-11-13 06:00:00,2019-12-04 22:00:00,2018-01-30 12:43:07,"Mayor's Office of Film, Theatre & Broadcasting",WEST 48 STREET between 6 AVENUE and 7 AVENUE,Manhattan,1.0,94.0,Television,Episodic series,United States of America,11222.0
freq,,58638,24,16,6,67359,1444,32958,14828.0,7263.0,36985,21549,67291,6427.0
first,,,2012-01-01 06:00:00,2012-01-02 23:59:00,2011-12-07 16:38:54,,,,,,,,,
last,,,2020-10-19 08:00:00,2020-12-31 01:00:00,2020-10-15 16:07:28,,,,,,,,,
mean,282618.005849,,,,,,,,,,,,,
std,146069.216982,,,,,,,,,,,,,
min,42069.0,,,,,,,,,,,,,
25%,152266.0,,,,,,,,,,,,,


When we run the code, we get a table of stats describing our data. We can see that alot of the stats that are number based dont return values for our **nominal** columns, which is fine. But things such as the most common, and number of unique entries are stil interesting. 

For example, all 5 New York Boroughs are present, and Manahattan is the most common. We can also see the first filming started on **2012-01-01 06:00:00**, and this works because we formatted it as a date, so Pandas is able to order them. 

## Selecting Columns 


We can select a column using its name 

``df["ColumnName"]``

Or we can select a bunch of columns by passing an array 

``df[["ColumnName1","ColumnName2"]]``

This returns a smaller **Series** object with the results but we can also use 

``result.values``


In [45]:
#select column
df["SubCategoryName"]

0         Cable-episodic
1                Feature
2        Episodic series
3         Not Applicable
4                Theater
              ...       
67354    Episodic series
67355         Commercial
67356    Episodic series
67357    Episodic series
67358    Episodic series
Name: SubCategoryName, Length: 67359, dtype: object

In [50]:
#select columns
df[["Category","SubCategoryName"]]

Unnamed: 0,Category,SubCategoryName
0,Television,Cable-episodic
1,Film,Feature
2,Television,Episodic series
3,Still Photography,Not Applicable
4,Theater,Theater
...,...,...
67354,Television,Episodic series
67355,Commercial,Commercial
67356,Television,Episodic series
67357,Television,Episodic series


In [48]:
#select column and get it back as an array of values
df["SubCategoryName"].values

array(['Cable-episodic', 'Feature', 'Episodic series', ...,
       'Episodic series', 'Episodic series', 'Episodic series'],
      dtype=object)

For more information on the differnce between lists and arrays check out https://www.geeksforgeeks.org/difference-between-list-and-array-in-python/

## Counting Columns 


We can get counts to see what the most prevalent combinations of categories are. Here, for the type of thing being filmed, we can see **Feature Films** and **Epsiodic TV Series** are the most common. This is a good way for us to get a feel for all the different things that are filmed in New York and what you need a permit for. 

``df[["Category","SubCategoryName"]].value_counts()``

I wander what **Television - Not Applicable** is?