<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/CST3512_DataFrames_WK02CL03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CST3512 - Information and Data Management II**

**Week #02**


## Pandas Review

## Setup and preliminaries

To read and process files, there is a  very powerful, and widely used Python library, called pandas. This notebook will import the **pandas** library in Python, and also import the following libraries to use with pandas data:

* **matplotlib** - for generating plots    
* **numpy** - for calculations on the data    



In [None]:
import pandas as pd
import matplotlib 
import numpy as np

*note: for additional plotting functionality, the library **Seaborn** may be imported as well.*

# Data Types and Conversions

## Loading Data

### From CSV Files

NOTE -- EXPLAIN !CURL WITH RAW GITHUB EXAMPLE OF COVID HOSPITAL DATA

In [None]:
!curl 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/hospitalizations/covid-hospitalizations.csv' -o hospitals.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 7212k  100 7212k    0     0  26.0M      0 --:--:-- --:--:-- --:--:-- 26.0M


This notebook uses a dataset with [restaurant inspection results in NYC](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j) which is available online from the City of New York.

Fetching a file to download in a Google Colab notebook results in the downloaded file residing in the '**sample_data/**' folder in the active Colab session.  That is a volatile copy which will not persist after the Colab session is closed. 

This notebook fetchs the data using the [Linux curl command](https://www.geeksforgeeks.org/curl-command-in-linux-with-examples/) by executing the following command with the parameters **"-o"** and **"restaurant.csv"** to specify * file name to save the result:

In [None]:
# Fetches the most recent dataset
!curl 'https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD' -o restaurant.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  156M    0  156M    0     0  4473k      0 --:--:--  0:00:35 --:--:-- 6350k


To be able to read and process this file within Python, the pandas library has a very convenient method `read_csv` which reads the file, and returns back a variable that contains its contents.  The following code creates a dataframe **restaurants** with the results of the Pandas **'.read_csv()'** function.

In [None]:
restaurants = pd.read_csv(
    "restaurant.csv",
    encoding="utf_8",
    dtype="unicode",
    parse_dates=True,
    infer_datetime_format=True,
    low_memory=False,
)

Using **'.read_csv()'** to read a CSV file (or TSV file), results in an object called a DataFrame, which is made up of rows and columns. DataFrame columns are accessed the same way as elements from a dictionary. Using the **restaurant** DataFrame object, the first five rows and the columns of data can be displayed with the **'.head(5)'** method:

In [None]:
restaurants.head(5)

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
0,50062685,POPINA,Brooklyn,127,COLUMBIA STREET,11231,7182221901,Italian,06/12/2017,Violations were cited in the following area(s).,05C,Food contact surface improperly constructed or...,Critical,64,,,02/06/2022,Pre-permit (Non-operational) / Initial Inspection,40.687231762007,-74.001622594583,306,39,4700,3003582,3003190032,BK33
1,50000741,CEMITAS PUEBLA RESTAURANT,Bronx,679,ALLERTON AVENUE,10467,7185477350,Spanish,01/19/2017,Violations were cited in the following area(s).,06C,Food not protected from potential source of co...,Critical,13,A,01/19/2017,02/06/2022,Cycle Inspection / Re-inspection,40.865416789114,-73.86805357738,211,15,33600,2053535,2045080005,BX07
2,50074677,SAKE JAPANESE CUISINE,Brooklyn,324,CHURCH AVENUE,11218,6462806305,Japanese,12/03/2019,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,Not Critical,12,A,12/03/2019,02/06/2022,Cycle Inspection / Initial Inspection,40.64374123966,-73.977165212997,312,39,48800,3124630,3053360074,BK41
3,50074677,SAKE JAPANESE CUISINE,Brooklyn,324,CHURCH AVENUE,11218,6462806305,Japanese,12/03/2019,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,Not Critical,12,A,12/03/2019,02/06/2022,Cycle Inspection / Initial Inspection,40.64374123966,-73.977165212997,312,39,48800,3124630,3053360074,BK41
4,40959012,CARVEL,Brooklyn,1652,86 STREET,11214,7182365928,Frozen Desserts,11/12/2019,Violations were cited in the following area(s).,10A,Toilet facility not maintained and provided wi...,Not Critical,13,,,02/06/2022,Cycle Inspection / Initial Inspection,40.609334746392,-74.006108215514,311,43,18000,3166388,3063640045,BK27


The **'.read_csv()'** method has many options.  See the [online documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) for details.



---



## Data Assessment and Transformations

### More DataFrame Exploration

The data types can be displayed for each column (variable) with **'.dtypes'**

In [None]:
restaurants.dtypes

CAMIS                    object
DBA                      object
BORO                     object
BUILDING                 object
STREET                   object
ZIPCODE                  object
PHONE                    object
CUISINE DESCRIPTION      object
INSPECTION DATE          object
ACTION                   object
VIOLATION CODE           object
VIOLATION DESCRIPTION    object
CRITICAL FLAG            object
SCORE                    object
GRADE                    object
GRADE DATE               object
RECORD DATE              object
INSPECTION TYPE          object
Latitude                 object
Longitude                object
Community Board          object
Council District         object
Census Tract             object
BIN                      object
BBL                      object
NTA                      object
dtype: object

The method **'.describe()'** method yields a quick overview of the data in the dataframe.

In [None]:
restaurants.describe()

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
count,373466,372330,373466,372847,373454,367961,373439,369693,373466,369694,365262,367610,373466,356599,188791,184560,373466,369694,373101,373101,367061,367069,367069,365324,372554,367061
unique,29937,23077,6,7681,2489,229,27366,87,1569,5,105,106,3,134,7,1402,1,31,23612,23612,69,51,1186,20408,20065,193
top,40400811,DUNKIN,Manhattan,1,BROADWAY,10003,7185958100,American,01/01/1900,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,Critical,12,A,06/05/2019,02/06/2022,Cycle Inspection / Initial Inspection,0,0,105,3,87100,4000000,1,MN17
freq,99,3839,145645,2217,13607,9287,297,70701,3772,346454,63743,63860,196374,37340,147635,495,373466,212950,5493,5493,30008,32180,3245,1874,3371,22193


To describe output list as a column of variables with the count, # unique values, top, and frequency as columns, transpose the result from the **'.describe()'** method using **'.T'**.

In [None]:
# Same as above, but the .T command transposes the table
restaurants.describe().T

Unnamed: 0,count,unique,top,freq
CAMIS,373466,29937,40400811,99
DBA,372330,23077,DUNKIN,3839
BORO,373466,6,Manhattan,145645
BUILDING,372847,7681,1,2217
STREET,373454,2489,BROADWAY,13607
ZIPCODE,367961,229,10003,9287
PHONE,373439,27366,7185958100,297
CUISINE DESCRIPTION,369693,87,American,70701
INSPECTION DATE,373466,1569,01/01/1900,3772
ACTION,369694,5,Violations were cited in the following area(s).,346454




---



The `object` type is a string as a result of reading a CSV. Many of these objects may be more useful changed to other data types. The **`pd.to_numeric`** and **`pd.to_datetime`** functions are two methods for changing dataframe columns to another type as the following code demonstrates. 

### Converting Data Types to Numeric

The `object` type is a string. To convert an object to numeric, use the **`pd.to_numeric()`** function, as shown below:

In [None]:
restaurants["SCORE"] = pd.to_numeric(restaurants["SCORE"])
restaurants["Latitude"] = pd.to_numeric(restaurants["Latitude"])
restaurants["Longitude"] = pd.to_numeric(restaurants["Longitude"])
restaurants.dtypes

CAMIS                     object
DBA                       object
BORO                      object
BUILDING                  object
STREET                    object
ZIPCODE                   object
PHONE                     object
CUISINE DESCRIPTION       object
INSPECTION DATE           object
ACTION                    object
VIOLATION CODE            object
VIOLATION DESCRIPTION     object
CRITICAL FLAG             object
SCORE                    float64
GRADE                     object
GRADE DATE                object
RECORD DATE               object
INSPECTION TYPE           object
Latitude                 float64
Longitude                float64
Community Board           object
Council District          object
Census Tract              object
BIN                       object
BBL                       object
NTA                       object
dtype: object



---



###  Converting Data to Dates

Converting appropriate columns into the date data type follows.

* What do you recall about datetime values in Pandas?    
* What are two types of datetime value?



In [None]:
restaurants["GRADE DATE"] = pd.to_datetime(restaurants["GRADE DATE"])
restaurants["RECORD DATE"] = pd.to_datetime(restaurants["RECORD DATE"])
restaurants["INSPECTION DATE"] = pd.to_datetime(restaurants["INSPECTION DATE"])
restaurants.dtypes

#### Note


In tricky cases, there may be a need to pass the `format` parameter, specifying the formatting of the date. To understand first how to [parse dates using  Python conventions](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).




---



### Converting Data to Categorical Variables

This is less important, but sometimes variables are best to be formatted as "Categorical". This is most commonly useful when we have variables that have an implicit order (e.g., the A/B/C grade of the restaurant).  Where this categorization is applicable, it can be helpful in charting (bar charts, etc.)  It is also important to recognize categorical variables if planning to conduct correlation analysis at some point.

In [None]:
restaurants["BORO"] = pd.Categorical(restaurants["BORO"], ordered=False)
restaurants["GRADE"] = pd.Categorical(
    restaurants["GRADE"], categories=["A", "B", "C"], ordered=True
)
restaurants["VIOLATION CODE"] = pd.Categorical(
    restaurants["VIOLATION CODE"], ordered=False
)
restaurants["CRITICAL FLAG"] = pd.Categorical(
    restaurants["CRITICAL FLAG"], ordered=False
)
restaurants["ACTION"] = pd.Categorical(restaurants["ACTION"], ordered=False)
restaurants["CUISINE DESCRIPTION"] = pd.Categorical(
    restaurants["CUISINE DESCRIPTION"], ordered=False
)

restaurants["INSPECTION TYPE"] = pd.Categorical(
    restaurants["INSPECTION TYPE"], ordered=False
)

restaurants.dtypes

CAMIS                            object
DBA                              object
BORO                           category
BUILDING                         object
STREET                           object
ZIPCODE                          object
PHONE                            object
CUISINE DESCRIPTION            category
INSPECTION DATE          datetime64[ns]
ACTION                         category
VIOLATION CODE                 category
VIOLATION DESCRIPTION            object
CRITICAL FLAG                  category
SCORE                           float64
GRADE                          category
GRADE DATE               datetime64[ns]
RECORD DATE              datetime64[ns]
INSPECTION TYPE                category
Latitude                        float64
Longitude                       float64
Community Board                  object
Council District                 object
Census Tract                     object
BIN                              object
BBL                              object




---



## Descriptive Statistics



### Descriptive Statistics for Numeric Variables


#### Basic descriptive statistics for numeric variables

Given that SCORE is a numeric variable, more detailed descriptive statistics for the variable are available using the **`.describe()`** method:

In [None]:
restaurants["SCORE"].describe()

count    356599.000000
mean         20.492029
std          15.017182
min           0.000000
25%          11.000000
50%          15.000000
75%          26.000000
max         164.000000
Name: SCORE, dtype: float64

### Descriptive Statistics for Dates


In [None]:
restaurants[["INSPECTION DATE", "GRADE DATE", "RECORD DATE"]].describe(datetime_is_numeric=True)

Unnamed: 0,INSPECTION DATE,GRADE DATE,RECORD DATE
count,373466,184560,373466
mean,2017-12-16 02:41:18.375006208,2019-02-13 20:54:58.517555456,2022-02-06 00:00:00
min,1900-01-01 00:00:00,2013-06-07 00:00:00,2022-02-06 00:00:00
25%,2018-04-09 00:00:00,2018-04-17 00:00:00,2022-02-06 00:00:00
50%,2019-03-12 00:00:00,2019-03-08 00:00:00,2022-02-06 00:00:00
75%,2019-10-22 00:00:00,2019-10-09 00:00:00,2022-02-06 00:00:00
max,2022-02-04 00:00:00,2022-02-04 00:00:00,2022-02-06 00:00:00


In addition to running a list of columns, we can look at each column individually.

In [None]:
restaurants["INSPECTION DATE"].describe(datetime_is_numeric=True)

In [None]:
restaurants["GRADE DATE"].describe(datetime_is_numeric=True)

In [None]:
restaurants["RECORD DATE"].describe(datetime_is_numeric=True)

### Descriptive Statistics for Categorical/string columns

Quick statistics about the common values that appear in each column are available with the **`.value_counts()`** method:

In [None]:
restaurants["DBA"].value_counts()

DUNKIN                                   3839
SUBWAY                                   2547
STARBUCKS                                1829
MCDONALD'S                               1676
KENNEDY FRIED CHICKEN                    1191
                                         ... 
BULLEIR BOURBON BAR - BARCLAYS CENTER       1
CARO BAKERY & COFFEE SHOP                   1
FLAVA II LOUNGE                             1
CROP CIRCLE                                 1
CRIOLLAS                                    1
Name: DBA, Length: 23077, dtype: int64

In [None]:
restaurants["BORO"].value_counts()

Manhattan        145645
Brooklyn          94265
Queens            86592
Bronx             34848
Staten Island     12017
0                    99
Name: BORO, dtype: int64

In [None]:
restaurants["CUISINE DESCRIPTION"].value_counts()

American          70701
Chinese           38860
Pizza             22771
Coffee/Tea        18610
Latin American    16489
                  ...  
Czech                20
Basque               10
Armenian             10
Lebanese              8
New French            3
Name: CUISINE DESCRIPTION, Length: 87, dtype: int64



---



# Basic Data Manipulation Techniques

## Selecting a subset of the columns -- `filter()`

It is possible in a dataframe to specify the column(s) to keep, and get back another dataframe with just that subset of the columns as the result using 
[the **`.filter()`** method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)

In [None]:
restaurants

In [None]:
restaurants.filter( 
    items = ["DBA", "GRADE", "GRADE DATE"] 
)

Unnamed: 0,DBA,GRADE,GRADE DATE
0,POPINA,,NaT
1,CEMITAS PUEBLA RESTAURANT,A,2017-01-19
2,SAKE JAPANESE CUISINE,A,2019-12-03
3,SAKE JAPANESE CUISINE,A,2019-12-03
4,CARVEL,,NaT
...,...,...,...
373461,LOS CUENCANITOS,A,2017-06-17
373462,LA CANOA,A,2017-12-20
373463,BAGELS & BREW,,NaT
373464,LAVELLE'S ADMIRAL'S CLUB,A,2018-07-03


In [None]:
columns = ["GRADE DATE", "VIOLATION CODE", "DBA", "SCORE"]

# Notice the use of "chain notation" below
# Chain notation means putting parentheses around
# the command and then having each operation in its
# own line
(
  restaurants
  .filter( items = columns )
  .head(10)
)


Unnamed: 0,GRADE DATE,VIOLATION CODE,DBA,SCORE
0,NaT,05C,POPINA,64.0
1,2017-01-19,06C,CEMITAS PUEBLA RESTAURANT,13.0
2,2019-12-03,10F,SAKE JAPANESE CUISINE,12.0
3,2019-12-03,10F,SAKE JAPANESE CUISINE,12.0
4,NaT,10A,CARVEL,13.0
5,NaT,10B,MALA PROJECT,84.0
6,2019-07-03,04E,PAVILLION CATERERS,12.0
7,NaT,04L,CHECKERS,21.0
8,2019-07-24,10H,GOGI 37,12.0
9,2019-07-31,08A,SAVOY BAKERY,12.0


Use the **`like`** option in `filter()` to find all the column names that include a certain string. For example, to get all the columns that include the string `DATE`:

In [None]:
restaurants.filter(
    like = 'DATE'
)

Unnamed: 0,INSPECTION DATE,GRADE DATE,RECORD DATE
0,2017-06-12,NaT,2022-02-06
1,2017-01-19,2017-01-19,2022-02-06
2,2019-12-03,2019-12-03,2022-02-06
3,2019-12-03,2019-12-03,2022-02-06
4,2019-11-12,NaT,2022-02-06
...,...,...,...
373461,2017-06-17,2017-06-17,2022-02-06
373462,2017-12-20,2017-12-20,2022-02-06
373463,2018-01-17,NaT,2022-02-06
373464,2018-07-03,2018-07-03,2022-02-06


The functionality of `filter()` is greatly expanded with the use of **regular expressions**:

In [None]:
restaurants.filter(
    regex = r'^C' # all the columns that start with C
)

Unnamed: 0,CAMIS,CUISINE DESCRIPTION,CRITICAL FLAG,Community Board,Council District,Census Tract
0,50062685,Italian,Critical,306,39,004700
1,50000741,Spanish,Critical,211,15,033600
2,50074677,Japanese,Not Critical,312,39,048800
3,50074677,Japanese,Not Critical,312,39,048800
4,40959012,Frozen Desserts,Not Critical,311,43,018000
...,...,...,...,...,...,...
373461,40967394,Latin American,Critical,402,26,025100
373462,41022489,Latin American,Critical,405,34,054700
373463,41683685,American,Not Critical,401,26,015300
373464,40365844,American,Critical,401,26,015300




---



## Renaming Columns -- `rename()`

To do the equivalent of `SELECT attr AS alias` in Pandas,  use the `rename` command, and pass a dictionary specifying which columns to rename:



In [None]:
restaurants.rename(
    columns = {
      "CAMIS": "RESTAURANT_ID",
      "DBA": "RESTAURANT_NAME",
      "BUILDING": "BUILDING_NUMBER",
      "BORO": "BOROUGH"
    }
)

Unnamed: 0,RESTAURANT_ID,RESTAURANT_NAME,BOROUGH,BUILDING_NUMBER,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
0,50062685,POPINA,Brooklyn,127,COLUMBIA STREET,11231,7182221901,Italian,2017-06-12,Violations were cited in the following area(s).,05C,Food contact surface improperly constructed or...,Critical,64.0,,NaT,2022-02-06,Pre-permit (Non-operational) / Initial Inspection,40.687232,-74.001623,306,39,004700,3003582,3003190032,BK33
1,50000741,CEMITAS PUEBLA RESTAURANT,Bronx,679,ALLERTON AVENUE,10467,7185477350,Spanish,2017-01-19,Violations were cited in the following area(s).,06C,Food not protected from potential source of co...,Critical,13.0,A,2017-01-19,2022-02-06,Cycle Inspection / Re-inspection,40.865417,-73.868054,211,15,033600,2053535,2045080005,BX07
2,50074677,SAKE JAPANESE CUISINE,Brooklyn,324,CHURCH AVENUE,11218,6462806305,Japanese,2019-12-03,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,Not Critical,12.0,A,2019-12-03,2022-02-06,Cycle Inspection / Initial Inspection,40.643741,-73.977165,312,39,048800,3124630,3053360074,BK41
3,50074677,SAKE JAPANESE CUISINE,Brooklyn,324,CHURCH AVENUE,11218,6462806305,Japanese,2019-12-03,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,Not Critical,12.0,A,2019-12-03,2022-02-06,Cycle Inspection / Initial Inspection,40.643741,-73.977165,312,39,048800,3124630,3053360074,BK41
4,40959012,CARVEL,Brooklyn,1652,86 STREET,11214,7182365928,Frozen Desserts,2019-11-12,Violations were cited in the following area(s).,10A,Toilet facility not maintained and provided wi...,Not Critical,13.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.609335,-74.006108,311,43,018000,3166388,3063640045,BK27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373461,40967394,LOS CUENCANITOS,Queens,5418,ROOSEVELT AVENUE,11377,7184261734,Latin American,2017-06-17,Violations were cited in the following area(s).,06E,"Sanitized equipment or utensil, including in-u...",Critical,8.0,A,2017-06-17,2022-02-06,Cycle Inspection / Re-inspection,40.744793,-73.910272,402,26,025100,4030938,4013230001,QN63
373462,41022489,LA CANOA,Queens,651,ONDERDONK AVENUE,11385,7184566011,Latin American,2017-12-20,Violations were cited in the following area(s).,04N,Filth flies or food/refuse/sewage-associated (...,Critical,12.0,A,2017-12-20,2022-02-06,Cycle Inspection / Initial Inspection,40.704525,-73.908126,405,34,054700,4082889,4034670013,QN20
373463,41683685,BAGELS & BREW,Queens,4305,BROADWAY,11103,7185454440,American,2018-01-17,Violations were cited in the following area(s).,08A,Facility not vermin proof. Harborage or condit...,Not Critical,23.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.757780,-73.916456,401,26,015300,4011889,4006930106,QN70
373464,40365844,LAVELLE'S ADMIRAL'S CLUB,Queens,4515,BROADWAY,11103,7187212764,American,2018-07-03,Violations were cited in the following area(s).,06D,"Food contact surface not properly washed, rins...",Critical,7.0,A,2018-07-03,2022-02-06,Cycle Inspection / Re-inspection,40.757002,-73.914800,401,26,015300,4012527,4007110001,QN70




---



## Selecting rows -- `query()`

Generate a list of boolean values, one for each row of the dataframe, and then  use the list to select which of the rows of the dataframe to keep using the 
[`.query()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) method.

In [None]:
# Find all violations for restaurants with DBA being Starbucks
restaurants.query(' DBA == "STARBUCKS" ')

*The following snippets use the accent grave or backquote character which is ASCII code 096.  It is found on English language QWERTY keyboards on the same key as the tilde (~) which is typically on the upper left side of the keyboard.  It can also be typed with the key combination [ALT]096.*

In [None]:
# Find all violations with code 04L (i.e., "has mice")
# Notice the use of backquotes for attribute names that have space
restaurants.query(' `VIOLATION CODE` == "04L" ')

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
7,50045201,CHECKERS,Queens,12221,MERRICK BLVD,11434,7187122420,Hamburgers,2022-01-21,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,21.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.683591,-73.769398,412,27,036800,4269742,4124800032,QN08
13,50013558,ROWE STUDIOS LOUNGE,Manhattan,410,WEST 42 STREET,10036,2125825472,American,2018-08-29,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,11.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.758645,-73.993185,104,03,011500,1026330,1010510029,MN15
16,41616749,MINHUI SNACK,Brooklyn,5919,7 AVENUE,11220,7185677789,Chinese,2017-06-08,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,17.0,B,2017-06-08,2022-02-06,Cycle Inspection / Re-inspection,40.637230,-74.011317,307,38,010400,3016386,3008660001,BK34
50,41386038,773 LOUNGE,Brooklyn,773,CONEY ISLAND AVENUE,11218,9173328721,American,2020-02-28,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,5.0,A,2020-02-28,2022-02-06,Cycle Inspection / Re-inspection,40.638354,-73.968520,314,40,052600,3118598,3051530061,BK42
61,50018562,BURGER TIME,Bronx,1080,MORRIS PARK AVENUE,10461,7182396210,Hamburgers,2019-09-17,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,11.0,A,2019-09-17,2022-02-06,Cycle Inspection / Re-inspection,40.849290,-73.853584,211,13,025400,2045007,2041080003,BX37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373417,40660748,ALTA,Manhattan,64,WEST 10 STREET,10011,2125057777,American,2019-10-24,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,19.0,B,2019-10-24,2022-02-06,Cycle Inspection / Re-inspection,40.734233,-73.997438,102,03,006300,1009457,1005730010,MN23
373430,40380517,BARNEY GREENGRASS,Manhattan,541,AMSTERDAM AVENUE,10024,2127244707,Jewish/Kosher,2019-03-21,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,12.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.787825,-73.974895,107,06,017300,1032218,1012170064,MN12
373435,50042550,GLOBAL KITCHEN,Manhattan,1290,AVE AMERICAS,,2125813200,American,2022-02-03,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,38.0,,2022-02-03,2022-02-06,Cycle Inspection / Re-inspection,0.000000,0.000000,,,,,1,
373438,50065684,MEET CUISINE & BAR,Queens,3610,UNION ST,11354,7183588895,Chinese,2019-02-28,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,44.0,,NaT,2022-02-06,Pre-permit (Operational) / Compliance Inspection,40.763482,-73.828056,407,20,086900,4112354,4049770052,QN22


In [None]:
# Storing the result of a query for Violation Code "04L" in a dataframe called
# has_mice
has_mice = restaurants.query(' `VIOLATION CODE` == "04L" ')
has_mice

Unnamed: 0,CAMIS,DBA,BORO,BUILDING,STREET,ZIPCODE,PHONE,CUISINE DESCRIPTION,INSPECTION DATE,ACTION,VIOLATION CODE,VIOLATION DESCRIPTION,CRITICAL FLAG,SCORE,GRADE,GRADE DATE,RECORD DATE,INSPECTION TYPE,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
7,50045201,CHECKERS,Queens,12221,MERRICK BLVD,11434,7187122420,Hamburgers,2022-01-21,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,21.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.683591,-73.769398,412,27,036800,4269742,4124800032,QN08
13,50013558,ROWE STUDIOS LOUNGE,Manhattan,410,WEST 42 STREET,10036,2125825472,American,2018-08-29,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,11.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.758645,-73.993185,104,03,011500,1026330,1010510029,MN15
16,41616749,MINHUI SNACK,Brooklyn,5919,7 AVENUE,11220,7185677789,Chinese,2017-06-08,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,17.0,B,2017-06-08,2022-02-06,Cycle Inspection / Re-inspection,40.637230,-74.011317,307,38,010400,3016386,3008660001,BK34
50,41386038,773 LOUNGE,Brooklyn,773,CONEY ISLAND AVENUE,11218,9173328721,American,2020-02-28,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,5.0,A,2020-02-28,2022-02-06,Cycle Inspection / Re-inspection,40.638354,-73.968520,314,40,052600,3118598,3051530061,BK42
61,50018562,BURGER TIME,Bronx,1080,MORRIS PARK AVENUE,10461,7182396210,Hamburgers,2019-09-17,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,11.0,A,2019-09-17,2022-02-06,Cycle Inspection / Re-inspection,40.849290,-73.853584,211,13,025400,2045007,2041080003,BX37
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373417,40660748,ALTA,Manhattan,64,WEST 10 STREET,10011,2125057777,American,2019-10-24,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,19.0,B,2019-10-24,2022-02-06,Cycle Inspection / Re-inspection,40.734233,-73.997438,102,03,006300,1009457,1005730010,MN23
373430,40380517,BARNEY GREENGRASS,Manhattan,541,AMSTERDAM AVENUE,10024,2127244707,Jewish/Kosher,2019-03-21,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,12.0,,NaT,2022-02-06,Cycle Inspection / Initial Inspection,40.787825,-73.974895,107,06,017300,1032218,1012170064,MN12
373435,50042550,GLOBAL KITCHEN,Manhattan,1290,AVE AMERICAS,,2125813200,American,2022-02-03,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,38.0,,2022-02-03,2022-02-06,Cycle Inspection / Re-inspection,0.000000,0.000000,,,,,1,
373438,50065684,MEET CUISINE & BAR,Queens,3610,UNION ST,11354,7183588895,Chinese,2019-02-28,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Critical,44.0,,NaT,2022-02-06,Pre-permit (Operational) / Compliance Inspection,40.763482,-73.828056,407,20,086900,4112354,4049770052,QN22


Using a slice from the result of the **`.value_counts()`** method to generate a list of the twenty most frequent DBA names.

In [None]:
# The most frequent DBA names overall
restaurants["DBA"].value_counts()[:20]

Using a slice from the result of the **`.value_counts()`** method to generate a list of the twenty restaurant Ids with the most frequent "Has Mice (04L)" violation.

In [None]:
# List the most frequent DBA values in the dataframe
has_mice["DBA"].value_counts()[:20]

In [None]:
has_mice["CAMIS"].value_counts()[:10]

Checking a restaurant ID to see if there was ever a "Has Mice (04L)" violation.

In [None]:
has_mice.query( ' CAMIS == "50015263" ' )

### Set Operations

And we can use more complex conditions to perform set operations.

In [None]:
# AND in pandas is "&"
# OR in pandas is "|"

In [None]:
has_mice_10012 = (
    restaurants
    .query(' `VIOLATION CODE` == "04L" & ZIPCODE == "10012" ')
    .filter( items = ['DBA', 'BUILDING', 'STREET', 'INSPECTION DATE'])
)

has_mice_10012

In [None]:
has_mice_10012["DBA"].value_counts()[:30]

In [None]:
has_mice_10012["DBA"].value_counts()[30::-1].plot(kind="barh")



---



***Some SQL-like Manipulations***

## Selecting distinct values -- `drop_duplicates()`

We can do the equivalent of the SQL [SELECT DISTINCT](https://www.w3schools.com/sql/sql_distinct.asp) statement in Pandas using the **'.drop_duplicates()'** method in Pandas as follows:

In [None]:
(
    has_mice_10012
    .filter( items = ['DBA', 'BUILDING', 'STREET'])
    .drop_duplicates()
)

## Sorting values -- `sort_values()`

And we can do the equivalent of SQL [ORDER BY](https://www.w3schools.com/sql/sql_orderby.asp) statement using the **'.sort_values()'** method in Pandas.

In [None]:
(
    has_mice_10012
    .sort_values("INSPECTION DATE", ascending=False)
    .head(15)
)

In [None]:
(
    has_mice_10012
    .sort_values(["INSPECTION DATE","DBA"], ascending=[False,True])
    .head(15)
)



---



## Defining New Columns -- `assign()` and `apply()`



### Using the `assign()` approach

The `assign` command applies a function to a dataframe and returns back a new dataframe with the new column(s).    


In the following example we will use the differences in lattitude and longitude from CityTech, square both and take the square root of their sum to get the distance 'as the crow flies' (or the shortest distance, regardless of obstacles) from the restuarant to CityTech.  This type of distance calculation could be used in determining all sites within a certain radius of a location.  It is not always a good measure of the distance to travel between sites, especially when the mode of transportation is by private automobile or public transportation. 

In [None]:
import numpy as np

# We define a function that will take as input a dataframe df
# and returns back a new column. This function computes
# the distance (in miles) from CityTech, given the lat/lon of the 
# other location
def distance(df):
  CityTech_lon = -73.9861
  CityTech_lat = 40.6973
  # The calculation below is simply the Pythagorean theorem.
  # The normalizing values are just for converting lat/lon differences
  # to miles
  distance = ((df.Latitude-CityTech_lat)/0.0146)**2 + ((df.Longitude-CityTech_lon)/0.0196)**2
  return np.sqrt(distance)

# This function combines STREET/BUILDING/BORO/ZIPCODE columns into one address
def combine_address(df):
  return (df.BUILDING + ' ' + df.STREET + ', '  + df.ZIPCODE).str.upper()

In [None]:
# First, let's use the `assign` function to create two new columns
# using the logic in the functions above,
(
  restaurants
  .assign(
      distance_from_CityTech = distance,
      address = combine_address
  )
  .filter(items = ['DBA','address','distance_from_CityTech'])
)

In [None]:
# And let's eliminate duplicates and sort by distance
(
  restaurants
  .assign(
      distance_from_CityTech = distance,
      address = combine_address
  )
  .filter(items = ['DBA','address','distance_from_CityTech'])
  .query('distance_from_CityTech > 0') # eliminates NaN values from distance_from_CityTech
  .drop_duplicates()
  .sort_values('distance_from_CityTech')
  .head(20)
)



---



### Using the `apply` approach

The `apply` function allows the users to pass a function and apply it on every single row or column of a Pandas dataframe. 

In [None]:
!sudo pip3 install -q -U geopy

from geopy import distance

# A bit more accurate distance calculation, which returns back
# the distance in miles. However, we cannot pass a dataframe
# to the function but only individual values
def distance_from_CityTech_geodesic(row):
  CityTech_lon = -73.9861
  CityTech_lat = 40.6973
  CityTech = (CityTech_lat, CityTech_lon)
  rest = (row.Latitude, row.Longitude)
  #if pd.isnull(row.Latitude) or pd.isnull(row.Longitude):
  #  return None
  return distance.distance(CityTech, rest).miles


In [None]:
# We now create a smaller version of the dataset with just
# the names/address/lon/lat of the restaurants
rest_names_locations = (
    restaurants
    .assign(
      address = combine_address
    )
    .filter(items = ['CAMIS','DBA','address','Longitude', 'Latitude'])
    .query(' Longitude==Longitude ') # idiomatic expression for saying IS NOT NULL
    .query(' Latitude==Latitude ') # idiomatic expression for saying IS NOT NULL
    .drop_duplicates()
)

rest_names_locations

In [None]:
# We will now apply the function distance_from_CityTech_geodesic 
# to every row of the dataset:
rest_names_locations.apply(distance_from_CityTech_geodesic, axis='columns')


In [None]:
# We will now save the result into a new column
rest_names_locations['distance_from_CityTech']=rest_names_locations.apply(distance_from_CityTech_geodesic, axis='columns')

In [None]:
# Let's see how many restaurants are within half a mile from NYU :)
(
    rest_names_locations
    .query('distance_from_CityTech < 0.5')
    .sort_values('distance_from_CityTech')
)



---



## Aggregation Function -- `agg()`

In [None]:
restaurants['SCORE'].agg('mean')

In [None]:
restaurants['SCORE'].agg(['mean','std','count','nunique'])

In [None]:
restaurants.agg(
    {
        'SCORE': ['mean','std','count','nunique'],
        'CAMIS':  ['nunique','count']
     }
    )

In [None]:
restaurants.agg(
        num_scored_violations = ('SCORE', 'count'),
        mean_score = ('SCORE', 'mean'),
        std_score  = ('SCORE', 'std'),
        num_entries = ('CAMIS',  'count'),
        num_restaurants = ('CAMIS',  'nunique'),
  )

## Calculating aggegates per groups -- `groupby()`

In [None]:
restaurants.groupby('GRADE DATE').agg({'SCORE': 'mean'})

In [None]:
(
  restaurants
  .groupby('GRADE DATE')
  .agg(
      score_mean = ('SCORE', 'mean'), 
      graded_restaurants = ('CAMIS', 'nunique')
    )
  .tail(500)
  .head(20)
)



---



## Pivot Tables

[Pivot tables](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) are one of the most commonly used exploratory tools, and in Pandas they are extremely flexible. 

For example, let's try to count the number of restaurants that are inspected every day. 

In [None]:
# Count the number of CAMIS values that appear on each date

pivot = pd.pivot_table(
    data=restaurants,
    index="GRADE DATE",  # specifies the rows
    values="CAMIS",  # specifies the content of the cells
    aggfunc="count",  # we ask to count how many different CAMIS values we see
)

In [None]:
pivot

#### Changing date granularity 

We can also use the [resample](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) command to change the frequency from one day, to, say, 7 days. Then we can compute, say, the average (`mean()`) for these days, or the total number (`sum()`) of inspections.

In [None]:
pivot.resample("1W").sum().tail(100)

#### Pivot Table with two (or more) variables)

We would like to break down the results by borough, so we add the `column` parameter.

In [None]:
pivot2 = pd.pivot_table(
    data=restaurants,  #
    index="INSPECTION DATE",
    columns="BORO",
    values="CAMIS",
    aggfunc="count",
)
pivot2.head(10)

##### Deleting rows and columns

Now, you will notice that there are a few columns and rows that are just noise. The first row with date *'1900-01-01'* is clearly noise, and the *'0'* column is also noise. We can use the `drop` command in Pandas to drop these.

In [None]:
# The axis='index' (or axis=0) means that we delete a row with that index value
pivot2 = pivot2.drop(pd.to_datetime("1900-01-01"), axis="index")

In [None]:
# The axis='columns' (or axis=1) means that we delete a columns with that value
pivot2 = pivot2.drop("0", axis="columns")

In [None]:
pivot2.tail(5)

## (Optional, FYI) Advanced Pivot Tables

We can also add multiple attributes in the index and columns. It is also possible to have multiple aggregation functions, and we can even define our own aggregation functions.

In [None]:
# We write a function that returns the
# number of unique items in a list x
def count_unique(x):
    return len(set(x))


# We break down by BORO and GRADE, and also calculate
# inspections in unique (unique restaurants)
# and non-unique entries (effectuvely, violations)
pivot_advanced = pd.pivot_table(
    data=restaurants,  #
    index="GRADE DATE",
    columns=["BORO", "GRADE"],
    values="CAMIS",
    aggfunc=["count", count_unique],
)

# Take the total number of inspections (unique and non-unique)
agg = pivot_advanced.resample("1M").sum()

# Show the last 5 entries and show the transpose (.T)
agg.tail().T



---



# Exercises

## Exercise 1 - Average Score by Inspector

Now let's do the same exercise, but instead of counting the number of inspections, we want to compute the average score assigned by the inspectors. Hint: We will need to change the `values` and the `aggfunc` parameters in the `pivot_table` function above.

In [None]:
# your code here

#### Solution 1 - Average Score by Inspector

In [None]:
pivot = pd.pivot_table(
    data=restaurants,
    index="INSPECTION DATE",  # specifies the rows
    values="SCORE",  # specifies the content of the cells
    aggfunc="mean",  # compute the average SCORE
)