### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

### Section 1:  Selecting Data With Pandas

In [2]:
import numpy as np
import pandas as pd

In [3]:
# Reading in the CSV - make sure you include the 'r' at the front.
# r stands for raw i.e. interpret the text exactly as it is

df = pd.read_csv(r"/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurants.csv")

In [4]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


**1). What is the average number of visitors througout the entire dataset?**

In [5]:
# your answer here
df['visitors'].mean()

20.973761245180636

**2). What are the median values of the visitors and holiday columns?**

In [7]:
# your answer here
df[['visitors','holiday']].median()

visitors    17.0
holiday      0.0
dtype: float64

**3). What was the lowest number of visitors among the first 5000 rows in the dataset?**

In [8]:
# your answer here
df['visitors'][:5000].min()

1

**4). What is the modal value of the last 4 columns in the dataset?**

In [25]:
# your answer here
df.iloc[0:,-4].mode()


0    Fukuoka-ken Fukuoka-shi Daimyō
dtype: object

In [29]:
# or
df.iloc[:,-4].mode()

0    Fukuoka-ken Fukuoka-shi Daimyō
dtype: object

**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [27]:
# your answer here
df.iloc[0:250,0:3].mean() #only returned one result, since other types non-numeric?

visitors    24.912
dtype: float64

In [31]:
# or - cleaner:
df.iloc[:250,:3].mean()

visitors    24.912
dtype: float64

### Section II: Selecting Based on Conditions

**1). What was the average attendance on Monday?  On the weekend (Saturday & Sunday)?**

In [None]:
# your answer here

In [34]:
# Average for Monday
df[df['day_of_week']=='Monday']['visitors'].mean()

17.177009027207877

In [37]:
# Average for saturday and sunday
df[(df['day_of_week']=='Saturday')|(df['day_of_week']=='Sunday')]['visitors'].mean()

25.256869738495084

**2). Is attendance higher on average for holidays or non-holidays?**

In [None]:
# your answer here

In [40]:
df[(df['holiday']==0)]['visitors'].mean()

20.828063827386945

In [41]:
df[(df['holiday']==1)]['visitors'].mean()

23.703326810176126

*On average holiday attendance is higher*

**3). What was the highest day of attendance for Dining Bars?**

In [None]:
# your answer here -- notice the different way of selecting

In [45]:
df[(df['genre']=='Dining bar')]['visitors'].max()

348

In [47]:
df[(df['genre']=='Dining bar')&(df['visitors']==348)]['calendar_date']

245791    2017-01-23
Name: calendar_date, dtype: object

*Highest day of attendance at dining bars was 23-Jan-2017*

Can also type out a column name as df.column_name

In [64]:
# e.g.
df.genre

0             Dining bar
1             Dining bar
2             Dining bar
3             Dining bar
4             Dining bar
               ...      
252103    Italian/French
252104    Italian/French
252105    Italian/French
252106    Italian/French
252107    Italian/French
Name: genre, Length: 252108, dtype: object

In [65]:
df[df.genre=='Dining bar']

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
249732,air_16c4cfddeb2cf69b,2017-01-04,9,2017-01-04,Wednesday,0,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,10.0
249733,air_16c4cfddeb2cf69b,2017-01-09,13,2017-01-09,Monday,1,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,14.0
249734,air_16c4cfddeb2cf69b,2017-02-26,19,2017-02-26,Sunday,0,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,
249735,air_16c4cfddeb2cf69b,2017-03-20,6,2017-03-20,Monday,1,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,3.0


In [66]:
df[df.genre=='Dining bar']['visitors'].idxmax()

245791

In [68]:
df.iloc[245791]

id                           air_c6aa2efba0ffc8eb
visit_date                             2017-01-23
visitors                                      348
calendar_date                          2017-01-23
day_of_week                                Monday
holiday                                         0
genre                                  Dining bar
area                Tōkyō-to Adachi-ku Chūōhonchō
latitude                                  35.7757
longitude                                 139.804
reserve_visitors                               25
Name: 245791, dtype: object

In [69]:
# An approach to do grouping by values in a particular column
# it's simlar to SQL but has it's own rules from a syntax perspective
df[df.genre=='Dining bar'].groupby('day_of_week')['visitors'].max()

day_of_week
Friday       132
Monday       348
Saturday     176
Sunday       228
Thursday     105
Tuesday      103
Wednesday    162
Name: visitors, dtype: int64

**4). What was the date that had the highest number of reservations that was a holiday?  Hint:  use the `idxmax()` function**

In [None]:
# your answer here

In [49]:

df[(df['holiday']==1)]['reserve_visitors'].max()
# highest number of reserve visitors on a holiday was 58

58.0

In [54]:
df[(df['holiday']==1)&(df['reserve_visitors']==58)]['visit_date'].max()

'2016-12-30'

In [55]:
df[(df['holiday']==1)&(df['reserve_visitors']==58)]['visit_date'].min()

'2016-12-30'

Highest reservations was 58 on 30-Dec-2016

In [56]:
df[(df['holiday']==1)]['reserve_visitors'].idxmax() 
#Returns the index value for the specified segment of df.

1503

In [58]:
df.iloc[1503]

id                          air_64d4491ad8cdb1c6
visit_date                            2016-12-30
visitors                                      23
calendar_date                         2016-12-30
day_of_week                               Friday
holiday                                        1
genre                                 Dining bar
area                Tōkyō-to Minato-ku Shibakōen
latitude                                 35.6581
longitude                                139.752
reserve_visitors                              58
Name: 1503, dtype: object

**Instructor approach**

In [70]:
df[df.holiday == 1]['visitors'].max()

205

In [71]:
df[df.holiday == 1]['visitors'].idxmax()

122871

In [72]:
df.iloc[122871]

id                                     air_df554c4527a1cfe6
visit_date                                       2016-12-30
visitors                                                205
calendar_date                                    2016-12-30
day_of_week                                          Friday
holiday                                                   1
genre                                               Izakaya
area                Shizuoka-ken Hamamatsu-shi Motoshirochō
latitude                                            34.7109
longitude                                           137.726
reserve_visitors                                         58
Name: 122871, dtype: object

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [None]:
# your answer here

**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [None]:
# your answer here

**3). Can you find the column with the highest number of unique values?  Can you sort columns their number of unique values?**

To use: `df.nunique`, and `df.sort_values()` if you want to sort it.

In [None]:
# your answer here

**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [None]:
# your answer here

**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [None]:
# your answer here

**6).  Can you find rows that contain duplicate values?**

To use:  `df.duplicated()`

In [None]:
# your answer here

**7). Can you find rows that contain duplicated values for the visitors and date columns?**  

To use: `df.duplicated()`

In [None]:
# your answer here

**8).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [None]:
# your answer here

**9).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [None]:
# your answer here

**10). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [None]:
# your answer here

**11). Can you select all the rows between Jan. 1 2016 & June 30, 2016?**

In [None]:
# your answer here

**12).  Can you determine the quarter of the year for each reservation?  The month?**

In [None]:
# we can get the quarters using the dt attribute