### Pandas Lab -- Basic Selecting & Querying

This lab walks you through various sections of Pandas syntax for grabbing & selecting data.

The lab is broken down into three parts, and will be completed throughout class.

 - 1. Basic selectors with Pandas
 - 2. Selecting based on conditions & boolean indexes
 - 3. Special commands for selecting certain types of rows

### Section 1:  Selecting Data With Pandas

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Reading in the CSV - make sure you include the 'r' at the front.
# r stands for raw i.e. interpret the text exactly as it is

df = pd.read_csv(r"/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurants.csv")

In [3]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,


**1). What is the average number of visitors througout the entire dataset?**

In [5]:
# your answer here
df['visitors'].mean()

20.973761245180636

**2). What are the median values of the visitors and holiday columns?**

In [7]:
# your answer here
df[['visitors','holiday']].median()

visitors    17.0
holiday      0.0
dtype: float64

**3). What was the lowest number of visitors among the first 5000 rows in the dataset?**

In [8]:
# your answer here
df['visitors'][:5000].min()

1

**4). What is the modal value of the last 4 columns in the dataset?**

In [25]:
# your answer here
df.iloc[0:,-4].mode()


0    Fukuoka-ken Fukuoka-shi Daimyō
dtype: object

In [29]:
# or
df.iloc[:,-4].mode()

0    Fukuoka-ken Fukuoka-shi Daimyō
dtype: object

**5). What is the mean value of the first 250 rows of the first 3 columns in the dataset?**

In [27]:
# your answer here
df.iloc[0:250,0:3].mean() #only returned one result, since other types non-numeric?

visitors    24.912
dtype: float64

In [31]:
# or - cleaner:
df.iloc[:250,:3].mean()

visitors    24.912
dtype: float64

### Section II: Selecting Based on Conditions

**1). What was the average attendance on Monday?  On the weekend (Saturday & Sunday)?**

In [None]:
# your answer here

In [34]:
# Average for Monday
df[df['day_of_week']=='Monday']['visitors'].mean()

17.177009027207877

In [37]:
# Average for saturday and sunday
df[(df['day_of_week']=='Saturday')|(df['day_of_week']=='Sunday')]['visitors'].mean()

25.256869738495084

**2). Is attendance higher on average for holidays or non-holidays?**

In [None]:
# your answer here

In [40]:
df[(df['holiday']==0)]['visitors'].mean()

20.828063827386945

In [41]:
df[(df['holiday']==1)]['visitors'].mean()

23.703326810176126

*On average holiday attendance is higher*

**3). What was the highest day of attendance for Dining Bars?**

In [None]:
# your answer here -- notice the different way of selecting

In [45]:
df[(df['genre']=='Dining bar')]['visitors'].max()

348

In [47]:
df[(df['genre']=='Dining bar')&(df['visitors']==348)]['calendar_date']

245791    2017-01-23
Name: calendar_date, dtype: object

*Highest day of attendance at dining bars was 23-Jan-2017*

Can also type out a column name as df.column_name

In [64]:
# e.g.
df.genre

0             Dining bar
1             Dining bar
2             Dining bar
3             Dining bar
4             Dining bar
               ...      
252103    Italian/French
252104    Italian/French
252105    Italian/French
252106    Italian/French
252107    Italian/French
Name: genre, Length: 252108, dtype: object

In [65]:
df[df.genre=='Dining bar']

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
249732,air_16c4cfddeb2cf69b,2017-01-04,9,2017-01-04,Wednesday,0,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,10.0
249733,air_16c4cfddeb2cf69b,2017-01-09,13,2017-01-09,Monday,1,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,14.0
249734,air_16c4cfddeb2cf69b,2017-02-26,19,2017-02-26,Sunday,0,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,
249735,air_16c4cfddeb2cf69b,2017-03-20,6,2017-03-20,Monday,1,Dining bar,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,3.0


In [66]:
df[df.genre=='Dining bar']['visitors'].idxmax()

245791

In [68]:
df.iloc[245791]

id                           air_c6aa2efba0ffc8eb
visit_date                             2017-01-23
visitors                                      348
calendar_date                          2017-01-23
day_of_week                                Monday
holiday                                         0
genre                                  Dining bar
area                Tōkyō-to Adachi-ku Chūōhonchō
latitude                                  35.7757
longitude                                 139.804
reserve_visitors                               25
Name: 245791, dtype: object

In [69]:
# An approach to do grouping by values in a particular column
# it's simlar to SQL but has it's own rules from a syntax perspective
df[df.genre=='Dining bar'].groupby('day_of_week')['visitors'].max()

day_of_week
Friday       132
Monday       348
Saturday     176
Sunday       228
Thursday     105
Tuesday      103
Wednesday    162
Name: visitors, dtype: int64

**4). What was the date that had the highest number of reservations that was a holiday?  Hint:  use the `idxmax()` function**

In [None]:
# your answer here

In [49]:

df[(df['holiday']==1)]['reserve_visitors'].max()
# highest number of reserve visitors on a holiday was 58

58.0

In [54]:
df[(df['holiday']==1)&(df['reserve_visitors']==58)]['visit_date'].max()

'2016-12-30'

In [55]:
df[(df['holiday']==1)&(df['reserve_visitors']==58)]['visit_date'].min()

'2016-12-30'

Highest reservations was 58 on 30-Dec-2016

In [56]:
df[(df['holiday']==1)]['reserve_visitors'].idxmax() 
#Returns the index value for the specified segment of df.

1503

In [58]:
df.iloc[1503]

id                          air_64d4491ad8cdb1c6
visit_date                            2016-12-30
visitors                                      23
calendar_date                         2016-12-30
day_of_week                               Friday
holiday                                        1
genre                                 Dining bar
area                Tōkyō-to Minato-ku Shibakōen
latitude                                 35.6581
longitude                                139.752
reserve_visitors                              58
Name: 1503, dtype: object

**Instructor approach**

In [70]:
df[df.holiday == 1]['visitors'].max()

205

In [71]:
df[df.holiday == 1]['visitors'].idxmax()

122871

In [72]:
df.iloc[122871]

id                                     air_df554c4527a1cfe6
visit_date                                       2016-12-30
visitors                                                205
calendar_date                                    2016-12-30
day_of_week                                          Friday
holiday                                                   1
genre                                               Izakaya
area                Shizuoka-ken Hamamatsu-shi Motoshirochō
latitude                                            34.7109
longitude                                           137.726
reserve_visitors                                         58
Name: 122871, dtype: object

**Section III: Special Types of Selectors**

To get some additional practice using common Pandas methods, we'll go over some common scenarios you typically have to select data for. 

*The methods used in this section have not been covered in class.*  Each question will come with the recommended method to use.  It's best to use the `?` before the method to read how it works and figure out how to use it.  

It's designed to be a little bit of a treasure hunt to familiarize yourself with a lot of the bread & butter pandas methods.

**1). Can you return the amount of null values for each column?**

To use: `df.isnull()`.  **Hint:** `True` sums to 1, `False` to 0.

In [None]:
# your answer here

In [19]:
df.isnull()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,False,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
252103,False,False,False,False,False,False,False,False,False,False,False
252104,False,False,False,False,False,False,False,False,False,False,False
252105,False,False,False,False,False,False,False,False,False,False,False
252106,False,False,False,False,False,False,False,False,False,False,False


In [8]:
df.isnull().sum()

id                       0
visit_date               0
visitors                 0
calendar_date            0
day_of_week              0
holiday                  0
genre                    0
area                     0
latitude                 0
longitude                0
reserve_visitors    143714
dtype: int64

In [22]:
?pd.isnull # Gives explanation of what is going on

In [23]:
df.isnull().sum(axis = 1) # Gives number of missing values in each row

0         1
1         1
2         1
3         1
4         1
         ..
252103    0
252104    0
252105    0
252106    0
252107    0
Length: 252108, dtype: int64

In [27]:
df.isnull().any() # Do any of our columns have at least one missing value

id                  False
visit_date          False
visitors            False
calendar_date       False
day_of_week         False
holiday             False
genre               False
area                False
latitude            False
longitude           False
reserve_visitors     True
dtype: bool

In [28]:
df.isnull().any(axis=1) # Forces the check across rows rather than columns

0          True
1          True
2          True
3          True
4          True
          ...  
252103    False
252104    False
252105    False
252106    False
252107    False
Length: 252108, dtype: bool

In [29]:
df[df.isnull().any(axis=1)] # Returns row with at least one missing value

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
252087,air_a17f0778617c76e2,2017-04-04,10,2017-04-04,Tuesday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
252092,air_a17f0778617c76e2,2017-04-10,28,2017-04-10,Monday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
252099,air_a17f0778617c76e2,2017-04-17,19,2017-04-17,Monday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,
252100,air_a17f0778617c76e2,2017-04-18,11,2017-04-18,Tuesday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,


**2). Can you find the count values for every single unique value within a column?**

To use: `pd.Series.value_counts()`.  **Hint:** This is a *Series* method, not a *Dataframe* method.  

In [None]:
# your answer here

In [30]:
df['genre'].value_counts()
# Returns count of unique value in column

Izakaya                         62052
Cafe/Sweets                     52764
Dining bar                      34192
Italian/French                  30011
Bar/Cocktail                    25135
Japanese food                   18789
Other                            8246
Yakiniku/Korean food             7025
Western food                     4897
Creative cuisine                 3868
Okonomiyaki/Monja/Teppanyaki     3706
Asian                             535
Karaoke/Party                     516
International cuisine             372
Name: genre, dtype: int64

In [33]:
?pd.Series.value_counts

In [37]:
?df.genre.value_counts

In [41]:
??pd.Series.value_counts # Gives source code as well as description

Object `pd.Series.value_counts # Gives source code as well as description` not found.


In [42]:
df['genre']

0             Dining bar
1             Dining bar
2             Dining bar
3             Dining bar
4             Dining bar
               ...      
252103    Italian/French
252104    Italian/French
252105    Italian/French
252106    Italian/French
252107    Italian/French
Name: genre, Length: 252108, dtype: object

In [44]:
type(df['genre']) # Series data type

pandas.core.series.Series

In [45]:
type(df)

pandas.core.frame.DataFrame

In [46]:
type(df[['genre']])

pandas.core.frame.DataFrame

In [43]:
df[['genre']] # Changes the format, since move from series to a data frame

Unnamed: 0,genre
0,Dining bar
1,Dining bar
2,Dining bar
3,Dining bar
4,Dining bar
...,...
252103,Italian/French
252104,Italian/French
252105,Italian/French
252106,Italian/French


In [47]:
df.value_counts() # Gives an error since the df is not a series

AttributeError: 'DataFrame' object has no attribute 'value_counts'

In [None]:
# Easiest way to turn a series into a data frame is to add the 
# square brackets around ti.

**3). Can you find the column with the highest number of unique values?  Can you sort columns their number of unique values?**

To use: `df.nunique`, and `df.sort_values()` if you want to sort it.

In [None]:
# your answer here

In [49]:
df.nunique()

id                  829
visit_date          478
visitors            204
calendar_date       478
day_of_week           7
holiday               2
genre                14
area                103
latitude            108
longitude           108
reserve_visitors     49
dtype: int64

In [50]:
df.nunique().sort_values()

holiday               2
day_of_week           7
genre                14
reserve_visitors     49
area                103
latitude            108
longitude           108
visitors            204
visit_date          478
calendar_date       478
id                  829
dtype: int64

In [56]:
?df.sort_values

In [55]:
??df.sort_values

In [53]:
df.nunique().sort_values(ascending=False)

id                  829
calendar_date       478
visit_date          478
visitors            204
longitude           108
latitude            108
area                103
reserve_visitors     49
genre                14
day_of_week           7
holiday               2
dtype: int64

In [60]:
df.sort_values(by='visitors', ascending=False)

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
35102,air_cfdeb326418194ff,2017-03-08,877,2017-03-08,Wednesday,0,Bar/Cocktail,Tōkyō-to Toshima-ku Minamiikebukuro,35.726118,139.716605,
225377,air_8c3175aa5e4fc569,2017-04-18,777,2017-04-18,Tuesday,0,Bar/Cocktail,Tōkyō-to Toshima-ku Nishiikebukuro,35.732286,139.710247,
158627,air_f2985de32bb792e0,2016-07-10,675,2016-07-10,Sunday,0,Izakaya,Tōkyō-to Ōta-ku Kamata,35.561257,139.716051,
206339,air_eca5e0064dc9314a,2016-08-30,627,2016-08-30,Tuesday,0,Cafe/Sweets,Fukuoka-ken Fukuoka-shi Daimyō,33.589216,130.392813,
61432,air_43d577e0c9460e64,2016-01-24,514,2016-01-24,Sunday,0,Creative cuisine,Hyōgo-ken Nishinomiya-shi Rokutanjichō,34.737597,135.341564,3.0
...,...,...,...,...,...,...,...,...,...,...,...
244631,air_0ead98dd07e7a82a,2017-02-28,1,2017-02-28,Tuesday,0,Cafe/Sweets,Fukuoka-ken Fukuoka-shi Daimyō,33.589216,130.392813,43.0
174055,air_aed3a8b49abe4a48,2017-02-17,1,2017-02-17,Friday,0,Cafe/Sweets,Ōsaka-fu Ōsaka-shi Ōgimachi,34.705362,135.510025,12.0
64282,air_f0c7272956e62f12,2016-02-05,1,2016-02-05,Friday,0,Izakaya,Tōkyō-to Itabashi-ku Itabashi,35.751165,139.709244,
149049,air_6873982b9e19c7ad,2017-03-03,1,2017-03-03,Friday,0,Cafe/Sweets,Hokkaidō Katō-gun Motomachi,42.994143,143.197959,7.0


In [61]:
df.sort_values(by=['id','visitors'], ascending=False)
# By passing a list we group by ID, then attendance

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
216453,air_fff68b929994bfbd,2016-09-03,18,2016-09-03,Saturday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,
216521,air_fff68b929994bfbd,2016-11-25,17,2016-11-25,Friday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,10.0
216448,air_fff68b929994bfbd,2016-08-26,16,2016-08-26,Friday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,
216522,air_fff68b929994bfbd,2016-11-26,15,2016-11-26,Saturday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,12.0
216597,air_fff68b929994bfbd,2017-02-25,15,2017-02-25,Saturday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,15.0
...,...,...,...,...,...,...,...,...,...,...,...
166889,air_00a91d42b08b08d9,2016-09-10,3,2016-09-10,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166938,air_00a91d42b08b08d9,2016-11-12,3,2016-11-12,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,4.0
166956,air_00a91d42b08b08d9,2016-12-04,2,2016-12-04,Sunday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,42.0
166922,air_00a91d42b08b08d9,2016-10-24,1,2016-10-24,Monday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,


In [62]:
df.sort_values(by=['id','visitors'], ascending=[True,False])
# This woudl give ID's in ascending order, and visitors in descending order
# Key is passing output by list.

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
166973,air_00a91d42b08b08d9,2016-12-24,99,2016-12-24,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,21.0
166915,air_00a91d42b08b08d9,2016-10-14,57,2016-10-14,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
166905,air_00a91d42b08b08d9,2016-10-01,56,2016-10-01,Saturday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,
167061,air_00a91d42b08b08d9,2017-04-21,55,2017-04-21,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,6.0
166943,air_00a91d42b08b08d9,2016-11-18,54,2016-11-18,Friday,0,Italian/French,Tōkyō-to Chiyoda-ku Kudanminami,35.694003,139.753595,7.0
...,...,...,...,...,...,...,...,...,...,...,...
216663,air_fff68b929994bfbd,2017-02-12,1,2017-02-12,Sunday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,30.0
216666,air_fff68b929994bfbd,2016-08-28,1,2016-08-28,Sunday,0,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,
216668,air_fff68b929994bfbd,2016-09-19,1,2016-09-19,Monday,1,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,
216671,air_fff68b929994bfbd,2016-12-30,1,2016-12-30,Friday,1,Bar/Cocktail,Tōkyō-to Nakano-ku Nakano,35.708146,139.666288,58.0


**4). Can you query your dataframe so that it only returns columns that have empty values?**

To use: `df.isnull()`, `df.loc`

In [None]:
# your answer here

In [70]:
df.isnull().any() # Do these columns have any missing values


id                  False
visit_date          False
visitors            False
calendar_date       False
day_of_week         False
holiday             False
genre               False
area                False
latitude            False
longitude           False
reserve_visitors     True
dtype: bool

In [71]:
type(df.isnull().any() ) # So the above is a series output, not a printout.

pandas.core.series.Series

In [72]:
df.isnull().any().index
# The index returns as the column labels

Index(['id', 'visit_date', 'visitors', 'calendar_date', 'day_of_week',
       'holiday', 'genre', 'area', 'latitude', 'longitude',
       'reserve_visitors'],
      dtype='object')

In [80]:
df.loc[:, df.isnull().any()] 

# loc searches for things based of their labels
# So returns columns with missing values in this case
# This just returns the full row 
# [:, => grab all the rows
# ,df.isnull().any()] => Filters the columns

Unnamed: 0,reserve_visitors
0,
1,
2,
3,
4,
...,...
252103,6.0
252104,37.0
252105,35.0
252106,3.0


In [81]:
df.loc[df.isnull().any(axis=1), df.isnull().any()]
# Boring data set, showing NaN rows

Unnamed: 0,reserve_visitors
0,
1,
2,
3,
4,
...,...
252087,
252092,
252099,
252100,


**5).  Can you query the dataframe such that it only returns rows that have *no* missing values, in any of their columns?**

To use: `df.isnull()`, `df.any()`, or, conversely, `df.notnull()`, and `df.all()`

**Hint:** The `~` operator, if put in front of a query, selects for values that are **not** True.

In [None]:
# your answer here

In [90]:
df[df.notnull().all(axis=1)==True]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
11,air_ba937bf13d40fb24,2016-01-26,11,2016-01-26,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
21,air_ba937bf13d40fb24,2016-02-09,15,2016-02-09,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,7.0
24,air_ba937bf13d40fb24,2016-02-12,26,2016-02-12,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,18.0
25,air_ba937bf13d40fb24,2016-02-13,8,2016-02-13,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
37,air_ba937bf13d40fb24,2016-02-27,23,2016-02-27,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [89]:
df[~df.isnull().any(axis=1)==True]

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
11,air_ba937bf13d40fb24,2016-01-26,11,2016-01-26,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
21,air_ba937bf13d40fb24,2016-02-09,15,2016-02-09,Tuesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,7.0
24,air_ba937bf13d40fb24,2016-02-12,26,2016-02-12,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,18.0
25,air_ba937bf13d40fb24,2016-02-13,8,2016-02-13,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
37,air_ba937bf13d40fb24,2016-02-27,23,2016-02-27,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,2.0
...,...,...,...,...,...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,49,2017-04-21,Friday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,6.0
252104,air_a17f0778617c76e2,2017-04-22,60,2017-04-22,Saturday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,37.0
252105,air_a17f0778617c76e2,2017-03-26,69,2017-03-26,Sunday,0,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,35.0
252106,air_a17f0778617c76e2,2017-03-20,31,2017-03-20,Monday,1,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri,34.695124,135.197852,3.0


In [96]:
~(df.isnull().any(axis=1)) # ~ Flips to the reverse

0         False
1         False
2         False
3         False
4         False
          ...  
252103     True
252104     True
252105     True
252106     True
252107     True
Length: 252108, dtype: bool

**6).  Can you find rows that contain duplicate values?**

To use:  `df.duplicated()`

In [None]:
# your answer here

0         False
1          True
2          True
3          True
4          True
          ...  
252103     True
252104     True
252105     True
252106     True
252107     True
Length: 252108, dtype: bool

**7). Can you find rows that contain duplicated values for the visitors and date columns?**  

To use: `df.duplicated()`

In [None]:
# your answer here

In [99]:
df[df.duplicated()] 
# This checks across every single column for a duplicate

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors


In [100]:
df.duplicated(subset=['day_of_week','visitors'])
# use subset

0         False
1         False
2         False
3         False
4         False
          ...  
252103     True
252104     True
252105     True
252106     True
252107     True
Length: 252108, dtype: bool

In [102]:
df[df.duplicated(subset=['day_of_week','visitors'])==False].sort_values()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,
...,...,...,...,...,...,...,...,...,...,...,...
252020,air_1c0b150f9e696a5f,2017-04-04,121,2017-04-04,Tuesday,0,Okonomiyaki/Monja/Teppanyaki,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,
252021,air_1c0b150f9e696a5f,2017-04-05,115,2017-04-05,Wednesday,0,Okonomiyaki/Monja/Teppanyaki,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,2.0
252038,air_1c0b150f9e696a5f,2017-03-26,166,2017-03-26,Sunday,0,Okonomiyaki/Monja/Teppanyaki,Tōkyō-to Shibuya-ku Shibuya,35.661777,139.704051,35.0
252050,air_900d755ebd2f7bbd,2017-04-12,114,2017-04-12,Wednesday,0,Italian/French,Tōkyō-to Chūō-ku Ginza,35.672114,139.770825,2.0


**8).  Can you only select columns that are text based?**

To use: `df.select_dtypes()`, and (optionally) the `columns` attribute.  **Note:** `columns` is NOT a method!

In [None]:
# your answer here

**9).  Can you only select columns that are numeric?**

To use: `df.select_dtypes()`.  This question is very similar to the one above it, just for a different data type.

In [None]:
# your answer here

In [110]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                252108 non-null  object 
 1   visit_date        252108 non-null  object 
 2   visitors          252108 non-null  int64  
 3   calendar_date     252108 non-null  object 
 4   day_of_week       252108 non-null  object 
 5   holiday           252108 non-null  int64  
 6   genre             252108 non-null  object 
 7   area              252108 non-null  object 
 8   latitude          252108 non-null  float64
 9   longitude         252108 non-null  float64
 10  reserve_visitors  108394 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 21.2+ MB


In [108]:
df.select_dtypes(include=np.object)

Unnamed: 0,id,visit_date,calendar_date,day_of_week,genre,area
0,air_ba937bf13d40fb24,2016-01-13,2016-01-13,Wednesday,Dining bar,Tōkyō-to Minato-ku Shibakōen
1,air_ba937bf13d40fb24,2016-01-14,2016-01-14,Thursday,Dining bar,Tōkyō-to Minato-ku Shibakōen
2,air_ba937bf13d40fb24,2016-01-15,2016-01-15,Friday,Dining bar,Tōkyō-to Minato-ku Shibakōen
3,air_ba937bf13d40fb24,2016-01-16,2016-01-16,Saturday,Dining bar,Tōkyō-to Minato-ku Shibakōen
4,air_ba937bf13d40fb24,2016-01-18,2016-01-18,Monday,Dining bar,Tōkyō-to Minato-ku Shibakōen
...,...,...,...,...,...,...
252103,air_a17f0778617c76e2,2017-04-21,2017-04-21,Friday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252104,air_a17f0778617c76e2,2017-04-22,2017-04-22,Saturday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252105,air_a17f0778617c76e2,2017-03-26,2017-03-26,Sunday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri
252106,air_a17f0778617c76e2,2017-03-20,2017-03-20,Monday,Italian/French,Hyōgo-ken Kōbe-shi Kumoidōri


In [109]:
df.select_dtypes(include=np.int64)

Unnamed: 0,visitors,holiday
0,25,0
1,32,0
2,29,0
3,22,0
4,6,0
...,...,...
252103,49,0
252104,60,0
252105,69,0
252106,31,1


In [111]:
df.select_dtypes(include=np.float64)

Unnamed: 0,latitude,longitude,reserve_visitors
0,35.658068,139.751599,
1,35.658068,139.751599,
2,35.658068,139.751599,
3,35.658068,139.751599,
4,35.658068,139.751599,
...,...,...,...
252103,34.695124,135.197852,6.0
252104,34.695124,135.197852,37.0
252105,34.695124,135.197852,35.0
252106,34.695124,135.197852,3.0


In [112]:
df.select_dtypes(include=np.number)

Unnamed: 0,visitors,holiday,latitude,longitude,reserve_visitors
0,25,0,35.658068,139.751599,
1,32,0,35.658068,139.751599,
2,29,0,35.658068,139.751599,
3,22,0,35.658068,139.751599,
4,6,0,35.658068,139.751599,
...,...,...,...,...,...
252103,49,0,34.695124,135.197852,6.0
252104,60,0,34.695124,135.197852,37.0
252105,69,0,34.695124,135.197852,35.0
252106,31,1,34.695124,135.197852,3.0


**10). Can you fill in the missing values of your numeric columns with their average value?**

To use: `df.fillna()`, to be used in conjunction with the suggested methods from question 11.

In [None]:
# your answer here

In [113]:
df_fill_na = df

In [118]:
numcols = df_fill_na.select_dtypes(include=np.number).columns.tolist()

In [120]:
df[numcols].fillna(4) # Fills NA with 4s.

Unnamed: 0,visitors,holiday,latitude,longitude,reserve_visitors
0,25,0,35.658068,139.751599,4.0
1,32,0,35.658068,139.751599,4.0
2,29,0,35.658068,139.751599,4.0
3,22,0,35.658068,139.751599,4.0
4,6,0,35.658068,139.751599,4.0
...,...,...,...,...,...
252103,49,0,34.695124,135.197852,6.0
252104,60,0,34.695124,135.197852,37.0
252105,69,0,34.695124,135.197852,35.0
252106,31,1,34.695124,135.197852,3.0


In [121]:
df[numcols].mean()

visitors             20.973761
holiday               0.050673
latitude             35.613121
longitude           137.357865
reserve_visitors     16.699808
dtype: float64

In [123]:
df_fill_na[numcols]=df[numcols].fillna(df[numcols].mean())

In [128]:
df_fill_na.isnull().any()

id                  False
visit_date          False
visitors            False
calendar_date       False
day_of_week         False
holiday             False
genre               False
area                False
latitude            False
longitude           False
reserve_visitors    False
dtype: bool

**11). Can you select all the rows between Jan. 1 2016 & June 30, 2016?**

In [None]:
# your answer here

In [130]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                252108 non-null  object 
 1   visit_date        252108 non-null  object 
 2   visitors          252108 non-null  int64  
 3   calendar_date     252108 non-null  object 
 4   day_of_week       252108 non-null  object 
 5   holiday           252108 non-null  int64  
 6   genre             252108 non-null  object 
 7   area              252108 non-null  object 
 8   latitude          252108 non-null  float64
 9   longitude         252108 non-null  float64
 10  reserve_visitors  252108 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 21.2+ MB


In [131]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808


There are a few ways to deal with dates.

On load can use the `parse_dates` argument on read

In [143]:
df_dates = pd.read_csv(r"/Users/PRSmb/OneDrive/General-Assembly/my-1019-repo/ClassMaterial/Unit2/data/restaurants.csv",parse_dates=True)

In [156]:
df_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                252108 non-null  object 
 1   visit_date        252108 non-null  object 
 2   visitors          252108 non-null  int64  
 3   calendar_date     252108 non-null  object 
 4   day_of_week       252108 non-null  object 
 5   holiday           252108 non-null  int64  
 6   genre             252108 non-null  object 
 7   area              252108 non-null  object 
 8   latitude          252108 non-null  float64
 9   longitude         252108 non-null  float64
 10  reserve_visitors  108394 non-null  float64
dtypes: float64(3), int64(2), object(6)
memory usage: 21.2+ MB


In [136]:
?pd.read_csv

In [135]:
df['visit_date'] = df.visit_date.astype(np.datetime64)

In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252108 entries, 0 to 252107
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                252108 non-null  object        
 1   visit_date        252108 non-null  datetime64[ns]
 2   visitors          252108 non-null  int64         
 3   calendar_date     252108 non-null  object        
 4   day_of_week       252108 non-null  object        
 5   holiday           252108 non-null  int64         
 6   genre             252108 non-null  object        
 7   area              252108 non-null  object        
 8   latitude          252108 non-null  float64       
 9   longitude         252108 non-null  float64       
 10  reserve_visitors  252108 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(5)
memory usage: 21.2+ MB


In [146]:
df['visit_date'].dt.day 

0         13
1         14
2         15
3         16
4         18
          ..
252103    21
252104    22
252105    26
252106    20
252107     9
Name: visit_date, Length: 252108, dtype: int64

In [148]:
df['visit_date'].dt.month 

0         1
1         1
2         1
3         1
4         1
         ..
252103    4
252104    4
252105    3
252106    3
252107    4
Name: visit_date, Length: 252108, dtype: int64

In [147]:
df['visit_date'].dt.quarter 

0         1
1         1
2         1
3         1
4         1
         ..
252103    2
252104    2
252105    1
252106    1
252107    2
Name: visit_date, Length: 252108, dtype: int64

In [149]:
df['visit_date'].dt.year 

0         2016
1         2016
2         2016
3         2016
4         2016
          ... 
252103    2017
252104    2017
252105    2017
252106    2017
252107    2017
Name: visit_date, Length: 252108, dtype: int64

In [152]:
dir(df['visit_date'].dt) # Gives list of all attributes after . 

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_add_delegate_accessors',
 '_constructor',
 '_delegate_method',
 '_delegate_property_get',
 '_delegate_property_set',
 '_deprecations',
 '_dir_additions',
 '_dir_deletions',
 '_freeze',
 '_get_values',
 '_reset_cache',
 'ceil',
 'date',
 'day',
 'day_name',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'floor',
 'freq',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'microsecond',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'quarter',
 'round',
 'second',
 'st

In [153]:
df['month'] = df['visit_date'].dt.month

In [154]:
df['quarter'] = df['visit_date'].dt.quarter

These variables may help us capture seasonal effects

In [155]:
df.head()

Unnamed: 0,id,visit_date,visitors,calendar_date,day_of_week,holiday,genre,area,latitude,longitude,reserve_visitors,month,quarter
0,air_ba937bf13d40fb24,2016-01-13,25,2016-01-13,Wednesday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
1,air_ba937bf13d40fb24,2016-01-14,32,2016-01-14,Thursday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
2,air_ba937bf13d40fb24,2016-01-15,29,2016-01-15,Friday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
3,air_ba937bf13d40fb24,2016-01-16,22,2016-01-16,Saturday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1
4,air_ba937bf13d40fb24,2016-01-18,6,2016-01-18,Monday,0,Dining bar,Tōkyō-to Minato-ku Shibakōen,35.658068,139.751599,16.699808,1,1


**12).  Can you determine the quarter of the year for each reservation?  The month?**

In [None]:
# we can get the quarters using the dt attribute