<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Core-questions" data-toc-modified-id="Core-questions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Core questions</a></span></li><li><span><a href="#Extension-questions" data-toc-modified-id="Extension-questions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Extension questions</a></span></li><li><span><a href="#Core-question-solutions" data-toc-modified-id="Core-question-solutions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Core question solutions</a></span></li><li><span><a href="#Extension-question-solutions" data-toc-modified-id="Extension-question-solutions-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Extension question solutions</a></span></li></ul></div>

<p style="page-break-after:always;"></p>

# Setup

Let's load in the datasets we need for the practice session. They correspond to cuts of the tables from the `omni_pool` database we used earlier in the course. The idea in this session is to try to do in `pandas` some of the operations we coded in `SQL` (with a bit of variation to keep things fresh)!

We would recommend you try all of the Core Questions below, and then move on to the Extension Questions if time permits.

**Note** - some of the questions should be attempted only after you have covered the material on joining, melting and pivoting `DataFrame`s on day 3. We have marked these questions below.

In [1]:
import pandas as pd

employees = pd.read_csv('data/omni_employees.csv', parse_dates=['start_date'])
pay_details = pd.read_csv('data/omni_pay_details.csv')
teams = pd.read_csv('data/omni_teams.csv')
committees = pd.read_csv('data/omni_committees.csv')
employees_committees = pd.read_csv('data/omni_employees_committees.csv')

# Core questions

***Q1.*** Perform some basic data exploration of the `employees` `DataFrame`. Methods and attributes to consider using here include `.info()`, `.describe()` and `.shape`. In particular, answer the following:

- What data type is each column?
- How many unique values are there in `department` and `country`?
- What are the minimum and maximum of `salary`?
- How many rows are there in the `DataFrame`?
  
***  
  
***Q2.*** Find all the details of `employees` who work in the 'Legal' `department`.

***
  
***Q3.*** How many `employees` are based in Japan?  

**Hint** - Running `.count()` on the `id` column might help here, or you could use the `.shape` attribute 

***

***Q4.*** [**Harder**] In the question above, we suggested you `.count()` the `id` column (i.e. treating it like a `SQL` primary key). But we haven't yet shown that `id` satisfies the associated requirements. Confirm that the number of unique values in `id` equals the number of rows in `employees`.

**Hints**

* The `.shape` attribute of the `employees` `DataFrame` and the `.nunique()` method can be combined in one line of code to show this
* Remember that `.shape` returns a `tuple`: which element of the `tuple` do you need?

***

***Q5.*** How many `employees` have a missing `email` address?

***

***Q6.*** Calculate the mean `salary` of `employees` in the 'Legal' `department`.

***

***Q7.*** Obtain the `first_name`, `last_name` and `salary` of all `employees` sorted in descending order of `salary`.

***

***Q8.*** Obtain the `first_name`, `last_name` and `country` of `employees` ordered first alphabetically by `country` and then alphabetically by `last_name`.

***

***Q9.*** [**Harder**] Obtain the `first_name`, `last_name` and `email` address of all `employees` in the 'Engineering' `department` who work 0.5 `fte_hours` or greater.

***

***Q10.*** [**Harder**] Calculate the mean `salary` of all `employees` who are members either of the 'Legal' or the 'Accounting' `department`s.

**Hint** - The `.isin()` method let's you do this in one line of code, without having to manually combine means for the two `department`s
   
***
    
***Q11.*** [**Harder**] How many pension enrolled `employees` are based outside of France, Austria or Ireland?

**Hint** - Remember `~` is the negation operator in `pandas`

***
    
***Q12.*** Add a new column `effective_salary` to `employees` containing `salary` multiplied by `fte_hours`.

***

***Q13.*** Obtain the details of all those `employees` whose `last_name` starts with 'R'.

**Hint** - Perhaps one of the `StringMethods` can be used: you can access these via the `.str` accessor 
   
***
    
***Q14.*** [**Harder**] Are there any `employees` whose `last_name` does not start with a capital letter?

**Hints** 

* Assume there are no missing values in `last_name`
* The `.str.startswith()` method won't accept regex, unfortunately
* Regex anchors may help here: `^` means 'the position just before the start of a string, and `$` means 'the position just after the end of a string'
* Remember `~` is the negation operator in `pandas`
   
***

***Q15.*** Create a new column `start_month` in `employees`, containing the month in which an employee started with the corporation. Make `start_month` a string rather than a number, e.g. 'June' rather than '6'.

**Hint** - Remember you can access `DatetimeProperties` using the `.dt` accessor

***
    
***Q16.*** Obtain a table showing the number of `employees` in each `department`.

**Hint** - Think **split-apply-combine** 

***

***Q17.*** Obtain a count of the number of `employees` enrolled and not enrolled in the pension scheme. Ignore missing values in `pension_enrol` for now.

***

***Q18.*** [**Harder**] Repeat your analysis from Question 17 above, but this time fill any missing values in `pension_enrol` with the string 'Missing'.

**Hint** - the `.fillna()` method will help here

***

***Q19.*** Obtain a count by `department` of the number of `employees` enrolled and not enrolled in the pension scheme. Include missing values or otherwise fill them with the string 'Missing' as in the question above. [**Harder**] - Change the header of the column containing count values to 'num_employees'.

**Hints** 

* You need to `.groupby()` on multiple columns
* Is there an argument to `.groupby()` that lets you keep missing values?
* The `.agg()` method can take arguments that let you specify the output column name for an aggregator

***
    
***Q20.*** Obtain a table showing all `employees` details together with a column `team_name` containing the name of their team. 

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**

* This will require joining `employees` to `teams`
* You can `.drop()` unwanted columns from `teams` after the join
* The `.rename()` method lets you change column names

***
    
***Q21.*** [**Harder**] Obtain a table showing `team_name` together with a count of the `num_employees` in each team.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**
    
* You will need to join `employees` to `teams`
* The `.rename()` method lets you rename a column
* The `.agg()` method can take arguments that let you specify the output column name for an aggregator    

***

***Q22.*** [**Harder**] Obtain a table showing the `id`, `first_name`, `last_name` and `department` of any `employees` lacking both a `local_account_no` and `local_sort_code` in their `pay_details`.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**
    
* Filter `pay_details` down to those rows lacking both a `local_account_no` and `local_sort_code`, and then join that to `employees`
* The `.all()` method applied over columns can help here

# Extension questions

***Q1.*** Investigate the use of the `.where()` function in the `numpy` package to add a column `hours_class` to `employees` containing the value 'low' when `fte_hours` is 0.5 or less and 'high' otherwise.

**Hint** - Import `numpy` as `np`, and `.where()` will then be available as `np.where()`

***
    
***Q2.*** Are there any `employees` with invalid `email` addresses?

**Hint** - Count non-null `email` addresses that don't match a reasonable regex pattern for a valid email address

***

***Q3.*** How many of the `employees` serve on one or more `committees`?

**Hints**

* This requires only the `employees_committees` `DataFrame`
* Think about distinct `employees`
    
***
    
***Q4.*** Get the full employee details (including committee name) of any committee members based in the Ukraine.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**
    
* This is a many-to-many join, so three `DataFrame`s will be involved: `employees`, `committees` and `employees_committees`
* The `.query()` method lets you do all this in one line of code

***

***Q5.*** Obtain a table with column `start_month` containing the names of months and column `num_started` showing the number of `employees` who started in that month (in any year). The table should contain 12 rows, one for each month, and be ordered 'January', 'February', 'March' and so forth, in ascending calendar order.

**Hints**
    
* Extracting both the `.month_name()` and `.month` from `start_date` might help here
* We used `.assign()` to create the two new date feature columns before any subsequent operations

# Core question solutions

***Q1.*** Perform some basic data exploration of the `employees` `DataFrame`. Methods and attributes to consider using here include `.info()`, `.describe()` and `.shape`. In particular, answer the following:
    - What data type is each column?
    - How many unique values are there in `department` and `country`?
    - What are the minimum and maximum of `salary`?
    - How many rows are there in the `DataFrame`?

In [2]:
# the .info() method gives us the data type of each column
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             1000 non-null   int64         
 1   first_name     974 non-null    object        
 2   last_name      1000 non-null   object        
 3   email          878 non-null    object        
 4   department     1000 non-null   object        
 5   team_id        1000 non-null   int64         
 6   grade          980 non-null    float64       
 7   country        1000 non-null   object        
 8   fte_hours      1000 non-null   float64       
 9   pension_enrol  958 non-null    object        
 10  salary         935 non-null    float64       
 11  pay_detail_id  1000 non-null   int64         
 12  start_date     926 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(3), object(6)
memory usage: 101.7+ KB


In [3]:
# .describe() with  include='all' gives basic overview of values in columns
employees.describe(include='all')

  employees.describe(include='all')


Unnamed: 0,id,first_name,last_name,email,department,team_id,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date
count,1000.0,974,1000,878,1000,1000.0,980.0,1000,1000.0,958,935.0,1000.0,926
unique,,918,992,878,12,,,130,,2,,,894
top,,Claudio,Caldayrou,jikinslb@fotki.com,Legal,,,China,,Yes,,,1992-06-05 00:00:00
freq,,3,2,1,102,,,192,,488,,,3
first,,,,,,,,,,,,,1990-01-25 00:00:00
last,,,,,,,,,,,,,2019-04-24 00:00:00
mean,500.5,,,,,5.399,0.189796,,0.62775,,59929.568984,500.5,
std,288.819436,,,,,2.88342,0.39234,,0.278514,,22891.672558,288.819436,
min,1.0,,,,,1.0,0.0,,0.25,,20063.0,1.0,
25%,250.75,,,,,3.0,0.0,,0.5,,40103.5,250.75,


We see that we have 12 unique values of `department`, 130 unique values of `country`, the minimum `salary` is 20,063 and the maximum `salary` is 99,889.

In [4]:
# .shape gives (num_rows, num_columns)
employees.shape

(1000, 13)

***Q2.*** Find all the details of `employees` who work in the 'Legal' `department`

In [5]:
employees.loc[employees.department == 'Legal']
# or
employees.query("department == 'Legal'")

Unnamed: 0,id,first_name,last_name,email,department,team_id,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date
0,1,Ibbie,Roscrigg,iroscrigg0@google.fr,Legal,9,0.0,Nigeria,0.25,Yes,97667.0,1,2014-12-25
2,4,Osmund,Kittel,okittel3@bloomberg.com,Legal,10,0.0,United Kingdom,1.00,Yes,51200.0,4,2007-09-06
10,12,Thorstein,Garr,tgarrb@icio.us,Legal,1,0.0,China,0.75,No,39926.0,12,2012-03-23
17,19,Robby,Harragin,,Legal,10,0.0,South Korea,0.25,Yes,70830.0,19,1998-07-20
35,37,Franky,Idell,fidell10@economist.com,Legal,4,0.0,Sweden,1.00,Yes,,37,1990-06-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...
942,944,Velvet,Mellodey,vmellodeyq7@huffingtonpost.com,Legal,2,0.0,Philippines,0.75,Yes,70961.0,944,NaT
958,960,Joela,McClenaghan,jmcclenaghanqn@aboutads.info,Legal,10,0.0,China,1.00,Yes,56952.0,960,2011-02-19
965,967,Elsinore,Stein,,Legal,2,0.0,Malaysia,1.00,Yes,,967,2013-01-25
980,982,Maurita,Sirkett,msirkettr9@webs.com,Legal,6,0.0,Ecuador,0.25,No,97989.0,982,2006-12-09


***Q3.*** How many `employees` are based in Japan?  

**Hint** - Running `.count()` on the `id` column might help here, or you could use the `.shape` attribute 

In [6]:
employees.loc[employees.country == 'Japan', 'id'].count()
# or
employees.loc[employees.country == 'Japan'].shape[0]

26

***Q4.*** [**Harder**] In the question above, we suggested you `.count()` the `id` column (i.e. treating it like a `SQL` primary key). But we haven't yet shown that `id` satisfies the associated requirements. Confirm that the number of unique values in `id` equals the number of rows in `employees`.

**Hints**

* The `.shape` attribute of the `employees` `DataFrame` and the `.nunique()` method can be combined in one line of code to show this
* Remember that `.shape` returns a `tuple`: which element of the `tuple` do you need?

In [7]:
employees.id.nunique() == employees.shape[0]

True

***Q5.*** How many `employees` have a missing `email` address?

In [8]:
employees.loc[employees.email.isna(), 'id'].count()

#or
employees.loc[employees.email.isna()].shape[0]

122

***Q6.*** Calculate the mean `salary` of `employees` in the 'Legal' `department`.

In [9]:
employees.loc[employees.department=='Legal'].salary.mean()

56503.947916666664

***Q7.*** Obtain the `first_name`, `last_name` and `salary` of all `employees` sorted in descending order of `salary`.

In [10]:
employees.loc[:, ['first_name', 'last_name', 'salary']].sort_values('salary', ascending=False)

Unnamed: 0,first_name,last_name,salary
758,Gustave,Truwert,99889.0
945,Patrice,Chitty,99853.0
857,Corny,Yearn,99798.0
963,Brucie,Ceschini,99634.0
198,Katinka,Peffer,99565.0
...,...,...,...
922,Claus,Hadigate,
940,Isobel,McMillan,
953,Martainn,McCaughan,
965,Elsinore,Stein,


***Q8.*** Obtain the `first_name`, `last_name` and `country` of `employees` ordered first alphabetically by `country` and then alphabetically by `last_name`.

In [11]:
employees.loc[:, ['first_name', 'last_name', 'country']].sort_values(['country', 'last_name'])

Unnamed: 0,first_name,last_name,country
929,Abeu,Pawden,Afghanistan
176,Trixi,Pickvance,Afghanistan
255,Vance,Ratlee,Afghanistan
792,Beale,Raynard,Afghanistan
338,Bentlee,Toy,Afghanistan
...,...,...,...
691,Ebenezer,Roseby,Yemen
421,Gretta,Zealey,Yemen
358,Farrel,Clethro,Zambia
920,Conn,Robiot,Zambia


***Q9.*** [**Harder**] Obtain the `first_name`, `last_name` and `email` address of all `employees` in the 'Engineering'  `department` who work 0.5 `fte_hours` or greater.

In [12]:
employees.loc[
    (employees.department == 'Engineering') & (employees.fte_hours >= 0.5), 
    ['first_name', 'last_name', 'email']
]

Unnamed: 0,first_name,last_name,email
3,Feodora,Dumingos,fdumingos4@bandcamp.com
37,Sybilla,Lodewick,slodewick12@salon.com
55,Rheba,Booton,rbooton1k@bravesites.com
98,Launce,Feyer,
99,Shep,Loveday,sloveday2s@twitpic.com
...,...,...,...
948,Shell,Over,soverqd@icio.us
950,Manuel,Ferrarotti,mferrarottiqf@ovh.net
954,Terry,Sawforde,tsawfordeqj@mit.edu
977,Tam,Tsar,ttsarr6@smh.com.au


***Q10.*** [**Harder**] Calculate the mean `salary` of all `employees` who are members either of the 'Legal' or the 'Accounting' `department`s.

**Hint** - The `.isin()` method let's you do this in one line of code, without having to manually combine means for the two `department`s

In [13]:
employees[employees.department.isin(['Legal', 'Accounting'])].salary.mean()

58213.132530120485

***Q11.*** [**Harder**] How many pension enrolled `employees` are based outside of France, Austria or Ireland?

**Hint** - Remember `~` is the negation operator in `pandas`

In [14]:
employees.loc[
    (employees.pension_enrol == 'Yes') &
    (~employees.country.isin(['France', 'Austria', 'Ireland'])),
    'id'
].count()

# or 
employees.loc[
    (employees.pension_enrol == 'Yes') &
    (~employees.country.isin(['France', 'Austria', 'Ireland']))
].shape[0]

473

***Q12.*** Add a new column `effective_salary` to `employees` containing `salary` multiplied by `fte_hours`

In [15]:
employees.loc[:, 'effective_salary'] = employees.salary * employees.fte_hours
employees

Unnamed: 0,id,first_name,last_name,email,department,team_id,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date,effective_salary
0,1,Ibbie,Roscrigg,iroscrigg0@google.fr,Legal,9,0.0,Nigeria,0.25,Yes,97667.0,1,2014-12-25,24416.75
1,2,Sylas,Smallcomb,,Training,3,0.0,Macedonia,0.75,No,48556.0,2,1991-08-01,36417.00
2,4,Osmund,Kittel,okittel3@bloomberg.com,Legal,10,0.0,United Kingdom,1.00,Yes,51200.0,4,2007-09-06,51200.00
3,5,Feodora,Dumingos,fdumingos4@bandcamp.com,Engineering,2,0.0,Indonesia,0.50,Yes,60460.0,5,1990-03-28,30230.00
4,6,Peter,de Vaen,pdevaen5@mail.ru,Product Management,2,1.0,Greece,0.50,No,32060.0,6,1992-06-30,16030.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,997,Carlina,Pirot,cpirotro@archive.org,Business Development,2,0.0,China,1.00,No,63189.0,997,2018-08-30,63189.00
996,998,Wileen,Skones,wskonesrp@hhs.gov,Accounting,4,0.0,Uganda,0.25,Yes,81734.0,998,2000-11-26,20433.50
997,999,Willy,Dulake,wdulakerq@webeden.co.uk,Business Development,9,0.0,Colombia,0.25,No,21590.0,999,2018-08-01,5397.50
998,1000,Maribelle,Rotge,mrotgerr@google.co.uk,Research and Development,5,0.0,Brazil,0.50,Yes,62531.0,1000,2019-03-09,31265.50


***Q13.*** Obtain the details of all those `employees` whose `last_name` starts with 'R'.

**Hint** - Perhaps one of the `StringMethods` can be used: you can access these via the `.str` accessor 

In [16]:
employees[employees.last_name.str.startswith('R')]

Unnamed: 0,id,first_name,last_name,email,department,team_id,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date,effective_salary
0,1,Ibbie,Roscrigg,iroscrigg0@google.fr,Legal,9,0.0,Nigeria,0.25,Yes,97667.0,1,2014-12-25,24416.75
20,22,Carie,Rimbault,crimbaultl@dyndns.org,Human Resources,4,0.0,Philippines,0.25,Yes,95805.0,22,2017-03-15,23951.25
70,72,Leopold,Rowlands,lrowlands1z@over-blog.com,Research and Development,5,0.0,China,0.5,No,88922.0,72,1997-04-28,44461.0
83,85,Iorgo,Rickaby,irickaby2c@auda.org.au,Legal,5,0.0,China,0.5,No,37548.0,85,2018-02-18,18774.0
93,95,Elissa,Rawsthorne,erawsthorne2m@geocities.com,Sales,7,0.0,China,0.5,Yes,95166.0,95,2013-07-03,47583.0
110,112,Spencer,Rollinshaw,srollinshaw33@paginegialle.it,Engineering,4,0.0,Brazil,0.75,Yes,83727.0,112,2008-12-20,62795.25
153,155,Angie,Rohlf,arohlf4a@cafepress.com,Research and Development,3,1.0,Indonesia,1.0,No,96722.0,155,1998-01-10,96722.0
154,156,Rodd,Rabl,rrabl4b@elpais.com,Marketing,1,1.0,Brazil,0.5,Yes,52506.0,156,1991-11-27,26253.0
158,160,Georgi,Rosenthal,grosenthal4f@dot.gov,Support,1,0.0,Russia,1.0,No,71750.0,160,1992-10-28,71750.0
196,198,Marv,Robjents,mrobjents5h@eventbrite.com,Sales,9,0.0,Indonesia,0.25,Yes,49588.0,198,2013-11-25,12397.0


***Q14.*** [**Harder**] Are there any `employees` whose `last_name` does not start with a capital letter?

**Hints** 

* Assume there are no missing values in `last_name`
* The `.str.startswith()` method won't accept regex, unfortunately
* Regex **anchors** may help here: `^` means 'the position just before the start of a string, and `$` means 'the position just after the end of a string'
* Remember `~` is the negation operator in `pandas`

In [17]:
# Don't need a capture group with .str.contains() 
employees.loc[~employees.last_name.str.contains(r'^[A-Z]'), 'id'].count()

# or
employees.loc[~employees.last_name.str.contains(r'^[A-Z]')].shape[0]

5

***Q15.*** Create a new column `start_month` in `employees`, containing the month in which an employee started with the corporation. Make `start_month` a string rather than a number, e.g. 'June' rather than '6'.

**Hint** - Remember you can access `DatetimeProperties` using the `.dt` accessor

In [18]:
employees.loc[:, 'start_month'] = employees.start_date.dt.month_name()
employees.head()

Unnamed: 0,id,first_name,last_name,email,department,team_id,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date,effective_salary,start_month
0,1,Ibbie,Roscrigg,iroscrigg0@google.fr,Legal,9,0.0,Nigeria,0.25,Yes,97667.0,1,2014-12-25,24416.75,December
1,2,Sylas,Smallcomb,,Training,3,0.0,Macedonia,0.75,No,48556.0,2,1991-08-01,36417.0,August
2,4,Osmund,Kittel,okittel3@bloomberg.com,Legal,10,0.0,United Kingdom,1.0,Yes,51200.0,4,2007-09-06,51200.0,September
3,5,Feodora,Dumingos,fdumingos4@bandcamp.com,Engineering,2,0.0,Indonesia,0.5,Yes,60460.0,5,1990-03-28,30230.0,March
4,6,Peter,de Vaen,pdevaen5@mail.ru,Product Management,2,1.0,Greece,0.5,No,32060.0,6,1992-06-30,16030.0,June


***Q16.*** Obtain a table showing the number of `employees` in each `department`.

**Hint** - Think **split-apply-combine** 

In [19]:
employees.groupby('department', as_index=False).id.count()

Unnamed: 0,department,id
0,Accounting,72
1,Business Development,77
2,Engineering,87
3,Human Resources,90
4,Legal,102
5,Marketing,84
6,Product Management,79
7,Research and Development,94
8,Sales,80
9,Services,73


***Q17.*** Obtain a count of the number of `employees` enrolled and not enrolled in the pension scheme. Ignore missing values in `pension_enrol` for now.

In [20]:
employees.groupby('pension_enrol', as_index=False).id.count()

Unnamed: 0,pension_enrol,id
0,No,470
1,Yes,488


***Q18.*** [**Harder**] Repeat your analysis from Question 17 above, but this time fill any missing values in `pension_enrol` with the string 'Missing'.

**Hint** - the `.fillna()` method will help here

In [21]:
employees.fillna({'pension_enrol': 'Missing'}).groupby('pension_enrol', as_index=False).id.count()

Unnamed: 0,pension_enrol,id
0,Missing,42
1,No,470
2,Yes,488


***Q19.*** Obtain a count by `department` of the number of `employees` enrolled and not enrolled in the pension scheme. Include missing values or otherwise fill them with the string 'Missing' as in the Question above. [**Harder**] - Change the header of the column containing count values to 'num_employees'

**Hints** 

* You need to `.groupby()` on multiple columns
* Is there an argument to `.groupby()` that lets you keep missing values?
* The `.agg()` method can take arguments that let you specify the output column name for an aggregator

In [22]:
# basic version
employees\
    .groupby(['department', 'pension_enrol'], dropna=False, as_index=False)\
    .id.count()

# version recoding 'Missing' and changing aggregator column header
employees\
    .fillna({'department': 'Missing', 'pension_enrol': 'Missing'})\
    .groupby(['department', 'pension_enrol'], as_index=False)\
    .agg(num_employees = ('id', 'count'))

Unnamed: 0,department,pension_enrol,num_employees
0,Accounting,Missing,3
1,Accounting,No,37
2,Accounting,Yes,32
3,Business Development,Missing,1
4,Business Development,No,43
5,Business Development,Yes,33
6,Engineering,Missing,5
7,Engineering,No,37
8,Engineering,Yes,45
9,Human Resources,Missing,5


***Q20.*** Obtain a table showing all `employees` details together with a column `team_name` containing the name of their team. 

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**

* This will require joining `employees` to `teams`
* You can `.drop()` unwanted columns from `teams` after the join
* The `.rename()` method lets you change column names

In [23]:
employees.merge(teams, how='left', left_on='team_id', right_on='id')\
    .drop(['id_y', 'charge_cost', 'team_id'], axis='columns')\
    .rename(columns={'id_x': 'id', 'name': 'team_name'})

Unnamed: 0,id,first_name,last_name,email,department,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date,effective_salary,start_month,team_name
0,1,Ibbie,Roscrigg,iroscrigg0@google.fr,Legal,0.0,Nigeria,0.25,Yes,97667.0,1,2014-12-25,24416.75,December,Data Escalate
1,2,Sylas,Smallcomb,,Training,0.0,Macedonia,0.75,No,48556.0,2,1991-08-01,36417.00,August,Risk Team 1
2,4,Osmund,Kittel,okittel3@bloomberg.com,Legal,0.0,United Kingdom,1.00,Yes,51200.0,4,2007-09-06,51200.00,September,Corporate
3,5,Feodora,Dumingos,fdumingos4@bandcamp.com,Engineering,0.0,Indonesia,0.50,Yes,60460.0,5,1990-03-28,30230.00,March,Audit Team 2
4,6,Peter,de Vaen,pdevaen5@mail.ru,Product Management,1.0,Greece,0.50,No,32060.0,6,1992-06-30,16030.00,June,Audit Team 2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,997,Carlina,Pirot,cpirotro@archive.org,Business Development,0.0,China,1.00,No,63189.0,997,2018-08-30,63189.00,August,Audit Team 2
996,998,Wileen,Skones,wskonesrp@hhs.gov,Accounting,0.0,Uganda,0.25,Yes,81734.0,998,2000-11-26,20433.50,November,Risk Team 2
997,999,Willy,Dulake,wdulakerq@webeden.co.uk,Business Development,0.0,Colombia,0.25,No,21590.0,999,2018-08-01,5397.50,August,Data Escalate
998,1000,Maribelle,Rotge,mrotgerr@google.co.uk,Research and Development,0.0,Brazil,0.50,Yes,62531.0,1000,2019-03-09,31265.50,March,Audit Escalate


***Q21.*** [**Harder**] Obtain a table showing `team_name` together with a count of the `num_employees` in each team.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**

* You will need to join `employees` to `teams`
* The `.rename()` method lets you rename a column
* The `.agg()` method can take arguments that let you specify the output column name for an aggregator

In [24]:
employees.merge(teams, how='right', left_on='team_id', right_on='id')\
    .rename(columns={'name': 'team_name'})\
    .groupby('team_name', as_index=False)\
    .agg(num_employees=('id_x', 'count'))

Unnamed: 0,team_name,num_employees
0,Audit Escalate,99
1,Audit Team 1,113
2,Audit Team 2,107
3,Corporate,92
4,Data Escalate,99
5,Data Team 1,99
6,Data Team 2,96
7,Risk Escalate,105
8,Risk Team 1,85
9,Risk Team 2,105


***Q22.*** [**Harder**] Obtain a table showing the `id`, `first_name`, `last_name` and `department` of any `employees` lacking both a `local_account_no` and `local_sort_code` in their `pay_details`.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**

* Filter `pay_details` down to those rows lacking both a `local_account_no` and `local_sort_code`, and then join that to `employees`
* The `.all()` method applied over columns can help here

In [25]:
employees\
    .merge(
        pay_details.loc[pay_details[['local_account_no', 'local_sort_code']].isna().all(axis='columns')],
        how='inner',
        left_on='pay_detail_id',
        right_on='id'
    )\
    .rename(columns={'id_x': 'id'})\
    .loc[:, ['id', 'first_name', 'last_name', 'department']]

Unnamed: 0,id,first_name,last_name,department
0,48,Barney,Yakovitch,Product Management
1,57,Rheba,Booton,Engineering
2,63,Reynard,Jonson,Legal
3,94,Darsey,Cescon,Accounting
4,147,Hubie,Butter,Engineering
5,150,Elsi,Norquay,Training
6,187,Brande,Crump,Marketing
7,208,Abigael,McArthur,Business Development
8,245,Stillman,Brislen,Human Resources
9,267,Bernarr,Nolan,Business Development


# Extension question solutions

***Q1.*** Investigate the use of the `.where()` function in the `numpy` package to add a column `hours_class` to `employees` containing the value 'low' when `fte_hours` is 0.5 or less and 'high' otherwise.

**Hint** - Import `numpy` as `np`, and `.where()` will then be available as `np.where()`

In [26]:
import numpy as np

employees.loc[:, 'hours_class'] = np.where(employees.fte_hours <= 0.5, 'low', 'high')
# or
employees = employees.assign(hours_class = np.where(employees.fte_hours <= 0.5, 'low', 'high'))

***Q2.*** Are there any `employees` with invalid `email` addresses?

**Hint** - look for non-null `email` addresses that don't match a reasonable regex pattern for a valid email address

In [27]:
employees.loc[
    ~employees.email.str.contains(r'[\w\.]+@[\w\.]+', na=False) & 
    employees.email.notnull(),
    'id'
].count()

# or 
employees.loc[
    ~employees.email.str.contains(r'[\w\.]+@[\w\.]+', na=False) & 
    employees.email.notnull()
].shape[0]

0

***Q3.*** How many of the `employees` serve on one or more `committees`?

**Hints**

* This requires only the `employees_committees` `DataFrame`
* Think about distinct `employees`

In [28]:
employees_committees.employee_id.nunique()

22

***Q4.*** Get the full employee details [including committee name(s)] of any committee members based in the Ukraine.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**

* this is a many-to-many join, so three `DataFrame`s will be involved: `employees`, `committees` and `employees_committees`
* The `.query()` method lets you do all this in one line of code

In [29]:
employees.merge(employees_committees, how='right', left_on='id', right_on='employee_id')\
    .merge(committees, how='left', left_on='committee_id', right_on='id')\
    .drop(columns=['employee_id', 'id_y', 'committee_id', 'id'])\
    .rename(columns={'id_x': 'id', 'name': 'committee_name'})\
    .query("country == 'Ukraine'")

Unnamed: 0,id,first_name,last_name,email,department,team_id,grade,country,fte_hours,pension_enrol,salary,pay_detail_id,start_date,effective_salary,start_month,hours_class,committee_name
2,434,Sax,Zmitrichenko,szmitrichenkoc1@slate.com,Legal,7,0.0,Ukraine,0.5,Yes,46928.0,434,2002-08-20,23464.0,August,low,Health and Safety
3,434,Sax,Zmitrichenko,szmitrichenkoc1@slate.com,Legal,7,0.0,Ukraine,0.5,Yes,46928.0,434,2002-08-20,23464.0,August,low,Social


***Q5.*** Obtain a table with column `start_month` containing the names of months and column `num_started` showing the number of `employees` who started in that month (in any year). The table should contain 12 rows, one for each month, and be ordered 'January', 'February', 'March' and so forth, in ascending calendar order.

**Hints**

* Extracting both the `.month_name()` and `.month` from `start_date` might help here
* We used `.assign()` to create the two new date feature columns before any subsequent operations

In [30]:
employees.assign(
        start_month=employees.start_date.dt.month_name(),
        start_month_num=employees.start_date.dt.month)\
    .groupby(['start_month', 'start_month_num'], as_index=False).agg(num_started=('id', 'count'))\
    .sort_values('start_month_num')\
    .drop(columns=['start_month_num'])\
    .reset_index(drop=True)

Unnamed: 0,start_month,num_started
0,January,83
1,February,69
2,March,89
3,April,62
4,May,77
5,June,75
6,July,93
7,August,87
8,September,68
9,October,83
