<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Core-questions" data-toc-modified-id="Core-questions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Core questions</a></span></li><li><span><a href="#Extension-questions" data-toc-modified-id="Extension-questions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Extension questions</a></span></li></ul></div>

# Setup

Let's load in the datasets we need for the practice session. They correspond to cuts of the tables from the `omni_pool` database we used earlier in the course. The idea in this session is to try to do in `pandas` some of the operations we coded in `SQL` (with a bit of variation to keep things fresh)!

We would recommend you try all of the Core Questions below, and then move on to the Extension Questions if time permits.

**Note** - some of the questions should be attempted only after you have covered the material on joining, melting and pivoting `DataFrame`s on day 3. We have marked these questions below.

In [1]:
import pandas as pd

employees = pd.read_csv('data/omni_employees.csv', parse_dates=['start_date'])
pay_details = pd.read_csv('data/omni_pay_details.csv')
teams = pd.read_csv('data/omni_teams.csv')
committees = pd.read_csv('data/omni_committees.csv')
employees_committees = pd.read_csv('data/omni_employees_committees.csv')

# Core questions

***Q1.*** Perform some basic data exploration of the `employees` `DataFrame`. Methods and attributes to consider using here include `.info()`, `.describe()` and `.shape`. In particular, answer the following:

- What data type is each column?
- How many unique values are there in `department` and `country`?
- What are the minimum and maximum of `salary`?
- How many rows are there in the `DataFrame`?
  
***  
  
***Q2.*** Find all the details of `employees` who work in the 'Legal' `department`.

***
  
***Q3.*** How many `employees` are based in Japan?  

**Hint** - Running `.count()` on the `id` column might help here, or you could use the `.shape` attribute 

***

***Q4.*** [**Harder**] In the question above, we suggested you `.count()` the `id` column (i.e. treating it like a `SQL` primary key). But we haven't yet shown that `id` satisfies the associated requirements. Confirm that the number of unique values in `id` equals the number of rows in `employees`.

**Hints**

* The `.shape` attribute of the `employees` `DataFrame` and the `.nunique()` method can be combined in one line of code to show this
* Remember that `.shape` returns a `tuple`: which element of the `tuple` do you need?

***

***Q5.*** How many `employees` have a missing `email` address?

***

***Q6.*** Calculate the mean `salary` of `employees` in the 'Legal' `department`.

***

***Q7.*** Obtain the `first_name`, `last_name` and `salary` of all `employees` sorted in descending order of `salary`.

***

***Q8.*** Obtain the `first_name`, `last_name` and `country` of `employees` ordered first alphabetically by `country` and then alphabetically by `last_name`.

***

***Q9.*** [**Harder**] Obtain the `first_name`, `last_name` and `email` address of all `employees` in the 'Engineering' `department` who work 0.5 `fte_hours` or greater.

***

***Q10.*** [**Harder**] Calculate the mean `salary` of all `employees` who are members either of the 'Legal' or the 'Accounting' `department`s.

**Hint** - The `.isin()` method let's you do this in one line of code, without having to manually combine means for the two `department`s
   
***
    
***Q11.*** [**Harder**] How many pension enrolled `employees` are based outside of France, Austria or Ireland?

**Hint** - Remember `~` is the negation operator in `pandas`

***
    
***Q12.*** Add a new column `effective_salary` to `employees` containing `salary` multiplied by `fte_hours`.

***

    
***Q13.*** Obtain a table showing the number of `employees` in each `department`.

**Hint** - Think **split-apply-combine** 

***

***Q14.*** Obtain a count of the number of `employees` enrolled and not enrolled in the pension scheme. Ignore missing values in `pension_enrol` for now.

***

***Q15.*** [**Harder**] Repeat your analysis from Question 17 above, but this time fill any missing values in `pension_enrol` with the string 'Missing'.

**Hint** - the `.fillna()` method will help here

***

***Q16.*** Obtain a count by `department` of the number of `employees` enrolled and not enrolled in the pension scheme. Include missing values or otherwise fill them with the string 'Missing' as in the question above. [**Harder**] - Change the header of the column containing count values to 'num_employees'.

**Hints** 

* You need to `.groupby()` on multiple columns
* Is there an argument to `.groupby()` that lets you keep missing values?
* The `.agg()` method can take arguments that let you specify the output column name for an aggregator

***
    
***Q17.*** Obtain a table showing all `employees` details together with a column `team_name` containing the name of their team. 

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**

* This will require joining `employees` to `teams`
* You can `.drop()` unwanted columns from `teams` after the join
* The `.rename()` method lets you change column names

***
    
***Q18.*** [**Harder**] Obtain a table showing `team_name` together with a count of the `num_employees` in each team.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**
    
* You will need to join `employees` to `teams`
* The `.rename()` method lets you rename a column
* The `.agg()` method can take arguments that let you specify the output column name for an aggregator    

***

***Q19.*** [**Harder**] Obtain a table showing the `id`, `first_name`, `last_name` and `department` of any `employees` lacking both a `local_account_no` and `local_sort_code` in their `pay_details`.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**
    
* Filter `pay_details` down to those rows lacking both a `local_account_no` and `local_sort_code`, and then join that to `employees`
* The `.all()` method applied over columns can help here



# Extension questions

***Q1.*** Investigate the use of the `.where()` function in the `numpy` package to add a column `hours_class` to `employees` containing the value 'low' when `fte_hours` is 0.5 or less and 'high' otherwise.

**Hint** - Import `numpy` as `np`, and `.where()` will then be available as `np.where()`

***
    
***Q2.*** Are there any `employees` with invalid `email` addresses?

**Hint** - Count non-null `email` addresses that don't match a reasonable regex pattern for a valid email address

***

***Q3.*** How many of the `employees` serve on one or more `committees`?

**Hints**

* This requires only the `employees_committees` `DataFrame`
* Think about distinct `employees`
    
***
    
***Q4.*** Get the full employee details (including committee name) of any committee members based in the Ukraine.

**Attempt this question only after the lesson on joining, melting and pivoting `DataFrame`s on day 3**

**Hints**
    
* This is a many-to-many join, so three `DataFrame`s will be involved: `employees`, `committees` and `employees_committees`
* The `.query()` method lets you do all this in one line of code

***

***Q5.*** Obtain a table with column `start_month` containing the names of months and column `num_started` showing the number of `employees` who started in that month (in any year). The table should contain 12 rows, one for each month, and be ordered 'January', 'February', 'March' and so forth, in ascending calendar order.

**Hints**
    
* Extracting both the `.month_name()` and `.month` from `start_date` might help here
* We used `.assign()` to create the two new date feature columns before any subsequent operations