# More Pandas

***Note***: this notebook contains cell with ***a*** solution. Remember ther is not only one solution to a problem!  
You will recognise these cells as they start with **# %**.  
If you would like to see the solution, you will have to remove the **#** (which can be done by using **Ctrl** and **?**) and run the cell. If you want to run the solution code, you will have to run the cell again.

>Import pandas (with the mostly used convension seen in notebook 2).  
>Import datetime.  
>Import matplotlib (as shown in notebook 3.1).  
>Add the magic comand to show the plot in jupyter notebook.

In [0]:
# %load ../solutions/05_01.py

In this notebook, we are going to use data provided by the UK government.  
https://data.gov.uk/dataset/a59198d9-2e24-4816-be1b-c3a1efa02dda/better-training-for-safer-food    
  
It is provided under the Open Government Licence that you can find here: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

>Load the data of the year 2104 and assign it to df_2014 (path: '../data/food_training/training_2014.csv')

In [0]:
# %load ../solutions/05_02.py

>Have a look at the first rows of df_2014

In [0]:
# %load ../solutions/05_03.py

You can see that the header are not at the right place.  
If you have a look at the documentation, you will see that the *header* parameter of the read_csv method is set to *infer*. It will infer the header using the first row (which has the index *0*).

>Load the data of the year 2104 using the header paramater and assign it to df_2014 (it will be overwritten)

In [0]:
# %load ../solutions/05_04.py

>Have a look at the first rows of df_2014. Looks better, isn't it?

In [0]:
# %load ../solutions/05_.py

>Load the data for the years 2015 and 2016 and assign them to df_2015 and df_2016 (don't forget the parameter!)

In [0]:
# %load ../solutions/05_06.py

## Concatenate
Pandas have a good page with documentation about merge, join and concatenation: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

>Using the documentation, concatenate the 3 dataframe into one, named df

In [0]:
# %load ../solutions/05_07.py

>check the shape of df

In [0]:
# %load ../solutions/05_08.py

>Look at the indices of df

In [0]:
# %load ../solutions/05_09.py

> Reset the index to have it start at 0 and continue with a step of 1 until the end.

In [0]:
# %load ../solutions/05_10.py

## Dropping columns

>Show the information relative to df's columns

In [0]:
# %load ../solutions/05_11.py

We can see that the last two columns are completely empty, so we will drop them.

>Get the help for the drop method (if you prefer, you can search for the documentation on internet)

In [0]:
# %load ../solutions/05_12.py

You can see that the default value for *axis* is *0*. When it is the case, the method will look at the rows (i.e. *axis 0*) of the dataframe.  
In our case, we want to drop columns (i.e. *axis 1*).

>Drop the last two columns of df  
>labels: provide a list with the column's names  
Don't forget the axis!

In [0]:
# %load ../solutions/05_13.py

In [0]:
df.head()

## Text Data

To [work with text data](https://pandas.pydata.org/pandas-docs/version/0.23.4/text.html), We will have to use the ***str*** attribute to access the methods (e.g. df['col'].str.replace('-', ' '))

>Display the unique values of the Location column

In [0]:
# %load ../solutions/05_14.py

You can see that the cities and contries are separated with a semi-column (even if some are missing). We are going to split the *Location* column into two.
>Try splitting the ***Location*** column on the semi-column.

In [0]:
# %load ../solutions/05_15.py

This return a list with the items which were before and after the semi-column.  
>Try again to split again the ***Location*** column, but expand expand the results this time.

In [0]:
# %load ../solutions/05_16.py

>Create a ***city*** and a ***country*** columns using this split.

In [0]:
# %load ../solutions/05_17.py

>Drop the Location column


In [0]:
# %load ../solutions/05_18.py

In [0]:
df.head()

>Get the count of unique values in the ***country*** column.

In [0]:
# %load ../solutions/05_19.py

>Check the count of each unique values in the ***country*** column.

In [0]:
# %load ../solutions/05_20.py

>Use the strip method to remove the (potential) extra spaces at the beginning and the end of the cities and countries

In [0]:
# %load ../solutions/05_21.py

   >Check again the number of unique values in the country column

In [0]:
# %load ../solutions/05_22.py

>Have a look at the rows where the ***country*** is Portugal.  
What do you notice?

In [0]:
# %load ../solutions/05_23.py

>Put the ***city*** column into lower case (don't forget to reafect it to the **city** column).

In [0]:
# %load ../solutions/05_24.py

>Have a look at the rows where the ***city*** contains ***/***

In [0]:
# %load ../solutions/05_25.py

## Map

In [0]:
df['country'].value_counts()

We saw that some countries were filled with their contry code, and others with thei name. We are going to move everything to the same format.  
We can do that by mapping a dictionary to a column. It will then replace the ***key*** with the corresponding ***value***.  
>Complete the code below to replace the country codes with the country name.  
>Note that we use a mask as we want to replace only certain values of the series.

In [0]:
dict_countries = {'BG':'Bulgaria', 'CZ':'Czech Republic', 'IT':'Italy', 'GR':'Greece', 'SI':'Slovenia', 'UK':'United Kingdom'}
df.loc[df['country'].isin(dict_countries.keys()), 'country'] = 

In [0]:
# %load ../solutions/05_26.py

In [0]:
df['country'].value_counts()

## Apply

>Write a function that returns ***single*** if the value passed is 1, and ***multiple*** otherwise.

In [0]:
# %load ../solutions/05_27.py

>Apply this function to the ***Attendees*** column.

In [0]:
# %load ../solutions/05_28.py

## Date time

>Check the dtype of the ***DateFrom***.

In [0]:
# %load ../solutions/05_29.py

***DateFrom*** and ***DateTo*** are objects.  
>Convert these two columns to ***datetime*** format.

In [0]:
# %load ../solutions/05_30.py

Note: We could have used the parameter *parse_dates* when loading data.  
We can now use these columns to filter or do some calculation.
>Display the rows where the training started after the first February 2017.

In [0]:
# %load ../solutions/05_31.py

>Create a ***duration*** column with the duration of the training.

In [0]:
# %load ../solutions/05_32.py

>Create a ***month*** column indicating which month the training started.  
>Plot an histogram of the ***month*** column.

In [0]:
# %load ../solutions/05_33.py

## Sorting values

>Sort df in alphabetical order of ***cities***.

In [0]:
# %load ../solutions/05_34.py

>Sort df by ascending ***duration*** and descending number of ***Attendees***.

In [0]:
# %load ../solutions/05_35.py

We can see that there was a couple of issues with dates data entry...

## Group By

>Group df by ***city***. Assign the grouped data to df_gr.

In [0]:
# %load ../solutions/05_36.py

>Get the mean number of ***Attendees*** for the grouped data.

In [0]:
# %load ../solutions/05_37.py