# Working with structured data in Python using Pandas


## Table of Contents

1. [Introduction](#introduction)<br>
2. [Series and DataFrames](#series)<br>
3. [Cleaning Data](#cleaning)<br>
4. [Selecting Data](#selection)<br>
5. [Merging Data](#merging)<br>
6. [Grouping Data](#grouping)<br>
7. [Visualising Data](#visualise)<br>

<a id="introduction"></a>
## 1. Introduction

A lot of data is **structured data**, which is data that is organized and formatted so it is easily readable, for example a table with variables as columns and records as rows, or key-value pairs in a noSQL database. As long as the data is formatted consistently and has multiple records with numbers, text and dates, you can probably read the data with [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), an open-source Python package providing high-performance data manipulation and analysis.

### Data

The data that you will explore in this notebook is about the boroughs in London. Within Greater London there are [32 boroughs](https://en.wikipedia.org/wiki/London_boroughs). You can download the data from [data.gov.uk](https://data.gov.uk/dataset/248f5f04-23cf-4470-9216-0d0be9b877a8/london-borough-profiles-and-atlas) where this description is given:

> The London Borough Profiles help paint a general picture of an area by presenting a range of headline indicator data to help show statistics covering demographic, economic, social and environmental datasets for each borough, alongside relevant comparator areas. 

**Let's start with loading the required Python packages and loading our data into the notebook.**

* To run the code, select the below cell by clicking on it, and then click on the `Run` button at the top of the notebook (or use `Shift+Enter`), to run the cells in the notebook
* The numbers in front of the cells tell you in which order you have run them, for instance `[1]`
* When you see a `[*]` the cell is currently running and `[]` means you have not run the cell yet. Make sure run all of them!

In [None]:
import numpy as np
import pandas as pd

**Read data from a CSV file using the `read_csv` function. Load a file by running the next cell:**

This file is read directly from a URL, but you can replace this with a local path when running this notebook on a local system. When you are using IBM Watson Studio you can also [upload](https://dataplatform.cloud.ibm.com/docs/content/wsj/manage-data/add-data-project.html?linkInPage=true) a file to your Cloud Object Storage, and then [import](https://dataplatform.cloud.ibm.com/docs/content/wsj/manage-data/add-data-project.html?linkInPage=true#os) it by clicking on the file in the menu on the right of the notebook.  

In [None]:
df = pd.read_csv('https://data.london.gov.uk/download/london-borough-profiles/c1693b82-68b1-44ee-beb2-3decf17dc1f8/london-borough-profiles.csv',encoding = 'unicode_escape', sep=',', thousands=',')

Only keep the data from the 32 boroughs by removng the last 5 rows from the DataFrame: 

In [None]:
df = df.drop([33,34,35,36,37])
df.head(10)

**Let's take a first look at the data loaded into the notebook**

* With `df.head()` or `df.tail()` you can view the first five or last five lines from the data  
* Add a number between the brackets `()` to specify the number of lines you want to display., e.g. `df.head(2)`
* Use `df.dtypes` to check the different variables and their datatype
* `df.columns` gives a list of all column names
* `len(df)` gives the number of rows
* `df.shape` gives the number of rows and columns

> **Tip**: to add more cells to run additional commands, activate a cell by clicking on it and then click on the '+' button at the top of the notebook. This will add a new cell. Click on the buttons with the upwards and downwards arrows to move the cells up and down to change their order

<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 Now let's have a look at the data that was loaded into the notebook. What are we actually looking at? 
    
  Explore some of the following commands:
  <ul>
  <li><font face="Courier">df.head()</font></li>
  <li><font face="Courier">df.tail()</font></li>
  <li><font face="Courier">df.columns</font></li>
  <li><font face="Courier">df.values</font></li>
  <li><font face="Courier">len(df)</font></li>
  <li><font face="Courier">list(df)</font></li>
  </ul>
</div>  


In [None]:
# try the commands here (add as many cells as you need):



<a id="series"></a>
## 2. Series and DataFrames 

A `Series` is a one-dimensional labelled array that can contain of any type (integer, string, float, python objects, etc.).

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

A `DataFrame` is a two-dimensional data structure, the data consists of rows and columns that you can create a in many ways, by loading a file or using a NumPy array and a date for the index.

<div class="alert alert-info" style="font-size:100%">
<a href="https://numpy.org"> NumPy</a> is a Python library for working with multi-dimensional arrays and matrices with a large collection of mathematical functions to operate on these arrays.
Have a look at this <a href="https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html"> NumPy tutorial</a> for an overview.
</div>



Create DataFrame `df1` with `dates` as the index, a 6 by 4 array of random `numbers` as values, and column names A, B, C and D (the index will be explained in the next section):  

In [None]:
dates = pd.date_range('20200101', periods=6)
dates

In [None]:
numbers = np.random.randn(6, 4)
numbers

In [None]:
df1 = pd.DataFrame(numbers, index=dates, columns=['A', 'B', 'C', 'D'])
df1

Or create a DataFrame by combining the above in one command:

In [None]:
df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})

In [None]:
df2.head()

Use `type()` to check the data type of each variable. Below `print` is used to display the data type of all of them used so far:

In [None]:
print('Data type of s is '+str(type(s)))
print('Data type of s is '+str(type(dates)))
print('Data type of s is '+str(type(numbers)))
print('Data type of df is '+str(type(df1)))

In [None]:
type(df)

<a id="cleaning"></a>
## 3. Cleaning Data

When exploring data there are always transformations needed to get it in the format you need for your analysis, visualisations or models. Below are only a few examples of the endless possibilities. The best way to learn is to find a dataset and try to answer questions with the data.

First, let's make a copy of the Dataframe loaded from the URL:

In [None]:
boroughs = df.copy()

### Adding an index

Indexing and selecting data is key to data analysis and creating visualizations. For more information on indexing have a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

Set the area code (`Code`) as the index, which will change the table slightly:

In [None]:
boroughs = boroughs.set_index(['Code'])
boroughs.head()

### Adding and deleting columns

Adding a column can be done by creating a new column `new`, which can be dropped using the `drop` function.

In [None]:
boroughs['new'] = 1
boroughs.head()

In [None]:
boroughs = boroughs.drop(columns='new')
boroughs.head()

As not all columns are needed, let's remove some. If you are interested in any of these, change the code and do not remove the columns.

In [None]:
boroughs = boroughs.drop(columns=['GLA_Household_Estimate_2017',
       'Proportion_of_population_aged_0-15,_2015',
       'Proportion_of_population_of_working-age,_2015',
       'Proportion_of_population_aged_65_and_over,_2015',
       'Net_internal_migration_(2015)', 'Net_international_migration_(2015)',
       'Net_natural_change_(2015)',
       '%_of_largest_migrant_population_(2011)',
       'Second_largest_migrant_population_by_country_of_birth_(2011)',
       '%_of_second_largest_migrant_population_(2011)',
       'Third_largest_migrant_population_by_country_of_birth_(2011)',
       '%_of_third_largest_migrant_population_(2011)',
       '%_of_population_from_BAME_groups_(2016)',
       '%_people_aged_3+_whose_main_language_is_not_English_(2011_Census)',
       'Overseas_nationals_entering_the_UK_(NINo),_(2015/16)',
       'Largest_migrant_population_arrived_during_2015/16',
       'Second_largest_migrant_population_arrived_during_2015/16',
       'Third_largest_migrant_population_arrived_during_2015/16',
       'Male_employment_rate_(2015)',
       'Female_employment_rate_(2015)', 'Unemployment_rate_(2015)',
       'Youth_Unemployment_(claimant)_rate_18-24_(Dec-15)',
       'Proportion_of_16-18_year_olds_who_are_NEET_(%)_(2014)',
       'Proportion_of_the_working-age_population_who_claim_out-of-work_benefits_(%)_(May-2016)',
       '%_working-age_with_a_disability_(2015)',
       'Proportion_of_working_age_people_with_no_qualifications_(%)_2015',
       'Proportion_of_working_age_with_degree_or_equivalent_and_above_(%)_2015',
       'Gross_Annual_Pay,_(2016)',
       'Modelled_Household_median_income_estimates_2012/13',
       '%_adults_that_volunteered_in_past_12_months_(2010/11_to_2012/13)',
       'Number_of_jobs_by_workplace_(2014)',
       '%_of_employment_that_is_in_public_sector_(2014)', 'Jobs_Density,_2015',
       'Number_of_active_businesses,_2015',
       'Two-year_business_survival_rates_(started_in_2013)',
       'Crime_rates_per_thousand_population_2014/15',
       'Fires_per_thousand_population_(2014)',
       'Ambulance_incidents_per_hundred_population_(2014)',
       'Average_Band_D_Council_Tax_charge_(£),_2015/16',
       'New_Homes_(net)_2015/16_(provisional)',
       'Homes_Owned_outright,_(2014)_%',
       'Being_bought_with_mortgage_or_loan,_(2014)_%',
       'Rented_from_Local_Authority_or_Housing_Association,_(2014)_%',
       'Rented_from_Private_landlord,_(2014)_%',
       'Total_carbon_emissions_(2014)',
       'Household_Waste_Recycling_Rate,_2014/15',
       'Number_of_cars,_(2011_Census)',
       'Number_of_cars_per_household,_(2011_Census)',
       '%_of_adults_who_cycle_at_least_once_per_month,_2014/15',
       'Average_Public_Transport_Accessibility_score,_2014',
       'Achievement_of_5_or_more_A*-_C_grades_at_GCSE_or_equivalent_including_English_and_Maths,_2013/14',
       'Rates_of_Children_Looked_After_(2016)',
       '%_of_pupils_whose_first_language_is_not_English_(2015)',
       '%_children_living_in_out-of-work_households_(2015)',
       'Male_life_expectancy,_(2012-14)', 'Female_life_expectancy,_(2012-14)',
       'Teenage_conception_rate_(2014)',
       'Life_satisfaction_score_2011-14_(out_of_10)',
       'Worthwhileness_score_2011-14_(out_of_10)',
       'Anxiety_score_2011-14_(out_of_10)',
       'Childhood_Obesity_Prevalance_(%)_2015/16',
       'People_aged_17+_with_diabetes_(%)',
       'Mortality_rate_from_causes_considered_preventable_2012/14',
       'Proportion_of_seats_won_by_Conservatives_in_2014_election',
       'Proportion_of_seats_won_by_Labour_in_2014_election',
       'Proportion_of_seats_won_by_Lib_Dems_in_2014_election'])

In [None]:
boroughs.columns

<a id="Renaming"></a>

You can change names of columns using `rename`:

In [None]:
boroughs.rename(columns={'Area_name':'Name',
                'Inner/_Outer_London':'Inner/Outer',
                'GLA_Population_Estimate_2017':'Population',
                'Inland_Area_(Hectares)':'Area (ha)',
                'Average_Age,_2017':'Average Age',
                'Political_control_in_council':'Political control',
                'Population_density_(per_hectare)_2017':'Population density (/ha)',
                'New_migrant_(NINo)_rates,_(2015/16)':'New migrant rates',
                'Happiness_score_2011-14_(out_of_10)':'Happiness score',
                '%_of_resident_population_born_abroad_(2015)':'Population born abroad (%)',
                'Employment_rate_(%)_(2015)':'Employment rate (%)',
                'Turnout_at_2014_local_elections':'Turnout at local elections',
                'Median_House_Price,_2015':'Median House Price',
                "Largest_migrant_population_by_country_of_birth_(2011)":'Largest migrant population',
                'Gross_Annual_Pay_-_Female_(2016)':'Gross Pay (Female)',
                'Gross_Annual_Pay_-_Male_(2016)':'Gross Pay (Male)',
                '%_of_area_that_is_Greenspace,_2005':'Greenspace (%)'},
                 inplace=True)

In [None]:
boroughs.columns

In [None]:
boroughs.head()

### Further Data Cleaning

**Things to check:**

* Is the data tidy: each variable forms a column, each observation forms a row and  each type of observational unit forms a table.
* Are all columns in the right data format?
* Are there missing values?
* Are there unrealistic outliers?

Get a quick overview of the numeric data using the `.describe()` function. If any of the numeric columns are missing this is a probably because of a wrong data type.

In [None]:
boroughs.describe()

When looking at the `Turnout at local elections` columns you can see a `.`, this needs to be replaced to a missing value. Change them all with `replace`:

In [None]:
boroughs = boroughs.replace('.', float('NaN'))

Check if all datatypes are as you expect with `dtypes`:

In [None]:
boroughs.dtypes

Expect for `Inner/Outer `, `Largest migration population` and `Political control` these all should be numeric (`float64` or `int64`). Change the data type to numeric with `to_numeric`:

In [None]:
boroughs['Population density (/ha)'] = pd.to_numeric(boroughs['Population density (/ha)'])
boroughs['Population born abroad (%)'] = pd.to_numeric(boroughs['Population born abroad (%)'])
boroughs['Gross Pay (Male)'] = pd.to_numeric(boroughs['Gross Pay (Male)'])
boroughs['Gross Pay (Female)'] = pd.to_numeric(boroughs['Gross Pay (Female)'])
boroughs['Median House Price'] = pd.to_numeric(boroughs['Median House Price'])
boroughs['Greenspace (%)'] = pd.to_numeric(boroughs['Greenspace (%)'])
boroughs['Turnout at local elections'] = pd.to_numeric(boroughs['Turnout at local elections'])

boroughs['Area (ha)'] = boroughs['Area (ha)'].str.replace(',', '')
boroughs['Area (ha)'] = pd.to_numeric(boroughs['Area (ha)'])

boroughs.dtypes

<a id="selection"></a>
## 4. Selecting Data


Access single or groups of rows and columns with labels using `.loc[]`. (This only works for the column that was set to the index):

In [None]:
boroughs.loc['E09000001', 'Area (ha)']

In [None]:
boroughs.loc['E09000001':'E09000004', ['Area (ha)', 'Average Age']]

Or select by position with `.iloc[]`. Select a single row, multiple rows (or columns) at particular positions in the index. This function is integer based (from 0 to length-1 of the axis):

In [None]:
boroughs.iloc[0]

In [None]:
boroughs.iloc[:,1]

In [None]:
boroughs.iloc[:,0:2]

In [None]:
boroughs.iloc[2:4,0:2]

All the above examples can be used to create a new DataFrame. Or create a new DataFrame from 2 columns:

In [None]:
boroughs2 = boroughs[['Area (ha)', 'Average Age']]
boroughs2.head()

### Filtering

Selecting rows based on a certain condition can be done with Boolean indexing. This uses the actual values of the data in the DataFrame as opposed to the row/column labels or index positions.

In [None]:
boroughs['Average Age'] > 39

When you want to select the rows and see all the data add `boroughs[]` around your function:

In [None]:
boroughs[boroughs['Average Age'] > 39]

You can combine different columns using `&`, `|` and `==` operators:

In [None]:
boroughs[(boroughs['Average Age'] > 39) & (boroughs['Political control'] == 'Cons')]

In [None]:
boroughs[(boroughs['Political control'] == 'Lab') | (boroughs['Political control'] == 'Lib Dem')]

<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 With the above commands you can now start exploring the data some more. Answer the following questions by writing a little code (add as many cells as you need):
  <ol>
  <li>Which borough has the largest population density per hectare? </li>  
  <li>What are the maximum and minimum number of new migrants? And for which boroughs?</li>   
  <li> Which borough is happiest? </li>
  
 </ol>  
</div>  


> *Tips*: 
- Find the maximum of a row with for instance `boroughs['Population'].max()` 
- Extract the value from a cell in a DataFrame with `.value[]`
- Print a value with `print()` for instance: `print(boroughs['area'][0])` for the first row. If you calculate multiple values in one cell you will need this, else the answers will not be displayed.
- To extract an entire row use `idxmax()` which returns column with maximum value, and `.loc[]` to return row of the index
- To see the answer uncomment the line in the cell that contains `%load` (by deleting the `#`) and then run the cell, but try to find your own solution first in the cell above the solution!

**Which borough has the largest population density per hectare?**

In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/answers/answer1.py

**What are the maximum and minimum number of new migrants? And for which boroughs?**

In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/answers/answer2.py

**Which borough is happiest?**

In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/answers/answer3.py

<a id="merging"></a>
## 5. Merging Data

Pandas has several different options to combine or merge data. The [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) has lots of examples. 

Let's create two new Dataframes to explore this: `cities` and `cities2`

In [None]:
data = {'city':       ['London','Manchester','Birmingham','Leeds','Glasgow'],
        'population': [9787426,  2553379,     2440986,    1777934,1209143],
        'area':       [1737.9,   630.3,       598.9,      487.8,  368.5 ]}
cities = pd.DataFrame(data)

data2 = {'city':       ['Liverpool','Southampton'],
        'population': [864122,  855569],
        'area':       [199.6,   192.0]}
cities2 = pd.DataFrame(data2)

Use `append()` to combine these Dataframes:

In [None]:
cities = cities.append(cities2)
cities

In [None]:
data = {'city': ['London','Manchester','Birmingham','Leeds','Glasgow'],
        'density': [5630,4051,4076,3645,3390]}
cities3 = pd.DataFrame(data)

In [None]:
cities3

An extra column can be added with `.merge()` with an outer join using the city names:

In [None]:
cities = pd.merge(cities, cities3, how='outer', sort=True,on='city')
cities

<a id="grouping"></a>
## 6. Grouping Data

Grouping data is a quick way to calculate values for classes in your DataFrame. 

In [None]:
boroughs.groupby(['Inner/Outer']).mean()

When you have multiple categorial variables you can create a nested index:

In [None]:
boroughs.groupby(['Inner/Outer','Political control']).sum().head(8)

<a id="visualise"></a>
## 7. Visualising Data

Pandas uses [`Matplotlib`](https://matplotlib.org/users/index.html) as the default for visualisations. 

Import the package and also add the magic line starting with `%` to output the charts within the notebook:

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
boroughs = boroughs.reset_index()

The default plot is a line chart that uses the index for the x-axis:

In [None]:
boroughs['Employment rate (%)'].plot();

To create a plot that makes more sense for this data have a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) for all options. 

For the above example, a histogram might work better. You can change the number of `bins` to get the desired output:

In [None]:
boroughs['Employment rate (%)'].plot.hist(bins=10);

Change the size of the histogram with the `figsize` option:

In [None]:
boroughs['Employment rate (%)'].plot.hist(bins=15,figsize=(10,5));

Within the plot command you can select the data directly. The below histogram shows the Employment rate for Outer London only:

In [None]:
boroughs['Employment rate (%)'][boroughs['Inner/Outer']=='Outer London'].plot.hist(bins=15,figsize=(10,5));

To add the Employment rate for Inner London, repeat the plot command with a different selection of the data:

In [None]:
boroughs['Employment rate (%)'][boroughs['Inner/Outer']=='Outer London'].plot.hist(bins=15,figsize=(10,5));
boroughs['Employment rate (%)'][boroughs['Inner/Outer']=='Inner London'].plot.hist(bins=15,figsize=(10,5));

The above plot is difficult to read as the histograms have overlapped. You can fix this by changing the colours and making them transparant. 
    
To add a legend each histogram needs to be assigned to an object `ax`. With `legend()` you can then add a legend. With `plt.xlabel()` you can also add a label for the x-axis (this works similar for the y-axis):

In [None]:
ax = boroughs['Employment rate (%)'][boroughs['Inner/Outer']=='Outer London'].plot.hist(
    bins=15,figsize=(10,5),alpha=0.5,color='#1A4D3B');
ax = boroughs['Employment rate (%)'][boroughs['Inner/Outer']=='Inner London'].plot.hist(
    bins=15,figsize=(10,5),alpha=0.5,color='#4D1A39');
ax.legend(['Outer London','Inner London'])
plt.xlabel('Employment rate (%)');

There are various options available to change every aspect of your chart. Below are some examples to get you started.
        
**Go ahead and create new charts and customise the options.** 

Especially the next one can be improved on to make it look better:

In [None]:
boroughs['Population density (/ha)'].plot.hist(
    bins=15, 
    title="Population Density (/ha)",
    legend=False,
    fontsize=14,
    grid=False,
    linestyle='--',
    edgecolor='black',
    color='darkred',
    linewidth=3);

## Seaborn

Seaborn is a Python data visualization library based on matplotlib. It is an easy to use visualisation package that works well with Pandas DataFrames. 

Below are a few examples using Seaborn. 

Refer to this [documentation](https://seaborn.pydata.org/index.html) for information on lots of plots you can create.

In [None]:
import seaborn as sns

Let's look at a distribution plot using `distplot`, which shows a distribution of the data. 

Use the `dropna()` function to remove rows and columns with Null/NaN values:

In [None]:
sns.distplot(boroughs['Population density (/ha)'].dropna());

<You can create categorical plots with `catplot`. There are categorical scatter plots, distribution plots and estimate plots. The `kind` parameter selects the function to use, for instance box, violin, swarm ,bar, stripplot and boxen.
    
The default representation in catplot() uses a scatter plot:

In [None]:
sns.catplot(x='Turnout at local elections', y='Political control', data=boroughs);

Also try `kind="swarm"`, `kind="box"` or `kind="violin"`:

In [None]:
sns.catplot(x='Median House Price', y='Name', kind='swarm', data=boroughs);

In [None]:
sns.catplot(x='Employment rate (%)', y='Largest migrant population', kind="box", data=boroughs);

In [None]:
sns.catplot(x='Turnout at local elections', y='Political control', kind="violin", data=boroughs);

<div class="alert alert-success">
 <b>EXERCISE</b>
 <ol>
  <li>Create two histograms that compare the Gross Annual pay for Male and Female Employees using `.plot.hist()`</li>
  <li>Create a bar plot comparing the median house prices for different boroughs</li>
  <li>Create a scatter plot comparing the Median House price and percentage of area that is greenspace </li>
 </ol> 
   </div> 
   
 <ul></ul> 
 <ul></ul> 
 <ul></ul> 
 
 > *Tips*:
-  To add two histograms to one plot you can repeat `.plot()` in the same cell 
-  Add a legend by assiging each histogram to an object `ax`, which is used to create a legend
-  To customise the size of your maps, use the example of `[fig, ax]`, which customises the figsize for each map in other examples above 

**Create two histograms that compare the Gross Annual pay for Male and Female Employees using `.plot.hist()`**

In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/answers/answer4.py

**Create a bar plot comparing the median house prices for different boroughs**

In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/answers/answer5.py

**Create a scatter plot comparing the Median House price and percentage of greenspace area** 

In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/python-pandas-workshop/master/answers/answer6.py

Now that you have explored some real data with Python and Pandas keep learning by exploring more of this dataset or create a new notebook and start with your own data. The data used here is a clean dataset, which is definitely not always the case, so stay alert to always check all your data. 

<div class="alert alert-info" style="font-size:100%">
<b>To learn more about Pandas start with this <a href="http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html">10 minute introduction</a><br>
</div>

## Optional Excercises and further learning

If you finish early:

2. Try to create other plots. Have a look at the [Pandas plot examples](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) or the [Seaborn gallery](https://seaborn.pydata.org/examples/index.html) for inspiration.  
3. Or load one of your own datasets into a new notebook and play around with the data to practice what you have learned. You can use the free account you created today for your own projects as well! 
4. Have a look at these Pandas workshops and book: <br>
4.1. [Pandas workshop by Alexander Hensdorf](https://github.com/alanderex/pydata-pandas-workshop) <br>
4.2. [Pandas tutorial by Joris van den Bossche](https://github.com/jorisvandenbossche/pandas-tutorial) <br>
4.3. [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) <br>

### Authors
    
Yamini Rao is a Developer Advocate for IBM. She compiles developer scenarios, workshops and training material based on IBM Cloud technologies. She works with various developer communities across the UK, collaborating with them to present/organise workshops and meetups. She has a background in computer science and has worked extensively as an Implementation Engineer for various IBM Analytical tools. 

Margriet Groenendijk is a Data & AI Developer Advocate for IBM. She develops and presents talks and workshops about data science and AI. She is active in the local developer communities through attending, presenting and organising meetups and conferences. She has a background in climate science where she explored large observational datasets of carbon uptake by forests during her PhD, and global scale weather and climate models as a postdoctoral fellow. 


Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.