# Data Science - Python and Pandas


## Table of Content

1. [Introduction](#introduction)<br>
1.1. [Series and DataFrames](#series)<br>
1.2. [Indexing and Data Selection](#index)<br>
2. [Transform data](#transform)<br>
2.1. [Adding and deleting columns](#columns)<br>
2.2. [Cleaning Data](#cleaning)<br>
2.3. [Merging Data](#merging)<br>
2.4. [Grouping Data](#grouping)<br>
3. [Visualise data](#visualise)<br>
4. [Optional Excercises](#extra)<br>



Let's start with loading the packages and a quick look at some data. Select the below cell by clicking on it, and then click on the `Run` button at the top of the notebook (or use `Shift+Enter`). This is how you can run all code cells in this notebook. The numbers in front of the cells tell you in which order you have run them, for instance `[1]`. When you see a `[*]` the cell is currently running and `[]` means you have not run the cell yet. 

In [None]:
#!pip install --upgrade seaborn

After running the first cell with code above, restart the notebook by clicking on the `Kernel` tab at the top of the notebook, and then `Restart`. You do not have to run the above cell again after the restart as updating the seaborn package only has to be done once. Then run the next cell that will import two other packages:

In [None]:
import numpy as np
import pandas as pd

Loading data from Cloud Object Store (COS) is done by adding the `measurements.csv` file in the menu on the right of the notebook (if you see no menu, click the `1010` button at the top first). 

- Activate the below cell, move the cursor to the empty line under `# add data`
- Click on `Insert to code` under the file from the right menu
- Select `Insert pandas DataFrame`
- Code to load the file will be inserted
- Change the default name of the data from `df_data_1` to `jeans` at the bottom two rows of the inserted code

In [None]:
# add data


In [None]:
# If you want to  run this notebook locally use:
#jeans = pd.read_csv('measurements.csv')

<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 Now let's have a look at the data that was loaded into the notebook. Use jeans, jeans.head() or jeans.tail() to see different parts of the table and jeans.dtypes to check which variables there are and what datatype they have. Add a number between the brackets () to specify how many lines you want to display.
    
  Explore some of the following commands:
  <ul>
  <li><font face="Courier">jeans</font></li>
  <li><font face="Courier">jeans.head()</font></li>
  <li><font face="Courier">jeans.tail()</font></li>
  <li><font face="Courier">jeans.head()</font></li>
  <li><font face="Courier">jeans.columns</font></li>
  <li><font face="Courier">jeans.values</font></li>
  <li><font face="Courier">jeans.shape</font></li>
  <li><font face="Courier">len(jeans)</font></li>
  </ul>
</div>  

> *Tip*: If you want to run these in separate cells, activate the below cell by clicking on it and then click on the + at the top of the notebook. This will add extra cells. Click on the upwards and downwards arrows to move the cells up and down to change their order.

In [None]:
# try the commands here (add as many cells as you need):


<a id="introduction"></a>
## 1. Introduction

The Python package you used to read this file and look at some of it's properties is [Pandas](https://pandas.pydata.org/), which is an open source library with easy-to-use data structures and data analysis tools. 

<div class="alert alert-info" style="font-size:100%">
<b>Read this <a href="http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html">10 minute introduction</a> for a quick overview of Pandas.<br>
</div>

<a id="series"></a>
### 1.1 Series and DataFrames 

Let's go through some of the basics of Pandas before going back to the Jeans dataset. Pandas has two main data structures: `Series` and `DataFrames`. 

A `Series` is a list of values with an integer index. The first column is the index (the default starts at 0) and the second column the values.

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

 A `DataFrame` is similar, but has multiple columns. You can create one in many ways, by loading a file or from for example a NumPy array and a date for the index. (We come back to the index and dates later) 


<div class="alert alert-info" style="font-size:100%">
<b>Read this <a href="https://docs.scipy.org/doc/numpy-1.15.0/user/quickstart.html"> tutorial</a> for an overview of NumPy.<br>
</div>

Two examples:

In [None]:
dates = pd.date_range('20130101', periods=6)
dates

In [None]:
numbers = np.random.randn(6, 4)
numbers

In [None]:
df = pd.DataFrame(numbers, index=dates, columns=list('ABCD'))
df

In [None]:
df2 = pd.DataFrame({'A': 1.,
                     'B': pd.Timestamp('20130102'),
                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D': np.array([3] * 4, dtype='int32'),
                     'E': pd.Categorical(["test", "train", "test", "train"]),
                     'F': 'foo'})

In [None]:
df2.head()

To find out what the data type is of a variable use `type()`: 

In [None]:
print('Data type of s is '+str(type(s)))
print('Data type of s is '+str(type(dates)))
print('Data type of s is '+str(type(numbers)))
print('Data type of df is '+str(type(df)))

In [None]:
type(jeans)

<a id="index"></a>
### 1.2 Indexing and Data Selection

For this we will create a new DataFrame with the population of the 5 largest cities in the UK ([source](https://en.wikipedia.org/wiki/List_of_urban_areas_in_the_United_Kingdom)). `data` is a [dictionary](https://realpython.com/python-dicts/).

In [None]:
data = {'city':       ['London','Manchester','Birmingham','Leeds','Glasgow'],
        'population': [9787426,  2553379,     2440986,    1777934,1209143],
        'area':       [1737.9,   630.3,       598.9,      487.8,  368.5 ]}
cities = pd.DataFrame(data)
cities

In [None]:
cities = cities.set_index('city')
cities

Select a single label or a range of labels with `.loc[]` (This only works for the column that was set to the index):

In [None]:
cities.loc['London', 'area']

In [None]:
cities.loc['Manchester':'Leeds', ['area', 'population']]

Or select by position with `.iloc[]`. You can select a single row, multiple rows (or columns) at particular positions in the index, it only takes integers:

In [None]:
cities.iloc[0]

In [None]:
cities.iloc[:,1]

In [None]:
cities.iloc[:,0:2]

In [None]:
cities.iloc[2:4,0:2]

You can also use one or more column names to create a new DataFrame.

In [None]:
cities['area']

In [None]:
cities2 = cities[['area','population']]
cities2

#### Filtering

Selecting rows based on a certain condition can be done with Boolean indexing:

In [None]:
cities['area'] > 500

If you want to select the data add `cities[]` around the above:

In [None]:
cities[cities['area'] > 500]

Combining different columns using `&`, `|` and `==` is also possible"

In [None]:
cities[(cities['area'] > 500) & (cities['population'] > 2500000)]

In [None]:
cities[(cities['area'] < 500) | (cities['population'] < 1000000)]

In [None]:
cities[cities['area'] == 487.8] 

<div class="alert alert-success">
 <b>EXERCISE</b> <br/> 
 With the above commands we can now start exploring the jeans DataFrame. Answer the following questions by writing some code (add as many cells as you need):
  <ul>
  <li>Find the most expensive and cheapest jeans brands</li>  
  <li>Calculate the difference in price between the cheapest and most expensive jeans</li>    
 </ul>  
</div>  

> *Tips*: 
- Find the maximum of a row with for instance `jeans['price'].max()` 
- Print a value with `print()` for instance: `print(jeans['price'][0])` for the price from the first row. If you calculate multiple values in one cell you will need this, else the answers will not be displayed.
- To see the answer uncomment the line in the cell that contains `%load` (by deleting the `#`) and then run the cell, but try to find your own solution first in the cell above the solution!


- Extract the value from a cell in a DataFrame with `.value[]`


In [None]:
# your answer:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/pandas-workshop/master/answers/dsa_answer1.py

<a id="transform"></a>
## 2. Transform Data

When looking at data there are almost always transformations needed to get it in the format you need for your analysis, visualisations or models. 

These are only a few examples of the endless possibilities. The best way to learn is to find a dataset and try to answer questions with the data. The [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) is real good, and on [StackOverflow](https://stackoverflow.com/questions/tagged/pandas) there is almost always someone who asked the same question already. 

<a id="columns"></a>
### 2.1 Adding and deleting columns
Adding a column can be done by defining a new column, which can be dropped with 'drop'. 

In [None]:
jeans['new'] = 1
jeans = jeans.drop(columns='new')

In [None]:
jeans['avgHeightFront'] = (jeans.maxHeightFront + jeans.minHeightFront) / 2

In [None]:
jeans.head()

<a id="cleaning"></a>
### 2.2 Cleaning Data

Things to check:

- Is the data tidy: each variable forms a column, each observation forms a row and  each type of observational unit forms a table.
- Are all columns in the right data format?
- Are there missing values?
- Are there unrealistic outliers?

Get a quick overview of the numeric data with `.describe()`. If any of the numeric columns is missing this is a probably because of a wring data type. 


In [None]:
jeans.describe()

It is not always ideal to have text in the table. Especially not if you want to create a model from the data. You could replace `style` into numbers, but is one style really twice as large as another. It is better to transform the data with `get.dummies()`. The below will add 4 new columns to the DataFrame:

In [None]:
jeans.head()

In [None]:
jeans2 = jeans.copy()
style = pd.get_dummies(jeans2['style'], drop_first=True)
jeans2 = jeans2.join(style)
jeans2.head(2)

Or do this all in one line of code:

In [None]:
jeans = jeans.join(pd.get_dummies(jeans['style'], drop_first=True))
jeans.head(2)

<a id="merging"></a>
### 2.3 Merging Data

There are several ways to combine data. The [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) has lots of examples. You can combine data with for instance `.append()`, `.concat()` or `.merge()`:

In [None]:
data = {'city':       ['London','Manchester','Birmingham','Leeds','Glasgow'],
        'population': [9787426,  2553379,     2440986,    1777934,1209143],
        'area':       [1737.9,   630.3,       598.9,      487.8,  368.5 ]}
cities = pd.DataFrame(data)

data2 = {'city':       ['Liverpool','Southampton'],
        'population': [864122,  855569],
        'area':       [199.6,   192.0]}
cities2 = pd.DataFrame(data2)

The new cities in `cities2` can be added with `append()`:

In [None]:
cities = cities.append(cities2)
cities

In [None]:
data = {'city': ['London','Manchester','Birmingham','Leeds','Glasgow'],
        'density': [5630,4051,4076,3645,3390]}
cities3 = pd.DataFrame(data)

In [None]:
cities3

An extra column can be added with '.merge()` with an outer join using the city names:

In [None]:
cities = pd.merge(cities, cities3, how='outer', sort=True,on='city')
cities

Data that does not quite fit can be merged as well:

In [None]:
data = {'city':       ['Newcastle','Nottingham'],
        'population': [774891,  729977],
        'area':       [180.5,   176.4]}

cities4 = pd.DataFrame(data)
cities4

In [None]:
cities = cities.append(cities4)
cities

<a id="grouping"></a>
### 2.4 Grouping Data

Grouping data is a quick way to calculate values for classes in your DataFrame. The example below gives you the mean values of all variables for the 2 `cutout` classes, and for a comination of all classes when `cutout` and `style` are combined.

In [None]:
jeans.columns

In [None]:
jeans.groupby(['cutout']).mean()

In [None]:
jeans.groupby(['cutout','style']).max().head(10)

<div class="alert alert-success">
 <b>EXERCISES</b> <br/> 
 Using the jeans DataFrame:
  <ul>
  <li>Add a column `FrontArea` with the area of the front pocket (height X width) </li>        
  <li>Add a column `BackArea` with the area of the back pocket (height X width) </li>        
  <li>Add two columns `men` and `women` with `get_dummies()` and keep the original `menWomen`</li>        
  <li>Using `groupby().count()`: what is the number of mens and womens jeans measured?</li>        
  <li>What are the average front and back pocket sizes of mens and womens jeans?</li>   
 
 </ul>  
</div>  

> *Tips*: 
- To find out how many unique values there are in a column use `np.unique(df['a'])`
- You can use `mean()`, `max()`, `min()`, `count()` and more with `groupby()`

In [None]:
# Your answers:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/pandas-workshop/master/answers/dsa_answer2.py

In [None]:
# Your answers:


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/pandas-workshop/master/answers/dsa_answer3.py


<a id="explore"></a>
## 3. Visualizing Data

In [None]:
# with this instruction plots will be included in the notebook
%matplotlib inline

import matplotlib.pyplot as plt

The default plot is a line chart:

In [None]:
jeans['price'].plot();

To create a plot that makes more sense for this data have a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) for all options. A histogram might work better. Go ahead and change the number of bins until you think the number of bins looks right:

In [None]:
jeans['price'].plot.hist(bins=5);

Change the size of the plot with `figsize`:

In [None]:
jeans['price'].plot.hist(bins=15,figsize=(10,5));

You can select data as you learned earlier direclt in a plot command. The below plot shows only the mens jeans:

In [None]:
jeans['price'][jeans['menWomen']=='men'].plot.hist(bins=15,figsize=(10,5));

To add the womens jeans, simply repeat the plot command with a different selection of the data:

In [None]:
jeans['price'][jeans['menWomen']=='men'].plot.hist(bins=15,figsize=(10,5));
jeans['price'][jeans['menWomen']=='women'].plot.hist(bins=15,figsize=(10,5));

The above plot is difficult to read as the histograms overlap. You can fix this by changing the colours and making them transparant. To add a legend each histogram needs to be assigned to an object `ax` that is used to create a legend:

In [None]:
ax = jeans['price'][jeans['menWomen']=='men'].plot.hist(
    bins=15,figsize=(10,5),alpha=0.5,color='#1A4D3B');
ax = jeans['price'][jeans['menWomen']=='women'].plot.hist(
    bins=15,figsize=(10,5),alpha=0.5,color='#4D1A39');
ax.legend(['men','women']);

It is easy to change pretty much everything as in the below code. This was the ugliest I could come up with. Can you make it worse?

In [None]:
jeans['price'].plot.hist(
    bins=15, 
    title="Jeans Price",
    legend=False,
    fontsize=14,
    grid=False,
    linestyle='--',
    edgecolor='black',
    color='darkred',
    linewidth=3);

You can use `groupby()` in combination with a bar plot to visualize the price by style:

In [None]:
style = jeans['price'].groupby(jeans['style']).mean()
ax=style.plot.bar();
ax.set_ylabel('Jeans Price');

## Seaborn

Seaborn is an easy to use visualisation package that works well with Pandas DataFrames. Below are a few examples, but have a look at the [documentation](https://seaborn.pydata.org/index.html) as there are many more plots you could make. 

In [None]:
import seaborn as sns

In [None]:
sns.distplot(jeans['price']);

In [None]:
sns.distplot(np.array(jeans['price']));

In [None]:
sns.catplot(x='menWomen', y='price', data=jeans);

In [None]:
sns.catplot(x='menWomen', y='price', hue='style', kind='swarm', data=jeans);

In [None]:
sns.catplot(x="style", y="price", kind="box", data=jeans);

In [None]:
sns.catplot(x="style", y="price", hue="menWomen", kind="box", data=jeans);

In [None]:
ax=sns.scatterplot(y='BackArea', x='price', data=jeans)
ax=sns.scatterplot(y='FrontArea', x='price', data=jeans)
ax.set_ylabel('Pocket Size');
ax.legend(['Back pocket','Front pocket']);

<div class="alert alert-success">
 <b>EXERCISE</b>
 <ul>
  <li>Create two histograms that compare the sizes of pockets between men and womens jeans with `.plot.hist()`</li>
  <li>Create a bar plot with the size of the front pockets for men and women with `.plot.bar()`</li>
  <li>Create a bar plot with the size of the front pockets for men and women with `seaborn`</li>
  <li>Customize the way one of the plots you have made so far or create a new one</li>
 </ul> 
</div>    

 
> Tip: to add two histograms to one plot you can repeat `.plot()` in the same cell 


In [None]:
# histogram front pockets


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/pandas-workshop/master/answers/answer9.py

In [None]:
# bar plot back pockets


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/pandas-workshop/master/answers/answer10.py


In [None]:
# bar plot back pockets (seaborn)


In [None]:
# %load https://raw.githubusercontent.com/IBMDeveloperUK/pandas-workshop/master/answers/answer11.py

<a id="extra"></a>
# 4. Optional Excercises

If you finish early:

1. Have a look at [Call for Code](https://callforcode.org/) which is running again this year. Notebooks and Pandas are tools you could use in the challenge.
2. Try to create other plots. Have a look at the [Pandas plot examples](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) or the [Seaborn gallery](https://seaborn.pydata.org/examples/index.html) for inspiration.  
3. Or load one of your own datasets into a notebook and play around with the data to practice what you have learned. 
4. Have a look at these Pandas workshops and book: <br>
4.1. [Pandas workshop by Alexander Hensdorf](https://github.com/alanderex/pydata-pandas-workshop) <br>
4.2. [Pandas tutorial by Joris van den Bossche](https://github.com/jorisvandenbossche/pandas-tutorial) <br>
4.3. [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) <br>

### Author
Margriet Groenendijk works as a Data & AI Developer Advocate for IBM. She develops and presents talks and workshops about data science and AI. She is active in the local developer communities through attending, presenting and organising meetups. She has a background in climate science where she explored large observational datasets of carbon uptake by forests during her PhD, and global scale weather and climate models as a postdoctoral fellow. 

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.