<p><font size="6"><b> 02 - Pandas: Basic operations on Series and DataFrames</b></font></p>



In [2]:
%matplotlib inline

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

In [3]:
path='C:\\Users\\hvill\\Destop\\'
import os
funded.to_csv(os.path.join(path,r'green1.csv'))

NameError: name 'funded' is not defined

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.


In [4]:
# redefining the example objects

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [5]:
countries.head()

Unnamed: 0,country,population,area,capital
0,Belgium,11.3,30510,Brussels
1,France,64.3,671308,Paris
2,Germany,81.3,357050,Berlin
3,Netherlands,16.9,41526,Amsterdam
4,United Kingdom,64.9,244820,London


# The 'new' concepts

## Elementwise-operations 

Just like with numpy arrays, many operations are element-wise:

In [6]:
population / 100

Germany           0.813
Belgium           0.113
France            0.643
United Kingdom    0.649
Netherlands       0.169
dtype: float64

In [7]:
countries['population'] / countries['area']

0    0.000370
1    0.000096
2    0.000228
3    0.000407
4    0.000265
dtype: float64

In [8]:
np.log(countries['population'])

0    2.424803
1    4.163560
2    4.398146
3    2.827314
4    4.172848
Name: population, dtype: float64

which can be added as a new column, as follows:

In [9]:
countries["log_population"] = np.log(countries['population'])

In [10]:
countries.columns

Index(['country', 'population', 'area', 'capital', 'log_population'], dtype='object')

In [11]:
countries['population'] > 40

0    False
1     True
2     True
3    False
4     True
Name: population, dtype: bool

<div class="alert alert-info">

<b>REMEMBER</b>:

 <ul>
  <li>When you have an operation which does NOT work element-wise or you have no idea how to do it directly in Pandas, use the **apply()** function</li>
  <li>A typical use case is with a custom written or a **lambda** function</li>
</ul>
</div>

In [12]:
countries["population"].apply(np.log) # but this works as well element-wise...

0    2.424803
1    4.163560
2    4.398146
3    2.827314
4    4.172848
Name: population, dtype: float64

In [13]:
countries["capital"].apply(lambda x: len(x)) # in case you forgot the functionality: countries["capital"].str.len()

0    8
1    5
2    6
3    9
4    6
Name: capital, dtype: int64

In [14]:
def population_annotater(population):
    """annotate as large or small"""
    if population > 50:
        return 'large'
    else:
        return 'small'

In [13]:
countries["population"].apply(population_annotater) # a custom user function

0    small
1    large
2    large
3    small
4    large
Name: population, dtype: object

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the population numbers relative to Belgium</li>
</ul>
</div>

In [15]:
countries["population"].mean


<bound method NDFrame._add_numeric_operations.<locals>.mean of 0    11.3
1    64.3
2    81.3
3    16.9
4    64.9
Name: population, dtype: float64>

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the population density for each country and add this as a new column to the dataframe.</li>
</ul>
</div>

<div class="alert alert-danger">

**WARNING**: **Alignment!** (unlike numpy)

 <ul>
  <li>Pay attention to **alignment**: operations between series will align on the index:  </li>
</ul> 

</div>

In [None]:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]

In [None]:
s1

In [None]:
s2

In [None]:
s1 + s2

## Aggregations (reductions)

Pandas provides a large set of **summary** functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column). 

The average population number:

In [None]:
population.mean()

The minimum area:

In [None]:
countries['area'].min()

For dataframes, often only the numeric columns are included in the result:

In [None]:
countries.median()

# Application on a real dataset

Reading in the titanic data set...

In [2]:
import pandas as pd

def blabla(a):
    return float(a)

type(blabla(10))

float

In [4]:
df = pd.read_csv("data/titanic.csv")

Quick exploration first...

In [None]:
df.head()

In [None]:
len(df)

The available metadata of the titanic data set provides the following information:

VARIABLE   |  DESCRIPTION
------ | --------
survival       | Survival (0 = No; 1 = Yes)
pclass         | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name           | Name
sex            | Sex
age            | Age
sibsp          | Number of Siblings/Spouses Aboard
parch          | Number of Parents/Children Aboard
ticket         | Ticket Number
fare           | Passenger Fare
cabin          | Cabin
embarked       | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the average age of the passengers?</li>
</ul>

</div>

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>Plot the age distribution of the titanic passengers</li>
</ul>
</div>

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the survival rate? (the relative number of people that survived)</li>
</ul>

Note: the 'Survived' column indicates whether someone survived (1) or not (0).
</div>

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>What is the maximum Fare? And the median?</li>
</ul>
</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
  <li>Calculate the 75th percentile (`quantile`) of the Fare price (Tip: look in the docstring how to specify the percentile)</li>
</ul>
</div>

<div class="alert alert-success">
<b>EXERCISE</b>:

 <ul>
  <li>Calculate the normalized Fares (relative to its mean)</li>
</ul>
</div>