## **Pandas - Filtering, sorting, and modifying DataFrames**

> **Key skills demonstrated**
> - Defining functions
> - Creating new columns in a DataFrame
> - Filtering a DataFrame
> - **if** and **else** statement
>
>
> **Methods & Functions used**\
>\
>`.read_csv()` `.sample()` `.set_index()` `inplace=True` `.drop()` `.mean()` \
>`.head()` `.value_counts()` `.loc` `.dtypes()` `.copy()` `round()` 


***

Import Pandas

In [1]:
import pandas as pd         #Library for analysing and manipulating data 

***

Importing the file `libraries.csv`, assigning the result to `df`:

In [3]:
df = pd.read_csv('/home/tmbillington/Portfolio/Data/libraries.csv')
df.sample(1)

Unnamed: 0,Library service,Library name,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Notes,Weekly hours open,Weekly hours staffed
637,Derbyshire,Bolsover,yes,yes,LAL,,,,,,49.0,49.0


Using `.set_index()` method to use the values from the `Library name` column as the index of `df`, using the parameter `inplace=True`:

In [4]:
df.set_index('Library name', inplace = True)

In [5]:
df.head(1)

Unnamed: 0_level_0,Library service,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Notes,Weekly hours open,Weekly hours staffed
Library name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Barking,Barking and Dagenham,Yes,Yes,LAL,,,,,,72.0,72.0


***

Using `.drop()` method to remove the `Notes` column:
- `axis=1` to specify it is a column
- `inplace=True` to make the change permanant

In [6]:
df.drop('Notes', axis = 1, inplace = True)

In [7]:
df.head(1)

Unnamed: 0_level_0,Library service,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Weekly hours open,Weekly hours staffed
Library name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Barking,Barking and Dagenham,Yes,Yes,LAL,,,,,72.0,72.0


***

## **Creating a function**

The cell below defines a function called `open_2016`. This function takes two parameters called `data` and `name`. 

The function takes these two parameters and returns the value in the `In use 2016` column for the row whose index is equal to `name` using the `.loc` method.

In [8]:
def open_2016(data, name):
    
    val = data.loc[name, 'In use 2016']
    
    return val 

This function goes to the **'In use 2016'** column and returns the boolean result.

In [9]:
print(open_2016(df, 'Barking'))
print(open_2016(df, 'Fanshawe'))
print(open_2016(df, 'Marks Gate'))

Yes
No
Yes


***

## **Working with the data**

Using the `.value_counts()` method on `df['In use 2010']` column to see what different values there are in that column:

In [51]:
df['In use 2010'].value_counts()

In use 2010
yes    2807
Yes     256
no      163
No       49
Name: count, dtype: int64

In the cell below, a function called `is_open()` has been defined, which take a single value and returns a boolean value as follows:
- `True` if the value equals `'yes'` or `'Yes'`
- `False` for any other value

In [53]:
def is_open(entry):
    
    if entry in ['yes', 'Yes']:
        return True
    else:
        return False

The following cell contains some examples that allow us to test the logic in our function:

In [55]:
is_open('No'), is_open('yes'), is_open('no'), is_open('Yes')

(False, True, False, True)

Using the `.apply()` method with the `is_open()` function and each of the columns `In use 2010` and `In use 2016`, new columns called `open_2010` and `open_2016` are created, each containing Boolean values returned by the function:**

In [57]:
df['open_2010'] = df['In use 2010'].apply(is_open)
df['open_2016'] = df['In use 2016'].apply(is_open)

In [61]:
df.head(1)

Unnamed: 0_level_0,Library service,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Weekly hours open,Weekly hours staffed,open_2010,open_2016
Library name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Barking,Barking and Dagenham,Yes,Yes,LAL,,,,,72.0,72.0,True,True


In [58]:
df[['open_2010', 'open_2016']].dtypes

open_2010    bool
open_2016    bool
dtype: object

In [62]:
df['open_2010'].value_counts(), df['open_2016'].value_counts()

(open_2010
 True     3063
 False     214
 Name: count, dtype: int64,
 open_2016
 True     2763
 False     514
 Name: count, dtype: int64)

I have created a new column in `df` called `open_both`, which contains a Boolean `True` for entries where both `open_2010` and `open_2016` are `True`:

In [63]:
df['open_booth'] = df['open_2010'] & df['open_2016']
df.head(3)

Unnamed: 0_level_0,Library service,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Weekly hours open,Weekly hours staffed,open_2010,open_2016,open_booth
Library name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Barking,Barking and Dagenham,Yes,Yes,LAL,,,,,72.0,72.0,True,True,True
Castle Green,Barking and Dagenham,Yes,No,,XL,Mar-13,,,,,True,False,False
Dagenham,Barking and Dagenham,No,Yes,LAL,,,Oct-10,Yes,56.0,56.0,False,True,False


In [64]:
df['open_booth'].value_counts()

open_booth
True     2637
False     640
Name: count, dtype: int64

I have assigned to `df_open` a DataFrame containing only entries where `open_both == True`:
- I used the `.copy()` method so that `df_open` can be subsequently modified without affecting `df`

In [67]:
df_open = df[df['open_booth']==True].copy()
df_open.head(2)

Unnamed: 0_level_0,Library service,In use 2010,In use 2016,Type of library,Type of closed library,Closed,New building,Replace existing,Weekly hours open,Weekly hours staffed,open_2010,open_2016,open_booth
Library name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Barking,Barking and Dagenham,Yes,Yes,LAL,,,,,72.0,72.0,True,True,True
Marks Gate,Barking and Dagenham,Yes,Yes,CRL,,,,,9.0,9.0,True,True,True


In [68]:
len(df_open) == df['open_booth'].value_counts()[True]

True

***

## **Working with data**

Calculating and rounding the `.mean()` of the values in the `Weekly hours open` column:

In [71]:
round(df_open['Weekly hours open'].mean(), 2)

36.99

Calculating the percentage of entries in `df_open` where `['Weekly hours open'] == 0`:

In [73]:
wklyhrs_zero = df_open['Weekly hours open'] == 0   #Create a mask

In [78]:
zerohours = df_open[wklyhrs_zero]                  #Apply the mask to the df
print(zerohours.shape)
print(df_open.shape)

(56, 13)
(2637, 13)


In [100]:
percentage = round(zerohours.shape[0] / df_open.shape[0] * 100, 2)  #3.69% are 0 hours - 121 / 3277 * 100
print(f"{percentage}% of libraries where their 'weekly hours open' is 0")

2.12% of libraries where their 'weekly hours open' is 0


***

## **Converting values to NaN**

These `NaN` values ('Not a Number') represent 'missing data' (of all types, not just numeric) in pandas.

NaN's originate from the `numpy` package.

In [86]:
import numpy as np

I have defined a function called `convert_zero_to_nan()`, which take a single argument (`x`) and returns the following values:**
- `np.nan` if `x` is equal (`==`) to `0`
- Otherwise, return `x` unchanged

In [87]:
def convert_zero_to_nan(x):
    if x == 0:
        return np.nan
    else:
        return x

I can then `.apply()` the `convert_zero_to_nan` function to modify the `Weekly hours open` column:

In [None]:
df_new = df_open.copy()       # Create a copy of the df_open DataFrame first
df_new['Weekly hours open'] = df_new['Weekly hours open'].apply(convert_zero_to_nan)

In [92]:
df_new['Weekly hours open'].value_counts()

Weekly hours open
31.00     62
40.00     61
20.00     60
15.00     53
27.00     52
          ..
131.75     1
168.00     1
83.50      1
218.50     1
73.50      1
Name: count, Length: 192, dtype: int64

Runing my earlier code to calculate the `.mean()` of `Weekly hours open`, we can now see that the mean has changed

In [101]:
round(df_new['Weekly hours open'].mean(), 2)

37.84