# Additional practice

Use this notebook to read through and execute cells, testing what you've learnt in class with the other notebook and experimenting yourself with the data in the imported dataframes.

## Initial set up

Import relevant libraries:

In [None]:
import numpy as np
import pandas as pd
import os
import random

Mount Drive content:

In [None]:
drive_loc = '/content/gdrive'
files_loc = os.path.join(drive_loc, 'MyDrive', 'pdsfiles')

from google.colab import drive
drive.mount(drive_loc)

In [None]:
!mkdir -p {files_loc}

# Using `iloc` and `loc` to select rows and columns in Pandas DataFrames

Remember, `ix` is deprecated as of Pandas 0.20, so we'll be using `loc` and `iloc`.

In [None]:
!wget https://bit.ly/ks-pds-csv4 -O {files_loc}/uk_data.csv
contents = !ls {files_loc}/*uk_data*
uk_data_file = contents[0]

In [None]:
# read the data from a CSV file.
data = pd.read_csv(uk_data_file)
# set a numeric id for use as an index for examples.
data['id'] = [random.randint(0,1000) for x in range(data.shape[0])]
 
data.head(5)

In [None]:
data.shape

## Using `iloc`

Let's do single selections using iloc for dataframes, starting with the rows.

This selects the first row of the data frame (note the Series data type output):

In [None]:
data.iloc[0]

Let's now select the second row of the data frame:

In [None]:
data.iloc[1]

We can do the last row as well, using the familiar Python syntax for it:

In [None]:
data.iloc[-1]

Let's do the same by columns:

In [None]:
data.iloc[:,0] # first column of data frame (first_name)

In [None]:
data.iloc[:,1] # second column of data frame (last_name)

In [None]:
data.iloc[:,-1] # last column of data frame (id)

Multiple columns and rows can be selected together using the .iloc indexer and Python slices syntax:

In [None]:
data.iloc[0:5] # first five rows of dataframe

In [None]:
data.iloc[:, 0:2] # first two columns of data frame with all rows

In [None]:
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.

In [None]:
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1).

There’s two thing to consider when using iloc like this:

- `.iloc` returns:
  - a Pandas Series when one row is selected
  - a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.
  
  If you require DataFrame output, pass a single-valued list.

  When using `.loc`, or `.iloc`, you can then control the output format by passing *lists* or *single values* to the selectors.

- When selecting multiple columns or multiple rows in this manner, remember that your selection will go from the first number to one minus the second number, as it is usual with Python.

In practice, it is better to use `.loc`.


## Using `loc`

The Pandas `loc` indexer can be used with DataFrames for two different use cases:

1. Selecting rows by label/index
2. Selecting rows with a boolean/conditional lookup

The `loc` indexer is used with the same syntax as iloc: `data.loc[<row selection>, <column selection>]`.



### Label-based / Index-based indexing using `loc`

Selections using the loc method are based on the index of the data frame (if defined). Where the index is set on a DataFrame, using `df.set_index()`, the `.loc` method directly selects based on index values of any rows.

For example, setting the index of our test data frame to the persons "last_name":

In [None]:
data.set_index('last_name', inplace=True)
data.head()

Now with this new index, we can directly select rows for different "last_name" values using `.loc[<label>]`. For example:

In [None]:
data.loc['Andrade']

In [None]:
data.loc[['Andrade','Veness']]

Note that the first example returns a series, and the second returns a DataFrame. You can achieve a single-column DataFrame by passing a single-element list to the `.loc` operation.

Select columns with `.loc` using the names of the columns. In most of the data work, typically thera are named columns, so use these named selections.

In [None]:
data.loc[['Andrade','Veness'], ['first_name','address', 'city']]

You can select ranges of index labels – the selection `data.loc['Bruch':'Julio']`will return all rows in the data frame between the index entries for “Bruch” and “Julio”.

The following examples should now make sense:

In [None]:
# Select rows with index values 'Andrade' and 'Veness', with all columns between 'city' and 'email'
data.loc[['Andrade', 'Veness'], 'city':'email']

In [None]:
# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Andrade':'Veness', ['first_name', 'address', 'city']]
 

Now, we reset the index before selecting the old index. If not, we may end up with a multiple index, a situation we don't want to deal with as of now:

In [None]:
data.reset_index(inplace=True)
data.head()

In [None]:
data.set_index('id', inplace=True)
data.head()

In [None]:
# select the row with 'id' = 487
data.loc[487]

Note that in the last example, `data.loc[487]` (the row with index value 487) **is not equal** to `data.iloc[487]` (the 487th row in the data). The index of the DataFrame can be out of numeric order, and/or a string or multi-value.

Remember that you can also recover the row in a DataFrame format if you specify a list instead of the raw element:

In [None]:
data.loc[[487]]

### Boolean indexing

Conditional selections with boolean arrays using `data.loc[<selection>]` is a common method to use with Pandas DataFrames. With boolean indexing or logical selection, you pass an array or Series of True/False values to the `.loc` indexer to select the rows where your Series has True values.

In most use cases, you will make selections based on the values of different columns in your data set.

For example, the statement `data['first_name'] == 'Antonio'` produces a Pandas Series with a True/False value for every row in the `data` DataFrame, where there are `True` values for the rows where the first_name is 'Antonio'. These type of boolean arrays can be passed directly to the `.loc` indexer as so:

In [None]:
data.loc[data['first_name'] == 'Antonio']

As before, a second argument can be passed to .loc to select particular columns out of the data frame.

Again, columns are referred to by name for the loc indexer and can be a single string, a list of columns, or a slice “:” operation:

In [None]:
data.loc[data['first_name'] == 'Erasmo', ['company_name', 'email', 'phone1']]

You can see that when selecting columns, if one column only is selected, the `.loc` operator returns a Series. For a single column DataFrame, use a one-element list to keep the DataFrame format, for example:


In [None]:
data.loc[data['first_name'] == 'Antonio', 'email']

In [None]:
data.loc[data['first_name'] == 'Antonio', ['email']]

Selecting rows with first name Antonio and all columns between 'city' and 'email:

In [None]:
data.loc[data['first_name'] == 'Antonio', 'city':'email']

Select rows where the email column ends with 'hotmail.com', include all columns:

In [None]:
data.loc[data['email'].str.endswith("hotmail.com")]

Select rows with last_name equal to some values, all columns:

In [None]:
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]

Select rows with first name 'Antonio' **AND** hotmail email addresses:

In [None]:
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')] 

Select rows with id column between 100 and 200, and just return 'postal' and 'web' columns. But let's first reset the index so `id` is a column:

In [None]:
data.reset_index(inplace=True)
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']] 

A lambda function that yields True/False values can also be used. Let's use it to select rows where the company name has 4 words in it:

In [None]:
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)] 

Selections can be achieved outside of the main `.loc` for clarity. First, form a separate variable with your selections:

In [None]:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
idx

Then, select only the True values in `idx` and only the 3 columns specified:

In [None]:
data.loc[idx, ['email', 'first_name', 'company_name']]

# Pandas `apply`, `applymap` and `map`

[Source](https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff)

Let's create a new Dataframe:

In [None]:
df = pd.DataFrame({
    'A': [1,2,3,4], 
    'B': [10,20,30,40],
    'C': [20,40,60,80]
    }, 
    index=['Row 1', 'Row 2', 'Row 3', 'Row 4'])
df

## Using Apply

The Pandas apply() is used to apply a function along an axis of the DataFrame or on values of Series.

Let’s begin with a simple example, to sum each row and save the result to a new column "D":

In [None]:
def custom_sum(row):
    return row.sum()
    
df['D'] = df.apply(custom_sum, axis=1)
df

Do you really understand what just happened?

Let’s take a look `df.apply(custom_sum, axis=1)`

- The first parameter custom_sum is a function.
- The second parameter axis is to specify which axis the function is applied to. `0` for applying the function to each column and `1` for applying the function to each row.

Let me explain this process in a more intuitive way. The second parameter `axis = 1` tells Pandas to use the row. So, the custom_sum is applied to each row and returns a new Series with the output of each row as value.

With the understanding of the sum of each row, the sum of each column is just to use axis = 0 instead, first clearing out what we just did:

In [None]:
df.drop('D', axis=1, inplace=True)
df

In [None]:
df.loc['Row 5'] = df.apply(custom_sum, axis=0)
df

So far, we have been talking about `apply()` on a DataFrame. Similarly, `apply()` can be used on the values of Series. For example, multiply the column **C** by 2 and save the result to a new column **D**:

In [None]:
df.drop('Row 5', inplace=True)
df

In [None]:
def multiply_by_2(val):
    return val * 2

df['D'] = df['C'].apply(multiply_by_2)
df

Notice that `df['C']` is used to select the column **C** and then call `apply()` with the only parameter `multiply_by_2`. We don’t need to specify axis anymore because Series is a one-dimensional array. The return value is a Series and get assigned to the new column **D** by `df[‘D’]`.

Now, we could do exactly the same for the rows:

In [None]:
df.drop('D', axis=1, inplace=True)
df

In [None]:
df.loc['Row 5'] = df.loc['Row 4'].apply(multiply_by_2)
df

### Using labmdas

As we saw in class, you can use Pandas `apply()` function with Labmdas.

The lambda equivalent for the sum of each row of a DataFrame that we used above is:


In [None]:
df['D'] = df.apply(lambda x:x.sum(), axis=1)

Or, the lambda equivalent for the sum of each column of a DataFrame:


In [None]:
df['Row 5'] = df.apply(lambda x:x.sum(), axis=0)

And finally, the lambda equivalent for multiply by 2 on a Series:

In [None]:
df['D'] = df['C'].apply(lambda x:x*2)

### Using the `result_type` parameter
`result_type` is a parameter in apply() set to `expand`, `reduce`, or `broadcast` to get the desired type of result.

In what we've done previously, if result_type is set to `broadcast` then the output will be a DataFrame substituted by the custom_sum value:


In [None]:
df.apply(custom_sum, axis=1, result_type='broadcast')

You can see that the result is broadcasted to the original shape of the frame, while the original index and columns are retained.



To understand `result_type`'s `expand` and `reduce`, you will first create a function that returns a list:

In [None]:
def cal_multi_col(row):
    return [row['A'] * 2, row['B'] * 3]

Let's apply this function on the dataframe's columns axis with result_type set as `expand`:

In [None]:
df.apply(cal_multi_col, axis=1, result_type='expand')

The output is a new DataFrame with column names 0 and 1.

To append this to the existing DataFrame, the result needs to be stored in a variable so the column names can be accessed by `resul.columns`:


In [None]:
resul = df.apply(cal_multi_col, axis=1, result_type='expand')
df[resul.columns] = res

Finally, apply the function across axis 1 with `result_type=reduce` . This is just the opposite of `expand` and returns a Series if possible rather than expanding list-like results:


In [None]:
df['New'] = df.apply(cal_multi_col, axis=1, result_type='reduce')

## Using `applymap()`

`applymap()` is used for element-wise operation across the whole DataFrame. It's an optimized method and in some particular cases it works much faster than `apply()` (but it’s always good to compare it with `apply()` for big operations).

Our example too output a DataFrame with number squared from before, with applymap would be:

In [None]:
df.applymap(np.square)

## Using `map()`

`map()` is only available in Series and used for substituting each value in a Series with a new one.

To get how the map() works, let's create a Series:

In [None]:
s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s

`map()` accepts a dict or a Series as input. Values that are not found in the dict are converted to `NaN`, unless the dict has a default value (as is the case of `defaultdict`):

In [None]:
s.map({'cat': 'kitten', 'dog': 'puppy'})

`map()` also accepts a function as input:

In [None]:
s.map('I am a {}'.format)

If you want to avoid applying the function to missing values (and therefore dragging `NaN` down your processing), you can use `na_action='ignore'`:

In [None]:
s.map('I am a {}'.format, na_action='ignore')

# Filling up missing Data

[Source](https://www.geeksforgeeks.org/python-pandas-dataframe-fillna-to-replace-null-values-in-dataframe/)

Sometimes our data has null values, which are later displayed as `NaN` in Data Frame. Just like pandas `dropna()` method manages and removes `Null` values from a data frame, `fillna()` manages and let the user replace `NaN` values with some value of their own.

[See the syntax in the Pandas Official Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html)

## Replacing NaN values with a static value

In [None]:
!wget https://bit.ly/ks-pds-csv5 -O {files_loc}/nba.csv
contents = !ls {files_loc}/*nba*
nba_file = contents[0]

In [None]:
nba = pd.read_csv(os.path.join(nba_file))
nba

Here, all the null values in College column are going to be replaced with “No college” string. Firstly, the data frame is imported from CSV and then College column is selected and fillna() method is used on it:

In [None]:
nba.fillna({'College':'No College'})

Working in place in the dataframe can be expressed somewhat differently as well:

In [None]:
nba['College'].fillna('No College', inplace=True)
nba

But look at what happens if we use the last syntax and we do not modify in place:

In [None]:
nba = pd.read_csv(os.path.join(files_loc, "nba.csv")) # Rereading, we modified nba df before
nba_1 = nba['College'].fillna('No College')
nba_1

In [None]:
type(nba['College'].fillna('No College'))

In [None]:
type(nba['College'].fillna('No College', inplace=True))

## Using the `method` parameter
Now, let's set the `method` to `ffill` (forward fill) and hence the value in the same column replaces the null value. In this case 'Georgia State' replaced 'null' value in college column of row 4 and 5.

Similarly, bfill, backfill and pad methods can also be used:

In [None]:
nba = pd.read_csv(os.path.join(files_loc, "nba.csv")) # Reloading, we modified nba df before
nba['College'].fillna(method='ffill', inplace=True)
nba

And now, let's do the same but not modifying the nba original dataframe. This is going to involve creating a new series the way we want it and using the [`assign`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html) method on the original dataframe that returns a copy with the desired modifications:

In [None]:
nba = pd.read_csv(os.path.join(files_loc, "nba.csv")) # Reloading, we modified nba df before
new_college = nba['College'].fillna(method='ffill')
nba_2 = nba.assign(College=new_college)
nba_2

In [None]:
nba

## Using `limit`
Let's set a limit of 1 is set in the fillna() method to check if the function stops replacing after one successful replacement of NaN value or not:

In [None]:
nba['College'].fillna(method='ffill', limit=1, inplace=True)
nba