### Transforming the data
It could be that the original formatting of the data is inconsistent. Maybe the values are not in the correct units or data types. Or maybe there is some incorrect data.

In [1]:
import pandas as pd
import numpy as np
data = {
    "class": [1, 1, 1, 2, 2],
    "score": [10, 21, 35, 11, 26],
    "result": [0, 1, 0, 1, 0],
    "performance": ["strong", "weak", "normal", "weak", "strong"],
}
df = pd.DataFrame(data)
df

Unnamed: 0,class,score,result,performance
0,1,10,0,strong
1,1,21,1,weak
2,1,35,0,normal
3,2,11,1,weak
4,2,26,0,strong


#### Mapping 

One of the basic tasks in data transformations is mapping of a set of values to another set. Pandas provides us with a special function for doing this called map(). This function is for Series and we can use it to map an existing value of a Series to a different set of values. Series.map({a_1:b_1, a_2:b_2, ...}). This means that the value a_1 will get mapped to b_1, the value a_2 will get mapped to b_2, and so on. Let’s suppose we want to transform the column result such that 0 gets mapped to fail and 1 to pass. We can do this with the map() function as follows:


In [2]:
df.result.map({0: "fail", 1: "pass"})

0    fail
1    pass
2    fail
3    pass
4    fail
Name: result, dtype: object

It is always a good practice to store the results in a new column instead of replacing an existing column. So let’s do this:



In [4]:
df["pass or fail"] = df.result.map({0: "fail", 1: "pass"})
df

Unnamed: 0,class,score,result,performance,pass or fail
0,1,10,0,strong,fail
1,1,21,1,weak,pass
2,1,35,0,normal,fail
3,2,11,1,weak,pass
4,2,26,0,strong,fail


#### Applying functions

There are situations where a direct mapping might not be sufficient. In such cases, we can apply a specific function to transform the data any way we want. There are several ways in which we might do this: to individual elements, to entire columns, to entire rows, or to the entire DataFrame. There are two main functions that we can use for this: apply() and applymap().

Let’s take a look first at apply(). This is both a Series method and a DataFrame method. When used on a Series, it applies a given function to each element of the series. Let’s suppose that we want to standardize the scores. This is a common procedure which produces values with mean 0 and standard deviation. Normalization can be important when comparing measurements that have different units, and as we will see later in the program, it is also a general requirement for many machine learning algorithms. Let’s define our own standardizing function for the column score as follows:

In [5]:
def stand(x):
    mean = df.score.mean()
    std = df.score.std()
    y = (x - mean) / std
    return y

In [6]:
# We can now apply it to the Series df.score and store the output in a new column
df["score standard"] = df.score.apply(stand)
df

Unnamed: 0,class,score,result,performance,pass or fail,score standard
0,1,10,0,strong,fail,-1.009295
1,1,21,1,weak,pass,0.038087
2,1,35,0,normal,fail,1.371118
3,2,11,1,weak,pass,-0.914078
4,2,26,0,strong,fail,0.514169


Let’s now look at using the apply() function on an entire DataFrame. To do this, we will consider the DataFrame consisting of the numerical columns. We can select these by column labels using the loc function as follows

In [7]:
df_num = df.loc[:, ["class", "result", "score", "score standard"]]
df_num

Unnamed: 0,class,result,score,score standard
0,1,0,10,-1.009295
1,1,1,21,0.038087
2,1,0,35,1.371118
3,2,1,11,-0.914078
4,2,0,26,0.514169


In [8]:
# For this example, let’s use the max() function from Python
df_num.apply(max, axis=0)

class              2.000000
result             1.000000
score             35.000000
score standard     1.371118
dtype: float64

In [9]:
# The maximum value in each row
df_num.apply(max, axis=1)

0    10.0
1    21.0
2    35.0
3    11.0
4    26.0
dtype: float64

Finally, let’s consider the applymap() function. Now, this function is for DataFrames, and it is used for applying a function to every element of a DataFrame. This is in contrast to apply(), which is for either rows or column.

For example, suppose we want to change the formatting of our values and convert them to strings and append a $ symbol before. We can do this as follows



In [12]:
df_new = df_num.applymap(lambda x: "\$" + str(x))
df_new

Unnamed: 0,class,result,score,score standard
0,\$1,\$0,\$10,\$-1.0092949703936966
1,\$1,\$1,\$21,\$0.03808660265636577
2,\$1,\$0,\$35,\$1.3711176956291724
3,\$2,\$1,\$11,\$-0.9140784637527819
4,\$2,\$0,\$26,\$0.5141691358609396
