# Pandas
## Selection


In [2]:
import numpy as np
import pandas as pd
import time 
poker_data = pd.read_csv('Datasets/poker_hand.csv')
poker_data.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


### Selecting Rows & Columns Efficiently using .iloc[] & .loc[]


In this section, we will introduce how to locate and select rows efficiently from dataframes using **.iloc[]** & **.loc[]** pandas functions. We will use iloc[] for the index number locator and loc[] for the index name locator.
In the example below we will select 1000 rows of the poker dataset randomly. Firstly by using the **.loc[]** function, and then by using the **.iloc[]** function.

In [3]:
def loc_speed_test():
    rows = np.random.randint(0,len(poker_data),1000) 
    # Specify the index of rows to select randomly
    totall_time = 0
    for i in range(1000):
        loc_start_time = time.time()
        poker_data.loc[rows]
        loc_end_time = time.time()
        totall_time += loc_end_time - loc_start_time
        
    avg_loc_time = totall_time/100
    return avg_loc_time

In [4]:
def iloc_speed_test():
    # Specify the index of rows to select randomly
    rows = np.random.randint(0,len(poker_data),1000)
    totall_time = 0
    for i in range(1000):
        iloc_start_time = time.time()
        poker_data.iloc[rows]
        iloc_end_time = time.time()
        totall_time +=  iloc_end_time-iloc_start_time
    avg_iloc_time = totall_time/100
    return avg_iloc_time

In [5]:
loc_time_list= []
iloc_time_list= []
for i in range(100):
    loc_time_list .append(loc_speed_test()) 
    iloc_time_list.append(iloc_speed_test()) 
    
loc_time = np.mean(loc_time_list)
iloc_time = np.mean(iloc_time_list)
print("avg Difference in percent: {} %".format(((1/iloc_time) / (1/loc_time)-1) *100))

avg Difference in percent: 188.1360470451509 %


While these two methods have the same syntax, **.iloc[]** performs almost 188% faster than **.loc[]**. The **.iloc[]** function takes advantage of the order of the indices, which are already sorted, and is therefore faster.
We can also use them to select columns not only rows. In the next example, we will select the first three columns using both methods.

In [6]:
def iloc_column_speed_test():
    totall_time = 0
    for i in range(100):
        iloc_start_time = time.time()
        poker_data.iloc[:,:3]
        iloc_end_time = time.time()
        totall_time += iloc_end_time - iloc_start_time
    avg_iloc_time = totall_time/100    
    return avg_iloc_time



In [7]:
def name_column_speed_test():
    totall_time = 0
    for i in range(100):
        names_start_time = time.time()
        poker_data[['S1', 'R1', 'S2']]
        names_end_time = time.time()
        totall_time += names_end_time - names_start_time
    avg_name_time = totall_time/100
    return avg_name_time


In [8]:
name_time_list= []
iloc_time_list= []
for i in range(100):
    name_time_list .append(name_column_speed_test()) 
    iloc_time_list.append(iloc_column_speed_test()) 
    
name_time = np.mean(name_time_list)
iloc_time = np.mean(iloc_time_list)

print("avg Difference in percent: {} %".format(((1/iloc_time) / (1/name_time)-1) *100))

avg Difference in percent: 1126.2194888730917 %


We can see also that using the column indexing using **.iloc[]** is still 1126% faster. So it is better to use **.iloc[]** as it is faster unless it is easier to use the loc[] to select certain columns by name.

### Replacing Values in a DataFrame Effectively


In [9]:
names = pd.read_csv('Datasets/Popular_Baby_Names.csv')
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,1
1,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,CHLOE,106,2
2,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMILY,93,3
3,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,4
4,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMMA,75,5


Let's have a closer look at the Gender feature and see the unique values they have:

In [10]:
names['Gender'].unique()

array(['FEMALE', 'MALE'], dtype=object)

We can see that the female gender is represented with two values both uppercase and lowercase. This is very common in real data and an easy way to do so is to replace one of the values with the other to keep it consistent throughout the whole dataset. There are two ways to do it the first one is simply defining which values we want to replace, and then what we want to replace them with. This is shown in the code below:

In [11]:
def replace_loc_method(df):
    start_time = time.time()
    df.loc[names.Gender=='female','Gender'] = 'FEMALE'
    end_time = time.time()

    pandas_time = end_time - start_time
    return pandas_time

The second method is to use the panda's built-in function **.replace()** as shown in the code below:

In [12]:
def replace_method(df):
    start_time = time.time()
    df['Gender'].replace('female', 'FEMALE', inplace=True)
    end_time = time.time()
    replace_time = end_time - start_time
    return replace_time

In [13]:
replace_time_list = []
pandas_time_list = []
for i in range(10000):
    replace_time_list.append(replace_method(names.copy()))
    pandas_time_list.append(replace_loc_method(names.copy()))
    
replace_time = sum(replace_time_list)/10000
pandas_time = sum(pandas_time_list)/10000
print('The differnce: {} %'.format(((1/replace_time )/(1/pandas_time)-1)*100))

The differnce: 114.76267295119501 %


We can see that there is a difference in time complexity with the built-in function 114% faster than using the **.loc()** method to find the rows and columns index of the values and replace it.

We can also replace multiple values using lists. Our objective is to change all ethnicities classified as **WHITE NON-HISPANIC** or **WHITE NON-HISP** to **WNH**. Using the **.loc[]** function, we will locate babies of the ethnicities we are looking for, using the 'or' statement (which in Python is symbolized by the pipe). We will then assign the new value. As always, we also measure the CPU time needed for this operation.

In [14]:
def replace_loc_method(df):
    start_time = time.time()
    df.loc[(df["Ethnicity"] == 'WHITE NON HISPANIC') |(df["Ethnicity"] == 'WHITE NON HISP'),'Ethnicity'] = 'WNH'
    end_time = time.time()

    pandas_time = end_time - start_time
    return pandas_time

We can also do the same operation using the .replace() pandas built-in function as the following:

In [15]:
def replace_method(df):
    start_time = time.time()
    df['Ethnicity'].replace(['WHITE NON HISPANIC','WHITE NON HISP'],
    'WNH', inplace=True)
    end_time = time.time()
    replace_time = end_time - start_time
    return replace_time

We can see that again using the **.replace()** method is much faster than using the **.loc[]** method. To have better intuition of how much faster it is let's run the code below:

In [16]:
replace_time_list = []
pandas_time_list = []
for i in range(10000):
    replace_time_list.append(replace_method(names.copy()))
    pandas_time_list.append(replace_loc_method(names.copy()))
    
replace_time = sum(replace_time_list)/10000
pandas_time = sum(pandas_time_list)/10000
print('The differnce: {} %'.format(((1/replace_time )/(1/pandas_time)-1)*100))

The differnce: 31.09173138936361 %


The **.replace()** method is 31% faster than using the .loc[] method. If your data is huge and need a lot of cleaning this tip will decrease the computational time of your data cleaning and makes your pandas code much faster and hence more efficient.

Finally, we can also use dictionaries to replace both single and multiple values in your DataFrame. This will be very helpful if you would like to multiple replacing functions in one command.
We're going to use dictionaries to replace every male's gender with BOY and every female's gender with GIRL.

In [17]:

def mult_replace_dict(df):
    start_time = time.time()
    df['Gender'].replace({'MALE':'BOY', 'FEMALE':'GIRL', 'female': 'girl'}, inplace=True)
    end_time = time.time()
    dict_time = end_time - start_time
    return dict_time

def mult_replace(df):
    start_time = time.time()
    df['Gender'].replace('MALE', 'BOY', inplace=True)
    df['Gender'].replace('FEMALE', 'GIRL', inplace=True)
    df['Gender'].replace('female', 'girl', inplace=True)
    end_time = time.time()
    list_time = end_time - start_time
    return list_time
    

In [18]:
names = pd.read_csv('Datasets/Popular_Baby_Names.csv')
mult_replace_dict_list = []
mult_replace_list = []
for i in range(10000):
    mult_replace_dict_list.append(mult_replace_dict(names.copy()))
    mult_replace_list.append(mult_replace(names.copy()))
    
dict_time = sum(mult_replace_dict_list)/10000
list_time = sum(mult_replace_list)/10000

print('The differnce: {} %'.format(((1/dict_time )/(1/list_time)-1)*100))

The differnce: -14.321091231581761 %


We could do the same thing with lists, but it's more verbose. If we compare both methods, we can see that dictionaries run approximately 14% slower. In general, working with dictionaries in Python is very efficient compared to lists: looking through a list requires a pass in every element of the list while looking at a dictionary directs instantly to the key that matches the entry. The comparison is a little unfair though since both structures serve different purposes.

Using dictionaries allows you to replace the same values on several different columns. In all the previous examples, we specified the column from which the values to replace came. We're now going to replace several values from the same column with one common value. We want to classify all ethnicities into three big categories: Black, Asian and White. The syntax again is very simple. We use nested dictionaries here: the outer key is the column in which we want to replace values. The value of this outer key is another dictionary, where the keys are the ethnicities to replace, and the values for the new ethnicity (Black, Asian or White).

In [19]:
start_time = time.time()
names.replace({'Ethnicity': {'ASIAN AND PACI': 'ASIAN', 'ASIAN AND PACIFIC ISLANDER': 'ASIAN',
'BLACK NON HISPANIC': 'BLACK', 'BLACK NON HISP': 'BLACK',
'WHITE NON HISPANIC': 'WHITE', 'WHITE NON HISP': 'WHITE'}})
print("Time using .replace() with dictionary: {} sec".format (time.time() - start_time))

Time using .replace() with dictionary: 0.005876302719116211 sec
