# Data Manipulation
Data manipulation basically refers to adjusting the data to make it organised and easier to read. We will see some of the data manipulation techniques

# Single element change
We will see how we can change single element of the dataframe. Let's create a dataframe conaining some NaN

In [1]:
import pandas as pd
import numpy as np
A = ['a','b','c','d']
B = ['e',np.nan,'g','h']
C = ['i', 'j', np.nan, 'l']
D = ['a', 'e', 'i', 'o']
E = ['s', 'k', np.nan, 'g']
#create dataframe
df = pd.DataFrame(data=[A, B, C, D, E], columns=['one', 'two', 'three', 'four'])

#display dataframe
df

Unnamed: 0,one,two,three,four
0,a,b,c,d
1,e,,g,h
2,i,j,,l
3,a,e,i,o
4,s,k,,g


In [3]:
df.isnull().sum() #check null values

one      0
two      1
three    2
four     0
dtype: int64

# Fillna()
It is used for updating missing values. Let's see how we can use pandas fillna.

In [4]:
# inplace = True which indicates we make changes in the original dataframe

df.fillna(2, inplace= True)

In [5]:
df.isnull().sum() #check null values after filling 

one      0
two      0
three    0
four     0
dtype: int64

In [7]:
import pandas as pd
import numpy as np
A = ['a','b','c','d']
B = ['e',np.nan,'g','h']
C = ['i', 'j', np.nan, 'l']
D = ['a', 'e', 'i', 'o']
E = ['s', 'k', np.nan, 'g']
#create dataframe
df = pd.DataFrame(data=[A, B, C, D, E], columns=['one', 'two', 'three', 'four'])
df.head()

Unnamed: 0,one,two,three,four
0,a,b,c,d
1,e,,g,h
2,i,j,,l
3,a,e,i,o
4,s,k,,g


In [9]:
df.fillna('Z', inplace=True) # Fill with desired item
print(df)

  one two three four
0   a   b     c    d
1   e   Z     g    h
2   i   j     Z    l
3   a   e     i    o
4   s   k     Z    g


# drop_na()
If you want to exclude labels from a data set which refer to missing data, you can use dropna(). The dropna() function simply drop Rows/Columns of datasets with Null values in different ways. Fo this we need to specify the parameters axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and 'index' or 'columns' for String.

## Drop along the row
To perform drop along the row, we need to define axis as zero or we can ignore axis paramter.

In [13]:
df.dropna(axis=0)

Unnamed: 0,one,two,three,four
0,a,b,c,d
1,e,Z,g,h
2,i,j,Z,l
3,a,e,i,o
4,s,k,Z,g


## Drop along the column
To perform drop along the column, we need to define axis as one.

In [14]:
df.dropna(axis=1)

Unnamed: 0,one,two,three,four
0,a,b,c,d
1,e,Z,g,h
2,i,j,Z,l
3,a,e,i,o
4,s,k,Z,g


# Apply functions
Let us say you have some data, and you want to apply a function on every item of the dataframe. We can apply it row-wise or column-wise, according to your requirements. It can also be a custom function that you make.

In [15]:
import pandas as pd
dic = {'A':[1,2,3,4], 'B':[5,6,7,8]}
df = pd.DataFrame(dic)
df

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7
3,4,8


In [16]:
df.apply(sum)

A    10
B    26
dtype: int64

Notice that we have passed the function without the parentheses. We have received the sum of all the rows in A and B. This is because we put the parameter `axis=0`. If the axis is 1, we can perform these functions on the columns instead.

In [17]:
df.apply(sum, axis=1) # mean, median, quartiles and more

0     6
1     8
2    10
3    12
dtype: int64

## Filter()
Now let’s see another function called Pandas dataframe filter function. The filter function is used to Subset rows or columns of dataframe according to labels in the specified index. The things to notice is that the filter is applied to the labels of the index,this does not filter a dataframe on its contents.

In [18]:
# importing pandas as pd
import pandas as pd

# Creating the Series
sr = pd.Series({'Coca Cola': 45, 'Coke': 40, 'Fanta': 40, 'Dew': 50, 'Thumbs Up':30})

# Print the series
sr

Coca Cola    45
Coke         40
Fanta        40
Dew          50
Thumbs Up    30
dtype: int64

In [19]:
sr.filter(items=['Coke', 'Fanta'])

Coke     40
Fanta    40
dtype: int64

## pandas groupby()

 It splits all the records from your data set into different categories or groups
 let's see below example

In [20]:
dic ={'gender': ['f', 'm', 'm', 'f', 'm', 'f', 'm'],
      'weight': [58, 60, 59, 55, 65, 52, 61]
      }
df =pd.DataFrame(dic)
df

Unnamed: 0,gender,weight
0,f,58
1,m,60
2,m,59
3,f,55
4,m,65
5,f,52
6,m,61


In [21]:
# Now, we split our data based on gender.
# In this case, we performed filter operation based on female and male.
# And assigned that filtered dataframe to separate variables, here we assigned filtered dataframe based on gender female to f filter variable and similary, for gender male to m filter variable.
# By running this line of code results in splitting of dataframe into two different group.

f_filter = df['gender']=='f'
print(df[f_filter])

m_filter = df['gender']=='m'
print(df[m_filter])

  gender  weight
0      f      58
3      f      55
5      f      52
  gender  weight
1      m      60
2      m      59
4      m      65
6      m      61


In [22]:
# After, the split process, let's apply aggregation based on mean. Here, we apply aggregation function mean for each group.
# And assigned to the separate variable, in this case, we assigned mean value for gender group female to f avg variable and similary, for gender group male to m avg variable.
# By running this blocks of code prints out average value for each gender group female and male.

f_avg = df[f_filter]['weight'].mean()

m_avg = df[m_filter]['weight'].mean()

print(f_avg,m_avg)


55.0 61.25


In [23]:
# Finally, we combine the results from both group to a single dataframe. For this, we created a dictionary with keys gender and weight, and assigned the average weigth resulted from previous cell to make a dataframe.
# By running this line of code creates a dataframe which is our final result. As we can see the contains mean weight for both gender.

pd.DataFrame({'Gender':['f','m'],'weight':[f_avg,m_avg]})


Unnamed: 0,Gender,weight
0,f,55.0
1,m,61.25


In [24]:
# We saw earlier the internal operation of groupby. We perform groupby operation directly using following steps.
# first, we apply groupby method with string gender as an argument.
# after that we apply  aggregate method where we put list of statistical operations,
# in this case we want to find out mean weight for each gender at once.
# by running this line of code results in a dataframe containing mean weight for each gender.

df.groupby('gender').agg(['mean'])

Unnamed: 0_level_0,weight
Unnamed: 0_level_1,mean
gender,Unnamed: 1_level_2
f,55.0
m,61.25


In [25]:
# Similarly, to perform multiple statistical operations that is the minimum, maximum, mean weight and sum of weight for each gender at once,
# we pass list of such operations in aggregation method, like shown here.
# By running this block of code results in a dataframe showing statistical informations like minimum, maximum, mean and sum of the weight for each gender group.

df.groupby('gender').agg(['min','max','mean','sum'])


Unnamed: 0_level_0,weight,weight,weight,weight
Unnamed: 0_level_1,min,max,mean,sum
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
f,52,58,55.0,165
m,59,65,61.25,245


## multi-column groupby: aggregation

In [26]:
# To demonstrate the multi-columns groupby aggregation, Assume, we have a following dataframe name d f containing 3 columns and 7 rows, as shown here.

import pandas as pd
dic ={'gender': ['f', 'm', 'm', 'f', 'm', 'f', 'm'],
      'weight': [58, 60, 59, 55, 65, 52, 61],
      'location': ['LA', 'LA', 'NY', 'NY', 'LA', 'NY', 'NY']
      }
df =pd.DataFrame(dic)
df

Unnamed: 0,gender,weight,location
0,f,58,LA
1,m,60,LA
2,m,59,NY
3,f,55,NY
4,m,65,LA
5,f,52,NY
6,m,61,NY


In [27]:
# Now, to perform multi-columns groupby aggregation we go through the following steps.
# First, we perform a groupby operation on multi-columns of this dataframe, in this case, we pass a list having columns gender and location.
# After that, we apply aggregation method, in this case, we perform aggregation based on mean.
# By running this line of code results a dataframe, which has been group by gender and location, and aggregated based on mean.


df.groupby(['gender', 'location']).agg(['mean'])


Unnamed: 0_level_0,Unnamed: 1_level_0,weight
Unnamed: 0_level_1,Unnamed: 1_level_1,mean
gender,location,Unnamed: 2_level_2
f,LA,58.0
f,NY,53.5
m,LA,62.5
m,NY,60.0
