# Data Frames and Data Cleaning Tasks



The specific tasks that you need to perform depend on the structure and contents of a dataset. In general, you will perform a workflow with the following steps, not necessarily always in this order (and some might be optional). All of the following steps can be performed with a Pandas data frame:

- Read data into a data frame
- Display top of a data frame
- Display column data types
- Display missing values
- Replace NA with a value
- Iterate through the columns
- Provide statistics for each column
- Find missing values
- Total missing values
- Percentage of missing values
- Sort table values
- Print summary information
- Identify columns with > 50% of the data missing
- Rename columns

# Reading Data into the 
# Pandas Dataframe

In [3]:

import pandas as pd
titanic_data = pd.read_csv(r"C:\Users\jaiku\PycharmProjects\Pandas\titanic_data.csv")
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Filtering Rows

To filter rows, first, you have to identify the indexes of the rows to filter. For those indexes, you need to pass True to the opening and closing square brackets that follow the Pandas dataframe name. 

In [4]:
titanic_pclass1= (titanic_data.Pclass == 1) 
titanic_pclass1

# The following script returns a series of True and False. True will be returned for indexes where the Pclass column has a value of 1.

0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Pclass, Length: 891, dtype: bool

Now the titanic_pclass1 series, which contains True or False, can be passed inside the opening and closing square brackets that follow the titanic_data dataframe. The result will be the Titanic dataset, containing only those records where the Pclass column contains 1.

In [5]:
titanic_pclass1= (titanic_data.Pclass == 1) 
titanic_pclass1_data = titanic_data[titanic_pclass1]
titanic_pclass1_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S


The comparison between the column values and filtering of rows can be made in a single line, as shown below.

In [6]:
titanic_pclass_data = titanic_data[titanic_data.Pclass == 1]
titanic_pclass_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S


# IsIn Operator

 The isin operator takes a list of values and returns only those rows where the column used for comparison contains values from the list passed to isin operator as a parameter

In [8]:
ages = [20,21,22]
age_dataset = titanic_data[titanic_data["Age"].isin(ages)]
age_dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
37,38,0,3,"Cann, Mr. Ernest Charles",male,21.0,0,0,A./5. 2152,8.05,,S
51,52,0,3,"Nosworthy, Mr. Richard Cater",male,21.0,0,0,A/4. 39886,7.8,,S
56,57,1,2,"Rugg, Miss. Emily",female,21.0,0,0,C.A. 31026,10.5,,S


You can filter rows in a Pandas dataframe based on multiple conditions using logical and (&) and or (|) operators. The following script returns those rows from the Pandas dataframe where passenger class is 1 and passenger age is in 20, 21, and 22.

In [9]:
ages = [20,21,22]
ageclass_dataset = titanic_data[titanic_data["Age"]. isin(ages) & (titanic_data["Pclass"] == 1) ]
ageclass_dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
102,103,0,1,"White, Mr. Richard Frasar",male,21.0,0,1,35281,77.2875,D26,S
151,152,1,1,"Pears, Mrs. Thomas (Edith Wearne)",female,22.0,1,0,113776,66.6,C2,S
356,357,1,1,"Bowerman, Miss. Elsie Edith",female,22.0,0,1,113505,55.0,E33,S
373,374,0,1,"Ringhini, Mr. Sante",male,22.0,0,0,PC 17760,135.6333,,C
539,540,1,1,"Frolicher, Miss. Hedwig Margaritha",female,22.0,0,2,13568,49.5,B39,C


# Filtering Columns

To filter columns from a Pandas dataframe, you can use the filter() method. The list of columns that you want to filter is passed to the filter() method

The following script filters Name, Sex, and Age columns from the Titanic dataset and ignores all the other columns. 

In [10]:
titanic_data_filter  = titanic_data.filter(["Name", "Sex", "Age"]) 
titanic_data_filter.head()

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0


you can also drop columns that you don’t want in the dataset. To do so, you need to call the drop() method and pass it the list of columns that you want to drop.

For instance, the following script drops the Name, Age, and Sex columns from the Titanic dataset and returns the remaining columns.

In [11]:
titanic_data_filter  = titanic_data.drop(["Name", "Sex", "Age"], axis = 1) 
titanic_data_filter.head()

Unnamed: 0,PassengerId,Survived,Pclass,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,1,0,A/5 21171,7.25,,S
1,2,1,1,1,0,PC 17599,71.2833,C85,C
2,3,1,3,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,1,0,113803,53.1,C123,S
4,5,0,3,0,0,373450,8.05,,S


# Concatenating Dataframes

Let’s first see how to concatenate or join Pandas dataframes vertically. We will first create two Pandas dataframes using Titanic data. The first dataframe contains rows where the passenger class is 1, while the second dataframe contains rows where the passenger class is 2. 

In [19]:
titanic_pclass1_data = titanic_data[titanic_data.Pclass == 1] 
print(titanic_pclass1_data.shape)
titanic_pclass2_data = titanic_data[titanic_data.Pclass == 2] 
print(titanic_pclass2_data.shape)

(216, 12)
(184, 12)


The output shows that both the newly created dataframes have 12 columns

It is important to mention that while concatenating data vertically, both the dataframes should have an equal number of columns. 

There are two ways to concatenate datasets horizontally. You can call the concat() method via the first dataframe and pass the second dataframe as a parameter to the concat() method

In [21]:
final_data = pd.concat([titanic_pclass1_data, titanic_pclass2_data ])
print(final_data.shape)

(400, 12)


To concatenate dataframes horizontally, make sure that the dataframes have an equal number of rows. You can use the concat() method to concatenate dataframes horizontally, as well. 

However, you will need to pass 1 as the value for the axis attribute. Furthermore, to reset dataset indexes, you need to pass True as the value for the ignore_index attribute. 

In [22]:
df1 = final_data[:200]
print(df1.shape)
df2 = final_data[200:]
print(df2.shape)
final_data2 = pd.concat([df1, df2], axis = 1, ignore_index = True)
print(final_data2.shape)

(200, 12)
(200, 12)
(400, 24)


# Sorting Dataframes

you can use the sort_values() 
function of the Pandas dataframe.

The list of columns used 
for sorting needs to be passed to the by attribute of the sort_values() method.

The following script sorts the Titanic dataset in by ascending order of the passenger’s age. 

In [24]:
age_sorted_data = titanic_data.sort_values(by=['Age'])
age_sorted_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5,,S
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,2666,19.2583,,C
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0,,S


To sort by descending order, you need to pass False as the value for the ascending attribute of the sort_values() function.  The following script sorts the dataset by descending order of age.

In [25]:
age_sorted_data = titanic_data.sort_values(by=['Age'], ascending = False)
age_sorted_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


You can also pass multiple columns to the by attribute of the 
sort_values()function

In such a case, the dataset will be 
sorted by the first column, and in case of equal values for 
two or more records, the dataset will be sorted by the second column and so on. The following script first sorts the data by Age and then by Fare, both by descending orders.

In [26]:
age_sorted_data = titanic_data.sort_values(by=['Age', 'Fare'], ascending=False)
age_sorted_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


# Apply Function

The apply() function is used to apply a function on multiple rows or on rows of a particular column. A lambda expression is passed to the apply() function. The lambda expression basically specifies the operation performed by the apply() function.

For instance, in the following script, the apply() function adds 2 to all the values in the Pclass column of the Titanic dataset.

In [27]:
updated_class = titanic_data.Pclass.apply(lambda x : x + 2)
updated_class.head()

0    5
1    3
2    5
3    3
4    5
Name: Pclass, dtype: int64

In [28]:
def mult(x):
    return x * 2
updated_class = titanic_data.Pclass.apply(mult)
updated_class.head()

0    6
1    2
2    6
3    2
4    6
Name: Pclass, dtype: int64

# Pivot & Crosstab

In [29]:
import matplotlib.pyplot as plt
import seaborn as sns
flights_data = sns.load_dataset('flights') 
flights_data.head()

Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


In [31]:
flights_data_pivot =flights_data.pivot_table(index='month', columns='year', values='passengers',observed='false')
flights_data_pivot.head()

year,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Jan,112.0,115.0,145.0,171.0,196.0,204.0,242.0,284.0,315.0,340.0,360.0,417.0
Feb,118.0,126.0,150.0,180.0,196.0,188.0,233.0,277.0,301.0,318.0,342.0,391.0
Mar,132.0,141.0,178.0,193.0,236.0,235.0,267.0,317.0,356.0,362.0,406.0,419.0
Apr,129.0,135.0,163.0,181.0,235.0,227.0,269.0,313.0,348.0,348.0,396.0,461.0
May,121.0,125.0,172.0,183.0,229.0,234.0,270.0,318.0,355.0,363.0,420.0,472.0


The crosstab() function is used to plot the cross-tabulation between two columns

 Let’s plot a cross tab matrix between passenger class and age columns for the Titanic dataset.

In [33]:
pd.crosstab(titanic_data.Pclass, titanic_data.Age, margins=True)

Age,0.42,0.67,0.75,0.83,0.92,1.0,2.0,3.0,4.0,5.0,...,63.0,64.0,65.0,66.0,70.0,70.5,71.0,74.0,80.0,All
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,1,0,1,0,1,0,...,1,2,2,0,1,0,2,0,1,186
2,0,1,0,2,0,2,2,3,2,1,...,0,0,0,1,1,0,0,0,0,173
3,1,0,2,0,0,5,7,3,7,3,...,1,0,1,0,0,1,0,1,0,355
All,1,1,2,2,1,7,10,6,10,4,...,2,2,3,1,2,1,2,1,1,714


# Arithmetic Operations with Where

The where  clause from the Numpy library can also be used to filter records and perform arithmetic operations on the Pandas dataframe

For instance, in the following script, the 
where clause is used to add 5 to the rows in the Fare column, 
where passengers’ ages are greater than 20.

In [34]:
import numpy as np
titanic_data.Fare = np.where( titanic_data.Age > 20, titanic_data.Fare +5 , titanic_data.Fare)
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,12.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,76.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,12.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,58.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,13.05,,S


In [35]:
def subtract_ten(x):
    return x - 10

titanic_data['Fare'] = titanic_data['Fare'].apply(subtract_ten)
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,2.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,66.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,2.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,48.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,3.05,,S
