**Table of contents**<a id='toc0_'></a>    
- [Cleaning Data](#toc1_)    
  - [Null values](#toc1_1_)    
    - [What are the different scenarios of missing data?](#toc1_1_1_)    
      - [MCAR (Missing Completely at Random)](#toc1_1_1_1_)    
      - [MAR (Missing at Random)](#toc1_1_1_2_)    
      - [Missing Not At Random (MNAR)](#toc1_1_1_3_)    
    - [Why are null values relevant?](#toc1_1_2_)    
    - [Cleaning Null Values](#toc1_1_3_)    
    - [Checking for Null Values](#toc1_1_4_)    
    - [Dropping Null Values](#toc1_1_5_)    
    - [Filling Null Values](#toc1_1_6_)    
    - [💡 Check for understanding](#toc1_1_7_)    
  - [Dealing with Duplicates](#toc1_2_)    
    - [Identifying Duplicates](#toc1_2_1_)    
    - [Removing Duplicates](#toc1_2_2_)    
    - [Removing Duplicates Based on Specific Columns](#toc1_2_3_)    
    - [Resetting the Index](#toc1_2_4_)    
  - [Formatting Data (Recap)](#toc1_3_)    
    - [Formatting Numeric Values (Recap)](#toc1_3_1_)    
    - [Formatting Strings (Recap)](#toc1_3_2_)    
    - [Formatting Dates](#toc1_3_3_)
  - [Changing column datatypes](#toc1_4_)    
  - [Cleaning Column Names](#toc1_5_)    
- [Using `apply()`, `map()`, and `applymap()`](#toc2_)    
    - [More examples](#toc2_1_1_)    
      - [Comparing Map and Apply](#toc2_1_1_1_)    
      - [Calculating the length of the name](#toc2_1_1_2_)    
      - [Converting to float some columns with applymap()](#toc2_1_1_3_)    
      - [Modifying columns names with apply()](#toc2_1_1_4_)    
    - [💡 Check for understanding](#toc2_1_2_)    
    - [💡 Check for understanding](#toc2_1_3_)    
- [Filtering Data](#toc3_)    
    - [Creating a condition](#toc3_1_1_)    
    - [Filtering df](#toc3_1_2_)    
    - [Using multiple conditions](#toc3_1_3_)    
- [More Data Manipulation](#toc4_)    
  - [Setting the index](#toc4_1_)    
  - [Adding/removing rows and/or columns](#toc4_2_)    
  - [💡 Check for understanding](#toc4_3_)    
- [Summary](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Cleaning Data](#toc0_)

## <a id='toc1_1_'></a>[Null values](#toc0_)

Null values (also known as missing values) are common in datasets and can hinder data analysis and modeling. It is essential to handle null values appropriately to ensure accurate and reliable results. Pandas provides various methods to clean and handle null values in datasets.

In Python, `None` is a special constant that represents the absence of a value. It is commonly used to indicate that a variable or function has no value or hasn't been assigned any value. For example, if a function does not explicitly return a value, it implicitly returns `None`.

On the other hand, `NaN` stands for "Not a Number" and is a special value used to represent missing or undefined numerical data. `NaN` is part of the floating-point representation and is commonly used in numeric data structures like Pandas DataFrames and Series to indicate missing or invalid numerical values.

### <a id='toc1_1_1_'></a>[What are the different scenarios of missing data?](#toc0_)

#### <a id='toc1_1_1_1_'></a>[MCAR (Missing Completely at Random)](#toc0_)

When data is missing for reasons that have nothing to do with the information being recorded, i.e. has no connection to any data points in the dataset, whether observed or unobserved. This scenario happens when there's **a glitch in the system collecting the data**, rather than the data itself and there's no way to predict where values will be missing.

For example:
- **Lost Surveys**: Say a surveyor accidentally dropped completed questionnaires into a lake. The fact that there's information missing from these specific questionnaires doesn’t depend on who filled them out or what they contained. The loss of data is completely random.  

- **Sensor Failure in a Weather Station**: A weather station collects temperature data hourly, but sometimes the sensor fails from time to time, for no particular reason. Given these failures don’t depend on the time of day, season, or temperature, the missing values just happen randomly due to a technical issue. However, if the sensor failure was due to temperature, or rainfall, or **any other factor connected to the data being observed**, the data will no longer be MCAR but it would be **MAR**.  

#### <a id='toc1_1_1_2_'></a>[MAR (Missing at Random)](#toc0_)

> Despite its name, MAR occurs when the absence of data is not random. The probability of missing data is not equal for all measurements. They’re more likely for some observations than others. However, measurements of observed variables predict the unequal probability of missing values occurring. ([Statistics by Jim](https://statisticsbyjim.com/basics/missing-data/))

This means that we can figure out where a value might be missing based on other information in the dataset.

For example: 

- **Medical Study Missing Data on a Drug’s Side Effects**: In a medical study, older participants might be more likely to drop out or miss follow-up appointments. As a result, we may be missing some follow-up data on side effects for older participants. The missing data is linked to an observable factor (age), but it’s not related to the unobserved side effect outcomes themselves.

-  **Student Performance Data**: In a school, students may be more likely to skip reporting their test scores in optional evaluations. For instance, students with lower attendance rates might be more likely to miss submitting their scores. The missing test scores are related to an observed factor (attendance rate), not the missing test scores themselves. 

- **Workplace Wellness Programs and Job Satisfaction**: Employees who are part-time might be less likely to participate in wellness programs, leading to missing data on wellness program engagement for part-time employees. The missing engagement data is related to observed information (employment status: part-time vs. full-time) rather than how much the part-time employees actually benefit from the program.

#### <a id='toc1_1_1_3_'></a>[Missing Not At Random (MNAR)](#toc0_)

MNAR data means that the reason for missing data is related to the actual missing data itself, so it's not possible to predict where there will be missing values in the dataset as the information needed to do that is missing.

For example:
- **Sensitive Information in Surveys**: Suppose a survey asks people about their income, and higher-income individuals are more likely to skip the question due to privacy concerns. Here, the likelihood of a person not responding is directly related to the value of their income (higher income = more likely to skip). This is MNAR because the missing data (income) is missing specifically because of the value it would have been, not because of any other observed variable like age or occupation, although the latter could also be correlated with a high-income.    

- **Health Studies and Sensitive Symptoms**: In a medical study, participants with more severe symptoms might skip reporting certain symptoms or drop out of the study because they’re uncomfortable sharing the extent of their condition. This missing data would be MNAR since people are missing from the dataset specifically because of their symptom severity, which we can’t observe directly once they’ve dropped out.   

- **Employee Feedback and Job Satisfaction**: If a company survey asks employees about job satisfaction, those who are very dissatisfied may skip questions or avoid taking the survey altogether to avoid potential consequences. This situation would be MNAR because the likelihood of missing data (not answering the job satisfaction questions) is directly related to the unobserved values (job dissatisfaction). However, it could be linked to other factors, such as compensation.  

- **Credit Scores and Loan Applications**: In a financial study, people with very low credit scores might be less likely to report their scores or apply for loans, creating a gap in data. Since the missing data (credit scores) is related to the unobserved values (the actual low scores), this scenario would be classified as MNAR.

### <a id='toc1_1_2_'></a>[Why are null values relevant?](#toc0_)

- **Biased Analysis**: Except for the MCAR case, missing values can bias the conclusion we draw from data, mostly by ignoring a subset of the population that is potentially very different from the population whose data we collected.

- **Lower Statistical Power / Precision**: Having missing data reduces the sample size of our dataset, which in turn reduces the precision and power of statistical tests used for association and hypothesis testing. 



### <a id='toc1_1_3_'></a>[Cleaning Null Values](#toc0_)

1. Checking for Null Values:
   - Use `isnull()` method to check for null values in a DataFrame or Series.
   - Use `notnull()` method to check for non-null values in a DataFrame or Series.

2. Dropping Null Values:  
   _Typically useful for MCAR but can be harmful for MAR since it will likely bias the overall analysis._
   - Use `dropna()` method to remove rows with null values from a DataFrame.
   - Use `dropna(axis=1)` to remove columns with null values.

3. Filling Null Values:
   - Use `fillna(value)` method to replace null values with a specific value.
   - Use `fillna(method='ffill')` to forward-fill null values with the previous non-null value.
   - Use `fillna(method='bfill')` to backward-fill null values with the next non-null value.

### <a id='toc1_1_4_'></a>[Checking for Null Values](#toc0_)

In [2]:
#!pip install pandas
#!pip install numpy

import pandas as pd
import numpy as np


# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
df = pd.read_csv(url)

In [3]:
# Review dataframe and df columns
print(df.columns)
display(df.head())

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Checking for Null Values
df.isnull()  # Returns a DataFrame with True where values are null
# isnull is an alias of isna

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


When working with large datasets, using `isna()` or `isnull()` along with `any()`, `all()` and `sum()` in Pandas becomes essential for quick and efficient data quality assessment.

In [5]:
# Check which cols have any null values
df.isna().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [6]:
# Check if any column have only null values
df.isna().all()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

sum() calculates the sum of each row, considering True as 1 and False as 0.

In [7]:
# Count the number of null values in each column
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

If we add the parameter `axis=1` with the `sum()` function, we can calculate the sum of each row (along the columns) of the DataFrame `df`. This results in a Series that contains the count of null values in each row.

In [8]:
# Get null values per row
df.isna().sum(axis=1)

0      1
1      0
2      1
3      0
4      1
      ..
886    1
887    0
888    2
889    0
890    1
Length: 891, dtype: int64

In [9]:
# Get % of null values per row
df.isna().sum(axis=1) * 2 / df.shape[1]

0      0.166667
1      0.000000
2      0.166667
3      0.000000
4      0.166667
         ...   
886    0.166667
887    0.000000
888    0.333333
889    0.000000
890    0.166667
Length: 891, dtype: float64

In [10]:
# Sort descending
(df.isna().sum(axis=1) * 2 / df.shape[1]).sort_values(ascending=False)

502    0.333333
773    0.333333
517    0.333333
783    0.333333
359    0.333333
         ...   
659    0.000000
662    0.000000
438    0.000000
215    0.000000
445    0.000000
Length: 891, dtype: float64

In [11]:
# What about the null values in each column?
round(df.isna().sum() * 100 / df.shape[0], 2)

PassengerId     0.00
Survived        0.00
Pclass          0.00
Name            0.00
Sex             0.00
Age            19.87
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.00
Cabin          77.10
Embarked        0.22
dtype: float64

### <a id='toc1_1_5_'></a>[Dropping Null Values](#toc0_)

In [12]:
# Dropping rows with any Null Values
df.dropna() 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


How many rows did we remove from our dataframe?

In [13]:
# Check rows removed

However, as we can see below in the DataFrame, the rows with NaN values have not been removed. To execute the change, it is necessary to use the `inplace=True` option: `df.dropna(inplace=True)` or assign it to a variable such as df = df.dropna().

In [14]:
# Check original dataframe
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [15]:
# Dropping columns with  Null Values
df.dropna(axis=1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500
...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000


In the `dropna()` method of Pandas DataFrame, the `subset`, `how`, and `thresh` parameters are used to control the behavior of dropping rows or columns containing NaN (null) values, when we don't want to drop them just because they have *one* null value:

- `subset`: It allows you to specify a subset of columns on which to apply the `dropna()` operation. Only the rows containing NaN values in the specified subset of columns will be dropped.

In [16]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [17]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
df.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
818,819,0,3,"Holm, Mr. John Fredrik Alexander",male,43.0,0,0,C 7075,6.45,,S
544,545,0,1,"Douglas, Mr. Walter Donald",male,50.0,1,0,PC 17761,106.425,C86,C
76,77,0,3,"Staneff, Mr. Ivan",male,,0,0,349208,7.8958,,S
595,596,0,3,"Van Impe, Mr. Jean Baptiste",male,36.0,1,1,345773,24.15,,S
705,706,0,2,"Morley, Mr. Henry Samuel (""Mr Henry Marshall"")",male,39.0,0,0,250655,26.0,,S
502,503,0,3,"O'Sullivan, Miss. Bridget Mary",female,,0,0,330909,7.6292,,Q
388,389,0,3,"Sadlier, Mr. Matthew",male,,0,0,367655,7.7292,,Q
717,718,1,2,"Troutt, Miss. Edwina Celia ""Winnie""",female,27.0,0,0,34218,10.5,E101,S
763,764,1,1,"Carter, Mrs. William Ernest (Lucile Polk)",female,36.0,1,2,113760,120.0,B96 B98,S
594,595,0,2,"Chapman, Mr. John Henry",male,37.0,1,0,SC/AH 29037,26.0,,S


In [19]:
# Drop cabin nulls
df.dropna(subset=['Cabin']).tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


- `how`: It specifies the condition for dropping rows. It can take the values 'any', which means to drop rows containing any NaN values in the `subset`, or 'all', which means to drop rows containing all NaN values in the `subset`.

In [20]:
# Drop rows only if ALL values are null
df.dropna(how='all').tail()

df.dropna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


- `thresh`: It sets a minimum threshold for the number of non-null values that a row must have in the `subset` in order to be kept. Rows with fewer non-null values than the specified threshold will be dropped.

In [21]:
# Test different thresh values
df.dropna(thresh=3).tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


### <a id='toc1_1_6_'></a>[Filling Null Values](#toc0_)

`fillna()` is a Pandas method used to replace NaN (null) values in a DataFrame or Series with specified values.
- You can use `inplace=True` to modify the DataFrame directly.

In [22]:
# Filling Null Values
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [23]:
# boolean mask
df['Age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [24]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [25]:
# First we define the condition
age_condition = df['Age'].isnull()

df[age_condition]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [26]:
df['Age'].fillna(-1)

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    -1.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

Careful if we assign a different data type, since Pandas will change the data type of the whole column. For example:

In [27]:
# Check dtypes


In [28]:
# Fill with "N/A" instead of -1 
df.fillna('N/A')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [29]:
# Check dtypes


To avoid this, we can select manually in which column to apply the `fillna()`

In [30]:
# Fill Cabin only
df['Cabin'].fillna('N/A')

0       N/A
1       C85
2       N/A
3      C123
4       N/A
       ... 
886     N/A
887     B42
888     N/A
889    C148
890     N/A
Name: Cabin, Length: 891, dtype: object

In [31]:
# Fill a subset of columns
df[ ['Cabin', 'Age'] ].fillna(0)

Unnamed: 0,Cabin,Age
0,0,22.0
1,C85,38.0
2,0,26.0
3,C123,35.0
4,0,35.0
...,...,...
886,0,27.0
887,B42,19.0
888,0,0.0
889,C148,26.0


In [32]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

We can also use the mean(), median() etc. to fill the null values.

In [33]:
# Check Age in last rows
mean_age = df['Age'].mean()


df['Age'].fillna(mean_age)

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

In [34]:
# Fill with age mean


- Two common methods for filling NaN values are `ffill`, which forward fills using the last valid value, and `bfill`, which backward fills using the next valid value.

In [35]:
# Forward-fill null values in the Age column
df['Age']

df['Age'].fillna(method='ffill')

  df['Age'].fillna(method='ffill')


0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    19.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [36]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [37]:
# Backward-fill null values in the Age column

df['Age'].fillna(method= 'bfill')

  df['Age'].fillna(method= 'bfill')


0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    26.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

### <a id='toc1_1_7_'></a>[💡 Check for understanding](#toc0_)

Consider the following DataFrame containing information about students:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)
```

Your task is to perform the following data cleaning tasks:

1. Check for null values in the DataFrame using `isna()` or `isnull()`. Count how many nulls are there in each column?

2. Replace the null values in the 'Age' column with the average age of the students.

3. Replace the null values in the 'Gender' column with "Female".

4. Drop any rows that have null values in the 'Name' column.

5. Forward fill (ffill) the null values in the 'Score' column with the previous valid value.

6. After performing all the cleaning steps, print the cleaned DataFrame.


In [38]:
# Your code goes here

data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)

df_students


Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,,
2,Cathy,,Female,78.0
3,,22.0,Male,
4,Eva,28.0,,85.0


In [39]:
# 1 - Check for nulls in dataframe
df_students.isnull()

Unnamed: 0,Name,Age,Gender,Score
0,False,False,False,False
1,False,False,True,True
2,False,True,False,False
3,True,False,False,True
4,False,False,True,False


In [40]:
# 1.1 Total nulls in the columns

df_students.isnull().sum()

# 1.2 total nulls in the dataframe

df_students.isnull().sum().sum()

np.int64(6)

In [41]:
int(df_students.isnull().sum().sum())

6

In [42]:
# 1.2 Fill nulls of the Age column with the column mean 

mean_age = df_students['Age'].mean()

df_students['Age'].fillna(mean_age, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_students['Age'].fillna(mean_age, inplace=True)


In [43]:
df_students.isnull().sum()

Name      1
Age       0
Gender    2
Score     2
dtype: int64

In [44]:
# 1.3 Fill null entries of Gender with 'Female'

df_students['Gender'].fillna('Female',inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_students['Gender'].fillna('Female',inplace = True)


In [45]:
# 1.4 Remove any entry that has null values for column name

df_students.dropna(subset=['Name'])

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,
2,Cathy,26.25,Female,78.0
4,Eva,28.0,Female,85.0


In [46]:
# 1.5 Forward fill score column with the previous existing value

df_students['Score'].fillna(method = 'ffill', inplace= True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_students['Score'].fillna(method = 'ffill', inplace= True)
  df_students['Score'].fillna(method = 'ffill', inplace= True)


In [47]:

df_students['Score'] = df_students['Score'].ffill()

# before 1 method, 3 possibilies
#fillna (methods= ['ffill', 'bfill', 'default'])


#now they have 3 different methods

#fillna()
#ffill()
#bfill()

In [48]:
# 1.6

display(df_students)

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,90.0
2,Cathy,26.25,Female,78.0
3,,22.0,Male,78.0
4,Eva,28.0,Female,85.0


In [49]:
df_students

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,90.0
2,Cathy,26.25,Female,78.0
3,,22.0,Male,78.0
4,Eva,28.0,Female,85.0


In [50]:
df_students.head()

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,90.0
2,Cathy,26.25,Female,78.0
3,,22.0,Male,78.0
4,Eva,28.0,Female,85.0


## <a id='toc1_2_'></a>[Dealing with Duplicates](#toc0_)

In data analysis, it's common to encounter duplicate values in datasets. Duplicates can distort our analysis and lead to incorrect conclusions. Fortunately, pandas provides efficient methods to handle duplicates.


### <a id='toc1_2_1_'></a>[Identifying Duplicates](#toc0_)

To identify duplicate rows in a DataFrame, we can use the `duplicated()` method, which returns a boolean Series indicating whether each row is a duplicate or not. We can then use the `sum()` method to count the total number of duplicates.


In [51]:
# Check total # of duplicates

df.duplicated().sum()

np.int64(0)

In [52]:
# Check if there are any duplicates

df.duplicated().any()

np.False_

To check for duplicates in specific columns, we can use the `duplicated()` method with the `subset` parameter, or just access first to the column and then check with duplicated().


In [53]:
# Check for duplicates in Age column

df.duplicated(subset= ['Age'])


0      False
1      False
2      False
3      False
4       True
       ...  
886     True
887     True
888     True
889     True
890     True
Length: 891, dtype: bool

### <a id='toc1_2_2_'></a>[Removing Duplicates](#toc0_)

To remove duplicates from a DataFrame, we can use the `drop_duplicates()` method. By default, this method keeps the first occurrence of each duplicated row and removes the rest.


In [54]:
# Remove duplicates and update the DataFrame

df.shape

(891, 12)

### <a id='toc1_2_3_'></a>[Removing Duplicates Based on Specific Columns](#toc0_)

Sometimes, we may want to remove duplicates based on specific columns. We can pass a subset of column names to the `drop_duplicates()` method to achieve this.


In [55]:
# Remove duplicates based on specific columns

df.drop_duplicates(subset=['Age'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.1000,C123,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5000,,S
767,768,0,3,"Mangan, Miss. Mary",female,30.50,0,0,364850,7.7500,,Q
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
843,844,0,3,"Lemberopolous, Mr. Peter L",male,34.50,0,0,2683,6.4375,,C


By default, `drop_duplicates()` keeps the first occurrence of each duplicated row. If we want to keep the last occurrence instead, we can set the `keep` parameter to `'last'`.


In [56]:
# Keep the last occurrence of duplicates

df_sampled = df.sample(20)

df_concatenated = pd.concat([df, df_sampled])

In [57]:
df_concatenated.shape

(911, 12)

In [58]:
df_concatenated.drop_duplicates()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [59]:
df.drop_duplicates(keep='last')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [60]:
df_concatenated[df_concatenated['PassengerId']==191]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
190,191,1,2,"Pinsky, Mrs. (Rosa)",female,32.0,0,0,234604,13.0,,S


### <a id='toc1_2_4_'></a>[Resetting the Index](#toc0_)

When removing duplicates, the DataFrame index may have gaps due to removed rows. To reset the index after removing duplicates, we can use the `reset_index()` method with the `drop=True` parameter.


In [61]:
# Remove duplicates and reset the index

df_concatenated.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [62]:
df_concatenated.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
322,323,1,2,"Slayter, Miss. Hilda Mary",female,30.0,0,0,234818,12.35,,Q
191,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S
390,391,1,1,"Carter, Mr. William Ernest",male,36.0,1,2,113760,120.0,B96 B98,S
327,328,1,2,"Ball, Mrs. (Ada E Hall)",female,36.0,0,0,28551,13.0,D,S
517,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.15,,Q


In [63]:
df_concatenated.reset_index(drop=True, inplace=True)

In [64]:
df_concatenated.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
906,323,1,2,"Slayter, Miss. Hilda Mary",female,30.0,0,0,234818,12.35,,Q
907,192,0,2,"Carbines, Mr. William",male,19.0,0,0,28424,13.0,,S
908,391,1,1,"Carter, Mr. William Ernest",male,36.0,1,2,113760,120.0,B96 B98,S
909,328,1,2,"Ball, Mrs. (Ada E Hall)",female,36.0,0,0,28551,13.0,D,S
910,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.15,,Q


## <a id='toc1_3_'></a>[Formatting Data (Recap)](#toc0_)

### <a id='toc1_3_1_'></a>[Formatting Numeric Values (Recap)](#toc0_)


1. `round()` Method:
   - Rounds numeric values to a specified number of decimal places.

2. `format()` Method:
   - Formats numeric values as strings for better representation.

In [65]:
num = 123456.78910
num

123456.7891

In [66]:
# Apply round

int(num)

round(num, 3)

123456.789

In [67]:
# Format to one decimal
format(num, '.1f')

'123456.8'

In [68]:
# Apply round to Fare


In [69]:
# We could also use a lambda function:


### <a id='toc1_3_2_'></a>[Formatting Strings (Recap)](#toc0_)

We can apply all of the string methods we learnt about in the data structures, like `len`, `lower`, `upper`, `split`, `replace`, etc. :

In [70]:
str_example = "This is my second Pandas Lesson"


In [71]:
# string length
len(str_example)

31

In [72]:
# name length - direct method
df['Name'].len()

AttributeError: 'Series' object has no attribute 'len'

In [73]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [74]:
df['PassengerId'].str.len()

AttributeError: Can only use .str accessor with string values!

In [None]:
# name length - str method


In [148]:
# Upper/lowercase

str_example.upper()


'THIS IS MY SECOND PANDAS LESSON'

In [75]:
# Upper/lowercase


### <a id='toc1_3_3_'></a>[Changing string values](#toc0_)

In [76]:
# I want to upper case all of my Name column values

df['Name'].str.upper()

0                                BRAUND, MR. OWEN HARRIS
1      CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...
2                                 HEIKKINEN, MISS. LAINA
3           FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)
4                               ALLEN, MR. WILLIAM HENRY
                             ...                        
886                                MONTVILA, REV. JUOZAS
887                         GRAHAM, MISS. MARGARET EDITH
888             JOHNSTON, MISS. CATHERINE HELEN "CARRIE"
889                                BEHR, MR. KARL HOWELL
890                                  DOOLEY, MR. PATRICK
Name: Name, Length: 891, dtype: object

In [77]:


# I want to change 'W./C.' -> 'w/c
df['Ticket']

0             A/5 21171
1              PC 17599
2      STON/O2. 3101282
3                113803
4                373450
             ...       
886              211536
887              112053
888          W./C. 6607
889              111369
890              370376
Name: Ticket, Length: 891, dtype: object

In [78]:
df['Ticket'].str.replace('W./C.', 'w/c')

0             A/5 21171
1              PC 17599
2      STON/O2. 3101282
3                113803
4                373450
             ...       
886              211536
887              112053
888            w/c 6607
889              111369
890              370376
Name: Ticket, Length: 891, dtype: object

### <a id='toc1_3_4_'></a>[Formatting Dates](#toc0_)

We will study this in another Notebook.

## <a id='toc1_4_'></a>[Changing data types](#toc1_4_)

In [79]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### <a id='toc1_4_2_'></a>[Convert columns datatypes](#toc1_4_2_)

In [81]:
# Converting passenger id

df['PassengerId'].astype(float)

0        1.0
1        2.0
2        3.0
3        4.0
4        5.0
       ...  
886    887.0
887    888.0
888    889.0
889    890.0
890    891.0
Name: PassengerId, Length: 891, dtype: float64

In [82]:
# Convert to boolean

df['Survived'].astype(bool)

0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Survived, Length: 891, dtype: bool

In [83]:
# convert first Age to string

df['Age'] = df['Age'].astype(str)

In [84]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [85]:
# error with:
df['Age'].astype(int)

ValueError: invalid literal for int() with base 10: '22.0'

In [176]:
# when not sure if it should be float or integer, pd.to_numeric is the best method



pd.to_numeric(df['Age'], errors='coerce')

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [86]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age             object
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

## <a id='toc1_5_'></a>[Cleaning Column Names](#toc1_5_)

We can acccess the columns using `df.columns`

In [87]:
# Review columns


titanic_columns = df.columns.tolist()
titanic_columns

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In order to modify them, we can assign new column names to `df.columns` by doing `df.columns = [list_of_new_column_names]` or we can use the `rename()` method to just modify a few of them.

In [88]:
# How can I create a list of columns that is lowercase?

titanic_columns_lower = [col_.lower() for col_ in titanic_columns]
titanic_columns_lower

['passengerid',
 'survived',
 'pclass',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked']

In [89]:
# Can also use a custom list...
df.columns = titanic_columns_lower

In [90]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [91]:
# Rename using a dictionary

renaming_dict = {
    'parch': 'parents_children',
    'sibsp': 'siblings_spouses'
}



df.rename(columns= renaming_dict, inplace= True)

In [92]:
df.rename

<bound method DataFrame.rename of      passengerid  survived  pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  name     sex   age  \
0                              Braund, Mr. Owen Harris    male  22.0   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0   
2                               Heikkinen, Miss. Laina  female  26.0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0   
4                             Allen, Mr. William Henry    male  35.0   
..                                                 ...     ...   ...   
886  

In [93]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# <a id='toc2_'></a>[Using `apply()`, `map()`, and `applymap()`](#toc0_)

- `apply()`
    - Apply a custom function to a Series.
    - Useful for element-wise transformations.
    - Example: `df['squared_numbers'] = df['numbers'].apply(lambda x: x ** 2)`

- `map()`
    - Transform Series elements based on a dictionary.
    - Replaces elements with corresponding dictionary values.
    - Example: `df['gender_mapped'] = df['gender'].map({'M': 'Male', 'F': 'Female'})`

- `applymap()`
    - Apply a custom function to every element in a DataFrame.
    - Useful for element-wise transformations on entire DataFrames.
    - Example: `df = df.applymap(lambda x: x.upper())`


### `apply()`

In [96]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [99]:
df['age'] = df['age'].astype(float)

In [101]:
# Applying a custom function using apply()


def get_yob( age ):
    """ Receives age and gets year of birth"""

    return 1912 - age

df['yob']  = df['age'].apply(get_yob)

In [102]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877.0


In [103]:
# Test out using lambda

exchange_rate = 0.9877

df['fare'].apply( lambda col_: col_ * exchange_rate)

0       7.160825
1      70.406515
2       7.827522
3      52.446870
4       7.950985
         ...    
886    12.840100
887    29.631000
888    23.161565
889    29.631000
890     7.654675
Name: fare, Length: 891, dtype: float64

In [104]:
# How to put transform but not inside of a function

df['euro_fare'] = df['fare'] * exchange_rate
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890.0,7.160825
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886.0,7.827522
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877.0,52.44687
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877.0,7.950985


In [110]:
df['sex'] + '_' + df['sex']

0          male_male
1      female_female
2      female_female
3      female_female
4          male_male
           ...      
886        male_male
887    female_female
888    female_female
889        male_male
890        male_male
Name: sex, Length: 891, dtype: object

In [109]:
(
    df['sex']
    .apply(lambda col_: col_ + '_' + col_)
    .str.replace('male_', 'nonfemale')
)

0          nonfemalemale
1      fenonfemalefemale
2      fenonfemalefemale
3      fenonfemalefemale
4          nonfemalemale
             ...        
886        nonfemalemale
887    fenonfemalefemale
888    fenonfemalefemale
889        nonfemalemale
890        nonfemalemale
Name: sex, Length: 891, dtype: object

In [111]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890.0,7.160825
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886.0,7.827522
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877.0,52.44687
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877.0,7.950985


In [112]:
df['pclass'].unique()


array([3, 1, 2])

In [113]:
def converts_class_into_description ( col) :
    """ picks the ticket class and adds more description."""
    if col == 1:
        return 'first class'
    elif col == 2:
        return 'second class'
    else:
        return 'third class'

In [114]:
df['pclass'].apply(converts_class_into_description)

0       third class
1       first class
2       third class
3       first class
4       third class
           ...     
886    second class
887     first class
888     third class
889     first class
890     third class
Name: pclass, Length: 891, dtype: object

In the example above, we can see that to create a new column in pandas, we can simply assign a new Series or list to a new column name within the DataFrame.

To edit the information in a whole column in pandas, you can simply assign a new list or array of values to the column you want to modify.

In [None]:
# Convert the fare from US Dollars to EUR


In [117]:
# Apply is awesome to place a logic inside of a function and then apply it to our dataframe:


In [None]:
# We can even use apply in the whole dataframe to perform some specific logic


### `map()`

In [115]:
# Using map() to transform the 'Sex' column to an integer

# male -> 0 
# female -> 1

def convert_to_numerical ( col ):
    if col == 'male':
        return 0
    else:
        return 1
    
    
df['sex'].apply(convert_to_numerical)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: sex, Length: 891, dtype: int64

In [116]:
gender_map = { 'male': 0, 'female': 1}

df['sex'].map(gender_map)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: sex, Length: 891, dtype: int64

In [117]:
# Switch to lambda
# if confusing, disregard this one
df['sex'].apply(lambda col_ : 0  if col_ == 'male' else 1)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: sex, Length: 891, dtype: int64

### `applymap()`

In [121]:
# Using applymap() to convert all string columns to uppercase - 

def upper_casing( col_):
    return col_.upper()

for cols in ['name', 'sex', 'ticket']:
    df[cols].apply(upper_casing)



In [120]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890.0,7.160825
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886.0,7.827522
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877.0,52.44687
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877.0,7.950985


In [124]:
# Are we able to apply the str.upper to the whole dataframe?

df.applymap( lambda col_: col_.upper()  if type(col_) == str else col_)

  df.applymap( lambda col_: col_.upper()  if type(col_) == str else col_)


Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
0,1,0,3,"BRAUND, MR. OWEN HARRIS",MALE,22.0,1,0,A/5 21171,7.2500,,S,1890.0,7.160825
1,2,1,1,"CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
2,3,1,3,"HEIKKINEN, MISS. LAINA",FEMALE,26.0,0,0,STON/O2. 3101282,7.9250,,S,1886.0,7.827522
3,4,1,1,"FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)",FEMALE,35.0,1,0,113803,53.1000,C123,S,1877.0,52.446870
4,5,0,3,"ALLEN, MR. WILLIAM HENRY",MALE,35.0,0,0,373450,8.0500,,S,1877.0,7.950985
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"MONTVILA, REV. JUOZAS",MALE,27.0,0,0,211536,13.0000,,S,1885.0,12.840100
887,888,1,1,"GRAHAM, MISS. MARGARET EDITH",FEMALE,19.0,0,0,112053,30.0000,B42,S,1893.0,29.631000
888,889,0,3,"JOHNSTON, MISS. CATHERINE HELEN ""CARRIE""",FEMALE,,1,2,W./C. 6607,23.4500,,S,,23.161565
889,890,1,1,"BEHR, MR. KARL HOWELL",MALE,26.0,0,0,111369,30.0000,C148,C,1886.0,29.631000


#### Let's pick the practical case from before of the data cleaning and put this into a function

Consider the following DataFrame containing information about students:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)
```

Your task is to perform the following data cleaning tasks:

1. Just picks the name and joins with gender and adds the age inside of parenthesis


In [127]:
data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)

In [143]:
df_students.fillna('non_information', inplace=True)

  df_students.fillna('non_information', inplace=True)


In [147]:
# this function accesses the Name  column
def turn_alice_to_wonderland ( row ):
    
    return str(row['Name']) + '_' + str(row['Gender']) + f' ({row['Age']})'



In [151]:
df_students.apply(turn_alice_to_wonderland, axis=1)

0               Alice_Female (25.0)
1        Bob_non_information (30.0)
2    Cathy_Female (non_information)
3       non_information_Male (22.0)
4        Eva_non_information (28.0)
dtype: object

In [152]:
df_students.head()

# For the first run of the above function, python is going to pass this basically:

first_row_dict = { 'Name': 'Alice',
   'Age': 25,
   'Gender': 'Female',
   'Score': 90
   }

In [155]:
first_row_dict['Score']  

90

### <a id='toc2_1_1_'></a>[More examples](#toc0_)

#### <a id='toc2_1_1_1_'></a>[Comparing Map and Apply](#toc0_)

We have a column called "Embarked" containing three possible values: 'C', 'Q', and 'S'. We want to map these values to 0, 1 and 2. In this case, `apply()` with a lambda function would be complex due to the if-elif-else conditions, but `map()` can handle it more easily.

In [None]:
# Mapping 'embarked' values to their full names using map()


# Display the first few rows of the updated DataFrame


Why is it a float?

In [None]:
# Check null values


In [None]:
# Mapping 'Embarked' values to their full names using apply() and a lambda function


# Note that here it doesn't convert it to float

In [None]:
# What happened with the null values?


#### <a id='toc2_1_1_2_'></a>[Calculating the length of the name](#toc0_)

What if we wanted to create a new column with the length of the name?

In [None]:
df.columns

#### <a id='toc2_1_1_3_'></a>[Converting to float some columns with applymap()](#toc0_)

Lets look just as an example, how to make float all the following columns: "PassengerId", "Survived", "Pclass"

#### <a id='toc2_1_1_4_'></a>[Modifying columns names with apply()](#toc0_)

In [None]:
# I could also use the apply function by converting df.columns to Series


### <a id='toc2_1_2_'></a>[💡 Check for understanding](#toc0_)

Make the column Embarked_nr as an integer type.

- If you get an error, read the error, and think how you should proceed.
- If you decide to fill the null values, use the mode() since its a categorical variable.
- If you get another error, look at what mode() is returning in order to fix the error and convert to integer the Embarked_nr column.

In [None]:
# Your code goes here

### <a id='toc2_1_3_'></a>[💡 Check for understanding](#toc0_)

You are given a dataset of students' exam scores here: https://raw.githubusercontent.com/data-bootcamp-v4/data/main/student_performance.csv. Your task is to perform the following operations using pandas:

1. Read the CSV file into a DataFrame.
2. Create a new column "total_score" that calculates the total score for each student by summing their "math score," "reading score," and "writing score."
3. Create a new column "grade" that assigns a grade to each student based on the following criteria:
   - If the total score is >= 90, the grade is "A."
   - If the total score is >= 80 and < 90, the grade is "B."
   - If the total score is >= 70 and < 80, the grade is "C."
   - If the total score is >= 60 and < 70, the grade is "D."
   - If the total score is < 60, the grade is "F."
4. Convert all student names in the "gender" column to uppercase.
5. Create a new column "is_passed" that indicates whether each student has passed the exam or not. If the total score is >= 60, the student has passed; otherwise, they have failed.


In [156]:
df_students = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/student_performance.csv")

In [157]:
# Your code goes here
df_students.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [162]:
# goal is to get total score -> math score + reading score + writing score

# harder but re-usable
def gets_total_score ( row ):
    """ picks math, reading and writing and sums  together."""
    return row['math score'] + row['reading score'] + row['writing score']



df_students['total_score'] = df_students.apply(gets_total_score, axis=1)

In [166]:
# simpler but not re-usable
df_students['total_score'] = (df_students['math score'] +  df_students['reading score'] + df_students['writing score']) / 3
df_students['total_score']

0      72.666667
1      82.333333
2      92.666667
3      49.333333
4      76.333333
         ...    
995    94.000000
996    57.333333
997    65.000000
998    74.333333
999    83.000000
Name: total_score, Length: 1000, dtype: float64

In [167]:
## Function to calculate the grade:

def get_grade ( score ):
    if score  >= 90:
        return 'A'
    elif score >= 80 and score < 90:
        return 'B'
    elif score >= 70 and score < 80:
        return 'C'
    elif score >= 60 and score < 70:
        return 'D'
    else:
        return 'F'

In [168]:
df_students['total_score'].apply(get_grade)

0      C
1      B
2      A
3      F
4      C
      ..
995    A
996    F
997    D
998    C
999    B
Name: total_score, Length: 1000, dtype: object

# <a id='toc3_'></a>[Filtering Data](#toc0_)

One of the primary tasks in dataset analysis is filtering rows.

When filtering DataFrames in Pandas, you can use boolean indexing to select specific rows based on certain conditions. Here's a step-by-step explanation:

1. Identify the column(s) you want to use as a filter condition. For example, in `housing_df` the column named 'SalePrice'.

2. Create a condition using a comparison operator (e.g., `>`, `<`, `==`, etc.) and the column(s) you want to filter. For instance, to filter all rows where the 'SalePrice' is greater than 10000, you would use `condition = housing_df['SalePrice'] > 10000`.

3. Use the condition to filter the DataFrame. You can do this by passing the condition inside square brackets to the DataFrame. For example, `filtered_df = housing_df[condition]` will create a new DataFrame `filtered_df` containing only the rows where the 'SalePrice' is greater than 10000.

Keep in mind that the condition should evaluate to a boolean Series with the same length as the DataFrame, indicating which rows to include (True) or exclude (False).

You can also combine multiple conditions using logical operators like `&` for 'and' and `|` for 'or'. For instance, to filter rows where the 'SalePrice' is greater than 10000 and the 'FullBath' is more than 1, you can use `condition = (housing_df['SalePrice'] > 10000) & (housing_df['FullBath'] > 1)`.

Filtering allows you to extract specific subsets of data from your DataFrame, making it easier to analyze and work with the data that meets your criteria.


### <a id='toc3_1_1_'></a>[Creating a condition](#toc0_)

In [169]:
# Check fare col and mean

df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1890.0,7.160825
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1886.0,7.827522
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1877.0,52.44687
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1877.0,7.950985


In [171]:
# check mean value
df['fare'].mean()

np.float64(32.204207968574636)

In [172]:
# Create filter condition - fares higher than the mean
fare_avg = df['fare'].mean()

# checking this condition
df['fare'] > fare_avg

0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: fare, Length: 891, dtype: bool

In [175]:
high_fare_tickets_df =  df[ df['fare'] > fare_avg ]
high_fare_tickets_df

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1877.0,52.446870
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,1858.0,51.224591
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5000,A6,S,1884.0,35.063350
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S,1893.0,259.765100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
856,857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S,1867.0,162.838840
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,,68.694535
867,868,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S,1881.0,49.874702
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S,1865.0,51.907783


In [176]:
high_fare_tickets_df

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1877.0,52.446870
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,1858.0,51.224591
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5000,A6,S,1884.0,35.063350
27,28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S,1893.0,259.765100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
856,857,1,1,"Wick, Mrs. George Dennick (Mary Hitchcock)",female,45.0,1,1,36928,164.8667,,S,1867.0,162.838840
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,,68.694535
867,868,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S,1881.0,49.874702
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S,1865.0,51.907783


### <a id='toc3_1_2_'></a>[Filtering df](#toc0_)

In [None]:
# Get filtered df


In [None]:
# Do it all in one go


### <a id='toc3_1_3_'></a>[Using multiple conditions](#toc0_)

In [178]:
# We can combine boolean operators with filters to add conditions
# boolean operators: and is &, or is |

# fare higher than mean but still lower than 50

# condition_1 -> when person paid above the average fare
condition_1 = df['fare'] > df['fare'].mean()

# condition_2 -> when person paid below 50 dollars
condition_2 =  df['fare'] < 50



0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887     True
888     True
889     True
890     True
Name: fare, Length: 891, dtype: bool

In [187]:
# When we want to combine the 2 conditions:
df[ condition_1 & condition_2].head()

# is also the same as writing this:


df[ (df['fare'] > df['fare'].mean() ) & ( df['fare'] < 50 )  ]

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S,1884.0,35.06335
43,44,1,2,"Laroche, Miss. Simonne Marie Anne Andree",female,3.0,1,2,SC/Paris 2123,41.5792,,C,1909.0,41.067776
50,51,0,3,"Panula, Master. Juha Niilo",male,7.0,4,1,3101295,39.6875,,S,1905.0,39.199344
55,56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5,C52,S,,35.06335
59,60,0,3,"Goodwin, Master. William Frederick",male,11.0,5,2,CA 2144,46.9,,S,1901.0,46.32313
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,5,2,CA 2144,46.9,,S,1896.0,46.32313
83,84,0,1,"Carrau, Mr. Francisco M",male,28.0,0,0,113059,47.1,,S,1884.0,46.52067
86,87,0,3,"Ford, Mr. William Neal",male,16.0,1,3,W./C. 6608,34.375,,S,1896.0,33.952188
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,1841.0,34.227953
145,146,0,2,"Nicholls, Mr. Joseph Charles",male,19.0,1,1,C.A. 33112,36.75,,S,1893.0,36.297975


In [186]:
# When we want to use one OR the other condition:

df[ condition_1 | condition_2]

Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1890.0,7.160825
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1874.0,70.406515
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1886.0,7.827522
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1877.0,52.446870
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1877.0,7.950985
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1885.0,12.840100
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1893.0,29.631000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,,23.161565
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1886.0,29.631000


In [192]:
# To filter on categorical data we can also use .isin()
# Get expensive tickets in the lowest classes


# Alternatively, get cheap tickets in the higher classes

df[condition_1 & ( df['sex'] == 'male' ) & ( df['survived'] == 1) ]


Unnamed: 0,passengerid,survived,pclass,name,sex,age,siblings_spouses,parents_children,ticket,fare,cabin,embarked,yob,euro_fare
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S,1884.0,35.06335
55,56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5,C52,S,,35.06335
74,75,1,3,"Bing, Mr. Lee",male,32.0,0,0,1601,56.4958,,S,1880.0,55.800902
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C,1889.0,62.578993
183,184,1,2,"Becker, Master. Richard F",male,1.0,2,1,230136,39.0,F4,S,1911.0,38.5203
224,225,1,1,"Hoyt, Mr. Frederick Maxfield",male,38.0,1,0,19943,90.0,C93,S,1874.0,88.893
248,249,1,1,"Beckwith, Mr. Richard Leonard",male,37.0,1,1,11751,52.5542,D35,S,1875.0,51.907783
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,1911.08,149.685935
370,371,1,1,"Harder, Mr. George Achilles",male,25.0,1,0,11765,55.4417,E50,C,1887.0,54.759767
390,391,1,1,"Carter, Mr. William Ernest",male,36.0,1,2,113760,120.0,B96 B98,S,1876.0,118.524


In [None]:
# We can also use between() for numerical data
# Get fares between 90-100


# <a id='toc4_'></a>[More Data Manipulation](#toc0_)




## <a id='toc4_1_'></a>[Setting the index](#toc0_)

To set an index in pandas, you can use the `set_index()` method of the DataFrame. This method allows you to specify which column you want to use as the index for the DataFrame.

In [None]:
# Basically renaming the rows in our df


## <a id='toc4_2_'></a>[Adding/removing rows and/or columns](#toc0_)

To add or remove rows and/or columns from a pandas DataFrame, you can use the following methods:

1. Adding rows:
   - Use the `concat()` method to add rows to the DataFrame.

2. Removing rows:
   - Use the `drop()` method with the row index or label to remove specific rows.

3. Adding columns:
   - Using `df[new_column]`, you simply assign a list, Series, or scalar value to the new column name
   - Assign a new column to the DataFrame using bracket notation or the `assign()` method.

4. Removing columns:
   - Use the `drop()` method with the column name and `axis=1` to remove specific columns.
   - Alternatively, you can use the `del` keyword to remove a column in-place.

In [None]:
# Add the first row of the df at the end


In [None]:
# Remove the row from the new df


In [None]:
# Remove a column from the dataframe


In [None]:
# Create a survived_bool col


## <a id='toc4_3_'></a>[💡 Check for understanding](#toc0_)

Use the `supermarket_sales.csv` file for this task.

1. **Load the Data**: Use pandas to load the `supermarket_sales.csv` file into a DataFrame.

2. **Null Values**: Check if the DataFrame has any null values. If there are any, count the number of null values in each column.

5. **Formatting Data**: Round any floating point numbers in the DataFrame to two decimal places.

6. **Cleaning Column Names**: Ensure all column names are in lowercase and replace any spaces in the column names with underscores.

7. **Using `apply()`, `map()`, and `applymap()`**: Create a new column called 'total_cost' which is the product of the 'quantity' and 'unit_price' columns (assuming these columns exist in your dataset). Use the `apply()` function for this.

8. **Filtering Data**: Filter the DataFrame to only include rows where 'total_cost' is greater than the average 'total_cost'.

9. **Setting the Index**: Set the 'invoice_id' column (or any other unique identifier) as the index of the DataFrame.



In [None]:
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/supermarket_sales.csv'

In [None]:
# Your code goes here

# <a id='toc5_'></a>[Summary](#toc0_)

1. Null Values:
   - Null values (also known as missing values) can hinder data analysis and modeling.
   - Use `isnull()` or `isna()` to check for null values in a DataFrame or Series.
   - Use `any()` and `sum()` to efficiently assess data quality.
   - Use `dropna()` to remove rows or columns with null values from a DataFrame.
   - Parameters like subset, how, and thresh can control the behavior of dropping rows or columns.
   - Use `fillna()` to replace null values with specific values, such as `mean()`, `median()`, or forward/backward fill.

5. Formatting Data:
   - Use `round()` and `format()` to format numeric values.
   - Use string methods like `lower()`, `upper()`, `title()`, `strip()`, `split()`, and `replace()`.

6. Cleaning Column Names:
   - Use df.columns to access column names.
   - Modify column names using df.columns or `rename()`.

7. Using `apply()`, `map()`, and `applymap()`:
   - `apply()`: Applies a custom function to a Series.
   - `map()`: Transforms Series elements based on a dictionary.
   - `applymap()`: Applies a custom function to every element in a DataFrame.

8. Filtering Data:
   - Filter rows in a DataFrame using boolean indexing.
   - Use comparison operators (<, >, ==) to create conditions.
   - Combine multiple conditions using logical operators (& for 'and', | for 'or').

9. Setting the Index:
   - Use `set_index()` to set an index for the DataFrame.

10. Adding/Removing Rows and Columns:
   - Use `concat()` to add rows to the DataFrame.
   - Use `drop()` with the row index/label to remove specific rows.
   - Use bracket notation or `drop()` with axis=1 to add/remove columns.