**Table of contents**<a id='toc0_'></a>    
- [Cleaning Data](#toc1_)    
  - [Null values](#toc1_1_)    
    - [What are the different scenarios of missing data?](#toc1_1_1_)    
      - [MCAR (Missing Completely at Random)](#toc1_1_1_1_)    
      - [MAR (Missing at Random)](#toc1_1_1_2_)    
      - [Missing Not At Random (MNAR)](#toc1_1_1_3_)    
    - [Why are null values relevant?](#toc1_1_2_)    
    - [Cleaning Null Values](#toc1_1_3_)    
    - [Checking for Null Values](#toc1_1_4_)    
    - [Dropping Null Values](#toc1_1_5_)    
    - [Filling Null Values](#toc1_1_6_)    
    - [💡 Check for understanding](#toc1_1_7_)    
  - [Dealing with Duplicates](#toc1_2_)    
    - [Identifying Duplicates](#toc1_2_1_)    
    - [Removing Duplicates](#toc1_2_2_)    
    - [Removing Duplicates Based on Specific Columns](#toc1_2_3_)    
    - [Resetting the Index](#toc1_2_4_)    
  - [Formatting Data (Recap)](#toc1_3_)    
    - [Formatting Numeric Values (Recap)](#toc1_3_1_)    
    - [Formatting Strings (Recap)](#toc1_3_2_)    
    - [Formatting Dates](#toc1_3_3_)    
  - [Cleaning Column Names](#toc1_4_)    
- [Using `apply()`, `map()`, and `applymap()`](#toc2_)    
    - [More examples](#toc2_1_1_)    
      - [Comparing Map and Apply](#toc2_1_1_1_)    
      - [Calculating the length of the name](#toc2_1_1_2_)    
      - [Converting to float some columns with applymap()](#toc2_1_1_3_)    
      - [Modifying columns names with apply()](#toc2_1_1_4_)    
    - [💡 Check for understanding](#toc2_1_2_)    
    - [💡 Check for understanding](#toc2_1_3_)    
- [Filtering Data](#toc3_)    
    - [Creating a condition](#toc3_1_1_)    
    - [Filtering df](#toc3_1_2_)    
    - [Using multiple conditions](#toc3_1_3_)    
- [More Data Manipulation](#toc4_)    
  - [Setting the index](#toc4_1_)    
  - [Adding/removing rows and/or columns](#toc4_2_)    
  - [💡 Check for understanding](#toc4_3_)    
- [Summary](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Cleaning Data](#toc0_)

## <a id='toc1_1_'></a>[Null values](#toc0_)

Null values (also known as missing values) are common in datasets and can hinder data analysis and modeling. It is essential to handle null values appropriately to ensure accurate and reliable results. Pandas provides various methods to clean and handle null values in datasets.

In Python, `None` is a special constant that represents the absence of a value. It is commonly used to indicate that a variable or function has no value or hasn't been assigned any value. For example, if a function does not explicitly return a value, it implicitly returns `None`.

On the other hand, `NaN` stands for "Not a Number" and is a special value used to represent missing or undefined numerical data. `NaN` is part of the floating-point representation and is commonly used in numeric data structures like Pandas DataFrames and Series to indicate missing or invalid numerical values.

### <a id='toc1_1_1_'></a>[What are the different scenarios of missing data?](#toc0_)

#### <a id='toc1_1_1_1_'></a>[MCAR (Missing Completely at Random)](#toc0_)

When data is missing for reasons that have nothing to do with the information being recorded, i.e. has no connection to any data points in the dataset, whether observed or unobserved. This scenario happens when there's **a glitch in the system collecting the data**, rather than the data itself and there's no way to predict where values will be missing.

For example:
- **Lost Surveys**: Say a surveyor accidentally dropped completed questionnaires into a lake. The fact that there's information missing from these specific questionnaires doesn’t depend on who filled them out or what they contained. The loss of data is completely random.  

- **Sensor Failure in a Weather Station**: A weather station collects temperature data hourly, but sometimes the sensor fails from time to time, for no particular reason. Given these failures don’t depend on the time of day, season, or temperature, the missing values just happen randomly due to a technical issue. However, if the sensor failure was due to temperature, or rainfall, or **any other factor connected to the data being observed**, the data will no longer be MCAR but it would be **MAR**.  

#### <a id='toc1_1_1_2_'></a>[MAR (Missing at Random)](#toc0_)

> Despite its name, MAR occurs when the absence of data is not random. The probability of missing data is not equal for all measurements. They’re more likely for some observations than others. However, measurements of observed variables predict the unequal probability of missing values occurring. ([Statistics by Jim](https://statisticsbyjim.com/basics/missing-data/))

This means that we can figure out where a value might be missing based on other information in the dataset.

For example: 

- **Medical Study Missing Data on a Drug’s Side Effects**: In a medical study, older participants might be more likely to drop out or miss follow-up appointments. As a result, we may be missing some follow-up data on side effects for older participants. The missing data is linked to an observable factor (age), but it’s not related to the unobserved side effect outcomes themselves.

-  **Student Performance Data**: In a school, students may be more likely to skip reporting their test scores in optional evaluations. For instance, students with lower attendance rates might be more likely to miss submitting their scores. The missing test scores are related to an observed factor (attendance rate), not the missing test scores themselves. 

- **Workplace Wellness Programs and Job Satisfaction**: Employees who are part-time might be less likely to participate in wellness programs, leading to missing data on wellness program engagement for part-time employees. The missing engagement data is related to observed information (employment status: part-time vs. full-time) rather than how much the part-time employees actually benefit from the program.

#### <a id='toc1_1_1_3_'></a>[Missing Not At Random (MNAR)](#toc0_)

MNAR data means that the reason for missing data is related to the actual missing data itself, so it's not possible to predict where there will be missing values in the dataset as the information needed to do that is missing.

For example:
- **Sensitive Information in Surveys**: Suppose a survey asks people about their income, and higher-income individuals are more likely to skip the question due to privacy concerns. Here, the likelihood of a person not responding is directly related to the value of their income (higher income = more likely to skip). This is MNAR because the missing data (income) is missing specifically because of the value it would have been, not because of any other observed variable like age or occupation, although the latter could also be correlated with a high-income.    

- **Health Studies and Sensitive Symptoms**: In a medical study, participants with more severe symptoms might skip reporting certain symptoms or drop out of the study because they’re uncomfortable sharing the extent of their condition. This missing data would be MNAR since people are missing from the dataset specifically because of their symptom severity, which we can’t observe directly once they’ve dropped out.   

- **Employee Feedback and Job Satisfaction**: If a company survey asks employees about job satisfaction, those who are very dissatisfied may skip questions or avoid taking the survey altogether to avoid potential consequences. This situation would be MNAR because the likelihood of missing data (not answering the job satisfaction questions) is directly related to the unobserved values (job dissatisfaction). However, it could be linked to other factors, such as compensation.  

- **Credit Scores and Loan Applications**: In a financial study, people with very low credit scores might be less likely to report their scores or apply for loans, creating a gap in data. Since the missing data (credit scores) is related to the unobserved values (the actual low scores), this scenario would be classified as MNAR.

### <a id='toc1_1_2_'></a>[Why are null values relevant?](#toc0_)

- **Biased Analysis**: Except for the MCAR case, missing values can bias the conclusion we draw from data, mostly by ignoring a subset of the population that is potentially very different from the population whose data we collected.

- **Lower Statistical Power / Precision**: Having missing data reduces the sample size of our dataset, which in turn reduces the precision and power of statistical tests used for association and hypothesis testing. 



### <a id='toc1_1_3_'></a>[Cleaning Null Values](#toc0_)

1. Checking for Null Values:
   - Use `isnull()` method to check for null values in a DataFrame or Series.
   - Use `notnull()` method to check for non-null values in a DataFrame or Series.

2. Dropping Null Values:  
   _Typically useful for MCAR but can be harmful for MAR since it will likely bias the overall analysis._
   - Use `dropna()` method to remove rows with null values from a DataFrame.
   - Use `dropna(axis=1)` to remove columns with null values.

3. Filling Null Values:
   - Use `fillna(value)` method to replace null values with a specific value.
   - Use `fillna(method='ffill')` to forward-fill null values with the previous non-null value.
   - Use `fillna(method='bfill')` to backward-fill null values with the next non-null value.

### <a id='toc1_1_4_'></a>[Checking for Null Values](#toc0_)

In [None]:
!pip install pandas

In [158]:

import pandas as pd
import numpy as np
# Load Titanic dataset from an online source
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv'
df = pd.read_csv(url)

In [2]:
# Review dataframe and df columns
print(df.columns)
display(df.head())

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df[df.Age.isna()]#check for NaN value

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [4]:
df.groupby(by='Pclass')['Age'].count()

Pclass
1    186
2    173
3    355
Name: Age, dtype: int64

In [5]:
# Checking for Null Values
df.isnull()  # Returns a DataFrame with True where values are null
# isnull is an alias of isna

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


When working with large datasets, using `isna()` or `isnull()` along with `any()`, `all()` and `sum()` in Pandas becomes essential for quick and efficient data quality assessment.

In [8]:
# Check which cols have any null values
df.isna().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [9]:
# Check if any column have only null values
df.isna().all()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

sum() calculates the sum of each row, considering True as 1 and False as 0.

In [10]:
# Count the number of null values in each column
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

If we add the parameter `axis=1` with the `sum()` function, we can calculate the sum of each row (along the columns) of the DataFrame `df`. This results in a Series that contains the count of null values in each row.

In [11]:
# Get null values per row
df.isna().sum(axis=1)

0      1
1      0
2      1
3      0
4      1
      ..
886    1
887    0
888    2
889    0
890    1
Length: 891, dtype: int64

In [12]:
# Get % of null values per row
df.isna().sum(axis=1) * 2 / df.shape[1]

0      0.166667
1      0.000000
2      0.166667
3      0.000000
4      0.166667
         ...   
886    0.166667
887    0.000000
888    0.333333
889    0.000000
890    0.166667
Length: 891, dtype: float64

In [13]:
# Sort descending
(df.isna().sum(axis=1) * 2 / df.shape[1]).sort_values(ascending=False)

502    0.333333
773    0.333333
517    0.333333
783    0.333333
359    0.333333
         ...   
659    0.000000
662    0.000000
438    0.000000
215    0.000000
445    0.000000
Length: 891, dtype: float64

In [None]:
# What about the percentage of null values in each column?
round(df.isna().sum() * 100 / df.shape[0], 2)

PassengerId     0.00
Survived        0.00
Pclass          0.00
Name            0.00
Sex             0.00
Age            19.87
SibSp           0.00
Parch           0.00
Ticket          0.00
Fare            0.00
Cabin          77.10
Embarked        0.22
dtype: float64

### <a id='toc1_1_5_'></a>[Dropping Null Values](#toc0_)

In [15]:
# Dropping rows with any Null Values
df.dropna() 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [18]:
df.dropna(how='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


How many rows did we remove from our dataframe?

In [2]:
df.dropna(thresh=11)#At least 11 value per row or we remove the line

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
# Check rows removed

However, as we can see below in the DataFrame, the rows with NaN values have not been removed. To execute the change, it is necessary to use the `inplace=True` option: `df.dropna(inplace=True)` or assign it to a variable such as df = df.dropna().

In [17]:
# Check original dataframe
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
# Dropping columns with  Null Values
df.dropna(axis=1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500
...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000


In the `dropna()` method of Pandas DataFrame, the `subset`, `how`, and `thresh` parameters are used to control the behavior of dropping rows or columns containing NaN (null) values, when we don't want to drop them just because they have *one* null value:

- `subset`: It allows you to specify a subset of columns on which to apply the `dropna()` operation. Only the rows containing NaN values in the specified subset of columns will be dropped.

In [None]:
df.tail()

In [None]:
# Drop cabin nulls
df.dropna(subset=['Cabin']).tail()

- `how`: It specifies the condition for dropping rows. It can take the values 'any', which means to drop rows containing any NaN values in the `subset`, or 'all', which means to drop rows containing all NaN values in the `subset`.

In [None]:
# Drop rows only if ALL values are null
df.dropna(how='all').tail()

- `thresh`: It sets a minimum threshold for the number of non-null values that a row must have in the `subset` in order to be kept. Rows with fewer non-null values than the specified threshold will be dropped.

In [None]:
# Test different thresh values
df.dropna(thresh=3).tail()

In [18]:
df.dropna(subset='Age', how='all')
#age_sum = df_cleaned['Age'].sum()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [6]:
df_no_age_nulls = df.dropna(subset=['Age',"Cabin"], how='all')#remove line zith age and cabin null
df_no_age_nulls.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             19
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          529
Embarked         2
dtype: int64

### <a id='toc1_1_6_'></a>[Filling Null Values](#toc0_)

`fillna()` is a Pandas method used to replace NaN (null) values in a DataFrame or Series with specified values.
- You can use `inplace=True` to modify the DataFrame directly.

In [6]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
df['Age'].isnull()
#Boleen mask

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [10]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [12]:
#first we define the condition
age_condition = df['Age'].isnull()
df[age_condition]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [13]:
df['Age'].fillna(-1)

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    -1.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [None]:
df['Cabin'].fillna('N/A')
#df['Cabin'].fillna('N/A', inplace = True) If we want to replace the value in the dataframe

0       N/A
1       C85
2       N/A
3      C123
4       N/A
       ... 
886     N/A
887     B42
888     N/A
889    C148
890     N/A
Name: Cabin, Length: 891, dtype: object

In [16]:
df['Cabin']

0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object

In [7]:
# Filling Null Values
df.fillna(-1).tail()  # Replaces null values with -1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,-1,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,-1.0,1,2,W./C. 6607,23.45,-1,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,-1,Q


In [8]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Careful if we assign a different data type, since Pandas will change the data type of the whole column. For example:

In [None]:
# Check dtypes
df.dtypes # age is a float

In [None]:
# Fill with "N/A" instead of -1 
df_na = df.fillna("N/A")
df_na.tail()

In [None]:
# Check dtypes
df_na.dtypes # age is not a float anymore, it's an object now

To avoid this, we can select manually in which column to apply the `fillna()`

In [17]:
# Fill Cabin only
df.Cabin.fillna("N/A").tail()

886     N/A
887     B42
888     N/A
889    C148
890     N/A
Name: Cabin, dtype: object

In [None]:
#Fill a subset of columns
#Dont work on series and need to create a dataframe
df[['Age', 'Cabin']].fillna(0)

Unnamed: 0,Age,Cabin
0,22.0,0
1,38.0,C85
2,26.0,0
3,35.0,C123
4,35.0,0
...,...,...
886,27.0,0
887,19.0,B42
888,0.0,0
889,26.0,C148


In [19]:
type(df[['Age', 'Cabin']])

pandas.core.frame.DataFrame

We can also use the mean(), median() etc. to fill the null values.

In [21]:
# Check Age in last rows
df.tail() 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [24]:
# Fill with age mean
df['Age'].fillna(df['Age'].mean()).tail() #after filling with the mean, lets see how it would look

886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, dtype: float64

- Two common methods for filling NaN values are `ffill`, which forward fills using the last valid value, and `bfill`, which backward fills using the next valid value.

In [25]:
# Forward-fill null values in the Age column
df['Age'].fillna(method='ffill').tail()

  df['Age'].fillna(method='ffill').tail()


886    27.0
887    19.0
888    19.0
889    26.0
890    32.0
Name: Age, dtype: float64

In [26]:
# Backward-fill null values in the Age column
df['Age'].fillna(method='bfill').tail()

  df['Age'].fillna(method='bfill').tail()


886    27.0
887    19.0
888    26.0
889    26.0
890    32.0
Name: Age, dtype: float64

### <a id='toc1_1_7_'></a>[💡 Check for understanding](#toc0_)

Consider the following DataFrame containing information about students:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)
```

Your task is to perform the following data cleaning tasks:

1. Check for null values in the DataFrame using `isna()` or `isnull()`.

2. Replace the null values in the 'Age' column with the average age of the students.

3. Replace the null values in the 'Gender' column with "Female".

4. Drop any rows that have null values in the 'Name' column.

5. Forward fill (ffill) the null values in the 'Score' column with the previous valid value.

6. After performing all the cleaning steps, print the cleaned DataFrame.


In [42]:
# Your code goes here
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],
    'Age': [25, 30, None, 22, 28],
    'Gender': ['Female', None, 'Female', 'Male', None],
    'Score': [90, None, 78, None, 85]
}

df_students = pd.DataFrame(data)

In [29]:
df_students

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,,
2,Cathy,,Female,78.0
3,,22.0,Male,
4,Eva,28.0,,85.0


In [43]:
df_students.isnull()

Unnamed: 0,Name,Age,Gender,Score
0,False,False,False,False
1,False,False,True,True
2,False,True,False,False
3,True,False,False,True
4,False,False,True,False


In [45]:
#Total null in the columns
df_students.isnull().sum()

#Total null in the dataframe
df_students.isnull().sum().sum()

6

In [48]:
int(df_students.isnull().sum().sum())

6

In [51]:
df_students['Age'].fillna(df_students['Age'].mean(),inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_students['Age'].fillna(df_students['Age'].mean(),inplace = True)


In [33]:
df_students

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,,
2,Cathy,26.25,Female,78.0
3,,22.0,Male,
4,Eva,28.0,,85.0


In [52]:
df_students['Gender'].fillna('Female',inplace = True)
df_students

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_students['Gender'].fillna('Female',inplace = True)


Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,90.0
2,Cathy,26.25,Female,78.0
3,,22.0,Male,78.0
4,Eva,28.0,Female,85.0


In [None]:
#4- Remove the null value of the column name
#select Name subset in the dataframe
df_students.dropna(subset = ['Name'], inplace = True)
df_students

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,90.0
2,Cathy,26.25,Female,78.0
4,Eva,28.0,Female,85.0


In [60]:
#df_students['Score'].fillna(method = 'ffill', inplace = True)
df_students['Score'] = df_students['Score'].ffill()
#before 1 method and 3 possibilities
#fillna(method = ['ffill', 'bfill', 'default['])

#Now there are 3 differents method
#fillna
#ffill
#bfill

In [61]:
df_students

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,Female,90.0
2,Cathy,26.25,Female,78.0
4,Eva,28.0,Female,85.0


## <a id='toc1_2_'></a>[Dealing with Duplicates](#toc0_)

In data analysis, it's common to encounter duplicate values in datasets. Duplicates can distort our analysis and lead to incorrect conclusions. Fortunately, pandas provides efficient methods to handle duplicates.


### <a id='toc1_2_1_'></a>[Identifying Duplicates](#toc0_)

To identify duplicate rows in a DataFrame, we can use the `duplicated()` method, which returns a boolean Series indicating whether each row is a duplicate or not. We can then use the `sum()` method to count the total number of duplicates.


In [None]:
#Boleen mask
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Length: 891, dtype: bool

In [62]:
# Check total # of duplicates
df.duplicated().sum()

0

In [64]:
# Check if there are any duplicates
df.duplicated().any()

False

To check for duplicates in specific columns, we can use the `duplicated()` method with the `subset` parameter, or just access first to the column and then check with duplicated().


In [67]:
df.duplicated(subset=['Age'])

0      False
1      False
2      False
3      False
4       True
       ...  
886     True
887     True
888     True
889     True
890     True
Length: 891, dtype: bool

In [65]:
# Check for duplicates in Age column
df.duplicated(subset=['Age']).sum()

802

In [66]:
df.Age.duplicated().any()

True

### <a id='toc1_2_2_'></a>[Removing Duplicates](#toc0_)

To remove duplicates from a DataFrame, we can use the `drop_duplicates()` method. By default, this method keeps the first occurrence of each duplicated row and removes the rest.


In [68]:
df.shape

(891, 12)

In [None]:
# Remove duplicates and update the DataFrame
df.drop_duplicates(inplace=True) # we know there are none but this is how we would do it

### <a id='toc1_2_3_'></a>[Removing Duplicates Based on Specific Columns](#toc0_)

Sometimes, we may want to remove duplicates based on specific columns. We can pass a subset of column names to the `drop_duplicates()` method to achieve this.


In [69]:
# Remove duplicates based on specific columns
df.drop_duplicates(subset=['Sex', 'Age']) #lets look at the number of rows if we do this

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
819,820,0,3,"Skoog, Master. Karl Thorsten",male,10.0,3,2,347088,27.9000,,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0000,B28,
843,844,0,3,"Lemberopolous, Mr. Peter L",male,34.5,0,0,2683,6.4375,,C
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S


By default, `drop_duplicates()` keeps the first occurrence of each duplicated row. If we want to keep the last occurrence instead, we can set the `keep` parameter to `'last'`.


In [78]:
df_sampled =df.sample(20)
df_concatenated = pd.concat([df,df_sampled])
df_concatenated.shape

(911, 12)

In [None]:
df_concatenated.drop_duplicates()#keep the first appearence of a value

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
# Keep the last occurrence of duplicates
#Keep the latest entry that was updated last
df_concatenated.drop_duplicates(keep='last') # we know there are none but this is how we would do it

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
151,152,1,1,"Pears, Mrs. Thomas (Edith Wearne)",female,22.0,1,0,113776,66.6000,C2,S
325,326,1,1,"Young, Miss. Marie Grice",female,36.0,0,0,PC 17760,135.6333,C32,C
655,656,0,2,"Hickman, Mr. Leonard Mark",male,24.0,2,0,S.O.C. 14879,73.5000,,S
763,764,1,1,"Carter, Mrs. William Ernest (Lucile Polk)",female,36.0,1,2,113760,120.0000,B96 B98,S


### <a id='toc1_2_4_'></a>[Resetting the Index](#toc0_)

When removing duplicates, the DataFrame index may have gaps due to removed rows. To reset the index after removing duplicates, we can use the `reset_index()` method with the `drop=True` parameter.


In [None]:
# Remove duplicates and reset the index
#Dont do it if index important to follow data
df_without_duplicates = df.copy()
df_without_duplicates = df.drop_duplicates(subset=['Sex', 'Age'])
df_without_duplicates # look at the gaps in the index

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
819,820,0,3,"Skoog, Master. Karl Thorsten",male,10.0,3,2,347088,27.9000,,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0000,B28,
843,844,0,3,"Lemberopolous, Mr. Peter L",male,34.5,0,0,2683,6.4375,,C
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S


In [82]:
df_without_duplicates.reset_index(drop=True, inplace=True)
df_without_duplicates

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
142,820,0,3,"Skoog, Master. Karl Thorsten",male,10.0,3,2,347088,27.9000,,S
143,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0000,B28,
144,844,0,3,"Lemberopolous, Mr. Peter L",male,34.5,0,0,2683,6.4375,,C
145,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S


## <a id='toc1_3_'></a>[Formatting Data (Recap)](#toc0_)

### <a id='toc1_3_1_'></a>[Formatting Numeric Values (Recap)](#toc0_)


1. `round()` Method:
   - Rounds numeric values to a specified number of decimal places.

2. `format()` Method:
   - Formats numeric values as strings for better representation.

In [83]:
num = 123456.78910

In [92]:
# Apply round
round(num, 2)
print(round(num, 2))
type(round(num, 2))

123456.79


float

In [91]:
# Format to one decimal
format(num, '.1f') # .1f
print(format(num, '.1f'))
type(format(num, '.1f'))

123456.8


str

In [86]:
# Apply round to Fare
df['Fare'].round(2)

0       7.25
1      71.28
2       7.92
3      53.10
4       8.05
       ...  
886    13.00
887    30.00
888    23.45
889    30.00
890     7.75
Name: Fare, Length: 891, dtype: float64

In [87]:
# Apply format to Fare - doesn't work the same way
df['Fare'].format('.1f')

AttributeError: 'Series' object has no attribute 'format'

In [None]:
# Apply format to Fare - the right way. Put tail before style
df[['Fare']].tail().style.format('{:.1f}')

Unnamed: 0,Fare
886,13.0
887,30.0
888,23.4
889,30.0
890,7.8


### <a id='toc1_3_2_'></a>[Formatting Strings (Recap)](#toc0_)

We can apply all of the string methods we learnt about in the data structures, like `len`, `lower`, `upper`, `split`, `replace`, etc. :

In [94]:
str_example = "This is my second Pandas Lesson"

In [95]:
# string length
len(str_example)

31

In [99]:
# name length - direct method
df['Name'].len()

AttributeError: 'Series' object has no attribute 'len'

In [100]:
# name length - str method
df['Name'].str.len()

0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

In [None]:
df['PassengerID'] = df['PassengerId'].astype(int)#put int to put all the value as int or str to have everything as string

In [107]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
PassengerID      int32
dtype: object

In [108]:
# Upper/lowercase
print(str_example.lower())
print(str_example.upper())

this is my second pandas lesson
THIS IS MY SECOND PANDAS LESSON


In [111]:
# Upper/lowercase
display(df['Name'].str.lower())
display(df['Name'].str.upper())

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

0                                BRAUND, MR. OWEN HARRIS
1      CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...
2                                 HEIKKINEN, MISS. LAINA
3           FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)
4                               ALLEN, MR. WILLIAM HENRY
                             ...                        
886                                MONTVILA, REV. JUOZAS
887                         GRAHAM, MISS. MARGARET EDITH
888             JOHNSTON, MISS. CATHERINE HELEN "CARRIE"
889                                BEHR, MR. KARL HOWELL
890                                  DOOLEY, MR. PATRICK
Name: Name, Length: 891, dtype: object

In [112]:
df['Ticket']

0             A/5 21171
1              PC 17599
2      STON/O2. 3101282
3                113803
4                373450
             ...       
886              211536
887              112053
888          W./C. 6607
889              111369
890              370376
Name: Ticket, Length: 891, dtype: object

In [113]:
df['Ticket'].str.replace('W./C.', 'w/c')

0             A/5 21171
1              PC 17599
2      STON/O2. 3101282
3                113803
4                373450
             ...       
886              211536
887              112053
888            w/c 6607
889              111369
890              370376
Name: Ticket, Length: 891, dtype: object

### <a id='toc1_3_3_'></a>[Formatting Dates](#toc0_)

We will study this in another Notebook.

Changing datatype

In [114]:
df['Survived'].astype('boolean')

0      False
1       True
2       True
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Survived, Length: 891, dtype: boolean

In [None]:
df['Age'].astype(int)
#Dont work as there are null entry
#If nulle entry, pynthon will convert it in float

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [116]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
PassengerID      0
dtype: int64

In [None]:
pd.to_numeric(df['Age'], errors ='coerce')
#coerce change NaN to null

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [121]:
df['Age'].value_counts()

Age
24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1
Name: count, Length: 88, dtype: int64

In [122]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
PassengerID      int32
dtype: object

## <a id='toc1_4_'></a>[Cleaning Column Names](#toc0_)

We can acccess the columns using `df.columns`

In [129]:
type(df.columns)

pandas.core.indexes.base.Index

In [None]:
# Review columns
titanic_columns = df.columns.tolist()
titanic_columns

['passengerid',
 'survived',
 'pclass',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'passengerid']

In order to modify them, we can assign new column names to `df.columns` by doing `df.columns = [list_of_new_column_names]` or we can use the `rename()` method to just modify a few of them.

In [None]:
#Other method
titanic_columns_lower = [col_.lower() for col_ in titanic_columns]
titanic_columns_lower 

['passengerid',
 'survived',
 'pclass',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'passengerid']

In [130]:
# How can I create a list of columns that is lowercase?
df.columns = df.columns.str.lower()
df.columns

Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age', 'sibsp',
       'parch', 'ticket', 'fare', 'cabin', 'embarked', 'passengerid'],
      dtype='object')

In [None]:
# Can also use a custom list...
df.columns = ['passenger_id', 'survived', 'pclass', 'sex', 'age', 'sib_sp', 'par_ch', 'ticket', 'fare',
       'cabin', 'embarked']
df.columns

In [None]:
# Rename using a dictionary
renaming_dict = {
    'parch': 'parents_children',
    "sibsp": 'siblings_spouses'
}
df.rename(columns=renaming_dict, inplace=True)
# Alternative
# df.rename(renaming_dict, axis=1, inplace=True)
print(df.columns)

#Possible de rename the index as well

Index(['passengerid', 'survived', 'pclass', 'name', 'sex', 'age',
       'siblings_spouses', 'parents_children', 'ticket', 'fare', 'cabin',
       'embarked', 'passengerid'],
      dtype='object')


# <a id='toc2_'></a>[Using `apply()`, `map()`, and `applymap()`](#toc0_)

- `apply()`
    - Apply a custom function to a Series.
    - Useful for element-wise transformations.
    - Example: `df['squared_numbers'] = df['numbers'].apply(lambda x: x ** 2)`

- `map()`
    - Transform Series elements based on a dictionary.
    - Replaces elements with corresponding dictionary values.
    - Example: `df['gender_mapped'] = df['gender'].map({'M': 'Male', 'F': 'Female'})`

- `applymap()`
    - Apply a custom function to every element in a DataFrame.
    - Useful for element-wise transformations on entire DataFrames.
    - Example: `df = df.applymap(lambda x: x.upper())`


###  `Apply()`

In [137]:
# Applying a custom function using apply()
#yob = year of birth
def get_yob(age):
    return 1912 - age #titanic sank in 1912, we will assume is when Age was recorded

# Add YOB to dataset
df['yob'] = df['age'].apply(get_yob)
df['yob'].head(3)

0    1890.0
1    1874.0
2    1886.0
Name: yob, dtype: float64

In [None]:
# Test out using lambda
df['yob'] = df['age'].apply(lambda age: 1912 - age)

In the example above, we can see that to create a new column in pandas, we can simply assign a new Series or list to a new column name within the DataFrame.

To edit the information in a whole column in pandas, you can simply assign a new list or array of values to the column you want to modify.

In [138]:
# Convert the fare from US Dollars to EUR
exchange_rate = 0.9877
df['fare_eur'] = df['fare'].apply(lambda num: exchange_rate * num)
df['fare_eur']

0       7.160825
1      70.406515
2       7.827522
3      52.446870
4       7.950985
         ...    
886    12.840100
887    29.631000
888    23.161565
889    29.631000
890     7.654675
Name: fare_eur, Length: 891, dtype: float64

In [None]:
#work as well
df['euro_fare'] = df['fare'] * exchange_rate

In [144]:
df['sex'] = '_' + df['sex']
df['sex']


0        ___male
1      ___female
2      ___female
3      ___female
4        ___male
         ...    
886      ___male
887    ___female
888    ___female
889      ___male
890      ___male
Name: sex, Length: 891, dtype: object

In [146]:
df['sex'].apply(lambda col_: col_ + '_'+ col_ )
df['sex']

0        ___male
1      ___female
2      ___female
3      ___female
4        ___male
         ...    
886      ___male
887    ___female
888    ___female
889      ___male
890      ___male
Name: sex, Length: 891, dtype: object

In [153]:
df['sex'].apply(lambda col_: col_ + '_' + col_).str.replace('_male', 'nonfemale')

0      __nonfemale___nonfemale
1          ___female____female
2          ___female____female
3          ___female____female
4      __nonfemale___nonfemale
                ...           
886    __nonfemale___nonfemale
887        ___female____female
888        ___female____female
889    __nonfemale___nonfemale
890    __nonfemale___nonfemale
Name: sex, Length: 891, dtype: object

In [154]:
def convert_class_into_description (col):
    if col == 1:
        return 'first class'
    elif col == 2:
        return 'second class'
    else:
        return 'third class'

In [156]:
df['pclass'].apply (convert_class_into_description )

0       third class
1       first class
2       third class
3       first class
4       third class
           ...     
886    second class
887     first class
888     third class
889     first class
890     third class
Name: pclass, Length: 891, dtype: object

### `Map()`

In [None]:
# Using map() to transform the 'Sex' column to an integer + add a column
gender_mapping = {'male': 0, 'female': 1}
df['sex_mapped'] = df['Sex'].map(gender_mapping)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [162]:
# Switch to lambda
df.Sex.apply(lambda x: 0 if x=="male" else 1)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: Sex, Length: 891, dtype: int64

### `applyMap()`

In [None]:
#with function
def upper_case(col):
    return col.upper()

for cols in ['Name', 'Sex', 'Ticket']:
    df[cols].apply(upper_case)

In [168]:
#Same with lambda
# Using applymap() to convert all string columns to uppercase - 
df = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)

# Displaying the modified DataFrame
df.head()

  df = df.applymap(lambda x: x.upper() if isinstance(x, str) else x)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
0,1,0,3,"BRAUND, MR. OWEN HARRIS",MALE,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"HEIKKINEN, MISS. LAINA",FEMALE,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)",FEMALE,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"ALLEN, MR. WILLIAM HENRY",MALE,35.0,0,0,373450,8.05,,S,0


In [166]:
# Are we able to apply the str.upper to the whole dataframe?
df.str.upper()

AttributeError: 'DataFrame' object has no attribute 'str'

#### Let's pick the practical case from before and put it in a function

In [169]:
import pandas as pd

data = {

    'Name': ['Alice', 'Bob', 'Cathy', None, 'Eva'],

    'Age': [25, 30, None, 22, 28],

    'Gender': ['Female', None, 'Female', 'Male', None],

    'Score': [90, None, 78, None, 85]

}

df_students = pd.DataFrame(data)

In [192]:
df_students.fillna(0)

Unnamed: 0,Name,Age,Gender,Score
0,Alice,25.0,Female,90.0
1,Bob,30.0,0,0.0
2,Cathy,0.0,Female,78.0
3,0,22.0,Male,0.0
4,Eva,28.0,0,85.0


In [180]:
#1 - Replace Alice to Alice in Wonderlands
def turn_alice_to_wonderlands(name, gender, age):
    return name + '_in_Wonderlands ' + gender + f'({str(age)})'



In [177]:
turn_alice_to_wonderlands('Alice', 'female', 25)

'Alice_in_Wonderlands female(25)'

In [204]:
def turn_alice_to_wonderlands(row):
    return str(row['Name']) + '_' + str(row['Gender']) + f'({row["Age"]})'

In [None]:
#axis = 1 for columns level
df_students.apply(turn_alice_to_wonderlands, axis = 1)

0    Alice_Female(25.0)
1        Bob_None(30.0)
2     Cathy_Female(nan)
3       None_Male(22.0)
4        Eva_None(28.0)
dtype: object

1 - Y Axis

0 - X Axis

### <a id='toc2_1_1_'></a>[More examples](#toc0_)

#### <a id='toc2_1_1_1_'></a>[Comparing Map and Apply](#toc0_)

We have a column called "Embarked" containing three possible values: 'C', 'Q', and 'S'. We want to map these values to 0, 1 and 2. In this case, `apply()` with a lambda function would be complex due to the if-elif-else conditions, but `map()` can handle it more easily.

In [None]:
# Mapping 'embarked' values to their full names using map()
embarked_mapping = {'C': 0, 'Q': 1, 'S': 2}
df['embarked_nr'] = df['embarked'].map(embarked_mapping)

# Display the first few rows of the updated DataFrame
df[['name', 'embarked_nr']].head()

Why is it a float?

In [None]:
# Check null values
df.isna().sum() # Because Embarked has null values and it converted it to float to handle NaN value

In [None]:
# Mapping 'Embarked' values to their full names using apply() and a lambda function
df['embarked'].apply(lambda x: 0 if x == 'C' else (1 if x == 'Q' else 2))

# Note that here it doesn't convert it to float

In [None]:
# What happened with the null values?
df.isna().sum()

#### <a id='toc2_1_1_2_'></a>[Calculating the length of the name](#toc0_)

What if we wanted to create a new column with the length of the name?

In [None]:
df.columns

In [None]:
df['name_length'] = df.name.apply(len)
df.head()

#### <a id='toc2_1_1_3_'></a>[Converting to float some columns with applymap()](#toc0_)

Lets look just as an example, how to make float all the following columns: "PassengerId", "Survived", "Pclass"

In [None]:
df[["passenger_id", "survived", "pclass"]].applymap(float)

#### <a id='toc2_1_1_4_'></a>[Modifying columns names with apply()](#toc0_)

In [None]:
# I could also use the apply function by converting df.columns to Series
pd.Series(df.columns).apply(lambda col: col.lower())

### <a id='toc2_1_2_'></a>[💡 Check for understanding](#toc0_)

Make the column Embarked_nr as an integer type.

- If you get an error, read the error, and think how you should proceed.
- If you decide to fill the null values, use the mode() since its a categorical variable.
- If you get another error, look at what mode() is returning in order to fix the error and convert to integer the Embarked_nr column.

In [None]:
# Your code goes here

### <a id='toc2_1_3_'></a>[💡 Check for understanding](#toc0_)

You are given a dataset of students' exam scores here: https://raw.githubusercontent.com/data-bootcamp-v4/data/main/student_performance.csv. Your task is to perform the following operations using pandas:

1. Read the CSV file into a DataFrame.
2. Create a new column "total_score" that calculates the total score for each student by summing their "math score," "reading score," and "writing score."
3. Create a new column "grade" that assigns a grade to each student based on the following criteria:
   - If the total score is >= 90, the grade is "A."
   - If the total score is >= 80 and < 90, the grade is "B."
   - If the total score is >= 70 and < 80, the grade is "C."
   - If the total score is >= 60 and < 70, the grade is "D."
   - If the total score is < 60, the grade is "F."
4. Convert all student names in the "gender" column to uppercase.
5. Create a new column "is_passed" that indicates whether each student has passed the exam or not. If the total score is >= 60, the student has passed; otherwise, they have failed.


In [206]:
df_students = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/student_performance.csv")

In [207]:
df_students

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [215]:
# Your code goes here
def gets_total_score (row):
    return row['math score'] +row['reading score'] + row['writing score']

df_students['total_score'] = df_students.apply(gets_total_score, axis =1)


In [217]:
#simpler but not reusable
df_students['total_score'] = df_students['math score'] +df_students['reading score'] + df_students['writing score']
df_students

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218
1,female,group C,some college,standard,completed,69,90,88,247
2,female,group B,master's degree,standard,none,90,95,93,278
3,male,group A,associate's degree,free/reduced,none,47,57,44,148
4,male,group C,some college,standard,none,76,78,75,229
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282
996,male,group C,high school,free/reduced,none,62,55,55,172
997,female,group C,high school,free/reduced,completed,59,71,65,195
998,female,group D,some college,standard,completed,68,78,77,223


In [226]:
def get_grade (score):
    if score >= 300:
        return 'A'
    elif score >=240:
        return 'B'
    elif score >=210:
        return 'C'
    elif score >=180:
        return 'D'
    else:
        return 'E'

In [230]:
df_students['total grade'] = df_students['total_score'].apply(get_grade)

df_students

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,total grade
0,female,group B,bachelor's degree,standard,none,72,72,74,218,C
1,female,group C,some college,standard,completed,69,90,88,247,B
2,female,group B,master's degree,standard,none,90,95,93,278,B
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,E
4,male,group C,some college,standard,none,76,78,75,229,C
...,...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,282,B
996,male,group C,high school,free/reduced,none,62,55,55,172,E
997,female,group C,high school,free/reduced,completed,59,71,65,195,D
998,female,group D,some college,standard,completed,68,78,77,223,C


In [229]:
df_students.drop('grade', axis=1, inplace=True)

# <a id='toc3_'></a>[Filtering Data](#toc0_)

One of the primary tasks in dataset analysis is filtering rows.

When filtering DataFrames in Pandas, you can use boolean indexing to select specific rows based on certain conditions. Here's a step-by-step explanation:

1. Identify the column(s) you want to use as a filter condition. For example, in `housing_df` the column named 'SalePrice'.

2. Create a condition using a comparison operator (e.g., `>`, `<`, `==`, etc.) and the column(s) you want to filter. For instance, to filter all rows where the 'SalePrice' is greater than 10000, you would use `condition = housing_df['SalePrice'] > 10000`.

3. Use the condition to filter the DataFrame. You can do this by passing the condition inside square brackets to the DataFrame. For example, `filtered_df = housing_df[condition]` will create a new DataFrame `filtered_df` containing only the rows where the 'SalePrice' is greater than 10000.

Keep in mind that the condition should evaluate to a boolean Series with the same length as the DataFrame, indicating which rows to include (True) or exclude (False).

You can also combine multiple conditions using logical operators like `&` for 'and' and `|` for 'or'. For instance, to filter rows where the 'SalePrice' is greater than 10000 and the 'FullBath' is more than 1, you can use `condition = (housing_df['SalePrice'] > 10000) & (housing_df['FullBath'] > 1)`.

Filtering allows you to extract specific subsets of data from your DataFrame, making it easier to analyze and work with the data that meets your criteria.


### <a id='toc3_1_1_'></a>[Creating a condition](#toc0_)

In [223]:
# Check fare col and mean
df.Fare

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
886    13.0000
887    30.0000
888    23.4500
889    30.0000
890     7.7500
Name: Fare, Length: 891, dtype: float64

In [224]:
df.Fare.mean()

32.204207968574636

In [225]:
# Create filter condition - fares higher than the mean
condition = df.Fare > df.Fare.mean()
condition

0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Fare, Length: 891, dtype: bool

### <a id='toc3_1_2_'></a>[Filtering df](#toc0_)

In [231]:
# Get filtered df
filtered_df = df[condition]
filtered_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
1,2,1,1,"CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,1
3,4,1,1,"FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)",FEMALE,35.0,1,0,113803,53.1000,C123,S,1
6,7,0,1,"MCCARTHY, MR. TIMOTHY J",MALE,54.0,0,0,17463,51.8625,E46,S,0
23,24,1,1,"SLOPER, MR. WILLIAM THOMPSON",MALE,28.0,0,0,113788,35.5000,A6,S,0
27,28,0,1,"FORTUNE, MR. CHARLES ALEXANDER",MALE,19.0,3,2,19950,263.0000,C23 C25 C27,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
856,857,1,1,"WICK, MRS. GEORGE DENNICK (MARY HITCHCOCK)",FEMALE,45.0,1,1,36928,164.8667,,S,1
863,864,0,3,"SAGE, MISS. DOROTHY EDITH ""DOLLY""",FEMALE,,8,2,CA. 2343,69.5500,,S,1
867,868,0,1,"ROEBLING, MR. WASHINGTON AUGUSTUS II",MALE,31.0,0,0,PC 17590,50.4958,A24,S,0
871,872,1,1,"BECKWITH, MRS. RICHARD LEONARD (SALLIE MONYPENY)",FEMALE,47.0,1,1,11751,52.5542,D35,S,1


In [233]:
# Do it all in one go
df[df.Fare > df.Fare.mean()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
1,2,1,1,"CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,1
3,4,1,1,"FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)",FEMALE,35.0,1,0,113803,53.1000,C123,S,1
6,7,0,1,"MCCARTHY, MR. TIMOTHY J",MALE,54.0,0,0,17463,51.8625,E46,S,0
23,24,1,1,"SLOPER, MR. WILLIAM THOMPSON",MALE,28.0,0,0,113788,35.5000,A6,S,0
27,28,0,1,"FORTUNE, MR. CHARLES ALEXANDER",MALE,19.0,3,2,19950,263.0000,C23 C25 C27,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
856,857,1,1,"WICK, MRS. GEORGE DENNICK (MARY HITCHCOCK)",FEMALE,45.0,1,1,36928,164.8667,,S,1
863,864,0,3,"SAGE, MISS. DOROTHY EDITH ""DOLLY""",FEMALE,,8,2,CA. 2343,69.5500,,S,1
867,868,0,1,"ROEBLING, MR. WASHINGTON AUGUSTUS II",MALE,31.0,0,0,PC 17590,50.4958,A24,S,0
871,872,1,1,"BECKWITH, MRS. RICHARD LEONARD (SALLIE MONYPENY)",FEMALE,47.0,1,1,11751,52.5542,D35,S,1


### <a id='toc3_1_3_'></a>[Using multiple conditions](#toc0_)

In [None]:
# We can combine boolean operators with filters to add conditions
# boolean operators: and is &, or is |

# fare higher than mean but still lower than 50
df[(df.fare > df.fare.mean()) & (df.fare <= 50)]

In [None]:
#simpler but in different line
condition_1 = df['Fare'] > df['Fare'].mean()
condition_2 = df['Fare'] < 50
df[condition_1 & condition_2].tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
817,818,0,2,"MALLET, MR. ALBERT",MALE,31.0,1,1,S.C./PARIS 2079,37.0042,,C,0
824,825,0,3,"PANULA, MASTER. URHO ABRAHAM",MALE,2.0,4,1,3101295,39.6875,,S,0
827,828,1,2,"MALLET, MASTER. ANDRE",MALE,1.0,0,2,S.C./PARIS 2079,37.0042,,C,0
848,849,0,2,"HARPER, REV. JOHN",MALE,28.0,0,1,248727,33.0,,S,0
853,854,1,1,"LINES, MISS. MARY CONOVER",FEMALE,16.0,0,1,PC 17592,39.4,D28,S,1


In [251]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
0,1,0,3,"BRAUND, MR. OWEN HARRIS",MALE,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,1,1,"CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"HEIKKINEN, MISS. LAINA",FEMALE,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,1,"FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)",FEMALE,35.0,1,0,113803,53.1000,C123,S,1
4,5,0,3,"ALLEN, MR. WILLIAM HENRY",MALE,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"MONTVILA, REV. JUOZAS",MALE,27.0,0,0,211536,13.0000,,S,0
887,888,1,1,"GRAHAM, MISS. MARGARET EDITH",FEMALE,19.0,0,0,112053,30.0000,B42,S,1
888,889,0,3,"JOHNSTON, MISS. CATHERINE HELEN ""CARRIE""",FEMALE,,1,2,W./C. 6607,23.4500,,S,1
889,890,1,1,"BEHR, MR. KARL HOWELL",MALE,26.0,0,0,111369,30.0000,C148,C,0


In [None]:
#combinaison of conditions
df[condition_1 & (df['Sex'] =='MALE') & (df['Survived'] == 1)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
23,24,1,1,"SLOPER, MR. WILLIAM THOMPSON",MALE,28.0,0,0,113788,35.5,A6,S,0
55,56,1,1,"WOOLNER, MR. HUGH",MALE,,0,0,19947,35.5,C52,S,0
74,75,1,3,"BING, MR. LEE",MALE,32.0,0,0,1601,56.4958,,S,0
97,98,1,1,"GREENFIELD, MR. WILLIAM BERTRAM",MALE,23.0,0,1,PC 17759,63.3583,D10 D12,C,0
183,184,1,2,"BECKER, MASTER. RICHARD F",MALE,1.0,2,1,230136,39.0,F4,S,0
224,225,1,1,"HOYT, MR. FREDERICK MAXFIELD",MALE,38.0,1,0,19943,90.0,C93,S,0
248,249,1,1,"BECKWITH, MR. RICHARD LEONARD",MALE,37.0,1,1,11751,52.5542,D35,S,0
305,306,1,1,"ALLISON, MASTER. HUDSON TREVOR",MALE,0.92,1,2,113781,151.55,C22 C26,S,0
370,371,1,1,"HARDER, MR. GEORGE ACHILLES",MALE,25.0,1,0,11765,55.4417,E50,C,0
390,391,1,1,"CARTER, MR. WILLIAM ERNEST",MALE,36.0,1,2,113760,120.0,B96 B98,S,0


In [239]:
#Use one or the other condition
filtered_df = df[(condition_1) | (condition_2)]
filtered_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_mapped
0,1,0,3,"BRAUND, MR. OWEN HARRIS",MALE,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"CUMINGS, MRS. JOHN BRADLEY (FLORENCE BRIGGS TH...",FEMALE,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"HEIKKINEN, MISS. LAINA",FEMALE,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"FUTRELLE, MRS. JACQUES HEATH (LILY MAY PEEL)",FEMALE,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"ALLEN, MR. WILLIAM HENRY",MALE,35.0,0,0,373450,8.05,,S,0


In [None]:
# To filter on categorical data we can also use .isin()
# Get expensive tickets in the lowest classes
df[(df['fare'] > 60) & (df['pclass'].isin([2, 3]))]

# Alternatively, get cheap tickets in the higher classes
df[(df['fare'] < 30) & (df['pclass'].isin([1, 2]))]

In [None]:
# We can also use between() for numerical data
# Get fares between 90-100
df[df['fare'].between(90, 100)]

# <a id='toc4_'></a>[More Data Manipulation](#toc0_)




## <a id='toc4_1_'></a>[Setting the index](#toc0_)

To set an index in pandas, you can use the `set_index()` method of the DataFrame. This method allows you to specify which column you want to use as the index for the DataFrame.

In [None]:
# Basically renaming the rows in our df
df.set_index('passenger_id',inplace=True)
df.head()

## <a id='toc4_2_'></a>[Adding/removing rows and/or columns](#toc0_)

To add or remove rows and/or columns from a pandas DataFrame, you can use the following methods:

1. Adding rows:
   - Use the `concat()` method to add rows to the DataFrame.

2. Removing rows:
   - Use the `drop()` method with the row index or label to remove specific rows.

3. Adding columns:
   - Using `df[new_column]`, you simply assign a list, Series, or scalar value to the new column name
   - Assign a new column to the DataFrame using bracket notation or the `assign()` method.

4. Removing columns:
   - Use the `drop()` method with the column name and `axis=1` to remove specific columns.
   - Alternatively, you can use the `del` keyword to remove a column in-place.

In [None]:
# Add the first row of the df at the end
new_df = pd.concat([df, pd.DataFrame(df.iloc[0, :]).T], axis=0)
new_df.tail()

In [None]:
# Remove the row from the new df
new_df.drop(1) # This deletes the row with index 1

In [None]:
# Remove a column from the dataframe
df.drop('name', axis=1, inplace=True)
df.head()

In [None]:
# Create a survived_bool col
df["survived_bool"] = df['survived'].map({0: False, 1: True})
df

## <a id='toc4_3_'></a>[💡 Check for understanding](#toc0_)

Use the `supermarket_sales.csv` file for this task.

1. **Load the Data**: Use pandas to load the `supermarket_sales.csv` file into a DataFrame.

2. **Null Values**: Check if the DataFrame has any null values. If there are any, count the number of null values in each column.

5. **Formatting Data**: Round any floating point numbers in the DataFrame to two decimal places.

6. **Cleaning Column Names**: Ensure all column names are in lowercase and replace any spaces in the column names with underscores.

7. **Using `apply()`, `map()`, and `applymap()`**: Create a new column called 'total_cost' which is the product of the 'quantity' and 'unit_price' columns (assuming these columns exist in your dataset). Use the `apply()` function for this.

8. **Filtering Data**: Filter the DataFrame to only include rows where 'total_cost' is greater than the average 'total_cost'.

9. **Setting the Index**: Set the 'invoice_id' column (or any other unique identifier) as the index of the DataFrame.



In [None]:
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/supermarket_sales.csv'

In [None]:
# Your code goes here

# <a id='toc5_'></a>[Summary](#toc0_)

1. Null Values:
   - Null values (also known as missing values) can hinder data analysis and modeling.
   - Use `isnull()` or `isna()` to check for null values in a DataFrame or Series.
   - Use `any()` and `sum()` to efficiently assess data quality.
   - Use `dropna()` to remove rows or columns with null values from a DataFrame.
   - Parameters like subset, how, and thresh can control the behavior of dropping rows or columns.
   - Use `fillna()` to replace null values with specific values, such as `mean()`, `median()`, or forward/backward fill.

5. Formatting Data:
   - Use `round()` and `format()` to format numeric values.
   - Use string methods like `lower()`, `upper()`, `title()`, `strip()`, `split()`, and `replace()`.

6. Cleaning Column Names:
   - Use df.columns to access column names.
   - Modify column names using df.columns or `rename()`.

7. Using `apply()`, `map()`, and `applymap()`:
   - `apply()`: Applies a custom function to a Series.
   - `map()`: Transforms Series elements based on a dictionary.
   - `applymap()`: Applies a custom function to every element in a DataFrame.

8. Filtering Data:
   - Filter rows in a DataFrame using boolean indexing.
   - Use comparison operators (<, >, ==) to create conditions.
   - Combine multiple conditions using logical operators (& for 'and', | for 'or').

9. Setting the Index:
   - Use `set_index()` to set an index for the DataFrame.

10. Adding/Removing Rows and Columns:
   - Use `concat()` to add rows to the DataFrame.
   - Use `drop()` with the row index/label to remove specific rows.
   - Use bracket notation or `drop()` with axis=1 to add/remove columns.