# Pandas: A Complete Tutorial, Part 1

This notebook is a comprehensive, hands-on guide to Pandas, one of the most essential libraries for data analysis and manipulation in Python. It's designed for students who have a basic understanding of Python and NumPy and are ready to dive into the world of data science.

**Our Learning Journey:**
1.  **Introduction to Pandas**: What are Series and DataFrames?
2.  **Data Loading and Saving**: Reading from files.
3.  **Data Inspection and Exploration**: Getting to know our data.
4.  **Data Selection and Indexing**: Grabbing the exact data you need.
5.  **Data Cleaning and Transformation**: The art of tidying up data.
6.  **Handling Missing Data**: Dealing with `NaN` values.


## 1. Introduction to Pandas

Pandas is a fast, powerful, and easy-to-use open-source data analysis and manipulation tool, built on top of Python. It provides two primary data structures that are the building blocks of most data science work.

First, let's import the necessary libraries. We import `pandas` with the standard alias `pd` and `numpy` as `np`.

In [1]:
import pandas as pd
import numpy as np

### Core Data Structure: Series
A **Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It's like a single column in a spreadsheet.

In [3]:
# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])

print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


### Core Data Structure: DataFrame

A **DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet, a SQL table, or a dictionary of Series objects. It is the most commonly used pandas object.

In [4]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df_people = pd.DataFrame(data)

df_people

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago
3,David,40,Houston


In [5]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df_people = pd.DataFrame(data, index = ["x", "y", "z", 'w'])

df_people

Unnamed: 0,Name,Age,City
x,Alice,25,New York
y,Bob,30,Los Angeles
z,Charlie,35,Chicago
w,David,40,Houston


In [6]:
df_people=df_people.set_index("Name")

# df_people.set_index("Name",inplace=True)

df_people

Unnamed: 0_level_0,Age,City
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,25,New York
Bob,30,Los Angeles
Charlie,35,Chicago
David,40,Houston


---

## 2. Data Loading and Saving 

A crucial part of data science is loading data from external sources. We will explore two common scenarios:
1. Loading a dataset that comes with another library (like Seaborn).
2. Loading a dataset from a CSV (Comma-Separated Values) file.

### Loading a Built-in Dataset (The Titanic Dataset)

Seaborn is a visualization library that includes some classic datasets. We'll use it to load the famous Titanic dataset. If you don't have seaborn, you can install it by running `pip install seaborn` in your terminal.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

titanic_df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


### Loading Data from a CSV File

This is the most common way to get data. we'll load it into a DataFrame using `pd.read_csv()`.

In [14]:
# Now, read the CSV file into a DataFrame
vgsales_df = pd.read_csv('vgsales.csv')

# Display the first few rows to confirm it loaded correctly
vgsales_df

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.00
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.00,31.37
...,...,...,...,...,...,...,...,...,...,...,...
16593,16596,Woody Woodpecker in Crazy Castle 5,GBA,2002.0,Platform,Kemco,0.01,0.00,0.00,0.00,0.01
16594,16597,Men in Black II: Alien Escape,GC,2003.0,Shooter,Infogrames,0.01,0.00,0.00,0.00,0.01
16595,16598,SCORE International Baja 1000: The Official Game,PS2,2008.0,Racing,Activision,0.00,0.00,0.00,0.00,0.01
16596,16599,Know How 2,DS,2010.0,Puzzle,7G//AMES,0.00,0.01,0.00,0.00,0.01


### Saving Data
You can easily save your DataFrame to a file. Let's save the first 5 rows of our video game sales data to a new CSV file.

In [16]:
top_5_games = vgsales_df.head(5)

# Save to a new CSV file. `index=False` prevents pandas from writing the row index to the file.
top_5_games.to_csv('top_5_games.csv', index=False)

print("Saved 'top_5_games.csv' successfully!")

Saved 'top_5_games.csv' successfully!


---

## 3. Data Inspection and Exploration

Once your data is loaded, the first step is always to explore it. Pandas provides several functions for a quick overview. We'll use the Titanic dataset for this section.

### Initial Exploration

In [20]:
# products.index
titanic_df.index

RangeIndex(start=0, stop=891, step=1)

In [21]:
titanic_df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [19]:
# View the first 5 rows
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [22]:
# View the last 3 rows
titanic_df.tail(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


### DataFrame Information

In [23]:
# Get a concise summary of the DataFrame
# This is extremely useful for seeing data types and missing values
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [24]:
# Generate descriptive statistics for numerical columns
titanic_df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [3]:
# Get the dimensions of the DataFrame (rows, columns)
titanic_df.shape

(891, 15)

In [6]:
titanic_df['fare'].nsmallest()

179    0.0
263    0.0
271    0.0
277    0.0
302    0.0
Name: fare, dtype: float64

In [5]:
titanic_df.nsmallest(10, ['fare', 'pclass'])

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
263,0,1,male,40.0,0,0,0.0,S,First,man,True,B,Southampton,no,True
633,0,1,male,,0,0,0.0,S,First,man,True,,Southampton,no,True
806,0,1,male,39.0,0,0,0.0,S,First,man,True,A,Southampton,no,True
815,0,1,male,,0,0,0.0,S,First,man,True,B,Southampton,no,True
822,0,1,male,38.0,0,0,0.0,S,First,man,True,,Southampton,no,True
277,0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
413,0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
466,0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
481,0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
674,0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True


### Unique Values

Often, you want to know what unique values a column contains, or how many times each value appears.

In [None]:
# Get the unique values in the 'class' column
titanic_df['class'].unique()

In [None]:
# Get the number of unique values in the 'class' column
titanic_df['class'].nunique()

In [None]:
# Get the counts of each unique value in the 'class' column
titanic_df['class'].value_counts()

### Exercises: Data Inspection

**Question 1:** Using the `vgsales_df` DataFrame, display its summary information using `.info()`.

**Question 2:** How many unique game genres are present in the `vgsales_df` DataFrame? What are they?

In [14]:
# Now, read the CSV file into a DataFrame
vgsales_df = pd.read_csv('vgsales.csv')

# Display the first few rows to confirm it loaded correctly
# print(vgsales_df.info())
print(vgsales_df.head())


   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  


In [13]:
print(vgsales_df['Genre'].unique())

['Sports' 'Platform' 'Racing' 'Role-Playing' 'Puzzle' 'Misc' 'Shooter'
 'Simulation' 'Action' 'Fighting' 'Adventure' 'Strategy']


---

## 4. Data Selection and Indexing 

Selecting specific subsets of your data is a fundamental skill. Pandas offers powerful, flexible, and intuitive ways to do this.

### Selecting Columns

In [None]:
# Select a single column (returns a Series)
titanic_df['age'].head()

In [None]:
# Select multiple columns (returns a DataFrame)
titanic_df[['pclass', 'sex', 'age', 'survived']].head()

### Selecting Rows & Columns with `.loc` and `.iloc`

This is the most precise way to select data. Remember:
- **`.loc`** is for **label-based** selection (uses the index names and column names).
- **`.iloc`** is for **integer-position-based** selection (uses the numerical position, starting from 0).

In [15]:
# .loc: Select the row with index label 3
titanic_df.loc[3]

survived                 1
pclass                   1
sex                 female
age                   35.0
sibsp                    1
parch                    0
fare                  53.1
embarked                 S
class                First
who                  woman
adult_male           False
deck                     C
embark_town    Southampton
alive                  yes
alone                False
Name: 3, dtype: object

In [16]:
titanic_df.loc[[3]]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False


In [17]:
titanic_df.loc[[3, 5, 7]]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [18]:
# .iloc: Select the row at integer position 3 (which is the 4th row)
titanic_df.iloc[3]

survived                 1
pclass                   1
sex                 female
age                   35.0
sibsp                    1
parch                    0
fare                  53.1
embarked                 S
class                First
who                  woman
adult_male           False
deck                     C
embark_town    Southampton
alive                  yes
alone                False
Name: 3, dtype: object

In [19]:
# .loc: Select rows with index labels 0 through 3, and columns 'sex' through 'fare'
titanic_df.loc[0:3, 'sex':'fare']

Unnamed: 0,sex,age,sibsp,parch,fare
0,male,22.0,1,0,7.25
1,female,38.0,1,0,71.2833
2,female,26.0,0,0,7.925
3,female,35.0,1,0,53.1


In [21]:
# .iloc: Select the first 4 rows (0,1,2,3) and columns at position 2,3,4
titanic_df.iloc[0:4, 2:5]

Unnamed: 0,sex,age,sibsp
0,male,22.0,1
1,female,38.0,1
2,female,26.0,0
3,female,35.0,1


In [None]:
vgsales_df.iloc[6:, 3:5]

### Conditional Selection (Boolean Indexing)

This is extremely powerful. You can filter your data based on conditions.

In [20]:
# Select all rows where the passenger was female
titanic_df[titanic_df['sex'] == 'female'].head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


In [None]:
# Select all rows where the passenger was female AND over the age of 50
# Note the parentheses around each condition
titanic_df[(titanic_df['sex'] == 'female') & (titanic_df['age'] > 50)].head()

In [None]:
titanic_df[titanic_df['age'].between(1, 10)]

In [None]:
titanic_df[titanic_df['embarked'].isin(['S', 'C'])]

In [None]:
titanic_df[~titanic_df['embarked'].isin(['S', 'C'])]

In [None]:
titanic_df[titanic_df['class'].str.startswith('Th')]

In [None]:
titanic_df['class'].str.lower()

In [None]:
titanic_df["sex"].str.split('e')

### Updating Data

You can use `.loc` or `.iloc` to target specific data and update it.

In [None]:
# Let's create a copy to avoid changing the original DataFrame
vgsales_copy = vgsales_df.copy()

# Update the Publisher for the game at Rank 13 to be 'Rockstar Games'
vgsales_copy.loc[vgsales_copy['Rank'] == 13, 'Publisher'] = 'Rockstar Games'

# Verify the change
vgsales_copy[vgsales_copy['Rank'] == 13]

### Exercises: Selection and Indexing

**Question 1:** From the `vgsales_df`, select the `Name`, `Platform`, and `Genre` for all games published by 'Nintendo'.

**Question 2:** select all games from the `vgsales_df` that were released after the year 2008 and are of the 'Action' genre.

---

## 5. Data Cleaning and Transformation 

Real-world data is rarely clean. You'll often need to add or remove columns, handle duplicates, apply transformations, and sort your data.

### Adding a New Column

Creating new columns from existing data is a very common operation.

In [None]:
# Let's add a 'Global_Sales' column to vgsales_df by summing the regional sales
vgsales_df['Global_Sales'] = vgsales_df['NA_Sales'] + vgsales_df['EU_Sales'] + vgsales_df['JP_Sales'] + vgsales_df['Other_Sales']

vgsales_df.head()

In [None]:
vgsales_df['new_col']  = 1
vgsales_df

### Applying Functions

You can apply your own functions to the data. Let's create a function to categorize game sales as 'High' or 'Low' and apply it to create a new column.

In [None]:
vgsales_df['target_growth_product'] = vgsales_df['NA_Sales'].apply(lambda x: True if x>10 else False)
vgsales_df

In [None]:
def sales_category(sales):
    if sales > 20:
        return 'High'
    else:
        return 'Low'

# Apply this function to the 'Global_Sales' column
vgsales_df['Sales_Category'] = vgsales_df['Global_Sales'].apply(sales_category)

vgsales_df.head()

### Sorting Data

You can sort your DataFrame by one or more columns.

In [None]:
# Sort vgsales_df by Year in descending order
vgsales_df.sort_values(by='Year', ascending=False).head()

In [None]:
# Sort by Genre (ascending) and then by Global_Sales (descending)
vgsales_df.sort_values(by=['Genre', 'Global_Sales'], ascending=[True, False]).head()

In [None]:
vgsales_df.sort_index(ascending=False)

### Renaming Columns

In [None]:
# Let's rename some columns in the titanic_df
titanic_df.rename(columns={'pclass': 'Passenger_Class', 'sex': 'Gender'}).head()

### Handling Duplicates

In [None]:
# Let's create a dummy DataFrame with duplicates
dup_data = {'col1': ['A', 'B', 'A', 'A'], 'col2': [1, 2, 1, 2]}
dup_df = pd.DataFrame(dup_data)
print("Original DataFrame with duplicates:")
print(dup_df)

# Drop the duplicates
print("\nDataFrame after dropping duplicates:")
print(dup_df.drop_duplicates())

In [None]:
dup_df.drop_duplicates(subset=['col1'])

### Exercises: Cleaning and Transformation

**Question 1:** In the `titanic_df`, create a new column called `family_size` which is the sum of the `sibsp` (siblings/spouses) and `parch` (parents/children) columns, plus 1 for the passenger themselves.

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')


In [2]:
titanic_df['family_size'] = titanic_df['sibsp'] + titanic_df['parch'] + 1

titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,family_size
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,2
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,2
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,1
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,2
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,1


**Question 2:** find the 5 passengers in titanic dataet who paid the highest fare.

In [3]:
titanic_df.sort_values(by='fare', ascending=False).head(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,family_size
679,1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False,2
258,1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True,1
737,1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True,1
88,1,1,female,23.0,3,2,263.0,S,First,woman,False,C,Southampton,yes,False,6
438,0,1,male,64.0,1,4,263.0,S,First,man,True,C,Southampton,no,False,6


---

## 6. Handling Missing Data (NaN)

Missing data is a common problem. Pandas uses the NumPy value `np.nan` to represent missing data. Let's see how to find and handle it.

### Identifying Missing Data

In [6]:
# Check for missing values in the titanic_df
# .isnull() returns a boolean DataFrame, and .sum() counts the 'True' values per column
titanic_df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
family_size      0
dtype: int64

In [5]:
titanic_df.notna()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,family_size
0,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,True,True,True,True,True,True,True,True,True,True,True,False,True,True,True,True
887,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True
888,True,True,True,False,True,True,True,True,True,True,True,False,True,True,True,True
889,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True,True


We can see that `age`, `deck`, and `embark_town` have many missing values.

### Treating Missing Data

You have two main options: drop the missing values or fill them in.

#### Dropping Missing Values
You can drop rows or columns with missing data. This is simple but can lead to loss of valuable information.

In [None]:
# Drop rows that have any missing values
# Note the change in shape
print(f"Original shape: {titanic_df.shape}")
df_dropped = titanic_df.dropna()
print(f"Shape after dropping NaNs: {df_dropped.shape}")

In [None]:
titanic_df.dropna(how='all')

#### Filling Missing Values (`fillna`)
A better approach is often to fill the missing values, a process called **imputation**. You can fill with a constant value, or with a calculated value like the mean or median.

In [None]:
# Let's fill the missing 'age' values with the mean age
mean_age = titanic_df['age'].mean()
print(f"Mean age: {mean_age:.2f}")

# Create a copy to work on
titanic_filled = titanic_df.copy()

titanic_filled['age'].fillna(mean_age, inplace=True)

# Verify that there are no more missing ages
titanic_filled.isnull().sum()

### Exercises: Handling Missing Data

**Question 1:** The `embark_town` column in `titanic_df` is also missing some values. Find the most frequent embarkation town (the mode) and use it to fill the missing values in that column.

In [12]:
import seaborn as sns
import pandas as pd
import numpy as np

# Load the Titanic dataset
titanic_df = sns.load_dataset('titanic')

titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
embark_town_mode = titanic_df['embark_town'].mode()

# Create a copy to work on
titanic_filled = titanic_df.copy()

titanic_filled['embark_town'].fillna(embark_town_mode)

**Question 2:** Filter titanic dataset to a rows in which age it not null and sex is female

In [22]:
titanic_df[((titanic_df['sex'] == 'female') & (titanic_df['age'].isnull()))]



Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
19,1,3,female,,0,0,7.225,C,Third,woman,False,,Cherbourg,yes,True
28,1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
31,1,1,female,,1,0,146.5208,C,First,woman,False,B,Cherbourg,yes,False
32,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
47,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
82,1,3,female,,0,0,7.7875,Q,Third,woman,False,,Queenstown,yes,True
109,1,3,female,,1,0,24.15,Q,Third,woman,False,,Queenstown,yes,False
128,1,3,female,,1,1,22.3583,C,Third,woman,False,F,Cherbourg,yes,False
140,0,3,female,,0,2,15.2458,C,Third,woman,False,,Cherbourg,no,False
166,1,1,female,,0,1,55.0,S,First,woman,False,E,Southampton,yes,False


---