# Chapter 4 Type Casting and Handling NA Values

In [None]:
import pandas as pd

# Read titanic dataset
tnc = pd.read_csv("./datasets/titanic.csv")

# Print dataframe
tnc.head()

## Datatype of a column in a Dataframe

When converting a dataset into a Dataframe, Pandas assumes the datatypes of columns based on the values present in each column.

To determine the datatypes assigned, you can utilize the following methods:

1) **Dataframe.column.dtype**: This method allows you to ascertain the datatype of a specific column.

2) **Dataframe.dtypes**: This method provides the datatypes assigned to all columns in a Dataframe.

3) **Dataframe.info()**: This method gives info about the count of non-null values, the datatypes assigned, and the total memory occupied by the Dataframe.

In [None]:
# Print the datatype of age
tnc.age.dtype

# dtype('O') refers to Python Object 'str'

In [None]:
# Extract datatypes of titanic dataframe
tnc.dtypes

In [None]:
# Extract info about non-null values, datatypes assigned and total memory occupied by the dataframe
tnc.info()

## Converting datatypes (or) Type casting

We can convert the datatype of a column assigned by pandas to a different datatype of our choice by making use of the method:

**Dataframe.column.astype(datatype)**

Ex: df["age"].astype("float")

In [None]:
# Current datatype of age
tnc.age.dtype

# dtype('O') refers to Python Object 'str'

In [None]:
# Convert datatype of age column from object to float in dataframe
# tnc.age.astype("float")

# The above method fails as python cannot convert the str value '?' into a float

In [None]:
# Replace the value of '?' in age to None using Dataframe.replace() method, in place
tnc.age.replace(['?'], [None], inplace=True)

In [None]:
# Convert datatype of age column from object to float in dataframe
tnc.age.astype("float")

# None values are converted to NaN
# Note that it returns converted values of age but doesn't update the age column

In [None]:
# Current datatype of age
tnc.age.dtype

# The datatype of age is still Object

In [None]:
# Set datatype of age as float using type casting
tnc.age = tnc.age.astype("float")

# Current datatype of age
tnc.age.dtype

# Now the datatype of age is converted to float64.

In [None]:
# Due to type conversion we can perform EDA
tnc.describe()

## Handling missing values

We can handle the missing values in a Dataframe using the following methods:

1) **Dataframe.column.isna()**: Returns a boolean Series indicating True for NaN (missing) values and False otherwise. This can be used for filtering out records with missing values in the specified column.

2) **Dataframe.isna()**: Returns a Dataframe indicating True for NaN (missing) values and False otherwise.

3) **Dataframe.column.notna()**: Returns a boolean Series indicating True for non-NaN values and False for NaN values. This can be used for filtering out records that **do not** have missing values in the specified column.

4) **Dataframe.notna()**: Returns a Dataframe indicating True for non-NaN values and False for NaN values.

5) **Dataframe.column.dropna()**: Drops all the records (rows) that contain NaN values in the specified column.

6) **Dataframe.column.fillna()**: Fills all the NaN values in the specified column with the values passed to the method.

**Note that all the methods have inplace set to False.**

These methods provide flexibility in how missing values are handled in a Dataframe, allowing users to either filter out, drop, or replace missing values based on their requirements.

In [None]:
# Read games dataset
stats = pd.read_csv("./datasets/game_stats.csv")

# Print dataframe
stats

## Filtering missing values

In [None]:
# Generate a dataframe indicating the presence of NaN values in stats dataframe
stats.isna()

In [None]:
# Generate a dataframe indicating the presence of non-NaN values in stats dataframe
stats.notna()

In [None]:
# Filter out records that have NaN values in league column of stats dataframe
nameNaN = stats.league.isna()

stats[nameNaN]

In [None]:
# Filter out records that do not have NaN values in assists column of stats dataframe
nameNaN = stats.assists.notna()

stats[nameNaN]

## Dropping missing values

In [None]:
# Drop the records that have rebounds as NaN values
stats.rebounds.dropna()

# Returns series of rebounds values that are not NaN

In [None]:
# Drop the records that have atleast one NaN value
stats.dropna()

# Returns a dataframe that contain records having all columns filled or no NaN

In [None]:
# Drop the records that have all values as NaN
stats.dropna(how = "all")

# Returns a dataframe that contain records having atleast one column filled

## Drop the records that have certain columns as NaN

To drop records that have certain columns filled with NaN we can make use of the **subset argument in dropna() method.**

Ex: df.dropna(subset=["age", "country"]), drops all the records that have age or country as NaN.

In [None]:
# Drop the records that have league or points as NaN
stats.dropna(subset=["league", "points"])

# Returns a dataframe that contain records whose league and points are filled

## Filling missing values

In [None]:
# Fill all the missing values in the dataframe with 0
stats.fillna(0)

# Returns a dataframe with all the missing values replaced with 0

In [None]:
# Fill all the missing values in league column with "isl"
stats.league.fillna("isl")

# Returns a series of league values with NaN values replaced with "isl"

## Filling certain missing values

To fill certain missing values in a Dataframe we need to pass a dictionary of columns to be filled along with their values.

In [None]:
# Fill the missing values of name with "Unknown" and assists with 0
stats.fillna({"name":"Unknown", "assists": 0})

# Returns a dataframe with missing names filled with "Unknown" and missing assists with 0

## Filling missing values with same record-different column values

To fill missing values of a Dataframe with values of some other column in the same record we can pass the column as argument to fillna() method.

In [None]:
# Read sales dataset
sales = pd.read_csv("./datasets/sales.csv")

# Print dataframe
sales

In [None]:
# Replace any missing shipping_zip value with corresponding billing_zip value, in place
sales["shipping_zip"].fillna(sales["billing_zip"], inplace=True)

# Print dataframe
sales

# Records that had NaN as shipping_zip values are replaced with corresponding billing_zip values