# Intro to Pandas
by Ryan Orsinger

## Module 3: DataFrames Continued

### Pandas DataFrames Continued - Identifying Missing Values
- Identifying and counting missing values
- Removing rows with missing information
- Dropping columns from a DataFrame

In [1]:
import pandas as pd

In [None]:
# Let's generate some data with missing values. 
# Real world data often has missing values
df = pd.DataFrame([
    {
        "item": "crackers",
        "serving_size": "4 crackers",
        "calories": 10,
        "fat": "1.1g",
        "sodium": "125mg",
        "price": 2.99,
        "discount": None
    },
    {
        "item": "club soda",
        "serving_size": "8 oz",
        "calories": None,
        "fat": None,
        "sodium": "75mg",
        "price": 2.25,
        "discount": None

    },
    {
        "item": "apple",
        "serving_size": 2,
        "calories": 95,
        "fat": None,
        "sodium": None,
        "price": 1.99,
        "discount": None
    },
    {
        "item": "banana",
        "serving_size": 3,
        "calories": 105,
        "fat": "0.4g",
        "sodium": "1mg",
        "price": None,
        "discount": None
    },
    {
        "item": "spam",
        "serving_size": "1 tin",
        "calories": None,
        "fat": None,
        "sodium": None,
        "price": None,
        "discount": None
    }
])

# Set the index to be the item name
df.set_index("item", inplace=True)
df

In [None]:
# The .info method outputs data types and non-null value count
df.info()

In [None]:
# Notice that missing values in a numeric column show as NaN, which means "not a number"
# For more on NaN, see https://en.wikipedia.org/wiki/NaN
df.calories

In [None]:
# NaN exists to allow us to do math without getting execution errors
# Many math functions ignore NaNs
df.calories.mean()

In [None]:
# By default, .value_counts ignores NaNs, too
df.sodium.value_counts()

In [None]:
# Use dropna=False to count missing values
df.sodium.value_counts(dropna=False)

In [None]:
# Notice that missing values in a string/object column show as None
df.fat

In [None]:
# .isna() can operate on a column, returning a boolean series
df.sodium.isna()

In [None]:
# .isna() can also operate on the entire dataframe
df.isna()

In [None]:
# Counting the number of nulls by column
print("Number of nulls by column")
df.isna().sum()

In [None]:
print("Proportion of nulls by column")
df.isna().mean()

In [None]:
# Counting the number of nulls by row
# Recall that .sum can run on columns or by row, by row with axis=1
print("Number of nulls by row")
df.isna().sum(axis=1)

In [None]:
# Proportion of the number of nulls by row
# Recall that .sum can run on columns or by row, by row with axis=1
print("Proportion of nulls by row")
df.isna().mean(axis=1)

### Handling Missing Values
- There's no one right answer for all cases. 
- "It depends" is a common answer in data science. Context matters.
- Sometimes missing values might mean zero, depending on the context, so we can fill in zero.
- Sometimes, dropping entire rows or columns is appropriate
- Sometimes, filling missing values makes sense to keep the rest of the row or column's data

In [None]:
# Example of removing null values 
# dropna drops every row with a null value
# Since there is missing data in every row, this is quite destructive...
# the default axis argument is axis=0, which means row-wise
df.dropna()

In [None]:
# dropna(axis=1) drops all columns with any missing values
# This is also too destructive to be helpful
df.dropna(axis=1)

In [None]:
# Let's review the dataframe
df

In [None]:
# The discount column is adding no information here, so we can drop it
df.drop(columns="discount", inplace=True)
df

In [None]:
# Reassign the df
# df.drop(index=["spam"], inplace=True) would produce the same result
df = df[df.index != "spam"]
df

## Additional Resources
- [.isnull](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html) is an alias for `isna`.
- The [.value_counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) documentation
- [Pandas .isna documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html)

## Exercises
- Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
- Write the pandas code to count the number of missing values by column
- Write the pandas necessary to get the proportion of missing values by row. Store this to a variable named `percent_missing_by_row`
- Sort the `percent_missing_by_row` series in descending order. How many of the rows are mostly empty?

In [2]:
# Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
penguins = pd.read_csv("../datasets/penguins.csv")

In [4]:
# Use .isna to count the number of missing values by column
penguins.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [8]:
# Write the pandas necessary to get the proportion of missing values by row. 
# Store this to a variable named `percent_missing_by_row
percent_missing_by_row = penguins.isna().mean(axis=1)

In [9]:
# Sort the `percent_missing_by_row` series in descending order
percent_missing_by_row.sort_values(ascending=False)

271    0.625
3      0.625
8      0.125
268    0.125
218    0.125
       ...  
117    0.000
116    0.000
115    0.000
114    0.000
343    0.000
Length: 344, dtype: float64

In [10]:
# How many of the rows are mostly empty
percent_missing_by_row.value_counts(dropna=False)

0.000    333
0.125      9
0.625      2
Name: count, dtype: int64