# Intro to Pandas
by Ryan Orsinger

## Module 3: DataFrames Continued

### Pandas DataFrames Continued - Filling Missing Values
- Filling missing values
- Using `.fillna`
- Using `.loc` with DataFrames (similar to `.loc` on Series, but two-dimensional w/ rows and columns)

### Handling Missing Values is a Case in Creative Problem Solving
- There's no single right answer for all cases. 
- "It depends" is a common answer in data science. Context matters.

- Sometimes missing values might mean zero, depending on the context, so we can fill in zero.
- Sometimes, dropping entire rows or columns is appropriate
- Other times, filling missing values with the mean, the median, the mode, or a likely value is appropriate

- Sometimes, analysts drop rows with too many missing values
- Other times, analysts drop columns with too many missing values
- Missing values can also be filled with a reasonable estimation, like a median, mean, or mode value.
- Filling too many missing values can skew the original data.

In [1]:
import pandas as pd

In [2]:
# Let's generate some data with missing values. 
# Real world data often has missing values
df = pd.DataFrame([
    {
        "item": "crackers",
        "serving_size": "4 crackers",
        "calories": 10,
        "fat": "1.1g",
        "sodium": "125mg",
        "price": 2.99,
    },
    {
        "item": "club soda",
        "serving_size": "8 oz",
        "calories": None,
        "fat": None,
        "sodium": "75mg",
        "price": 2.25,

    },
    {
        "item": "apple",
        "serving_size": 2,
        "calories": 95,
        "fat": None,
        "sodium": None,
        "price": 1.99,
    },
    {
        "item": "banana",
        "serving_size": 3,
        "calories": 105,
        "fat": "0.4g",
        "sodium": "1mg",
        "price": None,
    },
    {
        "item": "spam",
        "serving_size": "1 tin",
        "calories": None,
        "fat": None,
        "sodium": None,
        "price": None,
    }
])

# Set the index to be the item name
df.set_index("item", inplace=True)
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,,75mg,2.25
apple,2,95.0,,,1.99
banana,3,105.0,0.4g,1mg,
spam,1 tin,,,,


In [3]:
# Example of filling null values with a reasonable value
# Apples and club soda don't have fat, so these missing values can be 0
df.fat = df.fat.fillna(0)
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,
spam,1 tin,,0,,


### An Aside About Pandas Warnings
- Pandas warnings are not errors. The code will run. The warning is a notice, not an error that halts execution.
- Depending on your version of pandas, the above code might produce the following warning.
```
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
```
- Since this may impact some users, we'll move into working with `.loc` 


In [4]:
# Example of .loc's row_indexing and column_indexing
# [start_row:end_row, column_start:column_end]
# [:,] returns all rows and all columns
df.loc[:,]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,
spam,1 tin,,0,,


In [5]:
# Notice how we're getting the range of rows from club soda to apple
# df.loc["club soda":"banana", :]
df.loc["club soda":"banana"]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
club soda,8 oz,,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,


In [6]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple"]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
apple,2,95.0,0,,1.99


In [7]:
# Notice how .loc uses the indexing syntax
df.loc[df.serving_size == 3]

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
banana,3,105.0,0.4g,1mg,


In [8]:
# Notice how .loc uses the indexing syntax
df.loc[df.index == "apple", "serving_size":"fat"]

Unnamed: 0_level_0,serving_size,calories,fat
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
apple,2,95.0,0


In [9]:
# All rows, show only calories as the column
df.loc[:, "calories"]

item
crackers      10.0
club soda      NaN
apple         95.0
banana       105.0
spam           NaN
Name: calories, dtype: float64

In [10]:
# Notice how : for rows returns all rows
# show all the columns from calories through price 9(inclusive)
df.loc[:, "calories":"price"]

Unnamed: 0_level_0,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
crackers,10.0,1.1g,125mg,2.99
club soda,,0,75mg,2.25
apple,95.0,0,,1.99
banana,105.0,0.4g,1mg,
spam,,0,,


In [13]:
# Some pandas operataions may throw a SettingWithCopyWarning
# Recommend reading the documentation carefully
# Pandas developers designed this warning because effects can be difficult to predict
# Notice how the above operation evaluated, but the warning can feel disruptive.
df.loc[df.calories.isna(), "calories"] = 0
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,0.0,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,
spam,1 tin,0.0,0,,


In [14]:
# An average price might be reasonable here, since we don't have other information
df.loc[df.price.isna(), "price"] = df.price.mean()
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,0.0,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,2.41
spam,1 tin,0.0,0,,2.41


In [15]:
# Actual Spam information
spam_calories = 1080
spam_fat = "96g"
spam_sodium = "4740mg" 
spam_price = 3.25

# where the index is equal to 'spam', we want to insert values into theses colmns, and here are the values for it
df.loc[df.index == "spam", "calories":"price"] = [spam_calories, spam_fat, spam_sodium, spam_price]
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99
club soda,8 oz,0.0,0,75mg,2.25
apple,2,95.0,0,,1.99
banana,3,105.0,0.4g,1mg,2.41
spam,1 tin,1080.0,96g,4740mg,3.25


In [16]:
# Let's say we got in some new information about discounts
# The business manager says that we'll use discounts in the future and the existing values should be 0.
# We'll need to re-create the column and assign it zero
df["discount"] = 0

In [17]:
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99,0
club soda,8 oz,0.0,0,75mg,2.25,0
apple,2,95.0,0,,1.99,0
banana,3,105.0,0.4g,1mg,2.41,0
spam,1 tin,1080.0,96g,4740mg,3.25,0


## Additional Resources
- Using [.fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)
- [Returning-a-view-versus-a-copy in the pandas docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy)
- [pandas .loc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
- [pandas .iloc documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)

## Exercises
- Run the cells above to remove or fill most of the missing values from the `df` variable.
- Fill the missing sodium value with a logical choice.
- Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
- Fill the missing values of the `bill_length_mm` with its average
- Fill in the missing values for `bill_depth_mm` with its average
- Fill in the missing values for `body_mass_g` with its average
- Run `.value_counts` on the `sex` column
- Fill the missing values in the `sex` column with the `mode` (Follow .mode() with [0] to access the string value)
- Run `.value_counts` on the `sex` column again, after filling the missing values

In [18]:
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99,0
club soda,8 oz,0.0,0,75mg,2.25,0
apple,2,95.0,0,,1.99,0
banana,3,105.0,0.4g,1mg,2.41,0
spam,1 tin,1080.0,96g,4740mg,3.25,0


In [22]:
# Fill the missing sodium value with a logical choice.
# df.loc[row_indexer, column_indexer] = value

# used the apple's amount of sodium, that seemed to make the most sense
# without changing the data-type for calculation or googling it
df.loc[df.sodium.isna(), "sodium"] = "1mg"
df

Unnamed: 0_level_0,serving_size,calories,fat,sodium,price,discount
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
crackers,4 crackers,10.0,1.1g,125mg,2.99,0
club soda,8 oz,0.0,0,75mg,2.25,0
apple,2,95.0,0,1mg,1.99,0
banana,3,105.0,0.4g,1mg,2.41,0
spam,1 tin,1080.0,96g,4740mg,3.25,0


In [24]:
# Use `pd.read_csv` to read `"penguins.csv"` into a dataframe variable named `penguins`
penguins = pd.read_csv("../datasets/penguins.csv")
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


In [28]:
# Fill the missing values of the `bill_length_mm` with its average
penguins.loc[penguins.bill_length_mm.isna(), "bill_length_mm"] = penguins.bill_length_mm.mean()
penguins.head(10)

# I noticed here that it added a bunch of zeros after every number in
# the colun that I just did an avg on...

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,43.92193,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


In [29]:
# Fill in the missing values for `bill_depth_mm` with its average
penguins.loc[penguins.bill_depth_mm.isna(), "bill_depth_mm"] = penguins.bill_depth_mm.mean()
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,43.92193,17.15117,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


In [30]:
# Fill in the missing values for `body_mass_g` with its average
penguins.loc[penguins.body_mass_g.isna(), "body_mass_g"] = penguins.body_mass_g.mean()
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,43.92193,17.15117,,4201.754386,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


In [31]:
# Run `.value_counts` on the `sex` column
penguins.sex.value_counts()

sex
male      168
female    165
Name: count, dtype: int64

In [32]:
# Fill the missing values in the `sex` column with the `mode` (Follow .mode() with [0] to access the string value)
penguins.loc[penguins.sex.isna(), "sex"] = penguins.sex.mode()[0]
penguins.head(10)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,43.92193,17.15117,,4201.754386,male,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,male,2007
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,male,2007


In [33]:
# Run `.value_counts` on the `sex` column again, after filling the missing values
penguins.sex.value_counts()

sex
male      179
female    165
Name: count, dtype: int64