<a href="https://colab.research.google.com/github/JonathanSosa-py/pandas_notebooks/blob/main/9_Cleaning_Data__Casting_Datatypes_and_Handling_Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
import pandas as pd
import numpy as np

In [9]:
people = {
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'],
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'],
    'email': ['CoreyMSchafer@email.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}

In [10]:
df = pd.DataFrame(people)
df

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@email.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


# Drop missing values (Remove missing data)

In [11]:
# dropna()

df.dropna()

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@email.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
6,,Missing,,Missing


We get these 4 rows because they didn't have any missing values. We still have our bottom row here which has some of our custom missing values but we'll see how to deal with this in just a second.

In the background ***.dropna()*** is using some default arguments:

In [12]:
# Default arguments (what this is doing in the background)

df.dropna(axis='index', how='any')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@email.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
6,,Missing,,Missing




*   ***axis.-*** This can be either set to *index* or set to *columns*. That is gonna tell pandas that we want to **drop na values** when our ***rows*** are missing values (when it is set to index). If we set this to columns then it would instead drop columns if they had missing values.
*   ***how.-*** (How we want to drop these?) Or I guess a better way to frame that is this is the criteria that it uses for dropping a row or a column, so by default this is set to "**any**". 

So we are looking over our rows since this is set to index and this is set to "any" here so  it will drop rows with any missing values but this might not be what you want, maybe with this kind of analysis that we're doing it's okay to have, you know missing email or last name or something like that but there just has to be something it can't just be an entire row of missing values so if that's the case then we can instead change this how argument to ***all***.


```python
df.dropna(axis='index', how='all')
```
and this will then only drop rows when all of the values in that row are missing.





In [13]:
# All of the values have to be missing in order for this to drop the values.
# The row with index 4 is dropped.

df.dropna(axis='index', how='all')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@email.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [16]:
# Removing columns changing the axis to column.
# It will drop columns that have all missing values. We don't have any columns that have missing values all the way down so it should just return our original DataFrame.
# NONE OF THESE COLUMNS HAVE MISSING VALUES.

df.dropna(axis=1, how='all')

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@email.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
3,Chris,Schafer,,36
4,,,,
5,,,Anonymous@email.com,
6,,Missing,,Missing


In [17]:
df.dropna(axis=1, how='any')

0
1
2
3
4
5
6


At this point you might be wondering "Okay my data is a bit more complicated than this and I'm doing some analysis where I want to drop some missing values but I only want to drop rows that are missing values in a specific column so for example let's say that we're doing some analysis on our data and it's fine if they don't have a first name or a last name but we really need the email address and if they don't have an email address then we need to just drop those rows.

So in order to do this we can pass in a **subset** argument:

In [23]:
# The last row has custom missng values.
df.dropna(axis=0, how='any', subset=['email'])

Unnamed: 0,first,last,email,age
0,Corey,Schafer,CoreyMSchafer@email.com,33
1,Jane,Doe,JaneDoe@email.com,55
2,John,Doe,JohnDoe@email.com,63
5,,,Anonymous@email.com,
6,,Missing,,Missing
