<a href="https://colab.research.google.com/github/MonkeyWrenchGang/PythonBootcamp/blob/main/day_3/3_8_Dealing_with_Nulls.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas and Null Values

- Pandas easily handles missing or null values in datasets.
- Null values can occur due to various reasons, such as incomplete data, data corruption, or data extraction issues.
- Pandas represents null values as `NaN` (Not a Number) or `None`.

### Common Operations to Handle Nulls

1. Checking for Null Values:
   - `df.isnull()` or `isna()`: Returns a DataFrame of the same shape as the input DataFrame, with True where null values are present.
   - `df.notnull()` or `.notna()`: Returns a DataFrame of the same shape as the input DataFrame, with True where non-null values are present.

   NOTE:there is no difference between `isna()` and `isnull()` methods.

2. Counting Null Values:
   - `df.isnull().sum()`: Returns the count of null values in each column of the DataFrame.

3. Dropping Null Values:
   - `df.dropna()`: Removes rows containing any null value from the DataFrame.
   - `df.dropna(axis=1)`: Removes columns containing any null value from the DataFrame.

4. Filling Null Values:
   - `df.fillna(value)`: Fills null values with a specified value or using different filling methods such as forward fill (`ffill`) or backward fill (`bfill`).

let's dive into these functions!


In [2]:
# import libraries
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Suppress the FutureWarning

import pandas as pd
import requests
import pickle
import pprint
import io

In [15]:
churn = pd.read_csv("https://raw.githubusercontent.com/MonkeyWrenchGang/PythonBootcamp/main/day_3/data/churn_90k.csv")
churn.head()

Unnamed: 0,monthly_minutes,customerServiceCalls,streaming_minutes,TotalBilled,PrevBalance,latePayments,ip_address_asn,phone_area_code,customer_reg_date,email_domain,...,maling_code,paperlessBilling,paymentMethod,EVENT_TIMESTAMP,customerId,billing_address,gender,networkSpeed,senior_citizen,churn_ind
0,16559.0,3.0,26131.0,144.0,48.0,5.0,40603.0,268.0,2022-07-09,gmail.com,...,A,No,Bank Transfer,2022-09-27T11:01:40Z,0-7643-1091-7,15458 Palmer Port,Male,4Glte,1.0,0
1,21097.0,5.0,14182.0,236.0,80.0,7.0,15625.0,256.0,2022-06-24,gmail.com,...,M,Yes,Bank Transfer,2022-11-26T05:09:21Z,1-58537-872-0,8524 Alyssa Lodge,Male,5G,0.0,0
2,33705.0,4.0,23645.0,292.0,87.0,6.0,43940.0,231.0,2022-06-16,gmail.com,...,F,No,Mailed Check,2023-03-28T20:37:15Z,1-07-393412-8,3931 Melissa Mountains,Female,5G,1.0,0
3,20093.0,4.0,16552.0,237.0,61.0,6.0,8103.0,249.0,2023-01-29,gmail.com,...,F,Yes,Mailed Check,2022-06-16T15:18:46Z,1-4831-5317-7,89644 Ford Village,Male,4Glte,1.0,0
4,27930.0,4.0,19265.0,289.0,49.0,4.0,46280.0,234.0,2023-03-16,gmail.com,...,A,No,Bank Transfer,2023-05-04T00:07:20Z,0-413-17163-9,525 Sean Mission,Female,4Glte,0.0,1


# Import Data


---

Here we have a typical real world churn dataset.

```
"https://raw.githubusercontent.com/MonkeyWrenchGang/PythonBootcamp/main/day_3/data/churn_90k.csv"
```
1. import the dataset with `read_csv()`
2. eyeball the first 5 records with `head()`
3. use info to describe the state of the dataset `.info()`

## Summary of `.info()` in Pandas

The `.info()` method in pandas provides a concise summary of a DataFrame. It returns the following information:

- Total number of rows (observations) in the DataFrame.
- Total number of columns in the DataFrame.
- Name and data type of each column.
- Number of non-null values in each column.
- Memory usage of the DataFrame.

The `.info()` method is useful for initial data exploration and quality assessment. It quickly reveals missing values and helps identify columns that may require further attention. The memory usage information allows users to optimize memory usage for large datasets.

Usage: `df.info()`






## Summary of `isna()` in Pandas

The `isna()` method in pandas is used to identify missing or null values in a DataFrame. It returns a DataFrame with boolean values indicating the presence of null values.

Key points about `isna()`:

- It is a powerful method to detect missing values in a DataFrame or Series.
- Returns `True` for each element in the DataFrame or Series that is a missing or null value, and `False` otherwise.
- Missing or null values can be represented as `NaN` or `None` in pandas.
- `isna()` is a useful tool for initial data exploration and data cleaning processes.

Usage:

- `df.isna()`: Returns a DataFrame of the same shape as the input DataFrame, with `True` where null values are present and `False` where non-null values are present.


Example:

```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['John', 'Emma', None],
    'Age': [25, None, 32],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)

# Check for null values
null_values = df.isna()
print(null_values)

```

1. use churn.head(100).isna()
  - kind of useless right?


## Summary of `isna()`, `.sum()`, and `sum(axis=1)` in Pandas

The combination of `isna()`, `.sum()`, and `sum(axis=1)` in pandas provides a powerful approach to handle missing or null values in a DataFrame.

- `isna()` is used to identify missing or null values in a DataFrame. It returns a DataFrame or Series of the same shape as the original object, with boolean values indicating the presence of null values.

- `.sum()` can be applied on the result of `isna()` to calculate the number of missing values in each column or row. It returns a Series with the sum of null values for each column (when applied on a DataFrame) or the sum of null values for each element (when applied on a Series).

- `sum(axis=1)` calculates the sum of values across each row of a DataFrame, which is useful when determining the total number of missing values per row.

Example:

```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['John', 'Emma', None],
    'Age': [25, None, 32],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)

# Check for null values in each column
null_values_column = df.isna().sum()
print(null_values_column)

# Check for total null values in each row
null_values_row = df.isna().sum(axis=1)
print(null_values_row)
```

1. now use churn.isna().sum() do you get a nice columnar summary?
2. next use churn.isna().sum(axis=1) this will give you a per-row count nulls
3. create a dq column (yes new!)
  ```
  churn["dq"] = churn.isna().sum(axis=1)
  ```
  - next filter for rows where dq >= 1
4. churn["dq"].sum() this will give you the total number of nulls in the dataset.


##  Dropping Null Values
   - `df.dropna()`: Removes rows containing any null value from the DataFrame.
   - `df.dropna(axis=1)`: Removes columns containing any null value from the DataFrame.

  1. create a clean churn dataset by removing rows with NAs
  ```
  churn_clean = churn.dropna()

  ```
  2. how many rows did you drop use .shape
  3. can you print the difference of churn.shape[0] - churn_clean.shape[0] using print and .format()?

# Filling Null Values


---
## Filling Null Values with Mean or Median

Instead of dropping null values from a DataFrame, an alternative approach is to fill them in with the mean or median of the column. This allows for the retention of valuable data and avoids losing information.

The mean represents the average value of a numerical column, while the median represents the middle value when the data is sorted. Filling null values with these central tendency measures can help preserve the overall distribution and characteristics of the data.

Pandas provides convenient methods to achieve this:

1. Filling with Mean:
   - `df.fillna(df.mean())`: Fills null values with the mean of each column.

2. Filling with Median:
   - `df.fillna(df.median())`: Fills null values with the median of each column.

It is important to note that filling null values with central tendency measures assumes that the mean or median is **representative of the missing values**. However, this will introduce biases or distortions especially if the data has outliers or a skewed distribution.

Considerations:
- It is advisable to apply mean or median filling on numerical columns rather than categorical columns.
- Be cautious when using mean or median filling in the presence of outliers or data with significant deviations from normality.

1. create a new churn dataset churn_clean2 using one of the methods above, use describe() to compare the mean & median (50% percentile) do you see a difference?







## Using `.query()` to Find Null Values in Pandas

In pandas, the `.query()` method is primarily used to filter a DataFrame based on specific conditions. While it cannot directly identify null values, you can combine it with the `.isnull()` method to filter for null values effectively.

Here's an example of using `.query()` to find null values in a DataFrame:

```python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['John', 'Emma', None],
    'Age': [25, None, 32],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)

# Query to filter for null values
null_values = df.query('Name.isnull() or Age.isnull()')

# Print the filtered DataFrame
print(null_values)
```

1. Question: using .query() and isnull find all of the records where customerServiceCalls isnull.
  - how many records is that?