# **Guided Lab 343.3.14 - Handling Missing Data**


## **Lab Overview**

Missing data is a common problem in real-world datasets. It can occur for various reasons, such as data entry errors, sensor malfunctions, or simply incomplete data collection.

In this lab, we will demonstrate how to handle missing data (mostly in the form NaN value) in Pandas, we will utilize the **DataFrame.dropna(), DataFrame.fillna()** and **DataFrame.replace()** functions.

## **Lab Objective:**
By the end of this lab, learners will be able to:

- Describe the need for Handling Missing Data:
- Explore the Role of np.nan in Representing Missing Data:
- Utilize the **df.dropna(), DataFrame.fillna()**and **DataFrame.replace()** functions to handle missing data.
- Identify and understand the implications of missing data in datasets.

## **Introduction**

Missing data is a common problem in real-world datasets. It can occur for various reasons, such as data entry errors, sensor malfunctions, or simply incomplete data collection.  In this lab, we will demonstrate the techniques for handling missing data using the Pandas library in Python. We will focus on identifying, understanding, and applying different strategies for dealing with missing values. By the end of this lab, you will be equipped with the knowledge and tools to effectively manage missing data in your own datasets.

---

# **Begin:**

**Example 1: Dropping or Cleaning the Missing Data Using df.dropna() method.**

This following example initializes a pandas DataFrame with a dataset representing information about employees. It then demonstrates how to handle missing data by dropping rows containing NaN (Not a Number) values using the **dropna()** function. The **dropna()** function drops all the rows with missing values.



**Note:**The *np.nan* values in the dataset are representative of missing or undefined data, and np refers to the NumPy library. Make sure to import NumPy (import numpy as np) before using *np.nan*.

In [None]:

import numpy as np
import pandas as pd
# Initializing the nested list with Data set
employee_list = [['James', 36, 75, 5428000],
               ['Villers', 38, 74, 3428000],
               ['VKole', 31, 70, 8428000],
               ['Smith', 34, 80, 4428000],
               ['Gayle', 40, 100, 4528000],
               ['Adam', 40, np.nan, 4528000],
               ['Rooter', 33, 72, 7028000],
               ['Peterson', 42, 85, 2528000],
               ['lynda', 42, 85, np.nan],
               [np.nan, 42, 85, np.nan],
               ['Jenny', np.nan, 100, 25632],
               ['Kenn', np.nan, 110, 25632],
                ['Aly', np.nan, 90, 25582],
               ['John', 41, 85, 1528000]]

# creating a pandas dataframe
df = pd.DataFrame(employee_list, columns=['Name', 'Age', 'Weight', 'Salary'])

print(df)
print('---- after dropping or cleaning the missing data ---')
df = df.dropna()
print(df)


Like many other approaches, dropna() also has some pros and cons.

**Pros**

- Straightforward and simple to use.
- Beneficial when missing values have no importance.

**Cons**
- Using this approach can lead to information loss, which can introduce bias to the final dataset.
- This is not appropriate when the data is not missing completely at random.
- Data set with a large proportion of missing value can be significantly decreased, which can impact the result of all statistical analysis on that data set.


**Example 2: Filling Missing Data Using df.fillna() method**

Pandas has several options for filling or replacing missing values with other values. One of the most convenient methods is the **.fillna()** method; you can use it to replace missing values with:

- Specified values.

- The values above the missing value.

- The values below the missing value.

**Syntax:**

# `DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)`

**Parameters:**
- **value:** static, dictionary, array, series, or dataframe to fill instead of NaN.
- **method:** Method is used if the user does not pass any value. Pandas has different methods like bfill(), backfill(), or ffill(), which fills the place with value in the Forward index or Previous/Back, respectively.
- **axis:** axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and ‘index’ or ‘columns’ for String.
inplace: a boolean, which makes the changes in the data frame itself if True.
- **limit:** an integer value, which specifies the maximum number of consecutive forward/backward NaN value fills.
- **downcast**: takes a dict, which specifies what dtype to downcast to which one such as float64 to int64.
- **kwargs:** any other keyword arguments.



Consider the following Data frame.

In [None]:
import pandas as pd
import numpy as np
# Initializing the nested list with Data set
employee_list = [['James', 36, 75, 5428000],
               ['Villers', 38, 74, 3428000],
               ['VKole', 31, 70, 8428000],
               ['Smith', 34, 80, 4428000],
               ['Gayle', 40, 100, 4528000],
               ['Adam', 40, np.nan, 4528000],
               ['Rooter', 33, 72, 7028000],
               ['Peterson', 42, 85, 2528000],
               ['lynda', 42, 85, np.nan],
               [np.nan, 42, 85, np.nan],
               ['John', 41, 85, 1528000],]

# creating a pandas dataframe
df = pd.DataFrame(employee_list, columns=['Name', 'Age', 'Weight', 'Salary'])

print(df)



# **Example 2.1: fillna() on all columns**

In [None]:
import pandas as pd
import numpy as np

# fillna() on all columns
# Initializing the nested list with Data set
employee_list = [['James', 36, 75, 5428000],
               ['Villers', 38, 74, 3428000],
               ['Kole', 31, 70, 8428000],
               ['Smith', 34, 80, 4428000],
               ['Gayle', 40, 100, 4528000],
               ['Adam', 40, np.nan, 4528000],
               ['Rooter', 33, 72, 7028000],
               ['Peterson', 42, 85, 2528000],
               ['lynda', 42, 85, np.nan],
               [np.nan, 42, 85, np.nan],
               ['Jenny', np.nan, 100, 25632],
               ['Kenn', np.nan, 110, 25632],
                ['Aly', np.nan, 90, 25582],
               ['John', 41, 85, 1528000],]

# creating a pandas dataframe
df = pd.DataFrame(employee_list, columns=['Name', 'Age', 'Weight', 'Salary'])
print("---------------Before Cleaning NaN values--------------")
print(df)
print('---------------after filling all columns------------')
print(df.fillna('None'))



## **Example 2.2:  Fill same value on multiple columns from NaN values**

The below example updates columns 'Weight' and 'Age' with 'pending' for NaN values:






In [None]:
import pandas as pd
import numpy as np
employee_list = [['James', 36, 75, 5428000],
               ['Villers', 38, 74, 3428000],
               ['VKole', 31, 70, 8428000],
               ['Smith', 34, 80, 4428000],
               ['Gayle', 40, 100, 4528000],
               ['Adam', 40, np.nan, 4528000],
               ['Rooter', 33, 72, 7028000],
               ['Peterson', 42, 85, 2528000],
               ['lynda', 42, 85, np.nan],
               [np.nan, 42, 85, np.nan],
               ['Jenny', np.nan, 100, 25632],
               ['Kenn', np.nan, 110, 25632],
                ['Aly', np.nan, 90, 25582],
               ['John', 41, 85, 1528000],]

# creating a pandas dataframe
df = pd.DataFrame(employee_list, columns=['Name', 'Age', 'Weight', 'Salary'])
print("---------------Before Cleaning NaN values--------------")
print(df)

print("---------------after Cleaning NaN values from multiple columns--------------")
df =df[['Weight','Age']].fillna('pending')
print(df)


##**Example 2.3:  Fill different value for each column**
Fill different value for each column from NaN values
Now, let’s see how to fill different values for each column. The below example updates column:

- ‘Name’ with ‘verification pending’,

- ‘Weight’ with ‘pending’,

- ‘Age’ with ‘unknown’

- ‘Salary’ with ‘0.0’ for NaN values.








In [None]:
import pandas as pd
import numpy as np
employee_list = [['James', 36, 75, 5428000],
               ['Villers', 38, 74, 3428000],
               ['VKole', 31, 70, 8428000],
               ['Smith', 34, 80, 4428000],
               ['Gayle', 40, 100, 4528000],
               ['Adam', 40, np.nan, 4528000],
               ['Rooter', 33, 72, 7028000],
               ['Peterson', 42, 85, 2528000],
               ['lynda', 42, 85, np.nan],
               [np.nan, 42, 85, np.nan],
               ['Jenny', np.nan, 100, 25632],
               ['Kenn', np.nan, 110, 25632],
                ['Aly', np.nan, 90, 25582],
               ['John', 41, 85, 1528000],]

# creating a pandas dataframe
df = pd.DataFrame(employee_list, columns=['Name', 'Age', 'Weight', 'Salary'])
print("---------------Before Cleaning NaN values--------------")
print(df)
print("-----after Cleaning NaN values from multiple columns--------------")

df2 = df.fillna(value={'Weight':'Pending','Age':"Unknown", 'Salary': "0.0", 'Name': "Verification pending"})

print(df2)

# **Pandas Replace NaN with Blank/Empty String**

By using replace() or fillna() methods, you can replace NaN values with a Blank/Empty string in Pandas DataFrame.

Now, let’s create a DataFrame with a few rows and columns, and execute some examples and validate the results. Our DataFrame contains column names Courses, Fee, Duration and Discount.

In [None]:
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark",np.nan,"Hadoop","Python","pandas",np.nan,"Java"],
    'Fee' :[20000,25000, np.nan,22000,24000,np.nan,22000],
    'Duration':[np.nan,'40days','35days', np.nan,'60days','50days','55days'],
    'Discount':[1000,np.nan,1500,np.nan,2500,2100,np.nan]
              }
df = pd.DataFrame(technologies)
print(df)


**Example 3: Convert NaN to Empty String in Pandas**

Use **df.replace(np.nan,'',regex=True)** method to replace all NaN values to an empty string in the Pandas DataFrame column.


In [None]:
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark",np.nan,"Hadoop","Python","pandas",np.nan,"Java"],
    'Fee' :[20000,25000, np.nan,22000,24000,np.nan,22000],
    'Duration':[np.nan,'40days','35days', np.nan,'60days','50days','55days'],
    'Discount':[1000,np.nan,1500,np.nan,2500,2100,np.nan]
              }
df = pd.DataFrame(technologies)
df2 = df.replace(np.nan, '', regex=True)
print(df2)





**Example 4: Multiple Columns Replace Empty String**

In order to replace NaN values with Blank strings on multiple columns or all columns from a list, use **df[['Courses','Fee']] = df[['Courses','Fee']].fillna('')**. This replaces NaN values on Courses and Fee columns.


In [None]:
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark",np.nan,"Hadoop","Python","pandas",np.nan,"Java"],
    'Fee' :[20000,25000, np.nan,22000,24000,np.nan,22000],
    'Duration':[np.nan,'40days','35days', np.nan,'60days','50days','55days'],
    'Discount':[1000,np.nan,1500,np.nan,2500,2100,np.nan]
              }
df = pd.DataFrame(technologies)

# Using multiple columns
df2 = df[['Courses','Fee' ]] = df[['Courses','Fee' ]].fillna('PENDING')
print(df2)



**Example 5: fillna() with inplace=True**

Notice that the above output after applying fillna() function returns a new DataFrame. In order to update the current/referring DataFrame in place, use df.fillna('',inplace=True). When using this, the fillna() method returns the None type.


In [None]:
#fillna() with inplace=True
import pandas as pd
import numpy as np

# initializing the nested list with Data set and NaN values
technologies = {
    'Courses':["Spark",np.nan,"Hadoop","Python","pandas",np.nan,"Java"],
    'Fee' :[20000,25000, np.nan,22000,24000,np.nan,22000],
    'Duration':[np.nan,'40days','35days', np.nan,'60days','50days','55days'],
    'Discount':[1000,np.nan,1500,np.nan,2500,2100,np.nan]
              }
df = pd.DataFrame(technologies)
# Using pandas replace nan with null
df2 = df.fillna('', inplace=True)
print(df2)





**Example 6: Handling Missing Data from CSV file**

In this example, we will utilize the dummy employee dataset.

[Click here to download employee dataset (employee.csv)](https://drive.google.com/file/d/14RV1xKIRzWS166LtGqnPC1Wg7eTlI_y1/view)

As we have learned in the previous example, the **fillna()** function can be used to deal with NaN values. We have some NaN or missing values in the CSV file, let's handle them by using **fillna()** function.


In [None]:
import pandas as pd
df = pd.read_csv('employee.csv', dtype={'Name':'string' })


print('============Before Handling missing values=====')
print(df)


print('===========After Handling missing values using fillna()=====')
df2 = df.fillna(value={'Name':'Verification Pending','Age':"Unknown", 'Weight': "pending", 'Salary': 0.0})
print(df2)

 ### **Example 7: Identify the first non-empty row in a Pandas Series or column**

To identify the first non-empty row in a Pandas Series or column, you can use the `first_valid_index()` method. This method returns the index label of the first non-null (non-empty) value in the Series. Here's how you can use it:

Example 7.1

In [None]:
import pandas as pd

# Example Pandas Series
data = pd.Series([None, None, 5, 10, None, 20])

# Find the index label of the first non-empty row
first_non_empty_index = data.first_valid_index()

print("Index of the first non-empty row:", first_non_empty_index)
print("Value of the first non-empty row:", data[first_non_empty_index])


Index of the first non-empty row: 2
Value of the first non-empty row: 5.0


Example 7.2

In [None]:
import pandas as pd
import numpy as np
# Example Pandas Series with NaN values
data = pd.Series([np.nan, np.nan, 5, 10, np.nan, 20])

# Find the index label of the first non-empty row
first_non_empty_index = data.first_valid_index()

print("Index of the first non-empty row:", first_non_empty_index)
print("Value of the first non-empty row:", data[first_non_empty_index])

In the above examples, the first non-empty row in the Series occurs at index label 2, and the corresponding value is 5.0. You can use this method to identify the first non-empty row in any Pandas Series or column.

##**Submission Instructions**
- Submit your completed lab using the Start Assignment button on the assignment page in Canvas.
- Your submission can be include:
  - if you are using notebook then, all tasks should be written and submitted in a single notebook file, for example: (**your_name_labname.ipynb**).
  - if you are using python script file, all tasks should be written and submitted in a single python script file for example: **(your_name_labname.py)**.
- Add appropriate comments and any additional instructions if required.
