# Overview

What should we do in data cleaning process and how to do it? Let's use NumPy in this notebook.

Normally we will drop the missing data.The missing data can occur duce to various reasons such as human error, data corrpution, or data collection issues. Handing missing data is important becuase it can effect the accuracy and reliability of the data analysis.

There are several ways to handle missing data:

* Deletion
* Imputation: replacing messing data with estimated values.
* Interpolation: estimating missing data based on the values of nearby data points.
* Prediction


# Handling Missing Data

Here we use `numpy.nan` which is a special value representing missing or undefined data. And there are further functions like:
* `numpy.nan_to_num()` replaces NaN values with zero
* `numpy.delete()` removes rows or columns with missing data
* `numpy.isnan()` checks ig a value is NaN

In [1]:
import numpy as np

# Creating an array with missing values
data = np.array([1, 2, np.nan, 4, 5])

# Checking for missing values
has_missing = np.isnan(data)

# Filling missing values with a specific value or the mean
data[has_missing] = 0  # Replace with a specific value
# Alternatively:
# data[has_missing] = np.nanmean(data)

print(data)

[1. 2. 0. 4. 5.]


In [2]:
# Filling missing data
arr = np.array([1, 2, np.nan, 4, 5])

# fill missing data with zero
arr_filled = np.nan_to_num(arr)

print(arr_filled)

[1. 2. 0. 4. 5.]


In [3]:
# Dropping missing data
arr = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])

# delete rows with missing data
arr_no_missing = np.delete(arr, np.where(np.isnan(arr).any(axis=1)), axis=0)

print(arr_no_missing)

[[7. 8. 9.]]


In [4]:
# Imputing missing data
arr = np.array([1, 2, np.nan, 4, 5])

# impute missing data with mean
arr_imputed = np.where(np.isnan(arr), np.mean(arr[~np.isnan(arr)]), arr)

print(arr_imputed)

[1. 2. 3. 4. 5.]


# Addressing Outliers

Outliers can skew analytical results and impact the accuracy of models.

In [5]:
# Creating an array with outliers
data = np.array([1, 2, 8, 4, 5, 15, 100])

# Identifying outliers based on z-score
z_scores = np.abs((data - np.mean(data)) / np.std(data))
print(z_scores)

is_outlier = z_scores > 2  # Adjust the threshold as needed

# Removing outliers
cleaned_data = data[~is_outlier]

print(cleaned_data)

[0.55021329 0.5201235  0.33958476 0.45994392 0.42985413 0.12895624
 2.42867585]
[ 1  2  8  4  5 15]


## Percentile-based Clipping

We can use the `numpy.percentile()` function to calculate percentiles, and the `numpy.clip()` function to clip values outside a certain range.

In [6]:
# create an array with outliers
arr = np.array([1, 2, 3, 100, 5, 6])

# clip values outside the range of the 1st and 80th percentiles
arr_clipped = np.clip(arr, np.percentile(arr, 1), np.percentile(arr, 80))

print(arr_clipped)

[1.05 2.   3.   6.   5.   6.  ]


## Z-score-based filtering

We can use the `numpy.mean()` and `numpy.std()` functions to calculate the mean and standard deviation of the data, and the `numpy.abs()` function to calculate the absolute deviation from the mean. 

In [7]:
# create an array with outliers
arr = np.array([1, 2, 3, 100, 5, 6])

# calculate mean and standard deviation
mean = np.mean(arr)
std = np.std(arr)

# calculate z-scores
z_scores = np.abs((arr - mean) / std)
print(z_scores)

# remove values with z-scores greater than 3
arr_filtered = arr[z_scores <= 2]

print(arr_filtered)

[0.51331161 0.48556503 0.45781846 2.23359915 0.40232531 0.37457874]
[1 2 3 5 6]


## Median-based filtering

We can use the `numpy.median()` function to calcualte the median of the data, and the `numpy.percentitle()` function to calculate the 1st and 99th percentiles.

In [8]:
# create an array with outliers
arr = np.array([1, 2, 3, 100, 5, 6])

# calculate median
median = np.median(arr)

# replace values outside the range of the 1st and 99th percentiles with the median
arr_filtered = np.where(np.logical_or(arr < np.percentile(arr, 1), arr > np.percentile(arr, 99)), median, arr)

print(arr_filtered)

[4. 2. 3. 4. 5. 6.]


# Scikit-learn intergrates with NumPy

In [9]:
from sklearn.linear_model import LinearRegression

# create a NumPy array
arr = np.array([[1], [2], [3], [4], [5]])

# create a linear regression model
model = LinearRegression()

# train the model
model.fit(arr, np.array([2, 4, 6, 8, 10]))

# make a prediction
prediction = model.predict(np.array([[6]]))

print(prediction)

[12.]


# Credit

* https://blog.stackademic.com/numpy-in-real-world-data-science-projects-abfb517507e1