#### Numpy Data Preprocessing

##### Why Data Preprocessing
1. Data Quality - Data preprocessing allows us to clean and enhance data quality, ensuring reliable analysis.
2. Feature Engineering - Helps in creating meaningful features which are the building blocks of ML algorithms
3. Scaling and Normalization - Brings data into a consistent range, preventing certain featurs from dominating others.
4. Handling Categorical Data - It enable transformation of categorical data into a format that ML algorithms can understand.

##### Why Numpy is Indispensable for Data Preprocessing

1. Efficient Array Operations - It is highly efficient in performing Mathematical and logical operations, thus, its speed is critical for large datasets.
2. Handling Missing Data - it offers tools that help in handling missing data sealing the gaps in the data which may compromise the analysis.
3. Statistical Calculations - It provides functions to calculate statistics, which are essential for data cleaning and understanding your data's distribution.
4. Array Slicing and Indexing - This makes it easier to extract specific portions of data a crucial skill for selecting and transforming features.
5. Numerical Encoding - Helps tranform categoricl data into a numerical format through techniques like one-hot encoding.

##### Handling Missing Values
- We use function such as:
```nan```,
```isnan()```,
```nanmean()``` (which calculates the mean of non-missing values).
- Example:

In [1]:
# Imports
import numpy as np
import pandas as pd

In [2]:
# Create an array with a missing value

data = np.array([1, 2, np.nan, 4, 5])

# Calculate the mean of non-missing values
mean_val = np.nanmean(data)

# Assign the mean to the missing value
data[np.isnan(data)] = mean_val

print(data)


[1. 2. 3. 4. 5.]


##### Detecting Outliers
```z_score``` - The Z-score measures how far each data point is from the mean in terms of standard deviations.

In [3]:
# Create an array with outliers
data = np.array([1, 2, 3, 4, 100, 200])

# Calculate the z_scores
z_score = (data - np.mean(data)) / np.std(data)
#print(z_score)

# Identify and remove outliers (where absolute Z-score > 3)
outliers = np.where(np.abs(z_score) > 3)
filtered_data = data[outliers]
#print(filtered_data)

##### Scaling and Normalization
- This ensures that your data lies within a consistent range.
- Example:

In [4]:
# Create an array
data = np.array([10,20,30,40,50, 60])

# Perform min-max scaling(scaling to [0, 1]) range
scaled_data = (data - np.min(data)) / (np.max(data) - np.min(data))
print(scaled_data)

[0.  0.2 0.4 0.6 0.8 1. ]


**Explanantion:** In this example, we perform Min-Max scaling to rescale the data values to the range [0, 1]. We achieve this by subtracting the minimum value from each data point and dividing by the range (the difference between the maximum and minimum values). Min-Max scaling is useful when different features have different scales, and we want to bring them to a consistent range for modeling.

##### Encoding Categorica Data
- Numpy helps convert categorical variables into numerical representations using techniques like _one-hut_ encoding.
- Example: One-hut Encoding

In [5]:
# Create an array with categorical data
colors = np.array(['red', 'yellow', 'green', 'yellow', 'blue', 'black','yellow', 'red', 'violet'])

# Perform one-hot encoding
encoded_colors = np.eye(len(np.unique(colors)))[np.searchsorted(np.unique(colors), colors)]

print(encoded_colors)

[[0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]]


**Explanation:** A colors categorical data. We first identify unique categories using np.unique(), and then we use np.searchsorted() to map each categorical value to a numerical index. Finally, we use np.eye() to perform one-hot encoding, where each category is represented as a binary vector, indicating the presence or absence of each category. This technique ensures that categorical data can be used in machine learning models effectively.
- Example 2
Extacting day of the week from dates, import pandas

In [8]:
# Create an array of date strings
dates = np.array(['2026-08-01', '2026-09-01', '2026-10-01'])

# Convert to date objects
date_objs = pd.to_datetime(dates)

# Extract day of the week
day_of_week = date_objs.dayofweek

print(day_of_week)

Index([5, 1, 3], dtype='int32')


**Explanation:** Here we demonstrate how to work with date and time data. We first convert date strings into datetime objects and then extract the day of the week. This can be useful for time series analysis. 