## ðŸŒ¾ Question 02: NumPy Data Cleaning and Normalization

**Goal:** Clean and normalize a large, messy CSV file of soil moisture sensor readings using the high-performance capabilities of NumPy.
**Topics:** File I/O, NumPy Arrays, Vectorized Operations, Boolean Indexing, Axis Operations, Broadcasting, Min-Max Normalization.

### Data Overview
* The `sensor_data.csv` file has **100 columns** (one per sensor) and **720 rows** (one per hour).
* The data contains missing values (`"-999"`), impossible values (negative or greater than 100), and is stored as strings.

### Your Task:


1.  **Load Data:** Use **standard Python file I/O** (no Pandas) to read `sensor_data.csv` line-by-line, split the commas, and load it into a Python list of lists.


In [None]:
with open('sensor_data.csv','r') as csvObject:
    raw_data = csvObject.readlines() # each line is read and stored as a list element in raw_data
    # now raw_data is a list where element is a single line from the file
    # but each element is a str for values comma-separated, 
    # raw_data[n] = '...12,23...'

# convert 
# raw_data = [
#   ...,
#   '...12,23...',
#   '...12,33...',
#   ...
# ]
#  to
# sensor_data = [
#   ...,
#   [...12,23...],
#   [...12,33...],
#   ...
# ]
sensor_data = [
    
    [float(element) for element in (line.split(','))] # convert each fetch line from ["12,34"] to ["12","34"],meaning a 2d list of str, then fetch each element from the converted 2d str array and type cast it to float
    for line in raw_data
    ]


2.  **NumPy Conversion:** Convert this nested list into a **2D NumPy array**.


In [None]:
% pip install numpy

In [17]:
import numpy as np

sensor_data_np = np.array(sensor_data)
print(sensor_data_np)

[[  21.21   66.86   47.73 ...   60.63   10.51   83.58]
 [  47.24   71.7    68.44 ...   68.91   29.68   38.72]
 [  39.07   21.85   31.86 ...   12.77   12.38   16.5 ]
 ...
 [  18.6    64.89   41.9  ...   16.49   32.04   36.97]
 [  22.03   43.26   55.61 ...   13.56   83.69   61.68]
 [  45.06 -999.     20.37 ...   86.32   35.03   34.74]]


3.  **Data Cleaning (Vectorized):** Use NumPy's vectorized operations (no for loops) to:
    * Replace all occurrences of `"-999"` with **`np.nan`**.
    * Use **Boolean indexing** to find and replace all values **less than 0** and all values **greater than 100** with **`np.nan`**.


In [None]:
# mask = sensor_data_np==-999 creates a boolean masking
# arr[mask] selects all the true val of that mask
# arr[mask] = 0, those selected true val are replace by 0
sensor_data_np[sensor_data_np==-999] = np.nan

print(sensor_data_np)

[[21.21 66.86 47.73 ... 60.63 10.51 83.58]
 [47.24 71.7  68.44 ... 68.91 29.68 38.72]
 [39.07 21.85 31.86 ... 12.77 12.38 16.5 ]
 ...
 [18.6  64.89 41.9  ... 16.49 32.04 36.97]
 [22.03 43.26 55.61 ... 13.56 83.69 61.68]
 [45.06   nan 20.37 ... 86.32 35.03 34.74]]


4.  **Data Analysis (Axis Operations):**
    * Calculate the **average (mean) moisture** for each sensor (**column-wise**), **ignoring `np.nan` values**.
    * Calculate the **median moisture** for each hour (**row-wise**), **ignoring `np.nan` values**.
    * Identify the sensor (**column index**) with the **highest number of invalid readings** (the `np.nan` values you just set).


In [None]:
######################################
# axis = 0, means for each column
mean_moisture_per_sensor = np.nanmean(sensor_data_np, axis = 0)

print(mean_moisture_per_sensor.shape)

print("----------")
######################################
# axis = 1, means for each row
mean_moisture_per_hour = np.nanmedian(sensor_data_np, axis = 1)
print(mean_moisture_per_hour.shape)

######################################
# sensor with hight invalid readings
sensor_errors = np.isnan(sensor_data_np).sum(axis=0)
greatest_error = np.argmax(sensor_errors)

print("sensor with greatest error:",greatest_error)

(100,)
----------
(720,)
sensor with greatest error: 8


5.  **Normalization (Broadcasting):**
    * Find the minimum ($X_{min}$) and maximum ($X_{max}$) valid reading across the **entire dataset**.
    * Use **NumPy broadcasting** to perform **Min-Max Normalization** on the entire array, scaling all data to be between 0 and 1. The formula is:
        $$X_{norm}=\frac{X-X_{min}}{X_{max}-X_{min}}$$


In [None]:
X_min = np.nanmin(sensor_data_np)
print("X_min:",X_min)

X_max = np.nanmax(sensor_data_np)
print("X_max:",X_max)

X_norm = (sensor_data_np - X_min)/(X_max - X_min)
print("X_norm:",X_norm)


X_min: -19.93
X_max: 159.93
X_norm: [[0.22873346 0.48254198 0.37618147 ... 0.44790393 0.16924274 0.57550317]
 [0.37345713 0.5094518  0.49132659 ... 0.49393973 0.27582564 0.32608696]
 [0.32803291 0.23229178 0.28794618 ... 0.18180807 0.17963972 0.20254642]
 ...
 [0.21422217 0.47158901 0.34376737 ... 0.20249083 0.28894696 0.31635717]
 [0.23329256 0.35132881 0.41999333 ... 0.18620038 0.57611476 0.4537418 ]
 [0.3613366         nan 0.22406316 ... 0.59073724 0.305571   0.30395863]]


6.  **Save Output:** Write the final, cleaned, and normalized NumPy array to a new CSV file named **`sensor_data_normalized.csv`**.

In [29]:
file = "sensor_data_normalized.csv"
np.savetxt(file,X_norm,delimiter=",",fmt='%.4f')