# August 18th Stage 6

## Stage 6: Comprehensive Data Preprocessing 

- Deal with missing values 
- Not in the range we are looking to work in 
- Data types (type corrections)
- Scaling/normalization (MinMax, StandardScale) - if one data point in thouands and another in decimal, if you took a distance between - the 2 points it would be heavily weighted. Thus you would be scaling it
- Reusable functions 


In [1]:
import os
import pandas as pd 

In [3]:
csv_path = 'data/instructor_dirty.csv'
os.makedirs('data', exist_ok=True)

if not os.path.exists(csv_path):
    df_demo = pd.DataFrame({
        'numeric_col': [10, None, 40, 55, 70],
        'category_col': ['A', 'B', 'A', 'B', 'C'],
        'price': ['$100', '$200', '$150', None, '$250'],
        'date_str': ['2025-08-01','2025-08-02',None,'2025-08-04','2025-08-05'],
        'category': ['Electronics','Furniture','Toys','Clothing',None]
    })
    df_demo.to_csv(csv_path, index=False)
    print(f"Demo CSV created at {csv_path}")
else:
    print(f"CSV already exists at {csv_path}")

Demo CSV created at data/instructor_dirty.csv


Missing Data Handling:
- MCAR: Missing completely at random - missing for no particular pattern or reason, if u wanna fill it avg of data points or drop it
- MAR: Missing at Random -  based on some other column, not completely random 
- MNAR: Missing not at random
Look at columns, is it because of that, what if we drop a column
The first question is, is there a pattern? Is there a reason or completely at random 

In [8]:
import pandas as pd #Imports the pandas library for data manipulation and analysis, aliasing it as pd.

import numpy as np #Imports the NumPy library for numerical operations and array handling, aliasing it as np.

import seaborn as sns #Imports the Seaborn library for creating statistical data visualizations, aliasing it as sns.

import matplotlib.pyplot as plt #Imports the Matplotlib plotting module for creating static visualizations, aliasing it as plt.

from sklearn.preprocessing import MinMaxScaler, StandardScaler #Imports MinMaxScaler and StandardScaler from Scikit-learn #to scale features to a specific range or standardize them, respectively.

import missingno as msno #Imports the missingno library to visualize missing data patterns and relationships, aliasing it as msno.

No, it will not throw an error. When you import a function from a module, Python handles the dependencies for you. 
If the function you are importing from module_A internally uses another library, say module_B, 
you don't need to explicitly import module_B in your own script. Python's import system ensures that all 
necessary dependencies are loaded when module_A is imported, as long as module_B is installed in your environment.

This is because the import x statement is executed at the time the function's module is imported or when the function 
itself is defined, not when you call it from your notebook. You only need to import the objects you directly reference 
in your own code.

In [14]:
df_demo["numeric_col"] #this is a series not a data frame

0    10.0
1     NaN
2    40.0
3    55.0
4    70.0
Name: numeric_col, dtype: float64

In [18]:
df_demo.shape[0]
df_demo.shape[1] # this will give u just (5, _) cause series not a df 
# convert to dataframe 
# dataframe(df_demo.shape[1]) -RECHECK

5

In [25]:
r = np.random.rand(len(df_demo))<1
r

array([ True,  True,  True,  True,  True])

Called MASKING - LOOK IT UP AND LEARN 

Mask function - If it's true it's NaN

The `mask()` function in pandas is used to **conditionally replace values** in a DataFrame. It works by applying a boolean condition and replacing the values where that condition is `True`.

### How it Works

  * **Condition**: You provide a boolean DataFrame or Series of the same shape as your data. This acts like a filter.
  * **Replacement**: The `mask()` function then iterates through your DataFrame. For every cell where the condition is `True`, it replaces the original value with a new one. By default, this replacement value is `NaN` (Not a Number), but you can specify any value or even a callable function.
  * **Result**: The function returns a new DataFrame with the replaced values, leaving the original DataFrame unchanged unless you set the `inplace=True` parameter.

### Mask() vs. Where()

The `mask()` function is the **inverse** of the `where()` function.

  * **`mask()`** replaces values where the condition is **`True`**.
  * **`where()`** replaces values where the condition is **`False`**.

Think of it this way: `mask()` "hides" or "covers up" the data that meets your criteria, while `where()` "keeps" or "selects" the data that meets your criteria.

### Simple Example

```python
import pandas as pd

df = pd.DataFrame({'A': [10, 20, 30], 'B': [15, 25, 35]})

# Mask values greater than 20
masked_df = df.mask(df > 20, 0)

print(df)
print("\n")
print(masked_df)
```

**Output:**

```
   A   B
0  10  15
1  20  25
2  30  35

   A  B
0  10 15
1  20  0
2  0   0
```

for whatever values that were greater than 20 it reoolaced them with 0 

FillNa - fills in the missing values 
ffill also available - forward fill 