# Clean and Sample data with Dataframes

1. Clean and enrich data using snowpark for python

    1. Handle Missing values

    2. Sample Data

2. Perform Aggregate and set based operations on dataframes

    1. Functions

For more information follow the below links

1. [Handling missing values with Snowpark for Python — Part 1](https://medium.com/snowflake/handling-missing-values-with-snowpark-for-python-part-1-4af4285d24e6)

2. [snowflake.snowpark.DataFrame.dropna](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.DataFrame.dropna)

3. [snowflake.snowpark.DataFrame.fillna](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.DataFrame.fillna)

In [None]:
from snowflake.snowpark.context import get_active_session

session = get_active_session()
df = session.create_dataframe([
    [1.0, 1, "SE"],
    [float("nan"), 2, None],
    [float("nan"), 3, "DK"],
    [4.0, None, "SE"],
    [float("nan"), None, None]]
    ).to_df("a", "b", "c")
df

In [None]:
#Exclude all rows with a null/N/A in ANY column (how="any" by def)
df.dropna()

In [None]:
#Exclude all rows with a null/N/A in EVERY column
df.dropna(how="all")

In [None]:
#Exclude all rows with a null/N/A in AT LEAST 2 columns (A and C only)
df.dropna(subset=["a", "c"], thresh=2)

In [None]:
# Replace all missing values with 3 (must be the same data type!)
df.fillna(3)

In [None]:
# Replace missing values in specific columns
df.fillna({"a": 3.14, "c": "XY"})

In [None]:
#Replace some value everywhere (or in specific column, based on the data type)
df.replace("SE", "NW", subset=["C"])

In [None]:
#Replace multiple values
df.replace({1:111, "SE":"NW"})

In [None]:
import snowflake.snowpark.functions as F

#Replace first column with a last appended column (also named A)
df.with_column("A", F.replace(F.col("A"), 1, 111,))

In [None]:
# replace column values with existing data
df.with_column("A", 
    F.iff((F.col("A") == F.lit('NaN')) | (F.col("A").is_null()),
    F.avg(F.iff(F.col("A") == F.lit('NaN'), F.lit(None), F.col("A"))).over(),
    F.col("A")))

In [None]:
# replace calling built-in IFNULL function
df.with_column("B",
    F.call_builtin("ifnull", F.col("B"), F.avg(F.col("B")).over()))

In [None]:
# replace with most frequent value (mode)
df.with_column("C",
    F.call_builtin("ifnull", F.col("C"), F.mode(F.col("C")).over()))