<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://drive.google.com/uc?export=view&id=1LqkEgbpZj8A99Y9T59eBp9fEIZbpg6P2" alt="Snowflake Snowpark Classroom" style="width: 800px">
</div>

# Classroom 1.7 - Understanding Snowflake Snowpark Dataframes and their Methods

In this notebook, you will learn how to play with data using Snowpark Dataframes and detailed hands-on with various methods

## Learning Objectives

By the end of this classroom, you should be able to:
- Understanding on the Snowflake Snowpark Dataframes
- Understanding on various types of operations on snowflake dataframes
- Understanding on different and frequently used methods for snowflake dataframes.

In [None]:
# Let's get started with Snowpark for Python
from assets.config import connection_builder
session = connection_builder()

## Section 1 - Getting Started with Snowflake Snowpark Dataframes

snowflake.snowpark.DataFrame represents a lazily-evaluated relational dataset that contains a collection of Row objects with columns defined by a schema (column name and type).

- A DataFrame is considered lazy because it encapsulates the computation or query required to produce a relational dataset
- The computation is not performed until you call a method that performs an action 

There are multiple ways to create a dataframe using snowpark

1. Using session.Table() to create a dataframe
2. Using session.read property of DataFrameReader to create a dataframe
3. Creating new dataframe by applying transformation on existing dataframes


In [None]:
from snowflake.snowpark.types import StructField,StructType,StringType,IntegerType
session.file.put(local_file_name='assets/resources/csv_dataset.csv',stage_location='@TEST_STAGE',auto_compress=False,overwrite=True)

# Creating Dataframe using session.table
df_via_table = session.table('SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY')
df_via_table.show()

# Creating Dataframe using session.read property
df_via_stage_file = session.read.schema(StructType([StructField('Email No.',StringType()),
                                                    StructField('the',StringType()),
                                                    StructField('to',StringType())])).csv('@TEST_STAGE/csv_dataset.csv')
df_via_stage_file.show()

# Creating Dataframe by applying transformation on existing dataframe
dv_via_transformation = df_via_stage_file.filter(df_via_stage_file.to.try_cast(IntegerType()) > 10)
dv_via_transformation.show()


## Section 2 - Types of Operations on a Dataframe 

The operations on DataFrame can be divided into two types:

- **Transformations** produce a new DataFrame from one or more existing DataFrames. Note that transformations are lazy and don’t cause the DataFrame to be evaluated.

- **Actions** cause the DataFrame to be evaluated. When you call a method that performs an action, Snowpark sends the SQL query for the DataFrame to the server for evaluation.

Follow the Snowpark for Python Documentation to understand various use cases on Transformation and performing actions on the DataFrame - [Snowflake.Snowpark.DataFrame](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.DataFrame)

---

Complete List of All DataFrame Methods - [Here](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.DataFrame)


In [None]:
from snowflake.snowpark.types import StructType, StructField, IntegerType,StringType,FloatType
schema = StructType([StructField('name',StringType()),
                     StructField('age',IntegerType()),
                     StructField('salary',FloatType())
                     ])
dataset = [['divyansh',22,150.00],
           ['piyush',22,130.5],
           ['piyush',28,160.85],
           ['archana',29,190.25],
           ['archana',29,190.25],
           ['archana',29,None],
           [None,None,190.25],
           [None,None,float('nan')]]
df = session.createDataFrame(data=dataset,schema=schema)


# Hands-on for Aggregate Functions
print('----------------- Aggregate Functions on DataFrame using dataFrame.agg() -----------------')
df.group_by(df.age).agg((df.age,'max'),(df.salary,'median')).show()

# Hands-on for Caching the DataFrame Result
print('----------------- Caching DataFrame using dataframe.cache_result() and dataframe.drop_table()  -----------------')
df_qh = session.table('SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY')
df_qh_cache = df_qh.cache_result()
df_qh_cache.delete(df_qh_cache.user_name=='DIVYANSH')
print(df_qh_cache.count())
print(df_qh.count())
df_qh_cache2 = df_qh_cache.cache_result()
df_qh_cache.drop_table() # Dropping the Cached Result
df_qh_cache2.show() # Dataframe derived from Cached Results will work

# DataFrame.count() To get the number of rows in the result 
print('----------------- Generating Number of Rows in Dataframe Result using dataframe.count()  -----------------')
print(df_qh_cache2.count())

# DataFrame.createOrReplaceTempView() To create a temporary view of the dataframe result 
print('----------------- Creating a temp View from Dataframe Result using dataframe.createOrReplaceTempView()  -----------------')
df_qh_cache2.createOrReplaceTempView('df_qh_vw')
session.sql('select * from df_qh_vw').show()

# Joins Using Snowpark DataFrame
print('----------------- Joins Example Using Snowpark Dataframe -----------------')
df.crossJoin(df).show(100)

df_test = df.fillna({'name':'test'})
df.join(df_test,df.name == df_test.name,'inner').show()

# Computing basic statistics for numeric columns, which includes count, mean, stddev, min, and max Using DataFrame.describe()
print('----------------- Computing Basic Statistics using Dataframe.describe() -----------------')
df.describe().show()

# Generating Dataframe with distinct values from the current DataFrame using dataframe.distinct()
print('----------------- Distinct Dataframe results using Dataframe.distinct() -----------------')
df.distinct().show()

# Dropping Columns from a dataframe using dataframe.drop()
print('----------------- Dropping Age Column from DataFrame using Dataframe.drop() -----------------')
df.drop('age').show()

# Dropping Duplicates on given subset of columns from a dataframe using dataframe.dropDuplicates().
# The result is non-deterministic when removing duplicated rows from the subset of columns but not all columns.
print('----------------- Dropping Duplicates from DataFrame using Dataframe.dropDuplicates() -----------------')
df.dropDuplicates('name','age').show()

# Dropping Nulls/NaN using DataFrame.dropna that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
print('----------------- Dropping Nulls/NaNs from DataFrame using Dataframe.dropna() -----------------')
df.dropna(how='any').show()
df.dropna(thresh=1).show()
df.dropna(subset='name').show()

# Subtracting dataframes using dataframe.minus() or dataframe.subtract()
print('----------------- Generating Difference in the dataframes using Dataframe.minus() or Dataframe.subtract() -----------------')
df2 = df.dropna(how='any')
df.minus(df2).show() 

# Printing the list of queries that will be executed to evaluate this DataFrame using dataframe.explain()
print('----------------- Printing the list of queries that will be executed to evaluate dataframes using Dataframe.explain() -----------------')
df_qh.explain()

# Replacing all null and NaN values in the specified columns with the values provided using Dataframe.fillna()
print('----------------- Replacing all null and NaN values in the specified columns with the values provided using Dataframe.fillna() -----------------')
df.fillna({'name':'test','age':10,'salary':100.0}).show()

# Filter rows based on the specified conditional expression using dataframe.filter() or dataframe.where()
print('----------------- Filter rows based on the specified conditional expression -----------------')
df.where((df.name.isin('divyansh','piyush') | (df.salary > 170.0))).show()

# Generate Intersection of rows from 2 dataframe using dataframe.intersect()
print('----------------- Intersect 2 Dataframes using dataframe.intersect() -----------------')
df.intersect(df2).show()

# Executing the query representing a DataFrame and return an iterator of Row objects using dataframe.toLocalIterator().
print('----------------- Returning an iterator of Row objects using dataframe.toLocalIterator() -----------------')
for rows in df.toLocalIterator():
    print('name = ',rows[0])
    print(' -age = ',rows[1])
    print(' -salary = ',rows[2])

# Snowpark Dataframe to Pandas DataFrame
# Unlike to_pandas(), to_pandas_batches() method does not load all data into memory at once
#print('----------------- Converting Snowpark Dataframe to Pandas using dataframe.to_pandas() / dataframe.to_pandas_batches() -----------------')
df_pd = df.toPandas()
display(df_pd)
for pandas_df in df.to_pandas_batches():
    print(pandas_df)

   
# Hands-on on Union / UnionAll / UnionByName / UnionAllByName
print('----------------- Union Examples using Snowpark DataFrame -----------------')
df.union(df2).show()
df.unionAll(df2).show()
df.unionByName(df2).show()
df.unionAllByName(df2).show()

# Introducing additional columns with the specified names using dataframe.withColumn() and dataframe.with_columns()
print('----------------- Introducing additional columns with the specified names  -----------------')
df.withColumn('sum_age_sal',df.salary + df.age).show()
df.with_columns(['Salary By Age', '10pct Promotion'],[df.salary/df.age,df.salary * 1.1]).show()

# Performing drop method on the dataframe
print('----------------- Dropping data with any null value -----------------')
df.dropna(how='any').show()
print('----------------- Dropping data with all null value -----------------')
df.dropna(how='all').show()
print('----------------- Dropping duplicate data value -----------------')
df.dropDuplicates().show()
