![Join us](https://github.com/Resheras/WoS/blob/a69a0322c359c6f4aaee7cedcd788c13a839a14e/Welcome_WoS.png?raw=true)

Isa: Placeholder for First slide: Welcome form women on Snowflake
![Join us](https://github.com/Resheras/WoS/blob/main/Join_WoS.png?raw=true)

Isa: End data disparity - link to the whitepaper. What data disparity is. Salary gap and economic inequality. Bias in data and decision making. What is the role of data and uncover inequalities. How about we show you how to analyze data to discover the bias.
https://www.snowflake.com/en/company/overview/end-data-disparity/
![End data disparity](https://raw.githubusercontent.com/Resheras/WoS/refs/heads/main/end_dd.png)


## Instead of %pip or 🍺brew or 🟢conda - add package in dropdown

Notebooks come with pre-installed common Python libraries, such as numpy, pandas, matplotlib, and more! 

For the purpose of this demo, let's add the `matplotlib` and `scipy` package from the package picker⬆️⬆️⬆️.

In [None]:
# Import Python packages used in this notebook
import streamlit as st
import altair as alt

# Pre-installed libraries that comes with the notebook
import pandas as pd
import numpy as np

# Package that we just added
import matplotlib.pyplot as plt

## Let's start with a SQL cell 

Snowflake Notebooks allow us to switch between different languages. 

In [None]:
SELECT * 
FROM SALARY_DATA

If you are interested in the data we used, you can find it in Kaggle: 
* https://www.kaggle.com/datasets/nilimajauhari/glassdoor-analyze-gender-pay-gap

## Back to Python 🐍

You can give cells a name and refer to its output in subsequent cells.

We can access the SQL results directly in Python and convert the results to a pandas dataframe. 🐼

```python
# Access the SQL cell output as a Snowpark dataframe
my_snowpark_df = Select_star.to_df()
``` 

```python
# Convert a SQL cell output into a pandas dataframe
my_df = Select_star.to_pandas()
``` 

In [None]:
df = Select_star.to_pandas()
df["TOTAL_SALARY"]=df["BASEPAY"]+df["BONUS"]

In [None]:
df.describe()

## Working with data using Snowpark 🛠️

In addition to using your favorite Python data science libraries, you can also use the [Snowpark API](https://docs.snowflake.com/en/developer-guide/snowpark/index) to query and process your data at scale within the Notebook. 

First, you can get your session variable directly through the active notebook session. The session variable is the entrypoint that gives you access to using Snowflake's Python API.

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()
# Add a query tag to the session. This helps with troubleshooting and performance monitoring.
session.query_tag = {"origin":"sf_sit-is", 
                     "name":"notebook_demo_pack", 
                     "version":{"major":1, "minor":0},
                     "attributes":{"is_quickstart":1, "source":"notebook", "vignette":"my_first_notebook"}}

For example, we can use Snowpark to save our pandas dataframe back to a table in Snowflake. 💾

In [None]:
session.write_pandas(df,"SALARY_NEW_TABLE",auto_create_table=True, table_type="temp")

Now that the `SALARY_NEW_TABLE` table has been created, we can do another way around 🔄 : 

```python
df = session.table("<DATABASE_NAME>.<SCHEMA_NAME>.<TABLE_NAME>")
```

If your session is already set to the database and schema for the table you want to access, then you can reference the table name directly.

In [None]:
new_df = session.table("SALARY_NEW_TABLE")

Once we have loaded the table, we can call Snowpark's [`describe`](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/snowpark/api/snowflake.snowpark.DataFrame.describe) to compute basic descriptive statistics. 

In [None]:
new_df.describe()

## 📊 Visualize your data

We can use [Altair](https://altair-viz.github.io/) to easily visualize our data distribution as a histogram.

Let's say that you want to customize your chart and plot the kernel density estimate (KDE) and median. We can use matplotlib to plot the price distribution. Note that the `.plot` command uses `scipy` under the hood to compute the KDE profile, which we added as a package earlier in this tutorial.

In [None]:
fig, ax = plt.subplots(figsize = (6,3))
plt.tick_params(left = False, right = False , labelleft = False) 

min_salary = min(df['TOTAL_SALARY'])
max_salary = max(df['TOTAL_SALARY'])
bin_size = 10000

price = df["TOTAL_SALARY"]
price.plot(kind = "hist", density = True, bins = range(min_salary, max_salary + bin_size, bin_size))
price.plot(kind="kde", color='#c44e52')


# Calculate percentiles
median = price.median()
ax.axvline(median,0, color='#dd8452', ls='--')
ax.text(median,0.8, f'Median: {median:.2f}  ',
        ha='right', va='center', color='#dd8452', transform=ax.get_xaxis_transform())

# Make our chart pretty
plt.style.use("bmh")
plt.title("Total Salary Distribution")
plt.xlabel("Total Salary (binned)")
left, right = plt.xlim()   
plt.xlim((0, right))  
# Remove ticks and spines
ax.tick_params(left = False, bottom = False)
for ax, spine in ax.spines.items():
    spine.set_visible(False)

plt.show()

Good start, but we want to compare data by gender. Let's do it with `matplotlib.pyplot` as `plt`:

In [None]:
# Parameters
#bin_size = 10000

# Split the DataFrame by GENDER
male_salaries = df[df['GENDER'] == 'Male']['TOTAL_SALARY']
female_salaries = df[df['GENDER'] == 'Female']['TOTAL_SALARY']

# Determine the range of bins
#min_salary = min(df['BASEPAY'])
#max_salary = max(df['BASEPAY'])
bins = range(min_salary, max_salary + bin_size, bin_size)

# Plot histograms
plt.figure(figsize=(6, 4))

plt.hist(male_salaries, bins=bins, alpha=0.5, label='Male', color='green', edgecolor='black')
plt.hist(female_salaries, bins=bins, alpha=0.5, label='Female', color='blue', edgecolor='black')

# Add labels and legend
plt.title('Salary Distribution by Gender')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.legend(bbox_to_anchor=(1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()

I wonder - how does it look on different levels on seniority?

## Using Python variables in SQL cells 🔖

You can use the Jinja syntax `{{..}}` to refer to Python variables within your SQL queries as follows. 

```python
threshold = 5
```

```sql
-- Reference Python variable in SQL
SELECT * FROM SALARY_NEW_TABLE where SENIORITY > {{threshold}}
```

Let's put this in practice to generate a distribution of values for ratings based on the mean and standard deviation values we set with Python.

In [None]:
seniority = 3 

In [None]:
SELECT * FROM SALARY_NEW_TABLE where SENIORITY > {{seniority}}

## Creating an interactive app with Streamlit 🪄

Putting this all together, let's build a Streamlit app to explore how different parameters impacts the shape of the data distribution histogram.

In [None]:
import streamlit as st
st.markdown("# Move the slider to adjust and watch the results update! 👇")
col1, col2 = st.columns(2)
with col1:
    seniority = st.slider('SENIORITY threshold',1,5,1) 

# Read table from Snowpark and plot the results
df = session.sql(
    f"""
    SELECT * FROM SALARY_NEW_TABLE where SENIORITY = {seniority};
    """
    ).to_pandas()

# Split the DataFrame by GENDER
male_salaries = df[df['GENDER'] == 'Male']['TOTAL_SALARY']
female_salaries = df[df['GENDER'] == 'Female']['TOTAL_SALARY']

# Determine the range of bins
min_salary = min(df['TOTAL_SALARY'])
max_salary = max(df['TOTAL_SALARY'])
bins = range(min_salary, max_salary + bin_size, bin_size)

# Plot histograms
plt.figure(figsize=(6, 4))

plt.hist(male_salaries, bins=bins, alpha=0.5, label='Male', color='green', edgecolor='black')
plt.hist(female_salaries, bins=bins, alpha=0.5, label='Female', color='blue', edgecolor='black')

# Add labels and legend
plt.title('Salary Distribution by Gender')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.legend(bbox_to_anchor=(1, 1))
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.tight_layout()
plt.show()

df.describe()

![test](https://github.com/Resheras/WoS/blob/main/Join_WoS.png?raw=true)

In [None]:
with agg as (
SELECT JOBTITLE, GENDER, AVG(BASEPAY + BONUS) as TOTAL_SALARY
FROM SALARY_DATA
group by 1,2
order by 1
),
pvt as (
select * from agg PIVOT (sum (TOTAL_SALARY) for GENDER IN (ANY ORDER BY GENDER)) as G (JOB, F, M)
)
select *, M-F as gap from pvt where F < M order by 4 desc