# Basic Data wrangling
In this notebook I decided to use polars instead of pythons. You have to trust me that polars is faster when it comes to data wrangling compared to pandas. Especially on larger datasets!

In [None]:
import pyodbc
import polars as pl
from dotenv import load_dotenv # we need this to store environment variables (this is not needed when coding locally)
import os

In [None]:
# Only needed when using the .env file
load_dotenv()

In [None]:
f_server = os.getenv("SERVER_Fabric")
f_database = os.getenv("DB_Fabric")
f_uid = os.getenv("UID_Fabric")
f_pwd = os.getenv("PWD_Fabric")

In [None]:
fabric_conn = pyodbc.connect(f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={f_server};DATABASE={f_database};Authentication=ActiveDirectoryPassword;UID={f_uid};PWD={f_pwd};ConnectionTimeout=60')

In [None]:
tableresult = pl.read_database("SELECT * FROM Training_DF_Youri", fabric_conn)

Each 'package' has it's own syntax. You can check the documentation to see how it works. For polars its: https://docs.pola.rs/.

Let's first start with some basic calculations such as adding, removing and filtering.

In [None]:
tableresult.head()

In [None]:
added_column = tableresult.with_columns(
    (pl.col("Price") * 0.5).alias("BlackFridayPrice"),  # This adds a new column with the price dived in half
    Country=pl.lit("Netherlands"))                      # This also adds a new column, using different syntax, as Country

In [None]:
added_column.head()

As you can see we have 2 new columns. One created by multiplying an existing column and one by adding a literal value. It's good to realise that our original dataframe `tableresults` is still the same. This is because our new dataframe is called `added_column` which essentially is a copy of `table_results`

In [None]:
tableresult.head()

We can also create functions to perform basic operations for us. When you're just starting out, it might feel challenging to identify when to use a function. A good starting point is to look for repetitive code in your program. Whenever you find yourself writing the same or very similar code multiple times, consider putting it into a function. This way, you can reuse the function without rewriting the code, making your programs more efficient and easier to maintain.

In [None]:
# Without a function: repetitive code
books_sales = tableresult.filter(pl.col("Category") == "Toys").select(pl.col("TotalAmount").sum()).to_numpy()[0][0]
clothing_sales = tableresult.filter(pl.col("Category") == "Clothing").select(pl.col("TotalAmount").sum()).to_numpy()[0][0]

In [None]:
print(f"Total book sales are: {books_sales}")
print(f"Total clothing sales are: {clothing_sales}")

In [None]:
# With a function: reusable and concise
def calculate_sales(dataframe, category):
    """
    Filters the DataFrame by the given category and calculates the total sales.
    """
    return (
        dataframe
        .filter(pl.col("Category") == category)
        .select(pl.col("TotalAmount").sum())
        .to_numpy()[0][0]  # Extract the scalar result
    )

In [None]:
# Using the function
books_sales_function = calculate_sales(tableresult, "Toys")
clothing_sales_function = calculate_sales(tableresult, "Clothing")

In [None]:
print(f"Total book sales are: {books_sales_function}")
print(f"Total clothing sales are: {clothing_sales_function}")