## Day 5 project tasks, learning:
    - Panda Filtering
    - Panda Sorting
    - Panda Grouping
    - Panda Aggregation

The goal is to try to recreate queries that I would have used in SQL using panda.

Bonus items today: I wanted a better way to see the results than just a subset in the terminal. First I learned to export data to an html file so I could look at it with a web browser.  Then switched to VS Code's Jupyter Extension (notebooks) for ease of logging my work while seeing results. 

Get sample data for the testing process:

In [None]:
import pandas as pd

# Load Sales Sample Data from the Kaggle
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')
salesdf.info()

SQL to Convert:  SELECT TOP 10 * FROM table ORDER BY column DESC

Getting the largest orders, and then the smallest

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

salesdf.nlargest(10, 'PRICEEACH')
print(salesdf.nlargest(10, 'PRICEEACH')[['CUSTOMERNAME', 'PRICEEACH']].reset_index(drop=True))

smallest = salesdf.nsmallest(10, 'PRICEEACH')
print(smallest[['CUSTOMERNAME', 'PRICEEACH']].reset_index(drop=True))

SQL to Convert: Complex conditions

SQL: WHERE col1 = 'MA' AND col2 < 1000 OR col4 = 'Disputed'

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

salesdf['total_sale'] = salesdf['PRICEEACH'] * salesdf['QUANTITYORDERED']
condition = ((salesdf['STATE'] == 'MA') & (salesdf['total_sale'] <1000 )) | (salesdf['STATUS'] == 'Disputed')
filtered_salesdf = salesdf[condition]
# Display filtered DataFrame
print(filtered_salesdf[['STATE','total_sale', 'CUSTOMERNAME', 'STATUS']].reset_index(drop=True))


SQL - Working with Dates:

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Convert ORDERDATE to datetime
salesdf['ORDERDATE'] = pd.to_datetime(salesdf['ORDERDATE'])

# Filter: Year 2004 AND Month 12 (December)
sales_date = salesdf[
    (salesdf['ORDERDATE'].dt.year == 2004) &
    (salesdf['ORDERDATE'].dt.month == 12)]

# Display ORDERDATE column only, with clean index
print(sales_date[['ORDERDATE','CUSTOMERNAME']].reset_index(drop=True))

SQL - Working with Strings:

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Find strings that start with 'A'
print(salesdf[salesdf['CUSTOMERNAME'].str.startswith('A')][['CUSTOMERNAME', 'TERRITORY']].reset_index(drop=True))

# Find strings that end with 'Inc.'
filter_Inc = salesdf[salesdf['CUSTOMERNAME'].str.endswith('Inc.')]
print(filter_Inc[['CUSTOMERNAME']].reset_index(drop=True))

# Convert strings to uppercase
salesdf['CUSTOMERNAME'].str.upper()

# Replace strings 'Corp' with 'Corporation'
filter = salesdf['CUSTOMERNAME'].str.replace('Corp.', 'Corporation', regex=False)
print(filter[filter.str.contains('Corporation', na=False)].reset_index(drop=True))


# Check if 'Inc' is in CUSTOMERNAME, case insensitive
print(salesdf[salesdf['CUSTOMERNAME'].str.contains('Inc', case=False)][['CUSTOMERNAME']].reset_index(drop=True))


# Bonus: Saved the DataFrame to an HTML file - this in one way for me to see more of the results of the filering
salesdf.to_html(r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_output.html')


SQL: SELECT region, COUNT(*) FROM table GROUP BY region

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Grouping and Count - Count of Sales by State
print("---Sales Count by State---")
print(salesdf.groupby('STATE').size())


Multiple grouping levels

SQL: SELECT country, status, SUM(amount) FROM table GROUP BY country, status



In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Grouping and Aggregation - Sales Amount by Country and Status
salesdf['total_amount'] = salesdf['PRICEEACH'] * salesdf['QUANTITYORDERED']
salesdf.groupby(['COUNTRY', 'STATUS'])['total_amount'].sum()



SQL: SELECT country, SUM(total_amount) FROM table GROUP BY country HAVING SUM(total_amount) > 10000

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Grouping and Aggregation - Sum and Average of Sales Amount by Region
salesdf['total_amount'] = salesdf['PRICEEACH'] * salesdf['QUANTITYORDERED']
print("---Sales Amount by Country Where Sales Amount is > 1000---")
grouped = salesdf.groupby('COUNTRY')['total_amount'].sum()
grouped[grouped > 10000]


SQL: SELECT region, SUM(amount), AVG(amount) FROM table GROUP BY region

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Grouping and Aggregation - Total Sales Amount, Average Sales Amount, Count of Orders, and Latest Order Date by Country
salesdf['total_amount'] = salesdf['PRICEEACH'] * salesdf['QUANTITYORDERED']
salesdf.groupby('COUNTRY').agg({
    'total_amount': ['sum', 'mean', 'count'],
    'ORDERDATE': 'max'
})


Custom aggregation functions

In [None]:
import pandas as pd

# Load Sales Sample Data
filenm = r'C:\Users\melis\OneDrive\Desktop\PracticeData\sales_data_sample.csv'
salesdf = pd.read_csv(filenm, encoding='latin1')

# Custom aggregation functions
# Example: Custom aggregation to find the range of total sales amount by country
def custom_agg(series):
    return series.max() - series.min()
salesdf['total_amount'] = salesdf['PRICEEACH'] * salesdf['QUANTITYORDERED']
salesdf.groupby('COUNTRY')['total_amount'].agg(custom_agg)