# Working With Data

Previously we've demonstrated how to import/export data using Python Pandas. In this section, we'll address some of the basics for working with in-memory datasets.

## The DataFrame

The **DataFrame** is an essential concept carried over from the **R Programming Language**. DataFrames efficiently store data in tabular arrays and/or matrices which Pandas leverages with the NumPy library to perform algebraic lambda array operations.

## First Things First

Until these materials have been reviewed and updated, I will intentionally repeat myself. I dislike typing more than required unless it:
- reinforces learning
- highlights perceived importance
- holds some other significance

So far we've noticed the beginning of a trend. We commonly employing the same libraries, functions, and reuse many of the same variables. Instead of continuing to manually import everything in each notebook, we'll author a script which ensures that `notebooks.app.utils` is available on `$PATH` for import.

Simply stated, This code:
```python
import pandas as pd
from pathlib import Path
from sqlalchemy import create_engine

# instantiate the required database connection
connection_string = "sqlite:///../data/sqlite.db"
engine = create_engine(connection_string)

# declare SQL statement
sql_statement = """SELECT * 
                    FROM persons p 
                    JOIN contactdetails c
                        ON p.id == c.id 
                    JOIN labordetails l 
                        ON p.id == l.person_id;"""

# instantiate the DataFrame object
df = pd.read_sql(sql_statement, con=engine)

# ... and so on and so forth
```

Will be refactored to use the imports from the app.utils module:

```python
from os import path
import sys

# determine current working directory location
project_directory = path.abspath('')
project_path = path.dirname(project_directory)

# add current working directory location to $PATH
if not project_path in sys.path:
    sys.path.append(project_path)

# import the defined utils which contains a dictionary CONSTANT
from notebooks.app.utils import *

# get our variables from the CONSTANT dictionary
sql_statement = CONSTANT['sql_query']
sql_kwargs = CONSTANT['sql_kwargs']

# instantiate the DataFrame object
df = pd.read_sql(sql_statement, **sql_kwargs)

# ... and so on and so forth
```

Now that it's obvious that the new beginning cell block is actually longer, Please look to the contents of the `notebooks.app.utils.globals` file. The greater efficiency pays out dividends over time in the form of not having to retype the various imports and variable declarations.

### Notes on Syntax and an Internal Guide

<!-- This section seems a bit personal, eh?
how you gonna paraphrase bob ross properly? do the dude respect.
how do one properly paraphrase until one performs proper research?
why has 'one'(you) been subjectively limited in use either/or in 'liberal studies'/english? 
-->
Humans are inconsistent. The author is human despite repeated attempts to transcend. Humans are flawed/flaws and that's okay. In the wise and loosely paraphrased words of Bob Ross: "We don't make mistakes, just happy little accidents." 

A list of notable items:

- CAPITAL CASE IS A COMMON CONVENTION FOR CONSTANT VARIABLE DEFINITIONS IN MANY PROGRAMMING LANGUAGES.
- snake_case is convention in Python due to the [PEPs](https://peps.python.org/pep-0008/). PascalCase naming and camelCase are other common conventions, but not common in Python.
    - preceded by a single '\_' or two '\__' indicates intended OOP 'privacy' variables in interpreted and/or web languages (generally). They can still be directly accessed, but *shouldn't*.
    - duck-typed languages and polymorphism offer convenience w/ implied limitations (usually involving performance and optimization, but on the positive they're great for prototyping since there's no compile time!).

In [None]:
# Later, we'll define a script to be used in the first block of each notebook to ensure that app.utils will be added to the project PATH
# Once app.utils is on project path we will be able to import boilerplate Python code & variables

## TODO: SECTION
# insert boilerplate workaround courtesy of stack overflow
# https://stackoverflow.com/questions/61058798/python-relative-import-in-jupyter-notebook
from os import path
import sys

project_directory = path.abspath('')
project_path = path.dirname(project_directory)

if not project_path in sys.path:
    sys.path.append(project_path)

from notebooks.app.utils import *
## TODO: END SECTION

# construct our working DataFrame
# NOTE: the ** operator is a python operation to unpack a dictionary object into **kwargs syntax whereas the * prefix is used to unpack a list object for *args syntax
# this means that each argument is deconstructed to k=v in the method
df = pd.read_sql(CONSTANT['sql_query'], **CONSTANT['sql_kwargs'])

df.columns

## Selecting Data, Viewing Data, and Other General Information

It's common that not all columns are required for analysis. Rather than clutter our workspace with unnecessary data, it's best to limit the scope of what is available for inspection.

### Overview

In the following block, columns will be removed, the data set reindexed, and entries assigned to a new DataFrame object which omits the payroll entries. Then we'll demonstrate some of the more common Pandas methods for viewing data or gathering other general information statistics.

In [None]:
# BASICS: selecting data, viewing data, and general information statistics about a dataset

# not ALL of the columns are necessary, so lets SELECT a few columns to work with from the original DataFrame
# NOTE: feel free to experiment! customizing the columns section is essential to proper understanding!!! Live a little!!! c'mon!!!!
select_columns = [
    'lname',
    'fname',
    'email',
    'address',
    'city',
    'state',
    'zipcode',
    'dob',
    # 'ssn',
    'payrate',
    # 'current_date',
    # 'time_in',
    # 'time_out',
]

# lets also define columns from each specific data table source (in case we want to persist any changes to the database)
person_columns = ["id", "fname", "lname", "job", "payrate", "ssn", "dob"]

contact_columns = ["id", "phone", "email", "address", "city", "state", "zipcode"]

labor_columns = ["id", "person_id", "current_date", "time_in", "time_out"]


# create a copy using the previously defined column name filter
ndf = df[select_columns].copy()

# remove duplicate entries aka SELECT DISTINCT in SQL
ndf.drop_duplicates(inplace=True)

# re-index the dataset
ndf.reset_index(inplace=True, drop=True)

# print the top 5 rows
ndf.head()

# print the bottom 10 rows
# NOTE: df.head() also accepts an integer as an argument
ndf.tail(10)

# print unique field values
ndf.nunique()

# print general statistics (for numeric datatypes)
ndf.describe()

# # similar and/or related methods
# ndf.min()
# ndf.max()
# ndf.mean()
# ndf.median()
# ndf.mode()
# ndf.std()
# ndf.sum()
# ndf.cumsum()
# ndf.cumprod()


# check for nulls
ndf.isna()
ndf.notna()

ndf.isnull()
ndf.notnull()

# replace null values
ndf.fillna(0)

# count the number of entries
ndf['payrate'].count()

# print current column datatypes
ndf.dtypes

# print the column names
ndf.columns

# verify dataframe memory usage
# ndf.memory_usage()

## Applying Filters, Sorting, and Operators

To review common operands, refer to 001-Python-101-For-The-Uninitiated.ipynb#operation-table

<!-- TODO: Link to the file section -->

<!-- [Link](./notebooks/001-Python-101-For-The-Uninitiated#operation-table) -->

In [None]:
# applying filters and sorting!

# create a filter expression to check for entries which match a condition
payrate_filter_expression = ndf['payrate'] > 30

# create a slightly more complex filter expression
# NOTE: Use of parentheses; without them a TypeError mismatch will occur concerning comparison of int and str using &
complex_filter_expression = (ndf['payrate'] > 30) & (ndf['state'] == 'Arizona')

# apply the filter to our data set
ndf[payrate_filter_expression]

# apply a slightly more complex filter
ndf[complex_filter_expression]

# or apply the inverse filter using the unary operator symbol '~' (tilde)
# NOTE: The unary operator may not function as intended with complex_filter_expression
ndf[~payrate_filter_expression]

# count the number of entries resulting from the filter
ndf['payrate'][payrate_filter_expression].count()

# re-sort the array on any given column(s)
# this is written across multiple lines enclosed in brackets/parentheses to allow for easily commenting and re-sorting the sort_values filter columns
# TODO: have fun! live a little and experiment!
ndf.sort_values(
    by=[
        # 'state',
        'lname',
        # 'payrate',
        'dob',
    ],
    ascending=True,
)

# ndf[ndf.notna()]


## A Dilemma

If it's decided that the current column names are undesirable, then it's possible to manipulate them.

### Overview

In this section, we'll create a new DataFrame named `odf` in which columns will be renamed and existing column data will be used to produce new columns.

In [None]:
# renaming columns

# define a dictionary (or mapping) of key value pairs which represent the current column name and the anticipated column name
new_column_names = {
    "lname": "last_name",
    "fname": "first_name",
    "email": "email_address",
    "zipcode": "postal_code",
    "dob": "date_of_birth",
    "payrate": "rate_of_pay",
}

odf = ndf.rename(columns=new_column_names)

odf.columns

In [None]:
# creating new columns and using existing data in calculations

# we want a column named 'full_name' which displays the employee's name in the form of "last_name, first_name" in all caps
odf['full_name'] = odf['last_name'].str.upper() + ", " + odf['first_name'].str.upper()

# we decide to give the employees a 50 cent raise because of a raise in the cost of living, to share profits, incentivize, or because 'nobody wants to work'

# define a value to represent the raise amount
raise_amt = .5


# apply the amount to the entire workforce!
odf['rate_of_pay'] = odf['rate_of_pay'] + raise_amt


# uh-oh.... management isn't happy, they only wanted raises to be given to employees in specific regions to 'enhance competitiveness' (whatever that means)
# first, we'll undo our changes (in-memory)
odf['rate_of_pay'] = odf['rate_of_pay'] - raise_amt

# now we'll try again with a list curated by our benevolent dictators including the specific competitive states/regional employees that should be incentivized
select_states = [
    "Arizona",
    "California",
    "New York",
]

# TODO
# approach A)
# it works, but triggers a warning!
# odf['rate_of_pay'][odf['state'].isin(select_states)] = odf['rate_of_pay'] + raise_amt


# approach B)
# this also triggers the warning!

mask = odf[odf['state'].isin(select_states)]
odf['rate_of_pay'].loc[[x for x in mask.index]] = odf['rate_of_pay'] + raise_amt

# TODO: Review and correct to not trigger the associated warning and properly batch updates with filters!!!

# odf.loc[[x for x in mask.index]]

# mask = odf['state'].isin(select_states)
# odf.where(mask)

# odf[odf['state'].isin(select_states)]
# mask

# odf.loc[:, odf['state'].isin(select_states)]

# odf.loc[:, mask['rate_of_pay']]

# odf.loc[:, mask]

# mask[['rate_of_pay', 'full_name', 'state']]

# odf.iloc[mask['rate_of_pay']]

# odf.loc[mask['rate_of_pay']]



# odf.loc[mask['rate_of_pay'], :]
# odf.loc[:, mask['rate_of_pay']]

# odf[odf['state'].isin(select_states)]['rate_of_pay'] = odf['rate_of_pay'] + raise_amt

# odf['rate_of_pay'][odf['state'].isin(select_states)] = odf['rate_of_pay'] - raise_amt


# odf['rate_of_pay'][odf['state'].isin(select_states)]

# since o
#  


# odf.nunique()
# odf['state'].filter(items=select_states)
# odf['state'].filter(items=select_states)

In [None]:
# then it's realized that the employee email addresses are managed by a corporate account in a specific format, etc...
# so lets update the employee email addresses!

odf['email_address'] = odf['first_name'].str.lower() + "." + odf['last_name'].str.lower() + '@example.org'


# it's now decided that the date_of_birth is unnecessary, but for compliance reasons the individual's age in years is needed

# cast date of birth to a date datatype
# import datetime library to generate current timestamp
# NOTE: age is actually difficult to calculate due to leapyears
# for this there is a third-party module dateutil which will be aliased for further demonstration
from datetime import date
# NOTE: an example of aliasing a library; python-dateutil is the actual package name
from dateutil.relativedelta import relativedelta as rd


# lets be explicit (not implicit) by specifying and converting the data types

# optionally, lets explicitly cast each column to their respective datatype
# where and when possible use numpy datatypes for accuracy
# NOTE: remember, be explicit rather than implicit
# https://stackoverflow.com/questions/15891038/change-column-type-in-pandas
column_data_types = {
    "last_name": np.str_,
    "first_name": np.str_,
    "email_address": np.str_,
    "address": np.str_,
    "city": np.str_,
    "state": np.str_,
    "postal_code": np.str_,
    "date_of_birth": 'datetime64[ns]',
    "rate_of_pay": np.double,
    "full_name": np.str_,
}

odf = odf.astype(column_data_types)

# what day is it?
today = date.today()

# NOTE: expand on the use of apply/map and the lambda statement (since this is its first appearance)
odf['age'] = odf['date_of_birth'].apply(lambda x: rd(today, x).years)

odf[['full_name','age']]

In [None]:
# excel/common error; leading zeroes are truncated
# lets restore them!
filter_exp = odf[odf['postal_code'].str.len() < 5]

odf['postal_code'][filter_exp] = odf['postal_code'].apply(lambda x: f"0{x}")

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
# pd.pivot_table(odf, values=, index=[], columns=[])

# pd.pivot_table(odf, values)
# odf.head()
# table = pd.pivot_table(odf, values=['age'], index=['full_name', 'date_of_birth'], aggfunc={'age': 'sum'})
table = pd.pivot_table(odf, values=['age'], index=['full_name', 'date_of_birth'], aggfunc={'age': 'sum'})
table
# table = pd.pivot_table(odf, values=['age'], index=['full_name', 'date_of_birth'], columns=[], aggfunc={'age': 'mean'})

In [None]:
# resampling data
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html
# df.dtypes
df.current_date = pd.to_datetime(df.current_date)

dfA = df[["current_date", "payrate"]].resample('1D', on="current_date").cumsum()

dfA

# index = pd.date_range('1/1/2023', periods=52, freq='1W')

# series = pd.Series(range(52), index=index)

# series

# series.sum()