# Comprehensive Database Tutorial and Demonstration

This notebook provides a detailed walkthrough of our custom Database class, showcasing its features and functionalities. We'll cover everything from basic table access to advanced querying and analysis.

## Table of Contents
1. [Setup and Initialization](#1-setup-and-initialization)
2. [Exploring Database Structure](#2-exploring-database-structure)
3. [Accessing Tables and Views](#3-accessing-tables-and-views)
4. [Creating and Managing Views](#4-creating-and-managing-views)
5. [Merging Data](#5-merging-data)
6. [Querying the Database](#6-querying-the-database)
7. [Advanced Analysis](#7-advanced-analysis)
8. [Error Handling and Best Practices](#8-error-handling-and-best-practices)
9. [Performance Considerations](#9-performance-considerations)
10. [Conclusion and Next Steps](#10-conclusion-and-next-steps)

## 1. Setup and Initialization

In [1]:
import database_functions as func
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Create a Database instance
path = r'003_data\002_clean-data'
db = func.Database(path)
print("Database initialized successfully.")

Database initialized successfully.


In [2]:
db.tables.keys()

dict_keys(['ballot', 'census', 'demo', 'fascility', 'medicare', 'voter'])

## 2. Exploring Database Structure

In [3]:
# List all tables in the database
print("Available tables:")
for table_name in db.tables.keys():
    print(f"- {table_name}")

# Display basic information about each table
for table_name, table in db.tables.items():
    print(f"\nTable: {table_name}")
    print(f"  Rows: {table().shape[0]}")
    print(f"  Columns: {table.get_columns()}")
    print(f"  Available views: {', '.join(table.list_views())}")

Available tables:
- ballot
- census
- demo
- fascility
- medicare
- voter

Table: ballot
  Rows: 174
  Columns: ['year', 'county_name', 'yes_count', 'no_count', 'total_count', 'yes_perc', 'no_perc']
  Available views: 

Table: census
  Rows: 58
  Columns: ['year', 'county_name', 'under_5_years', 'age_5_9', 'age_10_14', 'age_15_19', 'age_20_24', 'age_25_29', 'age_30_34', 'age_35_39', 'age_40_44', 'age_45_49', 'age_50_54', 'age_55_59', 'age_60_64', 'age_65_69', 'age_70_74', 'age_75_79', 'age_80_84', 'age_85_plus', 'age_16_plus', 'age_18_plus', 'age_21_plus', 'age_62_plus', 'age_65_plus', 'male_population', 'under_5_years_1', 'age_5_9_1', 'age_10_14_1', 'age_15_19_1', 'age_20_24_1', 'age_25_29_1', 'age_30_34_1', 'age_35_39_1', 'age_40_44_1', 'age_45_49_1', 'age_50_54_1', 'age_55_59_1', 'age_60_64_1', 'age_65_69_1', 'age_70_74_1', 'age_75_79_1', 'age_80_84_1', 'age_85_plus_1', 'age_16_plus_1', 'age_18_plus_1', 'age_21_plus_1', 'age_62_plus_1', 'age_65_plus_1', 'female_population', 'under_5

## 3. Accessing Tables and Columns

In [7]:
# Access voter table
print(db.voter().head())
print(db.voter().shape)
print(db.voter().columns)

   year county_name  eligible  total_registered  democratic_perc  \
0  2018     Alameda   1089154            881491         0.556651   
1  2018      Alpine       939               758         0.411609   
2  2018      Amador     27117             22305         0.287962   
3  2018       Butte    171771            122741         0.349052   
4  2018   Calaveras     36101             29591         0.273698   

   republican_perc  american_independent_perc  green_perc  libertarian_perc  \
0         0.110469                   0.018620    0.007602          0.005286   
1         0.270449                   0.032982    0.006596          0.007916   
2         0.439901                   0.042233    0.004573          0.013226   
3         0.341817                   0.034552    0.007626          0.011259   
4         0.414450                   0.045453    0.006353          0.015106   

   peace_and_freedom_perc  unknown_perc  other_perc  no_party_preference_perc  
0                0.003049      0.000

## 4. Creating and Managing Views

In [26]:
# Create registration_rate view for voter table
db.voter.add_view("registration_rate", ['year', 'county_name', 'eligible', 'total_registered'])

# Access registration_rate view
registration_rate_df = db.voter.registration_rate
registration_rate_df['registration_rate'] = db.voter.registration_rate['total_registered'] / db.voter.registration_rate['eligible']
print('voter/registration_rate view: \n')
print(registration_rate_df.head())

# Create casted_votes view for ballot table
db.ballot.add_view("casted_votes", ['year', 'county_name', 'total_count'])

print('\n ballot/casted_votes view: \n')
print(db.ballot.casted_votes.head())


voter/registration_rate view: 

   year county_name  eligible  total_registered  registration_rate
0  2018     Alameda   1089154            881491           0.809336
1  2018      Alpine       939               758           0.807242
2  2018      Amador     27117             22305           0.822547
3  2018       Butte    171771            122741           0.714562
4  2018   Calaveras     36101             29591           0.819673

 ballot/casted_votes view: 

   year county_name  total_count
0  2018     Alameda     556285.0
1  2018      Alpine        579.0
2  2018      Amador      17031.0
3  2018       Butte      86302.0
4  2018   Calaveras      20912.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  registration_rate_df['registration_rate'] = db.voter.registration_rate['total_registered'] / db.voter.registration_rate['eligible']


## 5. Merging Data

In [28]:
# Merge multiple views
# db.merge_views(list[(table_name, view_name), (table_name, view_name), ...])
# assumes all columsn have county_name and year column, if not creates an empty column and merge on key = ['year', 'column_name'])
turnout_df = db.merge_views([('voter', 'registration_rate'), ('ballot', 'casted_votes')]) 
turnout_df['turnout'] = turnout_df['total_count'] / turnout_df['total_registered']
print('\n turnout_df: \n')
print(turnout_df.head())



 turnout_df: 

  county_name  year  eligible  total_registered  total_count   turnout
0     Alameda  2018   1089154            881491     556285.0  0.631073
1      Alpine  2018       939               758        579.0  0.763852
2      Amador  2018     27117             22305      17031.0  0.763551
3       Butte  2018    171771            122741      86302.0  0.703123
4   Calaveras  2018     36101             29591      20912.0  0.706701


## 6. Querying the Database

In [36]:
# Query the merged database
    # db.query(conditions, columns)
    # conditions: dict[column_name: function]

# Query for counties with population over 1 million
    # conditions: populatio_january_2023 > 1000000
    # columns: ['county_name', 'population_january_2023', 'median_household_income_2021']
    
highest_population = db.query({'population': lambda x: x > 1000000}, 
                          ['county_name', 'population'])
print('\n highest_population: \n')
print(highest_population.head())

# Query for multiple conditions using a dictionary
conditions = {'population': lambda x: x > 1000000, # population over 1 million
              'age_0_5': lambda x: x > 0.05} # 5% of population is 0-5 years old
highest_population_high_children = db.query(conditions,
    ['county_name', 'population', 'age_0_5']
)

print('\n highest_population_high_children: \n')
print(highest_population_high_children.head())



 highest_population: 

      county_name  population
232       Alameda   1636194.0
238  Contra Costa   1147653.0
241        Fresno   1011499.0
250   Los Angeles   9761210.0
261        Orange   3137164.0

 highest_population_high_children: 

      county_name  population   age_0_5
232       Alameda   1636194.0  0.064737
238  Contra Costa   1147653.0  0.065946
241        Fresno   1011499.0  0.083688
250   Los Angeles   9761210.0  0.063692
261        Orange   3137164.0  0.065930


## 7. Error Handling and Best Practices

In [37]:
# Demonstrate error handling
try:
    db.get_view('non_existent_table', 'some_view')
except ValueError as e:
    print("Error:", str(e))

try:
    db.get_view('voter', 'non_existent_view')
except ValueError as e:
    print("Error:", str(e))

# Best practice: Check if a view exists before trying to access it
def safe_get_view(db, table_name, view_name):
    if table_name in db.tables:
        table = db.tables[table_name]
        if view_name in table.list_views():
            return table.get_view(view_name)
        else:
            print(f"View '{view_name}' not found in table '{table_name}'")
    else:
        print(f"Table '{table_name}' not found in the database")
    return None

# Example usage of safe_get_view
safe_view = safe_get_view(db, 'voter', '2022')
if safe_view is not None:
    print("Successfully retrieved the '2022' view from the 'voter' table")
    print(safe_view.head())

safe_get_view(db, 'non_existent_table', 'some_view')
safe_get_view(db, 'voter', 'non_existent_view')

Error: Table 'non_existent_table' not found
Error: View 'non_existent_view' not found
View '2022' not found in table 'voter'
Table 'non_existent_table' not found in the database
View 'non_existent_view' not found in table 'voter'
