#### Clarusway Python

* [Instructor Landing Page](landing_page.ipynb)
* <a href="https://colab.research.google.com/github/4dsolutions/clarusway_data_analysis/blob/main/Kirby%20Notebooks/DAwPy_sandbox.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
* [![nbviewer](https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.org/github/4dsolutions/clarusway_data_analysis/blob/main/Kirby%20Notebooks/DAwPy_sandbox.ipynb)

<a id="toc"></a>

<a data-flickr-embed="true" href="https://www.flickr.com/photos/kirbyurner/52136642608/in/photolist-2n4sSUz-2nr8Vrb-2oADYNY" title="Clarusway Banner"><img src="https://live.staticflickr.com/65535/52136642608_bd45cb00a9_b.jpg" width="1024" height="334" alt="Clarusway Banner"/></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

## <p style="background-color:#0D8D99; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Looking Back: The pandas DataFrame<br>Looking Ahead: to SQL</p>

Highly relevant this juncture, when our focus is on table management in pandas, including their combination based on columns-in-common, are the conceptual similarities with SQL (Structured Query Language). The vocabulary (shoptalk) of inner, outer, left and right join, in turn inherited from Set Theory, spans both technologies, pandas and SQL.

In [None]:
import pandas as pd
import numpy as np
from os import path

Pythonistas enjoy the good fortune of having SQLite in the Standard Library. SQLite is a free open source tool that has a role in production, in the office setting, and as an onramp into RDBMS (relational database management systems) more generally.

In [None]:
import sqlite3 as sql  # part of Python Standard Library

Connecting to a database through a context manager has advantages. Connecting to a DB is akin to opening a file, with automatic closure once the code block is done and being exited, with or without unhandled exceptions.

We looked at the context manager pattern in Basic Python. Like the Iterator category, we recognize context managers by the presence of signature magic methods (also known as special names). 

In the case of the Iterator, we look for `__next__` and `__iter__` where the latter might return itself, as eligible for the office of iterator. In the case of a Context Manager, we expect to find `__enter__` and `__exit__`.

We learned how these two methods get triggered: not by directly calling them, but by the "occassions" of entering and exiting code suites set off by the `with` statement, `with` being one of Python's keywords.

Where we most likely encounter the `with` in basic Python is in connection with file object, with opening and closing upon entering and exiting, with access to the Context Manager itself as a presiding object (e.g. cm below), thanks to keyword `as`.  

We say:

```python
    with open("the_file.txt") as cm:
        content = cm.read()
```    

Likewise, our Connector class below wraps a database connection and cursor inside the instance, once `__enter__` has established them as attributes of the presiding object.

```python
    with open("airports.db") as db:
        db.list_tables()
``` 

Upon exiting the with suite, the connection closes, and any exceptions get handled or reraised.

The context manager object may be optionally equipped with additional DB-related methods, such as return a tables listing and/or perform a record lookup.

In [None]:
class Connector:

    def __init__(self, conn_name : str):
        """Run when class is called"""
        self.cn_name = conn_name # what file?
        
    def __enter__(self):
        """Run when the context is entered"""
        try:
            self.conn = sql.connect(self.cn_name)
            print("Connection: ", self.conn)
            self.curs = self.conn.cursor()
            # self.list_tables() # optional
        except:
            print("No connection")
            raise

        return self
    
    def lookup(self, table, column, code):
        """
        return the data for column = code condition
        """
        self.curs.execute(f"SELECT * FROM {table} WHERE {column} = ?", (code, ))
        return self.curs.fetchone() # could be None, could be a tuple
    
    def list_tables(self):
        """
        print a listing of all the tables in this db
        https://www.sqlitetutorial.net/sqlite-show-tables/
        """
        self.curs.execute("""SELECT name FROM sqlite_schema  
                            WHERE type ='table' AND name 
                            NOT LIKE 'sqlite_%';
                            """)    
        # loop through whatever table names were found 
        # and filtered and print them out.
        for nm in self.curs.fetchall():
            print(nm)
         
    def __exit__(self, *oops):
        """
        Process exceptions consists of *oops,
        a 3-tuple, we hope filled with Nones because 
        all went well.  Otherwise, exception info.
        return either True or False to determine if
        __exit__ does or does not raise an exception.
        """
        self.conn.close()
        if oops[0]:
            print("An error occurred")
            return False  # raise exception
        return True       # all good

The `airports.db` file contains only one table, Airports. This is a flat file with some information about airports around the world, including their unique IATA code.

A copy of airports.db used here [may be found](https://github.com/4dsolutions/clarusway_data_analysis/blob/main/DVwPY_S6/airports.db) in this Github repo. Download the raw file.

Our purpose here is to bring the data into pandas using `sqlite3` and our Connector, and then review our powers to merge and purge, ending up with some new database files as output, such as a relational `big_airports.db` with lat/long coordinates stored separately, linked by IATA code. We create this table more as a test of pandas than to produce output of much practical value.

In [None]:
path.isfile("airports.db")

In [None]:
with Connector("airports.db") as db:
    db.list_tables()

In [None]:
with Connector("./airports.db") as db:
    df = pd.read_sql("SELECT * FROM Airports", con = DB.conn)
    print(db.lookup("Airports", "iata", "SFO"))
    print(db.lookup("Airports", "iata", "PDX"))

In [None]:
df

In [None]:
df.info()

The description of numeric columns is hardly useful as these consist of either categorical values or latitude / longitude, which it doesn't make a lot of sense to average.

In [None]:
df.describe().T

However remember `describe` may be directed to attend non-numeric columns as well.

In [None]:
df.describe(include=['O','int64'])

In [None]:
df.type.nunique()

In [None]:
df.type.unique()

In [None]:
df.groupby(["type"]).agg("count")

In [None]:
df.status.nunique()

In [None]:
df.status.unique()

In [None]:
df["size"].nunique()

In [None]:
df["size"].unique()

In [None]:
df["size"].value_counts(dropna=False) # show the Nonesdf.

In [None]:
df.dropna(axis=0, how="any", inplace=False)

In [None]:
df2 = df.dropna(axis=0, how="any", inplace=False)

In [None]:
df2.info()

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
big = df2[(df["type"] == "airport") & (df["size"] == "large")].reset_index(drop=True)

In [None]:
medium = df2[(df["type"] == "airport") & (df["size"] == "medium")].reset_index(drop=True)

In [None]:
small = df2[(df["type"] == "airport") & (df["size"] == "small")].reset_index(drop=True)

In [None]:
df2.loc[:, ["iata", "iso", "name"]]

In [None]:
big = big.loc[:, ["iata", "iso", "name"]]
medium = medium.loc[:, ["iata", "iso", "name"]]
small = small.loc[:, ["iata", "iso", "name"]]
latlong = df2.loc[: , ["iata", "continent", "lat", "lon"]]

In [None]:
big.info()

In [None]:
medium.info()

In [None]:
small.info()

In [None]:
latlong.info()

In [None]:
big.join(latlong.set_index("iata"), on="iata", how="inner", sort=True) # right index set to iata

In [None]:
pd.merge(big, latlong, how='left', on='iata', sort=True)

In [None]:
big[big.duplicated('iata')]

In [None]:
big[big.iata == "HYD"]

In [None]:
big.info()

In [None]:
big = big.drop(index=421)

In [None]:
big.info()

In [None]:
latlong.size

In [None]:
latlong.duplicated('iata')==False

In [None]:
df3 = latlong[latlong.duplicated()==False]

In [None]:
df3

In [None]:
df3.duplicated().value_counts()

In [None]:
df3[df3.iata == 'YAX']

In [None]:
big_airports = pd.merge(big, df3, how='left', on='iata', sort=True)
big_airports

In [None]:
big

In [None]:
df3

In [None]:
big_airports.loc[:, ['iata', 'iso', 'name']]

Per [the documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html) for `pandas.DataFrame.to_sql`, this method requires an already-open connection to the database in question, suggesting [SQLAlchemy](https://docs.sqlalchemy.org/en/20/) and or SQLite may be used, the former being a 3rd party Python database API, and the later what we're using here, direct from the Standard Library.

The "tree" or "river delta" diagram below suggest two major user communities, that of website development and that of data science, both have their roots in talking to databases.

<a data-flickr-embed="true" href="https://www.flickr.com/photos/kirbyurner/24749338009/in/album-72177720296706479" title="Pythonic Ecosystem"><img src="https://live.staticflickr.com/1624/24749338009_537ab57eb1_w.jpg" width="300" height="400" alt="Pythonic Ecosystem"/></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

In addition to a connection object (`db` below), the `to_sql` method expects a table name. A database may contain any number of individual tables.

In the code cell below, we might be creating big_airports.db for the first time, or it might be an existing file. Either way, we take our flat file, `big_airports`, and write it out in two tables, Airports and Coords.

In [None]:
with Connector('big_airports.db') as db:
    big_airports.loc[:, ['iata', 'iso', 'name']].to_sql('Airports', db.conn, if_exists='replace')
    big_airports.loc[:, ['iata', 'continent', 'lat', 'lon']].to_sql('Coords', db.conn, if_exists='replace')

As a check, lets reconstitute a flat file pairing airports with corresponding coordinates based on IATA code. An [SQLite inner join](https://www.sqlitetutorial.net/sqlite-inner-join/) will accomplish this.

In [None]:
sql_stmnt = """
SELECT 
    Airports.iata,
    iso,
    name,
    Coords.lat,
    Coords.lon,
    Coords.continent
FROM 
    Airports
INNER JOIN Coords ON 
    Coords.iata = Airports.iata
"""

with Connector("big_airports.db") as db:
    airports = pd.read_sql(sql_stmnt, con = db.conn)
    db.list_tables()
    print(db.lookup("Airports", "iata", "SFO"))
    print(db.lookup("Airports", "iata", "PDX"))

In [None]:
airports

## EXPLORING UNICODE

<a data-flickr-embed="true" href="https://www.flickr.com/photos/kirbyurner/29832307687/in/album-72177720296706479" title="Unicode on Windows"><img src="https://live.staticflickr.com/1847/29832307687_0aee594ec5_w.jpg" width="400" height="276" alt="Unicode on Windows"/></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

Through unicode we may access the emojis, which in turn may be used to craft practice dataframes, for learning purposes. Unicode itself, as a topic, permeates our technology, especially when it comes to natural language processing, which is at the heart of Machine Learning (ML), for example in the form of LLMs (large language models, used to drive chat bots).

In [None]:
import unicodedata as ud

In [None]:
# help(ud)

In [None]:
smiley = ud.lookup("Smiling Face with Smiling Eyes")

In [None]:
smiley

In [None]:
ord(smiley)

In [None]:
ud.name(smiley)

In [None]:
"\N{SMILING FACE WITH SMILING EYES}"

In [None]:
"\N{HOT DOG}"

In [None]:
ord("\N{HOT DOG}")

In [None]:
start = hex(ord("\N{HOT DOG}")) # base 16 as a string
start

In [None]:
dec_start = int(start, base=16) # going back and forth between bases
dec_start

Our first range of emoji starts with hot dog (🌭) and ends with popcorn (🍿).

A great resource for studying the emoji is [at Wikipedia](https://en.wikipedia.org/wiki/List_of_emojis).

In [None]:
"\N{POPCORN}" # the Unicode escape symbol

In [None]:
'🍿'.encode('utf-8')

In [None]:
b'\xf0\x9f\x8d\xbf'.decode()

In [None]:
stop = hex(ord("\N{POPCORN}"))
stop

In [None]:
dec_stop = int(stop, base=16)
dec_stop

In [None]:
code_range = np.arange(dec_start, dec_stop+1)

In [None]:
foods = [chr(codepoint) 
         for codepoint in 
         code_range]

In [None]:
print(foods)

In [None]:
code_range2 = np.arange(0x1f950, 0x1f96f+1)
foods2 = [chr(codepoint) 
         for codepoint in 
         code_range2]
print(foods2)

In [None]:
all_foods = foods + foods2

In [None]:
df_foods = pd.DataFrame({"NAME": [ud.name(food) for food in all_foods],
              "GLYPH": all_foods,
              "CODEPOINT": [ord(food) for food in all_foods]})

In [None]:
df_foods.sort_values("CODEPOINT")

In [None]:
df_foods = df_foods.set_index("GLYPH")

In [None]:
df_foods

In [None]:
df_foods.loc['🍯':'🍵',:]

*Note*:

You may also embed YouTubes in markdown cells. Notebooks in this repo almost exclusively use the code cell method.

Example:

[![Less Than Jake — Scott Farcas Takes It On The Chin](https://img.youtube.com/vi/PYCxct2e0zI/0.jpg)](https://www.youtube.com/watch?v=PYCxct2e0zI)

[Markdown Guide](https://www.markdownguide.org/hacks/)
