# Handling Nulls in Data

Something something about how this is learnings about building a DataFrame Library

## What is `NULL`?

Before we start comparing various DataFrame Frameworks and Databases, we should first define what `NULL` is.

The answer actually depends on who you ask.

If you asked:
 * A Mathematician: They would probably point you to the [Null Set](https://en.wikipedia.org/wiki/Null_set)
 * A Programmer:  A null pointer or null reference is a value saved for indicating that the pointer or reference does not refer to a valid object. Which is the case for Static Typed langauges like C/C++ as well as Dynamic ones like Python where it is called `None`. [Wikipedia](https://en.wikipedia.org/wiki/Null_pointer)
 * A Database Geek: `NULL` is a special marker used to indicate that a data value does not exist. [Wikipedia](https://en.wikipedia.org/wiki/Null_(SQL))
 
 
The latter two definitions are the ones relevent to us. 

At first glance the Programmer's definition and the Database Geek's seem like the same thing and for many years I thought they were. However there's actually some deep implications using them interchangeably especially now that we use tools like Python DataFrame libaries such as Pandas, Spark and Daft, where databases would historically be used.


I think Kenneth Baclawski summarized the difference beautifully in his [write up](https://arxiv.org/html/1606.00740v1) on this topic:

*The main distinction between the relational null and the programming language null is the following: The relational null represents the absence of a value in a field of a record; whereas the programming language null represents one of the possible values of a variable. Succinctly,*

***The programming language null is a value but database null is not a value.*** 

*Consequently, the phrase "null value" is an oxymoron for databases.*

## Who is affect by this?

It matters if you want to deal with missing data! Dealing with missing data is fairly common and can happen for a variety of reasons. In the following example, we go into a case where we have a dataset of movies and their review score. A new movie that just came out would be unreviewed and not have a score yet.

The problem however is that DataFrames sit in this awkward spot between Databases and Programming. So which Null should they use and what should the behavior be?


## Where does it matter?

In [1]:
import sqlite3
from tempfile import TemporaryFile


tfile = TemporaryFile()
con = sqlite3.connect(f"{tfile}.db")
cur = con.cursor()
cur.execute("CREATE TABLE movie(title, year, score)")

data = [
    ("Monty Python Live at the Hollywood Bowl", 1982, 7.9),
    ("Monty Python's The Meaning of Life", 1983, 7.5),
    ("Monty Python's Life of Brian", 1979, 8.0),

]
cur.executemany("INSERT INTO movie VALUES(?, ?, ?)", data)
con.commit()


In [9]:
cur.execute("SELECT score != 7.5 FROM movie;").fetchall()

[(1,), (0,), (1,), (None,)]

In [10]:

new_movie_without_a_rating = ("Monty Python's Really cool NULL adventure", 2022, None)
cur.execute("INSERT INTO movie VALUES(?, ?, ?)", new_movie_without_a_rating)
con.commit()

In [11]:
cur.execute("SELECT score != 7.5 FROM movie;").fetchall()

[(1,), (0,), (1,), (None,), (None,)]

In [16]:
pd.read_sql_query?

In [20]:
import pandas as pd
df = pd.read_sql_query('SELECT * from movie;', con, dtype={'title': str, 'year': 'Int64'})

In [18]:
df[df['score'] != 7.5]

Unnamed: 0,title,year,score
0,Monty Python Live at the Hollywood Bowl,1982,7.9
2,Monty Python's Life of Brian,1979,8.0
3,Monty Python's Really cool NULL adventure,2022,
4,Monty Python's Really cool NULL adventure,2022,


In [None]:
df['score'] != 7.5

In [None]:
tfile.close()

In [None]:
import pandas as pd

In [None]:
x = pd.Series([float('inf')]) 

In [None]:
x * 0

In [None]:
None != 7.9

In [None]:
a: int

In [None]:
a