# 01_05: Data classes

Let us look at Python data structures from the persepctive of a data scientist or a data analyst. What are the options to store tabular data, such as a table of famous people with their names and birthdays? 

In [1]:
import math
import collections
import dataclasses
import datetime

import numpy as np
import pandas as pd
import matplotlib.pyplot as pp

<table>
<tr><th>name</th><th>lastname</th><th>birthday</th></tr>
<tr><td>Michele</td><td>Vallisneri</td><td>July 15</td></tr>
<tr><td>Albert</td><td>Einstein</td><td>March 14</td></tr>
<tr><td>John</td><td>Lennon</td><td>October 9</td></tr>
<tr><td>Jocelyn</td><td>Bell Burnell</td><td>July 15</td></tr>
</table>

A list of Python dicts is certainly a possibility. It's certainly easy to access the columns by the key and to query the rows using comprehensions. For instance, famous people with a birthday on July 15th

In [None]:
#store each person and their info in an individual dict
peopledict = [{"name": "Michele", "lastname": "Vallisneri",   "birthday": "July 15"},
              {"name": "Albert",  "lastname": "Einstein",     "birthday": "March 14"},
              {"name": "John",    "lastname": "Lennon",       "birthday": "October 9"},
              {"name": "Jocelyn", "lastname": "Bell Burnell", "birthday": "July 15"}]

In [3]:
#find those whose names line up with the birthday we're looking for by iterating through entry and returning each whose birthday is on July 15th
[person for person in peopledict if person["birthday"] == "July 15"]

[{'name': 'Michele', 'lastname': 'Vallisneri', 'birthday': 'July 15'},
 {'name': 'Jocelyn', 'lastname': 'Bell Burnell', 'birthday': 'July 15'}]

In [4]:
#ANother possibility lies in tuples, or, even betterm the namedtuples from the collections module in the Python starter library.
#With these we can create a specialized tuple that associates labels with columns
Person = collections.namedtuple("Person", ["name", "lastname", "birthday"])

In [None]:
# The syntax to create a person like this is intuitive. We can also meet the labels
Person(name='Michele', lastname='Vallisneri', birthday='July 15')

In [6]:
peopletuples = [Person("Michele", "Vallisneri", "July 15"),
                Person("Albert", "Einstein", "March 14"),
                Person("John", "Lennon", "October 9"),
                Person("Jocelyn", "Bell Burnell", "July 15")]

In [7]:
# 
# The columns can be accesswed with a dot notation of Python object attributes
[person for person in peopletuples if person.lastname == "Lennon"]

[Person(name='John', lastname='Lennon', birthday='October 9')]

In [10]:
#Although normal tuple indices would also work
Person(**peopledict[3])

Person(name='Jocelyn', lastname='Bell Burnell', birthday='July 15')

In [12]:
# We can convert these tuples from and to a dictionary using ** unpacking and the namedtuple method asdict
peopletuples[3]._asdict()
# The reason the underscore is there is to avoid confusion in case you really want to use asdict as a label. asdict is specific to namedtuple

{'name': 'Jocelyn', 'lastname': 'Bell Burnell', 'birthday': 'July 15'}

In [15]:
#Python 3.7 provided an alternative to tuples and dicts for data storage in the form of dataclasses
# In a dataclass, we list the fields in order and specifytheir type, such as int or string. We can also provide a default value. 
@dataclasses.dataclass
class Persondata:
  name : str
  lastname :str
  birthday : str = "unknown"


In [17]:
# The syntax here is again intuitive and we can either use or omit the labels
peopledata = [Persondata(name="Michele", lastname="Vallisneri", birthday="July 15"),
              Persondata("Albert", "Einstein", "March 14"),
              Persondata("John", "Lennon", "October 9"),
              Persondata("Jocelyn", "Bell Burnell", "July 15")]

In [18]:
#As with tuples, we access fields by name
[person for person in peopledata if person.birthday!="July 15"]

[Persondata(name='Albert', lastname='Einstein', birthday='March 14'),
 Persondata(name='John', lastname='Lennon', birthday='October 9')]

So far this is very similar to namedtuple, however, dataclass is a full Python class, so we can define methods that oper ate on the fields, just like we do in any Python class

**Python dataclass**

dataclass is a decorator class in Python 3.7. It is just meant to streamline the production of classes intended solely for the storage of data. It automatically generates the __init__(), __repr__(), __eq__(), and others

from dataclasses import dataclass

@dataclass
class \<classname\>: 
  
&emsp;  \<field1\>: \<type1\>

&emsp;  \<field2>: \<type2\> = \<defaultvalue\>
  
&emsp;  def \<method>(self, ...):

&emsp;&emsp;    [method body]

In [20]:
#The first compulsory variable in a method refers to the particular instance of the class.
#For instance, a person
@dataclasses.dataclass
class Persondata:
    name: str
    lastname: str
    birthday: str = "unknown"
    
    # Let's do methods that provide a prettier printout

    # when writing class methods, "self" refers to instances
    def fullname(self):
        return self.name + " " + self.lastname

    # the special method __str__ overrides the standard printout
    def __str__(self):
        return self.lastname + ", " + self.name + ", born " + self.birthday

In [21]:
#Here's the definition of the person
michele = Persondata('Michele', 'Vallisneri', 'July 15')

In [22]:
#here's our prettier fullname method
michele.fullname()

'Michele Vallisneri'

In [23]:
#Here's how our new method makes persondata objects print
print(michele)

Vallisneri, Michele, born July 15


In [None]:
#Dataclasses have a number of other useful features, such as freezing, so that fields cannot be changed
#makes fields uneditable after they've been defined
@dataclasses.dataclass(frozen = True)
class Persondata_frozen:
    name: str
    lastname: str
    birthday: str = "unknown"

# THis allows items to be compared using operators like <, >, or ==
@dataclasses.dataclass(order = True)
class Persondata_ordered:
    name: str
    lastname: str
    birthday: str = "unknown"

#This allows us to define how those objects are compared in each operator
@dataclasses.dataclass
class Persondata_customorder:
    name: str
    lastname: str
    birthday: str = "unknown"

    # custom "less than" comparison
    def __lt__(self, other): 
        #I guess this compares the hashes, will play with tomorrow      
        return (self.lastname, self.name, self.birthday) < (other.lastname, other.name, other.birthday)


@dataclasses.dataclass
class Persondata_computed:
    name: str
    lastname: str
    birthday: str = "unknown"
    fullname: str = dataclasses.field(init=False) # will compute it below, after initialization
    
    #this runs after initialization
    def __post_init__(self):
        self.fullname = self.name + " " + self.lastname

In [27]:
peep = Persondata("Tom","Hanks")
peep.lastname = "Henks"

In [28]:
#One thing we have not een is how the type of a field, such as string, is used with dataclasses. In fact, by default, it's not
#It is made available to third-party packages for certain applications, like validating data entry.
#An excellent packagge for this purpose is Pydantic
import pydantic

In [None]:
#To use if, we replace the standard dataclass **decorator** with the equivalent from Pydantic
@pydantic.dataclasses.dataclass
class Persondata_pydantic:
    name: str
    lastname: str
    birthday: str = "unknown"

    #We also write a custom validator for the birthday field
    #We'll try to convert it to a Python datetime object, and raise an exception if that's not possible
    @pydantic.field_validator("birthday")
    def validate_date(cls, value): # a class method, so first argument is the class 
        
        # will fail if date is not "MONTHNAME DAYNUMBER" 
        datetime.datetime.strptime(value, "%B %d")
        
        return value

here's two bad examples where the validation fails

In [None]:
Persondata_pydantic("Michele", 15, "July 15")

In [None]:
Persondata_pydantic('Michele', "Vallisneri", "7/15")

Pydantic is a very sophisticated and powerful package with many features. It's also compatible with many data analysis and data science packages. If your package requires substantial data validation, it'll pay to dig into Pydantic.

This concludes the overview of basic data structures in PYthon