Follow from my blog Understanding NULLs in Pandas, it leads me to discuss data types and by extensions data structures used by the Python Pandas library.

What are data types I hear you ask? Well they are how we categorises data. 

If we lump all data together then how do we know if we can do mathematical calculations on the data?

While each software package has its own data types there are really four types:
- Numerical
- String
- Date
- Boolean

Numerical and data are as they sound. String is text but can also hold numbers and symbols, no mathematical operations can be done on strings. Boolean is binary e.g. True/False

## Pandas data types

### Numerical

So now to Pandas data types and sometimes called Primitive.

Starting with Numerical. Pandas offers two main numerical data types:-
- Integer (int8, int16, int32, int64)
- Float (float16, float32, float64)

Integers are you might remember from maths class is a whole number.

Floats are numbers with decimal points.

### String

Pandas has historically only offered the Object data type for storing strings.

But from Pandas V1 we have also had the new String type which is a true string data type.

- Object (mixed type)
- String (StringDtype)

### Date

With dates there are few data types that can be used but the main one comes from Python.

- datetime64
- timedelta[ns]

Datetime stores a single date in the format YYYY-MM-DD HH:MM:SS for example 2012-05-01 00:00:00

Timedeltas are a range between two dates e.g. "2011-12-29" to "2011-12-31"

### Boolean

In Pandas the boolean type is important because of it use of bools for filtering data.

to convert a column (series) to boolean there can only be two options. e.g. 0 and 1 or yes and no

### Category

Pandas has recently added a category datatype which is great for reducing the size of column (series) with only a few options

## Lets see how these data types work

In [21]:
import pandas as pd
import numpy as np
import seaborn as sns

Lets import the penguins dataset from the seaborn library for convenience

In [22]:
penguins = sns.load_dataset('penguins')
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


Lets drop the nulls to start with

In [23]:
penguins = penguins.dropna()
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


Lets use the dtypes attribute to see the data types

In [24]:
penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

- You can see that 'Species' and 'Island' have been imported as Object which is fine for these. 
- 'bill_length_mm' and 'bill_depth_mm' have been correctly imported as floats.
- 'flipper_length_mm' and 'body_mass_g' have been imported as floats due to missing data but really we want them as ints.
- 'sex' has been imported as object again due to missing data but we want it as bool.

### Convert data types

In [25]:
penguins[['bill_length_mm', 'bill_depth_mm']] = penguins[['bill_length_mm', 'bill_depth_mm']].astype('int64')
penguins['sex'] = penguins['sex'].astype('bool')
penguins.dtypes

species               object
island                object
bill_length_mm         int64
bill_depth_mm          int64
flipper_length_mm    float64
body_mass_g          float64
sex                     bool
dtype: object

Now look at the data again

In [26]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39,18,181.0,3750.0,True
1,Adelie,Torgersen,39,17,186.0,3800.0,True
2,Adelie,Torgersen,40,18,195.0,3250.0,True
4,Adelie,Torgersen,36,19,193.0,3450.0,True
5,Adelie,Torgersen,39,20,190.0,3650.0,True


## Data Structures (complex data types)

Data Structures are more complex data types.

Pandas only really has two of its own these are:-
- Series
- DataFrames

But it also uses data structures from Numpy and Python including:-
- Numpy arrays
- Python lists
- Python dictionaries

Pandas data structures can be put into two categories 1d and 2d.

### Pandas Series

Creating a series using a Python dictionary

In [32]:
data = {1:'a', 2:'b', 3:'c'}
s = pd.Series(data)
s

1    a
2    b
3    c
dtype: object

Creating a series using Python lists. Data and index must be the same length.

In [33]:
data = ['a','b','c']
index = [1,2,3]
s = pd.Series(data, index=index)
s

1    a
2    b
3    c
dtype: object

A few things to note about the Series data structure
- It has a datatype, if mixed data is in the series it will be Object dtype
- It has an index
- It can be use it like a dictionary to get and set values by index labels
- It can be vectorised, meaning looping it not necessary
- It can have a name attribute

### Pandas DataFrame

Creating a Dataframes from dict of Series

In [35]:
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


Creating a Dataframes from lists

In [36]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}

pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


Creating a Dataframe from list of dictionaries

In [39]:
d = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]

pd.DataFrame(d)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


A few things to note about the Dataframe data structure
- Dataframes are 2d data structure much like spreadsheets or SQL tables.
- Dataframes columns can have different data types
- Columns can be added to a dataframe
- Pandas has methods for selection rows and/or columns
- Dataframes align with both column and row indexes
- Dataframes have an index class which give access to the index
- Dataframes have the column attribute for accessing column headers

That's all for now.