# 5. Data types and missing data
---
This section is based on the official tutorials on [Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html) and [Working with Missing Data Sections](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).

In [1]:
import pandas as pd
battles_got = pd.read_csv('datasets/battles.csv')
pd.set_option('max_rows', 5)
battles_got

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,15000.0,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36,Siege of Raventree,300,37,Joffrey/Tommen Baratheon,Robb Stark,Bracken,Lannister,,,Blackwood,...,0.0,1.0,1500.0,,"Jonos Bracken, Jaime Lannister",Tytos Blackwood,0.0,Raventree,The Riverlands,
37,Siege of Winterfell,300,38,Stannis Baratheon,Joffrey/Tommen Baratheon,Baratheon,Karstark,Mormont,Glover,Bolton,...,,,5000.0,8000.0,Stannis Baratheon,Roose Bolton,0.0,Winterfell,The North,


### 5.1. Data types
---
To know the data type of a column in a `DataFrame` or a `Series` you can use the `dtype` property

In [2]:
battles_got.battle_number.dtype

dtype('int64')

If you want to know the `dtype` of every column in the dataset, you can use the `dtypes` property:

In [3]:
battles_got.dtypes

name      object
year       int64
           ...  
region    object
note      object
Length: 25, dtype: object

`init64` means that it's using a 64-bit integer number, same as `float64` means that it's using a 64-bit floating point number.

One thing to keep in mind is that columns consisting entirely of strings don't get their own type, giving instead an `object` type.

You can convert a column of one type into another wherever it makes sense by using the `astype` function. For example, we can convert a column from its existing `init64` data type into a `float64` data type:

In [4]:
battles_got.year.astype('float64')

0     298.0
1     298.0
      ...  
36    300.0
37    300.0
Name: year, Length: 38, dtype: float64

Indexes from `DataFrame` or `Series` also have its own `dtype`:

In [5]:
battles_got.index.dtype

dtype('int64')

### 5.2. Missing data
---
When there is missing data, values are given as `NaN`, short for "Not a Number". These values are always `float64` data type. `pandas` provides some methods for missing data, as the `pd.isnull` method to select `NaN` entries (or its companion `pd.notnull`). This has to be used this way:

In [6]:
battles_got[battles_got.attacker_2.isnull()]

Unnamed: 0,name,year,battle_number,attacker_king,defender_king,attacker_1,attacker_2,attacker_3,attacker_4,defender_1,...,major_death,major_capture,attacker_size,defender_size,attacker_commander,defender_commander,summer,location,region,note
0,Battle of the Golden Tooth,298,1,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Tully,...,1.0,0.0,15000.0,4000.0,Jaime Lannister,"Clement Piper, Vance",1.0,Golden Tooth,The Westerlands,
1,Battle at the Mummer's Ford,298,2,Joffrey/Tommen Baratheon,Robb Stark,Lannister,,,,Baratheon,...,1.0,0.0,,120.0,Gregor Clegane,Beric Dondarrion,1.0,Mummer's Ford,The Riverlands,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33,Second Seige of Storm's End,300,34,Joffrey/Tommen Baratheon,Stannis Baratheon,Baratheon,,,,Baratheon,...,0.0,0.0,,200.0,"Mace Tyrell, Mathis Rowan",Gilbert Farring,0.0,Storm's End,The Stormlands,
34,Siege of Dragonstone,300,35,Joffrey/Tommen Baratheon,Stannis Baratheon,Baratheon,,,,Baratheon,...,0.0,0.0,2000.0,,"Loras Tyrell, Raxter Redwyne",Rolland Storm,0.0,Dragonstone,The Stormlands,


To replace missing values with `pandas` you can use the `fillna` method, which provides a few different strategies for mitigation such data. For example, we can replace each `NaN` value with an "None":

In [7]:
battles_got.attacker_2.fillna("None")

0          None
1          None
        ...    
36    Lannister
37     Karstark
Name: attacker_2, Length: 38, dtype: object

For more information about the `fillna` method, read the [official function documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html).

If we want to replace any value, we can use the `replace` method:

In [8]:
battles_got.attacker_king.replace("Joffrey/Tommen Baratheon", "Joffrey and Tommen Baratheon")

0     Joffrey and Tommen Baratheon
1     Joffrey and Tommen Baratheon
                  ...             
36    Joffrey and Tommen Baratheon
37               Stannis Baratheon
Name: attacker_king, Length: 38, dtype: object