# Data Types and Missing Values

In [34]:
import pandas as pd

def currency_to_int(curr):
    curr_list = curr[1:].split(",")
    price = int("".join(curr_list))
    return price

def trans_rooms_to_int(entry):
    return int(entry.split(" ")[0])

# transform prices
atl_addr = pd.read_csv("./atl-address-1.csv")
for i in range(len(atl_addr)):
    atl_addr.loc[i, "Price"] = currency_to_int(atl_addr.loc[i, "Price"])
    atl_addr.loc[i, "Beds"] = trans_rooms_to_int(atl_addr.loc[i, "Beds"])
    atl_addr.loc[i, "Baths"] = trans_rooms_to_int(atl_addr.loc[i, "Baths"])
       
atl_addr.head()

Unnamed: 0,Title,Price,Beds,Baths,Area
0,"34 The Prado NE, Atlanta, GA 30309",1495000,4,4,"3,644 sqft"
1,"2060 Shirley St SW, Atlanta, GA 30311",225000,3,2,"1,300 sqft"
2,"300 Peachtree St NE APT 11G, Atlanta, GA 30308",259000,2,2,890 sqft
3,"1690 Memorial Dr SE, Atlanta, GA 30317",320000,2,1,"1,163 sqft"
4,"6253 Old Kingston Dr, South Fulton, GA",349275,5,3,-- sqft


## Data types(dtype)
You can use the dtype property to grab the type of a specific column.

In [35]:
# dtypes property returns the dtype of every column in the DataFrame:
atl_addr.dtypes

Title    object
Price    object
Beds     object
Baths    object
Area     object
dtype: object

In [36]:
# dtype grab the type of a specific column
atl_addr.Baths.dtypes

dtype('O')

Columns consisting entirely of strings do not get their own type; they are instead given the **object** type.

In [37]:
# convert a column of one type into another via astype()
atl_addr.Price.astype("float64").head(5)

0    1495000.0
1     225000.0
2     259000.0
3     320000.0
4     349275.0
Name: Price, dtype: float64

In [38]:
# dtype of index
atl_addr.index.dtype

dtype('int64')

## Missing Data
Entries missing values are given the value **NaN**, short for "Not a Number". For technical reasons these NaN values are always of the **float64** dtype.<br>
To select NaN entries you can use **pd.isnull()** (or its companion **pd.notnull()**).

In [39]:
# check if there is a NaN entry
atl_addr.Area.isnull().head(5)

0    False
1    False
2    False
3    False
4    False
Name: Area, dtype: bool

In [40]:
# selet rows with NaN entries
atl_addr[pd.isnull(atl_addr.Baths)]

Unnamed: 0,Title,Price,Beds,Baths,Area


In [43]:
# replace NaN values via fillna()
atl_addr.loc[20] = ["2550 Akers Mill Rd SE", None, None, None, None]  # create a row with NaN data
atl_addr.Price.fillna("Unknown").tail(5)  # replace NaN with Unknown

16     250000
17     199999
18     260000
19     272900
20    Unknown
Name: Price, dtype: object

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the **backfill strategy**.

In [51]:
# backfill via replace()
atl_addr.Beds.replace(5, "TOO MANY").head(10)

0           4
1           3
2           2
3           2
4    TOO MANY
5           2
6    TOO MANY
7           1
8           4
9    TOO MANY
Name: Beds, dtype: object