# Introduction

In this tutorial, you'll learn how to investigate data types within a DataFrame or Series.  You'll also learn how to find and replace entries.


# Dtypes

The data type for a column in a DataFrame or a Series is known as the **dtype**.

You can use the `dtype` property to grab the type of a specific column.  For instance, we can get the dtype of the `price` column in the `reviews` DataFrame:

In [1]:

import pandas as pd
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)
#pd.set_option('max_rows', 5)

In [3]:
reviews.head()

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,


In [5]:
reviews.price.dtype

dtype('float64')

Alternatively, the `dtypes` property returns the `dtype` of _every_ column in the DataFrame:

In [7]:
reviews.dtypes

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dupe?                    float64
dtype: object

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the `object` type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the `points` column from its existing `int64` data type into a `float64` data type:

In [9]:
reviews.points.astype('float64') # "87" > 87.0

id
94355     85.0
126883    87.0
119493    86.0
126909    86.0
119752    87.0
          ... 
80210     90.0
76487     88.0
86953     90.0
18824     87.0
88999     90.0
Name: points, Length: 119988, dtype: float64

A DataFrame or Series index has its own `dtype`, too:

In [11]:
reviews.index.dtype

dtype('int64')

In [15]:
price = "100$"
price = price.replace("$","")
prince = int(price)
print(price)

100


Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.

# Missing data

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` dtype.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

In [19]:
reviews[pd.isnull(reviews.country)]
reviews[reviews.country.isnull()]

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
129590,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Mike DeSimone,@worldwineguys,Büyülübağ 2012 Shah Red,Red Blend,Büyülübağ,
11150,,"A blend of 85% Melnik, 10% Grenache Noir and 5...",,89,20.0,,,,Jeff Jenssen,@worldwineguys,Orbelus 2013 Melnik,Melnik,Orbelus,
46352,,"A dark color and rich, jammy, baked aromas of ...",Catalina,91,50.0,,,,Michael Schachner,@wineschach,Santa Ema 2014 Catalina Red,Red Blend,Santa Ema,
105167,,"A lively whiff of white blossom, Muscat grapes...",Steirische Klassik,92,21.0,,,,Anne Krebiehl MW,@AnneInVino,Neumeister 2016 Steirische Klassik Gelber Musk...,Gelber Muskateller,Neumeister,
85821,,A neutral nose leads onto an apple-fresh palat...,Steirische Klassik,90,22.0,,,,Anne Krebiehl MW,@AnneInVino,Neumeister 2015 Steirische Klassik Morillon,Morillon,Neumeister,
54222,,"Almost caramel in color, this wine offers arom...",Babaneuri Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2013 Babaneuri Valley Mtsvane,Mtsvane,Gotsa Family Wines,
913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines,
66524,,"Amber to the eye, this wine features an entici...",Dry Unfiltered Rkatsiteli Mtsvane,88,25.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2015 Dry Unfiltered Rkatsit...,White Blend,Gotsa Family Wines,
36112,,An interesting blend of indigenous Bulgarian a...,Hrumki Melnik 55 Mourvèdre Marselan,89,25.0,,,,Jeff Jenssen,@worldwineguys,Orbelus 2012 Hrumki Melnik 55 Mourvèdre Marsel...,Red Blend,Orbelus,
105715,,"Aromas of blackberry, strawberry and chocolate...",Peak,92,50.0,,,,Mike DeSimone,@worldwineguys,Psagot 2014 Peak Red,Rhône-style Red Blend,Psagot,


Replacing missing values is a common operation.  Pandas provides a really handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, we can simply replace each `NaN` with an `"Unknown"`:

In [21]:
reviews.region_2.fillna("Unknown")

id
94355              Unknown
126883         North Coast
119493    California Other
126909             Unknown
119752             Unknown
                ...       
80210              Unknown
76487              Unknown
86953              Unknown
18824     Sierra Foothills
88999              Unknown
Name: region_2, Length: 119988, dtype: object

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.

Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from `@kerinokeefe` to `@kerino`. One way to reflect this in the dataset is using the `replace()` method:

In [23]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

id
94355      @vossroger
126883        @vboone
119493            NaN
126909    @wineschach
119752    @wineschach
             ...     
80210             NaN
76487             NaN
86953             NaN
18824         @vboone
88999     @AnneInVino
Name: taster_twitter_handle, Length: 119988, dtype: object

The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like `"Unknown"`, `"Undisclosed"`, `"Invalid"`, and so on.

# Your turn



In [27]:
reviews['taster_twitter_handle'] = reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")
reviews

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,@vossroger,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,@vboone,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,@wineschach,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,@wineschach,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80210,Italy,Zonchera is Ceretto's more affordable base Bar...,Zonchera,90,48.0,Piedmont,Barolo,,,,Ceretto 2004 Zonchera (Barolo),Nebbiolo,Ceretto,
76487,Italy,Zonin's 2006 Amarone opens with very ripe arom...,,88,70.0,Veneto,Amarone della Valpolicella,,,,Zonin 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",Zonin,
86953,Italy,Zorzettig's precious Picolit dessert wine deli...,,90,,Northeastern Italy,Colli Orientali del Friuli,,,,Zorzettig 2006 Picolit (Colli Orientali del Fr...,Picolit,Zorzettig,
18824,US,Zucca has made a fragrant and floral Sangioves...,Sangiovese Rosato,87,18.0,California,Amador County,Sierra Foothills,Virginie Boone,@vboone,Zucca 2010 Sangiovese Rosato Rosé (Amador County),Rosé,Zucca,


In [45]:
reviews['taster_twitter_handle'] = reviews.taster_twitter_handle.str.replace("@","$") + "  Ahmed"
reviews

Unnamed: 0_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,dupe?
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
94355,Austria,"""Chremisa,"" the ancient name of Krems, is comm...",Edition Chremisa Sandgrube 13,85,24.0,Niederösterreich,,,Roger Voss,$vossroger Ahmed Ahmed,Winzer Krems 2011 Edition Chremisa Sandgrube 1...,Grüner Veltliner,Winzer Krems,
126883,US,$10 for this very drinkable Cab? That's crazy....,,87,10.0,California,North Coast,North Coast,Virginie Boone,$vboone Ahmed Ahmed,Line 39 2009 Cabernet Sauvignon (North Coast),Cabernet Sauvignon,Line 39,
119493,US,$14 is a pretty good price for a Chardonnay th...,Whiplash,86,14.0,California,California,California Other,,,Jamieson Ranch 2011 Whiplash Chardonnay (Calif...,Chardonnay,Jamieson Ranch,
126909,Spain,"). Earth, cola and leather aromas are good, ho...",Finca Resalso,86,15.0,Northern Spain,Ribera del Duero,,Michael Schachner,$wineschach Ahmed Ahmed,Emilio Moro 2009 Finca Resalso (Ribera del Du...,Tinto Fino,Emilio Moro,
119752,Spain,). Light and lemony on the nose. The palate ha...,,87,17.0,Galicia,Rías Baixas,,Michael Schachner,$wineschach Ahmed Ahmed,La Caña 2010 Albariño (Rías Baixas),Albariño,La Caña,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80210,Italy,Zonchera is Ceretto's more affordable base Bar...,Zonchera,90,48.0,Piedmont,Barolo,,,,Ceretto 2004 Zonchera (Barolo),Nebbiolo,Ceretto,
76487,Italy,Zonin's 2006 Amarone opens with very ripe arom...,,88,70.0,Veneto,Amarone della Valpolicella,,,,Zonin 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",Zonin,
86953,Italy,Zorzettig's precious Picolit dessert wine deli...,,90,,Northeastern Italy,Colli Orientali del Friuli,,,,Zorzettig 2006 Picolit (Colli Orientali del Fr...,Picolit,Zorzettig,
18824,US,Zucca has made a fragrant and floral Sangioves...,Sangiovese Rosato,87,18.0,California,Amador County,Sierra Foothills,Virginie Boone,$vboone Ahmed Ahmed,Zucca 2010 Sangiovese Rosato Rosé (Amador County),Rosé,Zucca,


In [49]:
reviews.taster_twitter_handle.str.strip()

id
94355      $vossroger  Ahmed  Ahmed
126883        $vboone  Ahmed  Ahmed
119493                          NaN
126909    $wineschach  Ahmed  Ahmed
119752    $wineschach  Ahmed  Ahmed
                    ...            
80210                           NaN
76487                           NaN
86953                           NaN
18824         $vboone  Ahmed  Ahmed
88999     $AnneInVino  Ahmed  Ahmed
Name: taster_twitter_handle, Length: 119988, dtype: object