**DATA TYPES AND MISSING VALUES**

We'll see how to investigate data types within a DataFrame or Series, and also learn how to find and replace entries.

**Dtypes**

Dtypes is the data type for a column in a DataFrame or a Series

You can use the dtype property to grab the type of a specific column. For instance, we can get the dtype of the price column in the reviews DataFrame:

In [11]:
reviews.price.dtype

dtype('float64')

Alternatively, the dtypes **property** returns the dtype of every column in the DataFrame:

In [13]:
reviews.dtypes

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

Data types tell us something about how pandas is storing the data internally. float64 means that it's using a 64-bit floating point number; int64 means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.
It's possible to convert a column of one type into another wherever such a conversion makes sense by using the **astype()** function. For example, we may transform the points column from its existing int64 data type into a float64 data type:



In [9]:
reviews.points.astype('float64')


0      87.0
1      87.0
2     100.0
3      90.0
4      89.0
5      88.0
6      92.0
7      86.0
8      91.0
9      85.0
10     87.0
Name: points, dtype: float64

A DataFrame or Series index has its own dtype, too:


In [7]:
reviews.index.dtype


dtype('int64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.


**Missing data**

Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype

Pandas provides some methods specific to missing data. To select NaN entries you can use **pd.isnull()** (or its companion **pd.notnull()**). This is meant to be used thusly:

In [17]:
reviews[pd.isnull(reviews.country)]


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery


Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). **fillna()** provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

In [23]:
reviews[pd.isnull(reviews.region_2)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
6,Argentina,"Bold and structured, offering black fruit and ...",Gran Reserva,92,30.0,Mendoza Province,Uco Valley,,Alejandro Iglesias,@aliglesiaswine,Trapiche 2015 Gran Reserva Malbec (Uco Valley),Malbec,Trapiche


In [35]:
reviews.region_2.fillna("Unknown")
#It wont replace it in the original df unless I assign it back 
#--> reviews['region_2'] = reviews['region_2'].fillna("Unknown")

0        Eastern Sicily
1     Northern Portugal
2      Northeast France
3      Northeast France
4        Northern Spain
5           North Coast
6               Unknown
7        Central Valley
8       Western Germany
9        Coastal Region
10       Coastal Region
Name: region_2, dtype: object

In [45]:
reviews['region_2'] = reviews['region_2'].fillna("Unknown")
reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, and mine...",Vulkà Bianco,87,20.0,Sicily & Sardinia,Etna,Eastern Sicily,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,Douro,Northern Portugal,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,France,"A dry style of Pinot Gris, crisp with acidity ...",Classic,100,32.0,Alsace,Alsace,Northeast France,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss
3,France,"Big, rich and off-dry, with intensity and flor...",Lieu-dit Harth Cuvée Caroline,90,21.0,Berona,Alsace,Northeast France,Roger Voss,@vossroger,Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...,Gewürztraminer,Domaine Schoffit
4,Spain,"Dark cherry, spice and leather aromas dominate...",Reserva,89,18.0,Rioja,Rioja Alta,Northern Spain,Michael Schachner,@wineschach,Marqués de Cáceres 2011 Reserva (Rioja),Tempranillo,Marqués de Cáceres
5,US,"Fruity and soft, with hints of raspberry and v...",Estate,88,25.0,California,Napa Valley,North Coast,Jim Gordon,@jimgordonwine,Robert Mondavi 2014 Cabernet Sauvignon (Napa V...,Cabernet Sauvignon,Robert Mondavi
6,Argentina,"Bold and structured, offering black fruit and ...",Gran Reserva,92,30.0,Mendoza Province,Uco Valley,Unknown,Alejandro Iglesias,@aliglesiaswine,Trapiche 2015 Gran Reserva Malbec (Uco Valley),Malbec,Trapiche
7,Chile,"Smooth, with red berries and a touch of herbs.",Reserva Especial,86,12.0,Maipo Valley,Maipo Valley,Central Valley,Patricio Tapia,@ptapiawine,Concha y Toro 2016 Carmenere (Maipo Valley),Carmenere,Concha y Toro
8,Germany,"Lively and fresh, with notes of green apple an...",Kabinett,91,22.0,Mosel,Mosel,Western Germany,Anne Krebiehl,@annewine,Dr. Loosen 2015 Riesling Kabinett (Mosel),Riesling,Dr. Loosen
9,South Africa,"Aromas of citrus and melon, fresh and vibrant.",Signature,85,10.0,Western Cape,Stellenbosch,Coastal Region,Lauren Buzzeo,@laurenbuzzeo,Spier 2016 Chenin Blanc (Western Cape),Chenin Blanc,Spier


Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the **BACKFILL STRATEGY**.

Alternatively, we may have a non-null value that we would like to **replace**. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the **replace()** method:


In [47]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")


0             @kerino
1          @vossroger
2          @vossroger
3          @vossroger
4         @wineschach
5      @jimgordonwine
6     @aliglesiaswine
7         @ptapiawine
8           @annewine
9       @laurenbuzzeo
10      @laurenbuzzeo
Name: taster_twitter_handle, dtype: object

The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.


In [49]:
reviews.region_2.replace('Unknown','Badajoz')

0        Eastern Sicily
1     Northern Portugal
2      Northeast France
3      Northeast France
4        Northern Spain
5           North Coast
6               Badajoz
7        Central Valley
8       Western Germany
9        Coastal Region
10       Coastal Region
Name: region_2, dtype: object

**EXERCISES**

In [3]:
import pandas as pd

In [137]:
reviews = pd.DataFrame([
    {
        'country': 'Italy',
        'description': 'Aromas include tropical fruit, broom, and minerals.',
        'designation': 'Vulkà Bianco',
        'points': 87,
        'price': 20.0,
        'province': 'Sicily & Sardinia',
        'region_1': 'Etna',
        'region_2': 'Eastern Sicily',
        'taster_name': 'Kerin O’Keefe',
        'taster_twitter_handle': '@kerinokeefe',
        'title': 'Nicosia 2013 Vulkà Bianco (Etna)',
        'variety': 'White Blend',
        'winery': 'Nicosia'
    },
    {
        'country': 'Portugal',
        'description': 'This is ripe and fruity, a wine that is smooth and balanced.',
        'designation': 'Avidagos',
        'points': 87,
        'price': 15.0,
        'province': 'Douro',
        'region_1': 'Douro',
        'region_2': 'Northern Portugal',
        'taster_name': 'Roger Voss',
        'taster_twitter_handle': '@vossroger',
        'title': 'Quinta dos Avidagos 2011 Avidagos Red (Douro)',
        'variety': 'Portuguese Red',
        'winery': 'Quinta dos Avidagos'
    },
    {
        'country': 'France',
        'description': 'A dry style of Pinot Gris, crisp with acidity and minerality.',
        'designation': 'Classic',
        'points': 100,
        'price': 32.0,
        'province': 'Alsace',
        'region_1': 'Alsace',
        'region_2': 'Northeast France',
        'taster_name': 'Roger Voss',
        'taster_twitter_handle': '@vossroger',
        'title': 'Domaine Marcel Deiss 2012 Pinot Gris (Alsace)',
        'variety': 'Pinot Gris',
        'winery': 'Domaine Marcel Deiss'
    },
    {
        'country': 'France',
        'description': 'Big, rich and off-dry, with intensity and floral notes.',
        'designation': 'Lieu-dit Harth Cuvée Caroline',
        'points': 90,
        'price': 21.0,
        'province': 'Berona',
        'region_1': 'Alsace',
        'region_2': 'Northeast France',
        'taster_name': 'Roger Voss',
        'taster_twitter_handle': '@vossroger',
        'title': 'Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline (Alsace)',
        'variety': 'Gewürztraminer',
        'winery': 'Domaine Schoffit'
    },
    {
        'country': 'Spain',
        'description': 'Dark cherry, spice and leather aromas dominate this classic Rioja.',
        'designation': 'Reserva',
        'points': 89,
        'price': 18.0,
        'province': 'Rioja',
        'region_1': 'Rioja Alta',
        'region_2': 'Northern Spain',
        'taster_name': 'Michael Schachner',
        'taster_twitter_handle': '@wineschach',
        'title': 'Marqués de Cáceres 2011 Reserva (Rioja)',
        'variety': 'Tempranillo',
        'winery': 'Marqués de Cáceres'
    },
    {
        'country': 'US',
        'description': 'Fruity and soft, with hints of raspberry and vanilla.',
        'designation': 'Estate',
        'points': 88,
        'price': 25.0,
        'province': 'California',
        'region_1': 'Napa Valley',
        'region_2': 'North Coast',
        'taster_name': 'Jim Gordon',
        'taster_twitter_handle': '@jimgordonwine',
        'title': 'Robert Mondavi 2014 Cabernet Sauvignon (Napa Valley)',
        'variety': 'Cabernet Sauvignon',
        'winery': 'Robert Mondavi'
    },
    {
        'country': 'Argentina',
        'description': 'Bold and structured, offering black fruit and mocha.',
        'designation': 'Gran Reserva',
        'points': 92,
        'price': 30.0,
        'province': 'Mendoza Province',
        'region_1': 'Uco Valley',
        'taster_name': 'Alejandro Iglesias',
        'taster_twitter_handle': '@aliglesiaswine',
        'title': 'Trapiche 2015 Gran Reserva Malbec (Uco Valley)',
        'variety': 'Malbec',
        'winery': 'Trapiche'
    },
    {
        'country': 'Chile',
        'description': 'Smooth, with red berries and a touch of herbs.',
        'designation': 'Reserva Especial',
        'points': 86,
        'price': 12.0,
        'province': 'Maipo Valley',
        'region_2': 'Central Valley',
        'taster_name': 'Patricio Tapia',
        'taster_twitter_handle': '@ptapiawine',
        'title': 'Concha y Toro 2016 Carmenere (Maipo Valley)',
        'variety': 'Carmenere',
        'winery': 'Concha y Toro'
    },
    {
        'country': 'Germany',
        'description': 'Lively and fresh, with notes of green apple and lime.',
        'designation': 'Kabinett',
        'points': 91,
        'price': 22.0,
        'province': 'Mosel',
        'region_1': 'Mosel',
        'region_2': 'Western Germany',
        'taster_name': 'Anne Krebiehl',
        'taster_twitter_handle': '@annewine',
        'title': 'Dr. Loosen 2015 Riesling Kabinett (Mosel)',
        'variety': 'Riesling',
        'winery': 'Dr. Loosen'
    },
    {
        'country': 'South Africa',
        'description': 'Aromas of citrus and melon, fresh and vibrant.',
        'designation': 'Signature',
        'points': 85,
        'price': 10.0,
        'province': 'Western Cape',
        'region_1': 'Stellenbosch',
        'region_2': 'Coastal Region',
        'taster_name': 'Lauren Buzzeo',
        'taster_twitter_handle': '@laurenbuzzeo',
        'title': 'Spier 2016 Chenin Blanc (Western Cape)',
        'variety': 'Chenin Blanc',
        'winery': 'Spier'
    },
    {
        'country': 'South Africa',
        'description': 'Aromas of citrus and melon, fresh and vibrant.',
        'designation': 'Signature',
        'points': 87,
        'price': 10.0,
        'province': 'Western Cape',
        'region_1': 'Stellenbosch',
        'region_2': 'Coastal Region',
        'taster_name': 'Lauren Buzzeo',
        'taster_twitter_handle': '@laurenbuzzeo',
        'title': 'Spier 2016 Chenin Blanc (Western Cape)',
        'variety': 'Chenin Blanc',
        'winery': 'Spier'
    }
])

print(reviews.head())

    country                                        description  \
0     Italy  Aromas include tropical fruit, broom, and mine...   
1  Portugal  This is ripe and fruity, a wine that is smooth...   
2    France  A dry style of Pinot Gris, crisp with acidity ...   
3    France  Big, rich and off-dry, with intensity and flor...   
4     Spain  Dark cherry, spice and leather aromas dominate...   

                     designation  points  price           province  \
0                   Vulkà Bianco      87   20.0  Sicily & Sardinia   
1                       Avidagos      87   15.0              Douro   
2                        Classic     100   32.0             Alsace   
3  Lieu-dit Harth Cuvée Caroline      90   21.0             Berona   
4                        Reserva      89   18.0              Rioja   

     region_1           region_2        taster_name taster_twitter_handle  \
0        Etna     Eastern Sicily      Kerin O’Keefe          @kerinokeefe   
1       Douro  Northern Port

**Exercise 1:**

What is the data type of the points column in the dataset?

In [59]:
dtype = reviews.points.dtype #attribute
dtype

dtype('int64')

**Exercise 2:**

Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.

In [67]:
point_strings = reviews.points.astype('str')
reviews.dtypes
#for it to be shown here, again, i'd have to asign it to reviews['points'] = ...

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

**Exercise 3**

Sometimes the price column is null. *How many* reviews in the dataset are missing a price? (BE CAREFUL TO READ WELL WHAT IS BEING ASKED!!!)

In [105]:
reviews.price.isnull().sum()
#.count() counts the number of NOT NULLS!!!
#here i wanna use sum() so that it sums the 'True's in the 

0

In [113]:
# Another option:
missing_price_reviews = reviews[reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)
missing_price_reviews
n_missing_prices

0

In [119]:
reviews.price.isnull()
#if i keep it like this, it happens the same, it counts number of entries, not the not nulls, so i have to pass it as reviews[...] so that now that df has the nulls only and now it counts just those
len(reviews[reviews.price.isnull()])

0

In [None]:
# More options:
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
n_missing_prices = reviews.price.isnull().sum()
# or equivalently:
n_missing_prices = pd.isnull(reviews.price).sum()

**Exercise 4:**

What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

In [143]:
reviews['region_1'] = reviews.region_1.fillna('Unknown')
reviews.groupby('region_1').title.count().sort_values(ascending = False)

region_1
Alsace          2
Stellenbosch    2
Douro           1
Etna            1
Mosel           1
Napa Valley     1
Rioja Alta      1
Uco Valley      1
Unknown         1
Name: title, dtype: int64

In [None]:
#Solution:
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)
