## Introduction
In this tutorial, you'll learn how to investigate data types within a DataFrame or Series. You'll also learn how to find and replace entries.

## Dtypes
The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column. For instance, we can get the dtype of the price column in the reviews DataFrame:

In [1]:
import pandas as pd
reviews = pd.read_csv("winemag-data-130k-v2.csv", index_col=0)

In [3]:
reviews.price.dtype

dtype('float64')

Alternatively, the **dtypes** property returns the dtype of every column in the DataFrame:

In [4]:
reviews.dtypes

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

Data types tell us something about how pandas is storing the data internally. float64 means that it's using a 64-bit floating point number; int64 means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.

It's possible to *convert a column of one type into another* wherever such a conversion makes sense by using the **astype()** function. For example, we may transform the points column from its existing int64 data type into a float64 data type:

In [5]:
reviews.points.dtype

dtype('int64')

In [6]:
reviews.points.astype('float64')

0         87.0
1         87.0
2         87.0
3         87.0
4         87.0
5         87.0
6         87.0
7         87.0
8         87.0
9         87.0
10        87.0
11        87.0
12        87.0
13        87.0
14        87.0
15        87.0
16        87.0
17        87.0
18        87.0
19        87.0
20        87.0
21        87.0
22        87.0
23        87.0
24        87.0
25        87.0
26        87.0
27        87.0
28        87.0
29        86.0
          ... 
129941    90.0
129942    90.0
129943    90.0
129944    90.0
129945    90.0
129946    90.0
129947    90.0
129948    90.0
129949    90.0
129950    90.0
129951    90.0
129952    90.0
129953    90.0
129954    90.0
129955    90.0
129956    90.0
129957    90.0
129958    90.0
129959    90.0
129960    90.0
129961    90.0
129962    90.0
129963    90.0
129964    90.0
129965    90.0
129966    90.0
129967    90.0
129968    90.0
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

A DataFrame or Series index has its own dtype, too:

In [7]:
reviews.index.dtype

dtype('int64')

In [8]:
reviews.index.dtype

dtype('int64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used, we will omit them until a much later section of this tutorial.

## Missing data
Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()). This is meant to be used thusly:

In [9]:
reviews[pd.isnull(reviews.country)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
4243,,"Violet-red in color, this semisweet wine has a...",Red Naturally Semi-Sweet,88,18.0,,,,Mike DeSimone,@worldwineguys,Kakhetia Traditional Winemaking 2012 Red Natur...,Ojaleshi,Kakhetia Traditional Winemaking
9509,,This mouthwatering blend starts with a nose of...,Theopetra Malagouzia-Assyrtiko,92,28.0,,,,Susan Kostrzewa,@suskostrzewa,Tsililis 2015 Theopetra Malagouzia-Assyrtiko W...,White Blend,Tsililis
9750,,This orange-style wine has a cloudy yellow-gol...,Orange Nikolaevo Vineyard,89,28.0,,,,Jeff Jenssen,@worldwineguys,Ross-idi 2015 Orange Nikolaevo Vineyard Chardo...,Chardonnay,Ross-idi
11150,,"A blend of 85% Melnik, 10% Grenache Noir and 5...",,89,20.0,,,,Jeff Jenssen,@worldwineguys,Orbelus 2013 Melnik,Melnik,Orbelus
11348,,"Light and fruity, this is a wine that has some...",Partager,82,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager White,White Blend,Barton & Guestier
14030,,"This Furmint, grown in marl soils, has aromas ...",Márga,88,25.0,,,,Jeff Jenssen,@worldwineguys,St. Donat 2013 Márga White,White Blend,St. Donat
16000,,"Jumpy, jammy aromas of foxy black fruits are s...",Valle de los Manantiales Vineyard,86,40.0,,,,Michael Schachner,@wineschach,Familia Deicas 2015 Valle de los Manantiales V...,Tannat,Familia Deicas
16749,,Winemaker: Bartho Eksteen. This wooded Sauvy s...,Cape Winemakers Guild Vloekskoot Wooded,91,,,,,Lauren Buzzeo,@laurbuzz,Bartho Eksteen 2016 Cape Winemakers Guild Vloe...,Sauvignon Blanc,Bartho Eksteen


Replacing missing values is a common operation. Pandas provides a really handy method for this problem: **fillna()**. fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":

In [10]:
reviews.region_2.fillna("Unknown")

0                   Unknown
1                   Unknown
2         Willamette Valley
3                   Unknown
4         Willamette Valley
5                   Unknown
6                   Unknown
7                   Unknown
8                   Unknown
9                   Unknown
10                     Napa
11                  Unknown
12                   Sonoma
13                  Unknown
14            Central Coast
15                  Unknown
16                  Unknown
17                  Unknown
18                  Unknown
19                  Unknown
20                  Unknown
21             Oregon Other
22                  Unknown
23            Central Coast
24                  Unknown
25                   Sonoma
26                  Unknown
27                  Unknown
28                  Unknown
29           Central Valley
                ...        
129941              Unknown
129942               Sonoma
129943              Unknown
129944              Unknown
129945        Centra

Or we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.

Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the **replace()** method:

In [11]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")

0                @kerino
1             @vossroger
2            @paulgwine 
3                    NaN
4            @paulgwine 
5            @wineschach
6                @kerino
7             @vossroger
8                    NaN
9             @vossroger
10               @vboone
11            @vossroger
12               @vboone
13               @kerino
14         @mattkettmann
15                   NaN
16           @wineschach
17           @wineschach
18           @wineschach
19                   NaN
20                   NaN
21           @paulgwine 
22               @kerino
23         @mattkettmann
24               @kerino
25               @vboone
26               @kerino
27               @kerino
28               @kerino
29               @vboone
               ...      
129941           @vboone
129942           @vboone
129943           @kerino
129944    @worldwineguys
129945     @mattkettmann
129946               NaN
129947           @kerino
129948       @wineschach
129949           @vboone


The replace() method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.

## Excercises

1. What is the data type of the points column in the dataset?

In [12]:
dtype = reviews.points.dtype
dtype

dtype('int64')

2. Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.

In [17]:
point_strings = reviews.points.astype('str')
point_strings


0         87
1         87
2         87
3         87
4         87
5         87
6         87
7         87
8         87
9         87
10        87
11        87
12        87
13        87
14        87
15        87
16        87
17        87
18        87
19        87
20        87
21        87
22        87
23        87
24        87
25        87
26        87
27        87
28        87
29        86
          ..
129941    90
129942    90
129943    90
129944    90
129945    90
129946    90
129947    90
129948    90
129949    90
129950    90
129951    90
129952    90
129953    90
129954    90
129955    90
129956    90
129957    90
129958    90
129959    90
129960    90
129961    90
129962    90
129963    90
129964    90
129965    90
129966    90
129967    90
129968    90
129969    90
129970    90
Name: points, Length: 129971, dtype: object

3. Sometimes the price column is null. How many reviews in the dataset are missing a price?

In [23]:
n_missing_prices = reviews.price.isnull().sum()
n_missing_prices


8996

In [None]:
#Решения от kaggle:

missing_price_reviews = reviews[reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)

In [None]:
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
n_missing_prices = reviews.price.isnull().sum()

In [None]:
# or equivalently:
n_missing_prices = pd.isnull(reviews.price).sum()

4. What are the most common wine-producing regions? 

Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

In [3]:
reviews_per_region = reviews.region_1.fillna("Unknown").value_counts().sort_values(ascending=False)
reviews_per_region

Unknown                               21247
Napa Valley                            4480
Columbia Valley (WA)                   4124
Russian River Valley                   3091
California                             2629
Paso Robles                            2350
Mendoza                                2301
Willamette Valley                      2301
Alsace                                 2163
Champagne                              1613
Barolo                                 1599
Finger Lakes                           1565
Sonoma Coast                           1474
Brunello di Montalcino                 1470
Rioja                                  1469
Sonoma County                          1246
Toscana                                1197
Walla Walla Valley (WA)                1066
Chianti Classico                       1062
Santa Barbara County                   1000
Carneros                                999
Yakima Valley                           997
Sta. Rita Hills                 