<a href="https://colab.research.google.com/github/ElnazDi/colab/blob/master/Panda_DataTypes_MissingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[Intro to Data Structures
](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html)

[Working with Missing Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

In [1]:
from google.colab import files
file = files.upload()

Saving winemag-data-130k-v2.csv to winemag-data-130k-v2.csv


In [0]:
import pandas as pd
reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)
pd.set_option('max_rows', 5)


In [3]:
reviews.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

## Data Types

Alternatively, the dtypes property returns the **dtype** of** every column** in the dataset:



In [6]:
reviews.dtypes

country        object
description    object
                ...  
variety        object
winery         object
Length: 13, dtype: object



You can use the **dtype** property to grab the type of a **specific column**:



In [7]:
reviews.price.dtype

dtype('float64')

It's possible to **convert** a column **type** with the **astype** function. 



In [9]:
reviews.points.dtype

dtype('int64')

In [10]:
reviews.points.astype('float64')

0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

A DataFrame or Series index has its own dtype, too:



In [11]:
reviews.index.dtype

dtype('int64')

# Missing Data

Missing values are shown as **NaN**, short for "Not a Number". These NaN values are always of the **float64** dtype.

pandas provides some methods specific to missing data. To select NaN entreis you can use **pd.isnull** (or its companion **pd.notnull**).

In [15]:
reviews.shape

(129971, 13)

In [20]:
reviews.country.isnull()

0         False
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

In [16]:
reviews[reviews.country.isnull()]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129590,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Mike DeSimone,@worldwineguys,Büyülübağ 2012 Shah Red,Red Blend,Büyülübağ
129900,,This wine offers a delightful bouquet of black...,,91,32.0,,,,Mike DeSimone,@worldwineguys,Psagot 2014 Merlot,Merlot,Psagot


**Replacing missing values** is a common operation.  

pandas provides a really handy method for this problem: **fillna**. fillna provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown":



In [21]:
reviews.region_2.fillna('Unknown')

0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

You can fill missing values with the **first non-null value that appears sometime after the given record in the database.** This is known as the **backfill strategy**:

**fillna** supports a few strategies for **imputing** missing values. For more on that read the[ official function documentation.](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)

**Non-null values** can also be **replaced**.

For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the replace method:

In [23]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")


0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

The replace method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like "Unknown", "Undisclosed", "Invalid", and so on.



## 3.
Sometimes the price column is null. How many reviews in the dataset are missing a price?

In [0]:
n_missing_prices = reviews.price.isnull().sum()


#Solution:

#missing_price_reviews = reviews[reviews.price.isnull()]
#n_missing_prices = len(missing_price_reviews)
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
#n_missing_prices = reviews.price.isnull().sum()
# or equivalently:
#n_missing_prices = pd.isnull(reviews.price).sum()

## 4.
What are the most common wine-producing regions? Create a `Series` counting the number of times each value occurs in the `region_1` field. This field is often missing data, so replace missing values with `Unknown`. Sort in descending order.  Your output should look something like this:

```
Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
```

In [25]:
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)
reviews_per_region







Unknown                               21247
Napa Valley                            4480
                                      ...  
Sonoma County-Santa Barbara County        1
Martina                                   1
Name: region_1, Length: 1230, dtype: int64