<a href="https://colab.research.google.com/github/ShirsaM/My-Google-Colab/blob/main/Pandas_Exercise_5_Data_Types_and_Missing_Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Type
The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column.

In [None]:
reviews.price.dtype

#output:-
dtype('float64')

1) Data types tell us something about how pandas is storing the data internally. float64 means that it's using a 64-bit floating point number; int64 means a similarly sized integer instead, and so on.


2) One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.

3) It's possible to convert a column of one type into another wherever such a conversion makes sense by using the astype() function. For example, we may transform the points column from its existing int64 data type into a float64 data type

In [None]:
reviews.points.astype('float64')

#output:- 
0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

In [None]:
# A DataFrame or Series index has its own dtype, too:

reviews.index.dtype

#output:-
dtype('int64')

## Missing data
Entries missing values are given the value NaN, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.



Pandas provides some methods specific to missing data. To select NaN entries you can use pd.isnull() (or its companion pd.notnull()).

In [None]:
reviews[pd.isnull(reviews.country)]

### Note:-
Replacing missing values is a common operation. Pandas provides a really handy method for this problem: fillna(). fillna() provides a few different strategies for mitigating such data. For example, we can simply replace each NaN with an "Unknown"

In [None]:
reviews.region_2.fillna("Unknown")

#output:-
0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

we could fill each missing value with the first non-null value that appears sometime after the given record in the database. This is known as the backfill strategy.

Alternatively, we may have a non-null value that we would like to replace. For example, suppose that since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. One way to reflect this in the dataset is using the replace() method:

In [None]:
reviews.taster_twitter_handle.replace("@kerinokeefe", "@kerino")



---



In [None]:
# ques-1 - What is the data type of the points column in the dataset?

dtype = reviews.points.dtype

In [None]:
# ques-2 - Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.

point_strings = reviews.points.astype(str)

In [None]:
# ques-3 - Sometimes the price column is null. How many reviews in the dataset are missing a price?

missing_price_reviews = reviews[reviews.price.isnull()]
n_missing_prices = len(missing_price_reviews)
# Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0
n_missing_prices = reviews.price.isnull().sum()
# or equivalently:
n_missing_prices = pd.isnull(reviews.price).sum()

In [None]:
# ques-4 - What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

reviews_per_region = reviews.region_1.fillna('Unknown').value_counts().sort_values(ascending=False)