In [1]:
import pandas as pd

In [3]:
titanic_df = pd.read_csv("Titanic_data.csv")
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Introduction
In this tutorial, you'll learn how to investigate data types within a DataFrame or Series. </br>
You'll also learn how to find and replace entries.



## Dtypes¶
The data type for a column in a `DataFrame` or a `Series` is known as the dtype.

You can use the dtype property to grab the type of a specific column. For instance, we can get the `dtype` of the `Fare` column in the reviews `DataFrame`:

In [5]:
titanic_df.Fare.dtype

dtype('float64')

Alternatively, the `dtypes` property returns the `dtype` of every column in the `DataFrame`:

In [6]:
titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the object type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the `Survived` column from its existing `int64` data type into a `float64` data type:



In [9]:
titanic_df.Survived.astype('float64')

0      0.0
1      1.0
2      1.0
3      1.0
4      0.0
      ... 
886    0.0
887    1.0
888    0.0
889    1.0
890    0.0
Name: Survived, Length: 891, dtype: float64

A `DataFrame` or `Series` **index** has its own dtype, too:

In [10]:
titanic_df.index.dtype

dtype('int64')

Pandas also supports more exotic data types,</br>
such as categorical data and timeseries data. Because these data types are more rarely used,</br>
we will omit them until a much later section of this tutorial.</br>

## Missing data
Entries missing values are given the value `NaN`, short for "Not a Number". </br>
For technical reasons these `NaN` values are always of the `float64` dtype.</br>

### isnull()
Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()`</br>
(or its companion `pd.notnull()`). This is meant to be used thusly:</br>

In [13]:
titanic_df[pd.isnull(titanic_df.Cabin)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


In [14]:
titanic_df['Cabin'].isnull()

0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Name: Cabin, Length: 891, dtype: bool

### fillna()
Replacing missing values is a common operation. Pandas provides a really handy method for this problem: `fillna()`.</br>
`fillna()` provides a few different strategies for mitigating such data. For example, </br>
we can simply replace each `NaN` with an `"Unknown"`:

In [15]:
titanic_df['Cabin'].fillna('Unknown')

0      Unknown
1          C85
2      Unknown
3         C123
4      Unknown
        ...   
886    Unknown
887        B42
888    Unknown
889       C148
890    Unknown
Name: Cabin, Length: 891, dtype: object

In [17]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Or we could fill each missing value with the first non-null value that appears </br>
sometime after the given record in the database.This is known as the backfill strategy.</br>

### replace()
Alternatively, we may have a non-null value that we would like to replace. </br>
For example, suppose that, We need to change the `Name` `Braund, Mr.Owen Harris` to</br>
`Mr. Owen Harris`. One way to reflect </br>
this in the dataset is using the `replace()` method:

In [20]:
titanic_df['Name'] = titanic_df['Name'].replace('Braund, Mr. Owen Harris', 'Mr. Owen Harris')

In [21]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


This will replace the exact occurrence of `'Braund, Mr. Owen Harris'` with `'Mr. Owen Harris'`.</br>
If you need to do replacements for a broader pattern of names or multiple replacements at once, </br>
you can pass a dictionary to `replace()`:

In [23]:
titanic_df['Name'] = titanic_df['Name'].replace({

    'Heikkinen, Miss. Laina' : 'Miss. Laina Heikkinen',
    'Allen, Mr. William Henry': 'Mr. William Henry Allen'
})


In [25]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Mr. Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Miss. Laina Heikkinen,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Mr. William Henry Allen,male,35.0,0,0,373450,8.05,,S


## Kaggle Exercise (Questions)

In [None]:
# 1. What is the data type of the points column in the dataset?

# Your code here
dtype = reviews.points.dtype

# Check your answer
q1.check()

In [None]:
# 2. Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.

point_strings = reviews.points.astype(str)

# Check your answer
q2.check()


In [None]:
# 3. Sometimes the price column is null. How many reviews in the dataset are missing a price?

n_missing_prices = reviews.price.isnull().sum()

# Check your answer
q3.check()

In [None]:
# 4. What are the most common wine-producing regions? Create a Series counting the number of 
#times each value occurs in the region_1 field. This field is often missing data, so replace missing
# values with Unknown. Sort in descending order. Your output should look something like this:

# Unknown                    21247
# Napa Valley                 4480
#                            ...  
# Bardolino Superiore            1
# Primitivo del Tarantino        1
# Name: region_1, Length: 1230, dtype: int64

reviews_per_region = reviews.region_1.fillna('unknown').value_counts()

# Check your answer
q4.check()