> **Copyright (c) 2020 Skymind Holdings Berhad**<br><br>
> **Copyright (c) 2021 Skymind Education Group Sdn. Bhd.**<br>
<br>
Licensed under the Apache License, Version 2.0 (the \"License\");
<br>you may not use this file except in compliance with the License.
<br>You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0/
<br>
<br>Unless required by applicable law or agreed to in writing, software
<br>distributed under the License is distributed on an \"AS IS\" BASIS,
<br>WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
<br>See the License for the specific language governing permissions and
<br>limitations under the License.
<br>
<br>
**SPDX-License-Identifier: Apache-2.0**
<br>

# Introduction

In this tutorial, you'll learn how to investigate data types within a DataFrame or Series.  You'll also learn how to find and replace entries.

# Notebook Content

* [Dtypes](#Dtypes)


* [Missing Data](#Missing-data)

# Dtypes

The data type for a column in a DataFrame or a Series is known as the **dtype**.

You can use the `dtype` property to grab the type of a specific column.  For instance, we can get the dtype of the `Avg_viewer_ratio` column in the `Game` DataFrame:

In [1]:
import pandas as pd

games = pd.read_csv("../../../resources/day_01/twitch_game_data.csv", index_col=0)

In [2]:
games

Unnamed: 0_level_0,Game,Month,Year,Hours_watched,Hours_Streamed,Peak_viewers,Peak_channels,Streamers,Avg_viewers,Avg_channels,Avg_viewer_ratio
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,League of Legends,1,2016,94377226,1362044 hours,530270,2903,129172,127021,1833,69.29
2,Counter-Strike: Global Offensive,1,2016,47832863,830105 hours,372654,2197,120849,64378,1117,57.62
3,Dota 2,1,2016,45185893,433397 hours,315083,1100,44074,60815,583,104.26
4,Hearthstone,1,2016,39936159,235903 hours,131357,517,36170,53749,317,169.29
5,Call of Duty: Black Ops III,1,2016,16153057,1151578 hours,71639,3620,214054,21740,1549,14.03
...,...,...,...,...,...,...,...,...,...,...,...
196,War Thunder,6,2021,704459,73613 hours,8812,223,7035,979,102,9.57
197,Muck,6,2021,701456,31741 hours,60091,112,8591,975,44,22.10
198,Trials Rising,6,2021,698899,4626 hours,217333,26,581,972,6,151.08
199,Little Nightmares II,6,2021,695130,27581 hours,43518,105,6128,966,38,25.20


In [3]:
games.Game.dtype

dtype('O')

In [4]:
games.Avg_viewer_ratio.dtype

dtype('float64')

Alternatively, the `dtypes` property returns the `dtype` of _every_ column in the DataFrame:

In [5]:
games.dtypes

Game                 object
Month                 int64
Year                  int64
Hours_watched         int64
Hours_Streamed       object
Peak_viewers          int64
Peak_channels         int64
Streamers             int64
Avg_viewers           int64
Avg_channels          int64
Avg_viewer_ratio    float64
dtype: object

Data types tell us something about how pandas is storing the data internally. `float64` means that it's using a 64-bit floating point number; `int64` means a similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns consisting entirely of strings do not get their own type; they are instead given the `object` type.

It's possible to convert a column of one type into another wherever such a conversion makes sense by using the `astype()` function. For example, we may transform the `points` column from its existing `int64` data type into a `float64` data type:

In [6]:
games.Avg_viewers.astype('float64')

Rank
1      127021.0
2       64378.0
3       60815.0
4       53749.0
5       21740.0
         ...   
196       979.0
197       975.0
198       972.0
199       966.0
200       957.0
Name: Avg_viewers, Length: 13200, dtype: float64

A DataFrame or Series index has its own `dtype`, too:

In [7]:
games.index.dtype

dtype('int64')

Pandas also supports more exotic data types, such as categorical data and timeseries data. Because these data types are more rarely used.

# Missing data

Entries missing values are given the value `NaN`, short for "Not a Number". For technical reasons these `NaN` values are always of the `float64` dtype.

Pandas provides some methods specific to missing data. To select `NaN` entries you can use `pd.isnull()` (or its companion `pd.notnull()`). This is meant to be used thusly:

In [8]:
potability = pd.read_csv("../../../resources/day_01/water_potability.csv", index_col=0)

In [9]:
potability

Unnamed: 0_level_0,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
ph,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.990970,2.963135,0
3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...
4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


In [10]:
potability[pd.isnull(potability.Sulfate)]

Unnamed: 0_level_0,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
ph,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
7.974522,218.693300,18767.656682,8.110385,,364.098230,14.525746,76.485911,4.011718,0
7.496232,205.344982,28388.004887,5.072558,,444.645352,13.228311,70.300213,4.777382,0
7.051786,211.049406,30980.600787,10.094796,,315.141267,20.397022,56.651604,4.268429,0
...,...,...,...,...,...,...,...,...,...
8.372910,169.087052,14622.745494,7.547984,,464.525552,11.083027,38.435151,4.906358,1
7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,,2.798243,1
9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.845400,3.298875,1
5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


Replacing missing values is a common operation.  Pandas provides a really handy method for this problem: `fillna()`. `fillna()` provides a few different strategies for mitigating such data. For example, we can simply replace each `NaN` with an `"Unknown"`:

In [11]:
trihalo_mean = potability.mean(axis=0, numeric_only=True).Trihalomethanes
print("Trihalomethanes:", trihalo_mean)

Trihalomethanes: 66.39629294676803


In [12]:
potability.Trihalomethanes.fillna("unknown", inplace=True)

In [13]:
potability

Unnamed: 0_level_0,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
ph,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
3.716080,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
9.092223,181.101509,17978.986339,6.546600,310.135738,398.410813,11.558279,31.997993,4.075075,0
...,...,...,...,...,...,...,...,...,...
4.668102,193.681735,47580.991603,7.166639,359.948574,526.424171,13.894419,66.687695,4.435821,1
7.808856,193.553212,17329.802160,8.061362,,392.449580,19.903225,unknown,2.798243,1
9.419510,175.762646,33155.578218,7.350233,,432.044783,11.039070,69.8454,3.298875,1
5.126763,230.603758,11983.869376,6.303357,,402.883113,11.168946,77.488213,4.708658,1


Or we could fill each missing value with the mean of values respective to the columns in the database.

In [14]:
potability.Trihalomethanes.replace("unknown", trihalo_mean)

ph
NaN          86.990970
3.716080     56.329076
8.099124     66.420093
8.316766    100.341674
9.092223     31.997993
               ...    
4.668102     66.687695
7.808856     66.396293
9.419510     69.845400
5.126763     77.488213
7.874671     78.698446
Name: Trihalomethanes, Length: 3276, dtype: float64

The `replace()` method is worth mentioning here because it's handy for replacing missing data which is given some kind of sentinel value in the dataset: things like `"Unknown"`, `"Undisclosed"`, `"Invalid"`, and so on.

# Contributors

**Author**
<br>Chee Lam

# References

1. [Learning Pandas](https://www.kaggle.com/learn/pandas)
2. [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)