# Data Types

## 1. Data Types
Data in different data types requires different way to deal with it. As we can imaging that number 1 and a string "1" are not the same.

Let's bring in data from seaborn to play with data types.

In [2]:
import pandas as pd
import seaborn as sns

# Load tips data, this is not from a file
# This comes with seaborn
tips = sns.load_dataset("tips")

In [4]:
# Let's see the type of each column
print(tips.dtypes)

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object


In [6]:
# Let's see the actual data
tips.iloc[0]

total_bill     16.99
tip             1.01
sex           Female
smoker            No
day              Sun
time          Dinner
size               2
Name: 0, dtype: object

In pandas, these are the main data types:
* float
* int
* datetime
* string
* bool
* object

_We will discuss about category later._


## 2. Converting Types
### 2.1 Converting to Numerical Values
#### astype()
We can use the function **astype** that comes with NumPy.

In [8]:
print(tips.dtypes)

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object


In [9]:
# Let's convert to string object
tips['total_bill']=tips['total_bill'].astype(str)
print(tips.dtypes)

total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object


In [10]:
# Let's convert to back to number
tips['total_bill']=tips['total_bill'].astype(float)
print(tips.dtypes)

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object


#### to_numeric()
We can also use a function that comes with Pandas.
This function allows us to deal with problems that may occur during the conversion process.

We can set **error** parametere to:
* raise : (the default value) it will raise the error
* coerce : it will set the one that cannot convert to **NaN**
* ignore : it will not convert at all. (we won't see error message too)

In [15]:
# Let's have a small data to play with
tips_subset = tips.head(10)
tips_subset.dtypes

# Notice. the type of total_bill is float64

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

In [16]:
# Now, we set some of the values to string 
tips_subset.loc[[1,3,5,7],'total_bill'] = 'I_am_a_string'
tips_subset.dtypes
# Type of total_bill should be object now (it was float64 before)

total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

In [17]:
# Let's see the data
tips_subset.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,I_am_a_string,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,I_am_a_string,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


#### First, let's try to call with astype

In [23]:
# Calling astype to float, we will have a problem

tips_subset['total_bill'].astype(float)

# The computer doesn't know how to convert "I_am_a_string" to a number.

# Don't get panic with the error message below

ValueError: could not convert string to float: 'I_am_a_string'

#### Second, try to call with to_numeric with default

In [24]:
# Calling to_numeric without any option is the same as calling with error='raise'
# It means, if there is error, the system will stop and alarm us

pd.to_numeric(tips_subset['total_bill'])
# same as
#pd.to_numeric(tips_subset['total_bill'], errors='raise')

# Don't be panic with the error message below

ValueError: Unable to parse string "I_am_a_string" at position 1

#### Third, we call with errors = 'ignore'

ignore means...if it cannot perform the conversion, it won't complain.

It will just return the original column (hence, no change)

In [29]:
# Now, we tell pandas that if they cannot convert, ignore it
pd.to_numeric(tips_subset['total_bill'], errors='ignore')

# Let's take a look
tips_subset.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,I_am_a_string,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,I_am_a_string,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [27]:
# Let's see the types
tips_subset.dtypes

total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

#### Finally, we will call with errors = 'coerce'

This is the forceful way. It if can convert, it will be converted.

Otherwise, it will be NaN.

In [34]:
pd.to_numeric(tips_subset['total_bill'], errors='coerce')

0    16.99
1      NaN
2    21.01
3      NaN
4    24.59
5      NaN
6     8.77
7      NaN
8    15.04
9    14.78
Name: total_bill, dtype: float64

## Category type
When data in a column has a few patterns, if we use category type, it can make the system more efficient.

In [35]:
# Notice type of sex
tips.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

In [37]:
# Now, we convert it to string
tips['sex'] = tips['sex'].astype('str')
tips.dtypes # Notice type of sex.. It should be object now... (object is not efficient)

total_bill     float64
tip            float64
sex             object
smoker        category
day           category
time          category
size             int64
dtype: object

In [38]:
tips['sex'] = tips['sex'].astype('category')
tips.dtypes # Notice type of sex.. It should be category now. (more efficient)

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object