Different ways to change the data type of a Series so that we can fix incorrect data types.

In [3]:
import pandas as pd

In [3]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

In [4]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [5]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Here we got 3 columns are integers, 1 column floating point and 2 columns are object (String).
country and continent columns are Strings. 

So lets convert 'beer_servings' column to floating point rather than integer as below:

In [7]:
drinks.beer_servings.astype(float)
# Series.astype(): Convert a pandas object to a specified dtype 'dtype'.

0        0.0
1       89.0
2       25.0
3      245.0
4      217.0
       ...  
188    333.0
189    111.0
190      6.0
191     32.0
192     64.0
Name: beer_servings, Length: 193, dtype: float64

In [10]:
drinks['beer_servings']= drinks.beer_servings.astype(float)

In [12]:
drinks.dtypes # check the dtypes of Series in Dataframe

country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Above 'beer_servings' dtype has changed from integer to floating point.

Most commonly used case is, converting dtypes helps when you read a csv file into pandas but while reading a numeric dtype column has read as object(string) dtype. So in this case to change String dtype to integer dtype, we can use the method astype().

###### How to define dtype of each column before reading the csv file?  :

Actually, we can change the data types while reading the CSV itself instead of changing it after reading CSV as below :

In [14]:
# Add 'dtype' parameter to the previous 'read_csv' command
drinks = pd.read_csv('https://bit.ly/drinksbycountry', dtype=({'beer_servings':float})) 

In [17]:
drinks.dtypes

country                          object
beer_servings                   float64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

Here 'beer_servings' are converted into floating point. 

So the only difference between with this and above is, this method change data types while reading process but above there will convert after the Dataframe had already been created. 

one more example as below:

In [6]:
orders = pd.read_table('https://bit.ly/chiporders')

In [7]:
orders.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


Lets play with column 'item_price', after seeing the data i am wondering is it 'float' or some 'currency type'. 

In [8]:
orders.dtypes

order_id               int64
quantity               int64
item_name             object
choice_description    object
item_price            object
dtype: object

Above results of dtypes say's 'item_price' is storing 'object' data type which is String. 

If you like to do math with this 'item_price' column, then we need to convert the data type of this column.

In [9]:
orders.item_price.str.replace('$','') # Will replace '$' with nothing

0        2.39 
1        3.39 
2        3.39 
3        2.39 
4       16.98 
         ...  
4617    11.75 
4618    11.75 
4619    11.25 
4620     8.75 
4621     8.75 
Name: item_price, Length: 4622, dtype: object

In [11]:
orders.item_price.str.replace('$','').astype(float).mean() 
# This will give error, if you try to do mathematical operation on this 'item_price' Series because even though we removed 
# $(dollar) sign, the rest of the numbers is still String. So we have to convert to float in order to do any math with 
# that column.

7.464335785374397

###### Useful tip: Will talk about 'item_name' column:

In [13]:
orders.item_name.str.contains('Chicken') # contains will check the presence of substring in 'item_name' Series.

0       False
1       False
2       False
3       False
4        True
        ...  
4617    False
4618    False
4619     True
4620     True
4621     True
Name: item_name, Length: 4622, dtype: bool

Imagine you want numeric 1's and 0's instead of True's and False's which is boolean, simply we can do this way: 

In [15]:
orders.item_name.str.contains('Chicken').astype(int) # Just add astype(int) to above code which casts boolean to int.

0       0
1       0
2       0
3       0
4       1
       ..
4617    0
4618    0
4619    1
4620    1
4621    1
Name: item_name, Length: 4622, dtype: int32