# Specifying Data Types
When you create a new DataFrame, either by calling a constructor or reading a CSV file, Pandas assigns a data type to each column based on its values. While it does a pretty good job, it’s not perfect. If you choose the right data type for your columns upfront, then you can significantly improve your code’s performance.


In [8]:
import pandas as pd
df = pd.read_csv("Desktop/movies.csv",encoding='latin-1')

In [9]:
df.head()

Unnamed: 0,budget,company,country,director,genre,gross,name,rating,released,runtime,score,star,votes,writer,year
0,8000000.0,Columbia Pictures Corporation,USA,Rob Reiner,Adventure,52287414.0,Stand by Me,R,1986-08-22,89,8.1,Wil Wheaton,299174,Stephen King,1986
1,6000000.0,Paramount Pictures,USA,John Hughes,Comedy,70136369.0,Ferris Bueller's Day Off,PG-13,1986-06-11,103,7.8,Matthew Broderick,264740,John Hughes,1986
2,15000000.0,Paramount Pictures,USA,Tony Scott,Action,179800601.0,Top Gun,PG,1986-05-16,110,6.9,Tom Cruise,236909,Jim Cash,1986
3,18500000.0,Twentieth Century Fox Film Corporation,USA,James Cameron,Action,85160248.0,Aliens,R,1986-07-18,137,8.4,Sigourney Weaver,540152,James Cameron,1986
4,9000000.0,Walt Disney Pictures,USA,Randal Kleiser,Adventure,18564613.0,Flight of the Navigator,PG,1986-08-01,90,6.9,Joey Cramer,36636,Mark H. Baker,1986


# Take another look at the columns of the dataset

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6820 entries, 0 to 6819
Data columns (total 15 columns):
budget      6820 non-null float64
company     6820 non-null object
country     6820 non-null object
director    6820 non-null object
genre       6820 non-null object
gross       6820 non-null float64
name        6820 non-null object
rating      6820 non-null object
released    6820 non-null object
runtime     6820 non-null int64
score       6820 non-null float64
star        6820 non-null object
votes       6820 non-null int64
writer      6820 non-null object
year        6820 non-null int64
dtypes: float64(3), int64(3), object(9)
memory usage: 799.3+ KB


# take a look at the released column

In [13]:
df['released'] = pd.to_datetime(df['released'])

In [22]:
df['released'].nunique

<bound method IndexOpsMixin.nunique of 0      1986-08-22
1      1986-06-11
2      1986-05-16
3      1986-07-18
4      1986-08-01
5      1987-02-06
6      1986-06-27
7      1986-10-23
8      1986-02-28
9      1986-08-15
10     1986-09-26
11     1986-03-07
12     1986-03-28
13     1986-07-02
14     1986-08-15
15     1986-02-21
16     1986-07-25
17     1986-12-19
18     1986-11-21
19     1986-08-01
20     1986-04-11
21     1986-06-13
22     1986-08-22
23     1986-12-05
24     1986-11-26
25     1986-05-09
26     1986-09-24
27     1986-01-17
28     1986-12-12
29     1987-05-08
          ...    
6790   2016-11-23
6791   2017-03-10
6792   2016-11-25
6793   2016-09-16
6794   2016-04-01
6795   2016-10-21
6796   2016-09-09
6797   2016-10-21
6798   2016-11-18
6799   2016-03-25
6800   2016-11-11
6801   2016-03-04
6802   2017-03-15
6803   2016-12-02
6804   2016-07-09
6805   2016-08-26
6806   2016-08-11
6807   2016-05-13
6808   2016-06-03
6809   2016-08-26
6810   2016-09-30
6811   2016-10-07
6812   

In [23]:
df['released'].value_counts()

1991-10-04    10
1988-11-18     9
1988-10-21     9
2008-09-26     9
2011-03-18     8
2002-10-11     8
1991-11-01     8
1988-03-04     8
2003-11-14     8
2008-12-25     8
1993-02-12     8
2012-04-20     8
1986-11-07     8
1990-03-09     8
1986-08-22     8
2016-10-21     8
1987-11-06     8
2000-10-20     8
1989-01-13     8
2006-10-13     8
2007-10-19     8
2010-12-17     7
1988-05-06     7
1993-10-01     7
1991-01-18     7
1996-03-22     7
1998-02-20     7
1998-11-20     7
1987-03-13     7
2015-04-10     7
              ..
1995-03-08     1
2004-09-01     1
1998-02-28     1
1988-08-10     1
1992-10-26     1
1993-01-21     1
2010-12-11     1
2011-06-25     1
1994-05-18     1
1999-01-13     1
2014-01-01     1
2012-08-15     1
2000-04-20     1
2001-02-01     1
1990-07-01     1
1986-09-05     1
1998-05-21     1
2015-08-13     1
1996-08-07     1
2012-01-05     1
2001-09-12     1
2009-05-15     1
2014-08-26     1
2007-08-30     1
2009-09-09     1
2003-03-27     1
1992-12-16     1
2017-02-24    

Which data type would you use in a relational database for such a column? You would probably not use a varchar type, but rather an enum. Pandas provides the categorical data type for the same purpose:



In [27]:
df['rating'] = pd.Categorical(df['rating'])
df['rating'].dtype
CategoricalDtype(categories=['R', 'PG'], ordered=False)

NameError: name 'CategoricalDtype' is not defined