We'll work with data on Academy Award nominations, which can be downloaded [here](https://www.aggdata.com/awards/oscar).

In [1]:
import sqlite3
import pandas as pd

In [2]:
df = pd.read_csv("academy_awards.csv", encoding="ISO-8859-1")

In [3]:
df.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010 (83rd),Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


In [4]:
df.describe()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
count,10137,10137,10137,9011,10137,11,12,3,2,1,1
unique,83,40,6001,6424,16,5,4,3,2,1,1
top,1941 (14th),Writing,Meryl Streep,Metro-Goldwyn-Mayer,NO,*,*,while requiring no dangerous solvents. [Syste...,"understanding comedy genius - Mack Sennett.""""",*,*
freq,192,888,16,60,7168,7,9,1,1,1,1


In [5]:
df.columns

Index(['Year', 'Category', 'Nominee', 'Additional Info', 'Won?', 'Unnamed: 5',
       'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10'],
      dtype='object')

In [6]:
df['Category'].unique()

array(['Actor -- Leading Role', 'Actor -- Supporting Role',
       'Actress -- Leading Role', 'Actress -- Supporting Role',
       'Animated Feature Film', 'Art Direction', 'Cinematography',
       'Costume Design', 'Directing', 'Documentary (Feature)',
       'Documentary (Short Subject)', 'Film Editing',
       'Foreign Language Film', 'Makeup', 'Music (Scoring)',
       'Music (Song)', 'Best Picture', 'Short Film (Animated)',
       'Short Film (Live Action)', 'Sound', 'Sound Editing',
       'Visual Effects', 'Writing', 'Honorary Award',
       'Irving G. Thalberg Memorial Award',
       'Scientific and Technical (Scientific and Engineering Award)',
       'Scientific and Technical (Technical Achievement Award)',
       'Scientific and Technical (Bonner Medal)',
       'Jean Hersholt Humanitarian Award',
       'Scientific and Technical (Gordon E. Sawyer Award)',
       'Scientific and Technical (Academy Award of Merit)',
       'Scientific and Technical (Special Awards)',
       '

Clean up the **Year** column.

Use [Pandas vectorized string methods](http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods) to select just the first 4 elements in each string.

In [7]:
df['Year'].dtype

dtype('O')

In [8]:
df['Year'] = df['Year'].str[:4]

Convert the **Year** column to the **int64** data type using [astype](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html).

In [9]:
df['Year'] = df['Year'].astype('int64')

In [10]:
df['Year'].dtype

dtype('int64')

Use conditional filtering to select only the rows from the Dataframe where the **Year** column is larger than 2000.

In [11]:
later_than_2000 = df[df['Year']>2000]

later_than_2000.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


Use conditional filtering to select only the rows from later_than_2000 where the Category matches one of the 4 awards we're interested in.

In [12]:
award_categories = [\
                    'Actor -- Leading Role', \
                    'Actor -- Supporting Role', \
                    'Actress -- Leading Role', \
                    'Actress -- Supporting Role'\
                   ]

nominations = later_than_2000[later_than_2000['Category'].isin(award_categories)]

nominations.head(10)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,
5,2010,Actor -- Supporting Role,Christian Bale,The Fighter {'Dicky Eklund'},YES,,,,,,
6,2010,Actor -- Supporting Role,John Hawkes,Winter's Bone {'Teardrop'},NO,,,,,,
7,2010,Actor -- Supporting Role,Jeremy Renner,The Town {'James Coughlin'},NO,,,,,,
8,2010,Actor -- Supporting Role,Mark Ruffalo,The Kids Are All Right {'Paul'},NO,,,,,,
9,2010,Actor -- Supporting Role,Geoffrey Rush,The King's Speech {'Lionel Logue'},NO,,,,,,


In [13]:
nominations.describe(include="all")

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
count,200.0,200,200,200,200,0.0,0.0,0.0,0.0,0.0,0.0
unique,,4,150,200,2,0.0,0.0,0.0,0.0,0.0,0.0
top,,Actress -- Supporting Role,Cate Blanchett,Crazy Heart {'Bad Blake'},NO,,,,,,
freq,,50,4,1,160,,,,,,
mean,2005.5,,,,,,,,,,
std,2.879489,,,,,,,,,,
min,2001.0,,,,,,,,,,
25%,2003.0,,,,,,,,,,
50%,2005.5,,,,,,,,,,
75%,2008.0,,,,,,,,,,


Use the Series method [map](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) to replace all NO values with 0 and all YES values with 1.

In [14]:
replace_dict = { 'YES': 1, 'NO': 0 }

nominations['Won?'] = nominations['Won?'].map(replace_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [15]:
nominations['Won'] = nominations['Won?']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [16]:
final_nominations = nominations.drop([\
                                      'Won?', 'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', \
                                      'Unnamed: 8', 'Unnamed: 9', 'Unnamed: 10'\
                                     ], axis=1)

final_nominations.head(10)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won
0,2010,Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},0
1,2010,Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},0
2,2010,Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},0
3,2010,Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},1
4,2010,Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},0
5,2010,Actor -- Supporting Role,Christian Bale,The Fighter {'Dicky Eklund'},1
6,2010,Actor -- Supporting Role,John Hawkes,Winter's Bone {'Teardrop'},0
7,2010,Actor -- Supporting Role,Jeremy Renner,The Town {'James Coughlin'},0
8,2010,Actor -- Supporting Role,Mark Ruffalo,The Kids Are All Right {'Paul'},0
9,2010,Actor -- Supporting Role,Geoffrey Rush,The King's Speech {'Lionel Logue'},0


In [17]:
final_nominations.describe(include="all")

Unnamed: 0,Year,Category,Nominee,Additional Info,Won
count,200.0,200,200,200,200.0
unique,,4,150,200,
top,,Actress -- Supporting Role,Cate Blanchett,Crazy Heart {'Bad Blake'},
freq,,50,4,1,
mean,2005.5,,,,0.2
std,2.879489,,,,0.401004
min,2001.0,,,,0.0
25%,2003.0,,,,0.0
50%,2005.5,,,,0.0
75%,2008.0,,,,0.0


Use [vectorized string methods](http://pandas.pydata.org/pandas-docs/stable/basics.html#vectorized-string-methods) to clean up the Additional Info column.

In [18]:
additional_info_one = final_nominations['Additional Info'].str.rstrip("'}")
additional_info_two = additional_info_one.str.split(" {'")
movie_names = additional_info_two.str[0]
characters = additional_info_two.str[1]

final_nominations['Movie'] = movie_names
final_nominations['Character'] = characters

final_nominations.drop(['Additional Info'], axis=1, inplace=True)

final_nominations.head(10)

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston
5,2010,Actor -- Supporting Role,Christian Bale,1,The Fighter,Dicky Eklund
6,2010,Actor -- Supporting Role,John Hawkes,0,Winter's Bone,Teardrop
7,2010,Actor -- Supporting Role,Jeremy Renner,0,The Town,James Coughlin
8,2010,Actor -- Supporting Role,Mark Ruffalo,0,The Kids Are All Right,Paul
9,2010,Actor -- Supporting Role,Geoffrey Rush,0,The King's Speech,Lionel Logue


In [19]:
final_nominations.dtypes

Year          int64
Category     object
Nominee      object
Won           int64
Movie        object
Character    object
dtype: object

Create the SQLite database **nominations.db** and connect to it.

In [20]:
# Since it doesn't exist in our current directory, it will be automatically created.
conn = sqlite3.connect("nominations.db")

Use the Dataframe method **to_sql** to export final_nominations to **nominations.db**.

In [21]:
final_nominations.to_sql("nominations", conn, index=False)

Explore the database to make sure the **nominations** table matches our Dataframe.

In [22]:
pd.read_sql_query("SELECT * FROM nominations LIMIT 10;", conn)

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston
5,2010,Actor -- Supporting Role,Christian Bale,1,The Fighter,Dicky Eklund
6,2010,Actor -- Supporting Role,John Hawkes,0,Winter's Bone,Teardrop
7,2010,Actor -- Supporting Role,Jeremy Renner,0,The Town,James Coughlin
8,2010,Actor -- Supporting Role,Mark Ruffalo,0,The Kids Are All Right,Paul
9,2010,Actor -- Supporting Role,Geoffrey Rush,0,The King's Speech,Lionel Logue


In [23]:
pd.read_sql_query('PRAGMA table_info(nominations);', conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Year,INTEGER,0,,0
1,1,Category,TEXT,0,,0
2,2,Nominee,TEXT,0,,0
3,3,Won,INTEGER,0,,0
4,4,Movie,TEXT,0,,0
5,5,Character,TEXT,0,,0


Once you're done, use the Connection method **close** to close the connection to the database.

In [24]:
conn.close()

In this guided project, you used Pandas to clean a CSV dataset and export it to a SQLite database.

As a data scientist, it's important to learn many tools and how to use them together to accomplish what you need to.