## Loading the Data

In [109]:
import pandas as pd
import numpy as np
import sqlite3 as sqlite

In [110]:
aca = pd.read_csv("academy_awards.csv", encoding="ISO-8859-1")

In [111]:
aca.head()

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010 (83rd),Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


## Filtering and Cleaning the Data

In [112]:
aca["Year"] = aca["Year"].str.strip().str[:4].astype(int)
aca = aca[aca['Won?'].isin(["NO", "YES"])]
aca["Won"] = aca["Won?"].map({ "NO": 0, "YES": 1 }).astype(int)
aca = aca.drop(["Won?",
         "Unnamed: 5",
         "Unnamed: 6",
         "Unnamed: 7",
         "Unnamed: 8",
         "Unnamed: 9",
         "Unnamed: 10"], axis=1)
cleaning_movies = aca["Additional Info"].str.rstrip("'}").str.split(" {'")
aca["Movie"] = cleaning_movies.str[0]
aca["Character"] = cleaning_movies.str[1]
aca = aca.drop('Additional Info', axis=1)

In [113]:
later_than_2000 = aca[aca.Year > 2000]
award_categories = ['Actor -- Leading Role',
                   'Actor -- Supporting Role',
                   'Actress -- Leading Role',
                   'Actress -- Supporting Role']
nominations = later_than_2000[later_than_2000.Category.isin(award_categories)]

In [114]:
nominations.head()

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston


In [115]:
nominations.dtypes

Year          int64
Category     object
Nominee      object
Won           int64
Movie        object
Character    object
dtype: object

## Exporting Data to SQLite

In [116]:
con = sqlite.connect("nominations.db")
nominations.to_sql("nominations", con, index=False, if_exists='replace')

## Verifying in SQL

In [117]:
query_one = pd.read_sql_query("pragma table_info(nominations);", con)
query_one

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Year,INTEGER,0,,0
1,1,Category,TEXT,0,,0
2,2,Nominee,TEXT,0,,0
3,3,Won,INTEGER,0,,0
4,4,Movie,TEXT,0,,0
5,5,Character,TEXT,0,,0


In [118]:
query_two = pd.read_sql_query("select * from nominations limit 10;", con)
query_two

Unnamed: 0,Year,Category,Nominee,Won,Movie,Character
0,2010,Actor -- Leading Role,Javier Bardem,0,Biutiful,Uxbal
1,2010,Actor -- Leading Role,Jeff Bridges,0,True Grit,Rooster Cogburn
2,2010,Actor -- Leading Role,Jesse Eisenberg,0,The Social Network,Mark Zuckerberg
3,2010,Actor -- Leading Role,Colin Firth,1,The King's Speech,King George VI
4,2010,Actor -- Leading Role,James Franco,0,127 Hours,Aron Ralston
5,2010,Actor -- Supporting Role,Christian Bale,1,The Fighter,Dicky Eklund
6,2010,Actor -- Supporting Role,John Hawkes,0,Winter's Bone,Teardrop
7,2010,Actor -- Supporting Role,Jeremy Renner,0,The Town,James Coughlin
8,2010,Actor -- Supporting Role,Mark Ruffalo,0,The Kids Are All Right,Paul
9,2010,Actor -- Supporting Role,Geoffrey Rush,0,The King's Speech,Lionel Logue


In [119]:
con.close()

## Next Steps

In this guided project, you used Pandas to clean a CSV dataset and export it to a SQLite database. As a data scientist, it's important to learn many tools and how to use them together to accomplish what you need to. As you do more guided projects, you'll become more familiar with the strengths and weaknesses of each tool. For example, you probably have noticed that data cleaning is much easier in Pandas than in SQL.

For next steps, explore the rest of our original dataset academy_awards.csv and brainstorm how to fix the rest of the dataset:

- The awards categories in older ceremonies were different than the ones we have today. What relevant information should we keep from older ceremonies?
- What are all the different formatting styles that the Additional Info column contains. Can we use tools like regular expressions to capture these patterns and clean them up?
 - The nominations for the Art Direction category have lengthy values for Additional Info. What information is useful and how do we extract it?
 - Many values in Additional Info don't contain the character name the actor or actress played. Should we toss out character name altogether as we expand our data? What tradeoffs do we make by doing so?
- What's the best way to handle awards ceremonies that included movies from 2 years?
 - E.g. see 1927/28 (1st) in the Year column.

Next up is a guided project where we'll continue down the path we started and explore how to normalize our data into multiple tables using relations.