# Introduction to the Data.

In this notebook we are going to clean a CSV dataset and add it to a SQLite database. The data contains information on Academy Awards nominations, also known as Oscars, and it is broke down in the following columns:

· __Year__: the year of the awards ceremony.

· __Category__: the category of award the nominee was nominated for.

· __Nominee__: the person nominated for the award.     

· __Additional info__: it contains additional info like the movie the nominee participated or the name of the character played.

· __Won?__: YES or NO depending on if the nominee won the award.
    
The dataset we will use can be found here: https://www.aggdata.com/awards/oscar

In [23]:
# We will start importing pandas and reading our file into a Dataframe, 

import pandas as pd
academy_awards = pd.read_csv('academy_awards.csv',encoding = 'ISO-8859-1')
academy_awards.head(5)

Unnamed: 0,Year,Category,Nominee,Additional Info,Won?,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10
0,2010 (83rd),Actor -- Leading Role,Javier Bardem,Biutiful {'Uxbal'},NO,,,,,,
1,2010 (83rd),Actor -- Leading Role,Jeff Bridges,True Grit {'Rooster Cogburn'},NO,,,,,,
2,2010 (83rd),Actor -- Leading Role,Jesse Eisenberg,The Social Network {'Mark Zuckerberg'},NO,,,,,,
3,2010 (83rd),Actor -- Leading Role,Colin Firth,The King's Speech {'King George VI'},YES,,,,,,
4,2010 (83rd),Actor -- Leading Role,James Franco,127 Hours {'Aron Ralston'},NO,,,,,,


The first thing we can see when exploring the first rows of the Dataframe is that we have 6 *unnamed columns*, full of NaNs. To see if there are different values, the *value_counts* method will be used in every column.

In [24]:
academy_awards['Unnamed: 5'].value_counts(normalize = True, dropna = False)

NaN                                                                                                             0.998915
*                                                                                                               0.000691
 discoverer of stars                                                                                            0.000099
 error-prone measurements on sets. [Digital Imaging Technology]"                                                0.000099
 resilience                                                                                                     0.000099
 D.B. "Don" Keele and Mark E. Engebretson has resulted in the over 20-year dominance of constant-directivity    0.000099
Name: Unnamed: 5, dtype: float64

In [25]:
academy_awards['Unnamed: 6'].value_counts(normalize = True, dropna = False)

NaN                                                                 0.998816
*                                                                   0.000888
 sympathetic                                                        0.000099
 direct radiator bass style cinema loudspeaker systems. [Sound]"    0.000099
 flexibility and water resistance                                   0.000099
Name: Unnamed: 6, dtype: float64

In [26]:
academy_awards['Unnamed: 7'].value_counts(normalize = True, dropna = False)

NaN                                                   0.999704
*                                                     0.000099
 while requiring no dangerous solvents. [Systems]"    0.000099
 kindly                                               0.000099
Name: Unnamed: 7, dtype: float64

In [27]:
academy_awards['Unnamed: 8'].value_counts(normalize = True, dropna = False)

NaN                                               0.999803
*                                                 0.000099
 understanding comedy genius - Mack Sennett.""    0.000099
Name: Unnamed: 8, dtype: float64

In [28]:
academy_awards['Unnamed: 9'].value_counts(normalize = True, dropna = False)

NaN    0.999901
*      0.000099
Name: Unnamed: 9, dtype: float64

In [29]:
academy_awards['Unnamed: 10'].value_counts(normalize = True, dropna = False)

NaN    0.999901
*      0.000099
Name: Unnamed: 10, dtype: float64

# Filtering the Data

As we could see in the exploratory analysis this dataset doesn't have a consistent format, which is extremely important to query the data later on when we use SQL.

To make our dataset easier to work with, we are going to make some filtering to get a smaller subset that only contains the values for the columns:

· __Actor -- Leading Role.__

· __Actor -- Supporting Role.__

· __Actress -- Leading Role.__

· __Actress -- Supporting Role.__

First, we are going to clean up the "Year" column, because we just want the value outside of parentheses and converted to integers, so we can handle them better.

In [37]:
academy_awards["Year"] = academy_awards["Year"].str[0:4].astype('int64')

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas