# Exercises

### Tidy Data from Excel

An Excel spreadsheet with some brief information on awards given to movies is available at:

> https://www.gnosis.cx/cleaning/Film_Awards.xlsx

In a more fleshed out case, we might have data for many more years, more types of awards, more associations that grant awards, and so on.  While the organization of this spreadsheet is much like a great many you will encounter "in the wild," it is very little like the tidy data we would rather work with.  In the simple example, only 63 data values occur, and you could probably enter them into the desired structure by hand as quickly as coding the transformations.  However, the point of this exercise is to write programming code that could generalize to larger data sets of similar structure.

<img src="img/Film_Awards.png" alt="Film Awards"/>

__Image: Film Awards Spreadsheet__

Your task in this exercise is to read this data into a single well-normalized data frame, using whichever language and library you are most comfortable with.  Along the way, you will need to remediate whatever data integrity problems you detect.  As examples of issues to look out for:

* The film _1917_ was stored as a number not a string when naïvely entered into a cell.
* The spelling of some values is inconsistent.  Olivia Colman's name is incorrectly transcribed as 'Coleman' in one occurrence.  There is a spacing issue in one value you will need to identify.
* Structurally, an apparent parallel is not really so.  Person names are sometimes listed under the name of the association, but elsewhere under another column.  Film names are sometimes listed under association, other times elsewhere.
* Some column names occur multiple times in the same tabular area.

In thinking about a good data frame organization, think of what the independent and dependent variables are.  In each year, each association awards for each category. These are independent dimensions.  A person name and a film name are slightly tricky since they are not exactly independent, but at the same time some awards are to a film and others to a person.  Moreover, one actor might appear in multiple films in a year (not in this sample data; but do not rule it out).  Likewise, at times multiple films have used the same name at times in film history. Some persons are both director and actor (in either the same or different films).

Once you have a useful data frame, use it to answer these questions in summary reports:

* For each film involved in multiple awards, list the award and year it is associated with.
* For each actor/actress winning multiple awards, list the film and award they are associated with.
* While not occurring in this small data set, sometimes actors/actresses win awards for multiple films (usually in different years).  Make sure your code will handle that situation.
* It is manual work, but you may want to research and add awards given in other years; in particular, adding some data will show actors with awards for multiple films.  Do your other reports correctly summarize the larger data set?

### Tidy Data from SQL

An SQLite database with roughly the same brief information as in the prior spreadsheet is available at:

> https://www.gnosis.cx/cleaning/Film_Awards.sqlite

However, the information in the database version is relatively well normalized and typed.  Also, additional information has been included on a variety of entities included in the spreadsheet.  Only slightly more information is included in this schema than in the spreadsheet, but it should be able to accommodate a large amount of data on films, actors, directors, and awards, and the relationships among those data.

```sql
sqlite> .tables
actor     award     director  org_name
```

As was mentioned in the prior exercise, the same name for a film can be used more than once, even by the same director.  For example  Abel Gance, used the title _J'accuse!_ for both his 1919 and 1938 films with connected subject matter.

```
sqlite> SELECT * FROM director WHERE year < 1950;
Abel Gance|J'accuse!|1919
Abel Gance|J'accuse!|1938
```

Let us look at a selection from the `actor` table, for example.  In this table we have a column `gender` to differentiate beyond name. As of this writing, no transgender actor has been nominated for a major award both before and after a change in gender identity, but this schema allows for that possibility.  In any case, we can use this field to differentiate the "actor" versus "actress" awards that many organizations grant.

```sql
sqlite> .schema actor
CREATE TABLE actor (name TEXT, film TEXT, year INTEGER, gender CHAR(1));

sqlite> SELECT * FROM actor WHERE name="Joaquin Phoenix";
Joaquin Phoenix|Joker|2019|M
Joaquin Phoenix|Walk the Line|2006|M
Joaquin Phoenix|Hotel Rwanda|2004|M
Joaquin Phoenix|Her|2013|M
Joaquin Phoenix|The Master|2013|M
```

The goal in this exercise is to create the same tidy data frame that you created in the prior exercise, and answer the same questions that were asked there.  If some questions can be answered directly with SQL, feel free to take that approach instead.  For this exercise, only consider awards for the years 2017, 2018, and 2019.  Some others are included in an incomplete way, but your reports are for those years.

```sql
sqlite> SELECT * FROM award WHERE winner="Frances McDormand";
Oscar|Best Actress|2017|Frances McDormand
GG|Actress/Drama|2017|Frances McDormand
Oscar|Best Actress|1997|Frances McDormand
```