# In-Class Exercises 4 (solution)

The data that you will use come from the [replication package]() to Ricardo Duque
Gabriel, Mathias Klein, and Ana Sofia Pessoa: *The Political Costs of Austerity*,
forthcoming at the Review of Economics and Statistics.

> **Note:**
> 
> Please commit every time you solve one of the exercises. An example commit message
> could be `"Solution to question 1"`. Feel free to commit more than once per
> exercise if solving it requires multiple complicated steps.
>
> Push every now and then and switch to somebody else's machine.

## Using the `pathlib` library

---
### Question 1

Assign the path of the current directory to a variable `this_dir`. 

Verify that the type of the variable is `pathlib.PosixPath` (`PosixPath` are used on Unix-like systems such as Linux ans macOS) or `pathlib.WindowsPath`. 

Display the absolute path of the directory.

---
### Question 2

In the `original_data` directory, there is a file called `Data_Elections.dta`. Assign
the path of this file to a variable `data_file`. The type of the variable should be 
`pathlib.PosixPath` or `pathlib.WindowsPath`. 

- When creating the `Path` object, do not use absolute paths.
- Display the absolute path of the `data_file`.

---
### Bonus: Question 3 

This question is optional and meant to sharpen your theoretical understanding for absolute and relative paths. Skip if you want. 

Using only the objects `this_dir` and `data_file` along with their methods, get the relative path to `data_file` as seen from `this_dir`. Display the relative path.

> Note: We have not seen this in the screencast; you are on your own with your favourite
> search engine.

## pandas

---
### Question 4

- Import the `pandas` library as `pd`. 

The following options are useful:
- With these options, you use "modern" Pandas. 
- It sets the plotting backend to `plotly`.

In [None]:
pd.options.mode.copy_on_write = True
pd.options.future.infer_string = True
pd.options.plotting.backend = "plotly"


---
### Question 5

Read the file `Data_Elections.dta` into a `pd.DataFrame` object called `data`. Use the
`data_file` object for doing so.

You are likely to get an error when doing so. Find out
how to fix it. *(Hint: The error message is even more explicit than usually in Python;
you may want to have a carefully look at it despite its length. If working in VS Code,
make sure that you can see the entire message by selecting from the view options at the
bottom of the cell output)*

---
### Question 6

Familiarise yourself with the dataset. E.g., you may want to look the column names, the
shape, some rows, data types ...

---
### Question 7

We had to discard some information when reading the dataset. Luckily, we can access all
of it using a low-level `pd.io.StataReader` object. This will allow us to look at all meta
information that is stored with the dataset in Stata format.

Create `data_info` that contains the `pd.io.StataReader` object. Use the `data_file` object for
doing so.

Look at the various `labels` attributes of the `pd.StataReader` object. Explain what
they do. Can you explain now why the error occurred when reading the dataset?

---
### Question 8

Look at the structure of `data` again. Would you keep all of

- `Country` and `cid`?
- `Nuts_id`,  `Name`, and `id`?

Why or why not?


---
### Bonus: Question 9 (only if you have some time left)

This question is optional and means to justify why we can drop `cid` and `id`.

Make sure that we can safely drop `cid` and `id` from the dataset by finding out the unique combinations of the sets of variables from the previous question.

> Note: The `.unique()` method only works on Series, you'll need to use `.drop_duplicates()`.

---
### Question 10

Drop `cid` and `id` from the dataset, which you should continue to store in the variable
`data`. Verify that the columns are gone.

To do so, you can either use the `drop(columns=[...])` method or select those columns
that you want to keep (there are even more options). Can you think of a reasons for each
strategy, particularly in an interactive setting like a notebook?

---
### Question 11

Give the columns (more) sensible names using the `lowercase_with_underscores`
convention. You should continue to store the dataset in the variable `data`. 

> Note: While trying out, you don't want to assign to `data` yet because once you change
> the column names, you cannot access the old ones anymore. That will eventually be 
> fine, but not in the interactive setting yet.

> Tip: You can just copy the output of the `variable_labels()` call above to get
> started.

---
### Question 12

Convert all variables to sensible data types. In some cases, you may want to keep the column names, in some cases you may want to generate new ones. Briefly explain all of  your choices.

#### `"country", "nuts_id", "nuts_name"` to `pd.CategoricalDtype()`

This can be done in one step because the variables already have sensible labels.

### `year` to a `pd.Int16Dtype()`

### `election_type` to a `pd.CategoricalDtype()`

The labels are not meaningful, but we can use the `value_labels` attribute of the
   `pd.StataReader` object to get the labels.

1. Create `election_cats`, a dictionary with the labels from the `value_labels`) 
2. Replace the duplicated label with `National + Regional A` and `National + Regional B`
3. First, convert `election_type` to a `pd.Int8Dtype()` and then to a `pd.CategoricalDtype()`
4. Last, you can use the `election_cats` dictionary to convert the labels to a
   `pd.CategoricalDtype()`.

### Round and convert all columns starting with `number` to `pd.Int32Dtype()` (except for `number_parties_effective`)

### `pm_party_left` to `pd.CategoricalDtype()`
- Convert it first to a `pd.Int8Dtype()` and then to a `pd.CategoricalDtype()`
- You can then rename the categories to meaningful labels

---
### Question 13

Summarise the cleaned data in a similar way as above. Now also look at summary
statistics, including value counts of categorical variables. Do you notice anything
when doing the latter? If so, will you have to be careful for interpreting descriptive
statistics?

---
### Question 14

Make three plots of mean vote shares by year, across all NUTS regions and elections.
- far right share
- far left share
- far share (any)

Don't worry about whether these plots make a lot of sense (i.e., any weighting with
electorate size or the like).

---
### Question 15

Make three scatterplots of vote shares by year, across all NUTS regions and elections.

- far right share
- far left share
- far share (any)

Colour the dots using the country.