# In-Class Exercises 4 (solution)

The data that you will use come from the [replication package]() to Ricardo Duque
Gabriel, Mathias Klein, and Ana Sofia Pessoa: *The Political Costs of Austerity*,
forthcoming at the Review of Economics and Statistics.

> **Note:**
> 
> Please commit every time you solve one of the exercises. An example commit message
> could be `"Solution to question 1"`. Feel free to commit more than once per
> exercise if solving it requires multiple complicated steps.
>
> Push every now and then and switch to somebody else's machine.

## Using the `pathlib` library

---
### Question 1

Assign the path of the current directory to a variable `this_dir`. 

Verify that the type of the variable is `pathlib.PosixPath` (`PosixPath` are used on Unix-like systems such as Linux ans macOS) or `pathlib.WindowsPath`. 

Display the absolute path of the directory.

In [1]:
from pathlib import Path

this_dir = Path()
type(this_dir)

pathlib.PosixPath

In [2]:
this_dir.resolve()

PosixPath('/Users/malka/Dropbox/teaching-uliege/ HSG-8276-Spring-2025/m4-exercises')

---
### Question 2

In the `original_data` directory, there is a file called `Data_Elections.dta`. Assign
the path of this file to a variable `data_file`. The type of the variable should be 
`pathlib.PosixPath` or `pathlib.WindowsPath`. 

- When creating the `Path` object, do not use absolute paths.
- Display the absolute path of the `data_file`.

In [3]:
data_file = this_dir / "original_data" / "Data_Elections.dta"
data_file.resolve()

PosixPath('/Users/malka/Dropbox/teaching-uliege/ HSG-8276-Spring-2025/m4-exercises/original_data/Data_Elections.dta')

---
### Bonus: Question 3 

This question is optional and meant to sharpen your theoretical understanding for absolute and relative paths. Skip if you want. 

Using only the objects `this_dir` and `data_file` along with their methods, get the relative path to `data_file` as seen from `this_dir`. Display the relative path.

> Note: We have not seen this in the screencast; you are on your own with your favourite
> search engine.

In [4]:
data_file.resolve().relative_to(this_dir.resolve())

PosixPath('original_data/Data_Elections.dta')

## pandas

---
### Question 4

- Import the `pandas` library as `pd`. 

In [5]:
import pandas as pd

The following options are useful:
- With these options, you use "modern" Pandas. 
- It sets the plotting backend to `plotly`.

In [6]:
pd.options.mode.copy_on_write = True
pd.options.future.infer_string = True
pd.options.plotting.backend = "plotly"


---
### Question 5

Read the file `Data_Elections.dta` into a `pd.DataFrame` object called `data`. Use the
`data_file` object for doing so.

You are likely to get an error when doing so. Find out
how to fix it. *(Hint: The error message is even more explicit than usually in Python;
you may want to have a carefully look at it despite its length. If working in VS Code,
make sure that you can see the entire message by selecting from the view options at the
bottom of the cell output)*

In [7]:
data = pd.read_stata(data_file, convert_categoricals=False)

---
### Question 6

Familiarise yourself with the dataset. E.g., you may want to look the column names, the
shape, some rows, data types ...

In [8]:
data.columns

Index(['Country', 'cid', 'Nuts_id', 'Name', 'Year', 'ElectionType',
       'EligibleVoters', 'Valid', 'HHI', 'Far_Right', 'Far_Left',
       'Far_Right_share', 'Far_Left_share', 'Far_share', 'Turnout',
       'F0Far_Incumbent', 'left', 'id'],
      dtype='string')

In [9]:
data.shape

(4536, 18)

In [10]:
data

Unnamed: 0,Country,cid,Nuts_id,Name,Year,ElectionType,EligibleVoters,Valid,HHI,Far_Right,Far_Left,Far_Right_share,Far_Left_share,Far_share,Turnout,F0Far_Incumbent,left,id
0,Austria,1.0,AT11,Burgenland (AT),1980.0,,192225.0,180717.0,0.474132,4941.0,694.0,0.027341,0.003840,0.031181,0.949138,0.0,1.0,1.0
1,Austria,1.0,AT11,Burgenland (AT),1981.0,,192225.0,180717.0,0.474132,4941.0,694.0,0.027341,0.003840,0.031181,0.949138,0.0,1.0,1.0
2,Austria,1.0,AT11,Burgenland (AT),1982.0,3.0,198000.0,173691.0,0.469377,5559.0,942.0,0.032005,0.005423,0.037429,0.900672,0.0,1.0,1.0
3,Austria,1.0,AT11,Burgenland (AT),1983.0,1.0,197459.0,184704.0,0.460471,4090.0,543.0,0.022144,0.002940,0.025083,0.947716,4090.0,1.0,1.0
4,Austria,1.0,AT11,Burgenland (AT),1984.0,,197459.0,184704.0,0.460471,4090.0,543.0,0.022144,0.002940,0.025083,0.947716,4090.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4531,Sweden,8.0,SE33,Övre Norrland,2011.0,,803062.0,660666.0,0.255184,16251.0,68424.0,0.024598,0.103568,0.128166,0.835183,0.0,0.0,126.0
4532,Sweden,8.0,SE33,Övre Norrland,2012.0,,803062.0,660666.0,0.255184,16251.0,68424.0,0.024598,0.103568,0.128166,0.835183,0.0,0.0,126.0
4533,Sweden,8.0,SE33,Övre Norrland,2013.0,,803062.0,660666.0,0.255184,16251.0,68424.0,0.024598,0.103568,0.128166,0.835183,0.0,0.0,126.0
4534,Sweden,8.0,SE33,Övre Norrland,2014.0,7.0,1211834.0,877980.0,0.213000,59626.0,94202.0,0.067913,0.107294,0.175207,0.734514,0.0,1.0,126.0


In [11]:
data.dtypes

Country            string[pyarrow_numpy]
cid                              float32
Nuts_id            string[pyarrow_numpy]
Name               string[pyarrow_numpy]
Year                             float64
ElectionType                     float32
EligibleVoters                   float64
Valid                            float64
HHI                              float32
Far_Right                        float64
Far_Left                         float64
Far_Right_share                  float32
Far_Left_share                   float32
Far_share                        float32
Turnout                          float32
F0Far_Incumbent                  float32
left                             float32
id                               float32
dtype: object

---
### Question 7

We had to discard some information when reading the dataset. Luckily, we can access all
of it using a low-level `pd.io.StataReader` object. This will allow us to look at all meta
information that is stored with the dataset in Stata format.

Create `data_info` that contains the `pd.io.StataReader` object. Use the `data_file` object for
doing so.

In [12]:
data_info = pd.io.stata.StataReader(data_file)

Look at the various `labels` attributes of the `pd.StataReader` object. Explain what
they do. Can you explain now why the error occurred when reading the dataset?

In [13]:
data_info.variable_labels()

{'Country': '',
 'cid': 'group(Country)',
 'Nuts_id': 'Nuts ID',
 'Name': 'Nuts Name',
 'Year': '',
 'ElectionType': 'Type of Election(s) occurring in a specific year',
 'EligibleVoters': 'Number of total voters',
 'Valid': 'Number of total valid votes',
 'HHI': 'Effective Number of Parties',
 'Far_Right': 'Total number of votes in far right parties',
 'Far_Left': 'Total number of votes in far left parties',
 'Far_Right_share': '',
 'Far_Left_share': '',
 'Far_share': '',
 'Turnout': 'Share of Total Votes over Eligible Voters',
 'F0Far_Incumbent': 'Number of votes of far parties part of the government',
 'left': "1 if the incumbent government's prime-minister belongs to a left-leaning party",
 'id': 'group(Nuts_id)'}

In [14]:
data_info.value_labels()

{'ElectionType': {1: 'National',
  2: 'European',
  3: 'Regional',
  4: 'National + European',
  5: 'National + Regional',
  6: 'European + Regional + European',
  7: 'National + Regional',
  8: '2 National',
  9: '2 Regional',
  10: 'European + 2 Regional'}}

In [15]:
data_info.data_label

''


- `data_label` is a string describing the dataset. In this case, it is empty.

- `variable_labels` is a dictionary mapping the column names we have in our dataset to verbose descriptions.

- `value_labels` is a nested dictionary with column names as keys on the outer level.
  The inner dictionaries map the numeric values in the dataset to verbose descriptions.
  It is filled only for `ElectionType`. This is very much how `pd.Categorical` is
  stored internally, although we cannot see the numerical values in that case (which is
  a good thing! Far too easy to run numerical calculations with unordered categorical
  variables in Stata accidentally).

- The error occurred because the `National + Regional` label is repeated for two
  different values of `ElectionType`. 
    - While `value_labels` are supposed to be a bijection, Stata does not enforce this. 
    - Pandas will check it when converting the `value_labels` dictionary to a `pd.Categorical` object with the value labels as
  categories. 
  Since categories need to be unique, 5 and 7 would be merged and information would be lost. Hence the error.


---
### Question 8

Look at the structure of `data` again. Would you keep all of

- `Country` and `cid`?
- `Nuts_id`,  `Name`, and `id`?

Why or why not?



- No reason to keep `cid`. Countries are very few and we have the country names. Most
  certainly, these are no official country codes, so we won't need them for merging.

- `Nuts_id` is a unique identifier for the regions and they look quite official. 
  - While we do have the names, they may easily differ across datasets (unicode characters, 
  parentheses, ...) or not be present in all datasets. 
  
- `id` looks like a numerical code for `Nuts_id`, but we better verify.

---
### Bonus: Question 9 (only if you have some time left)

This question is optional and means to justify why we can drop `cid` and `id`.

Make sure that we can safely drop `cid` and `id` from the dataset by finding out the unique combinations of the sets of variables from the previous question.

> Note: The `.unique()` method only works on Series, you'll need to use `.drop_duplicates()`.

In [16]:

data[["Country", "cid"]].drop_duplicates()

Unnamed: 0,Country,cid
0,Austria,1.0
324,Germany,4.0
1692,Spain,7.0
2304,Finland,2.0
2448,France,3.0
3240,Italy,5.0
3996,Portugal,6.0
4248,Sweden,8.0


In [17]:
(
    len(data[["Country", "cid"]].drop_duplicates()) # unique country/cid pairs
    == len(data[["Country"]].drop_duplicates()) # unique countries
    == len(data[["cid"]].drop_duplicates()) # unique cids
)

True

In [18]:
data[["Nuts_id", "Name", "id"]].drop_duplicates() # unique nuts_id/name/id pairs: gives the corresponding nuts_id and id for each region 

Unnamed: 0,Nuts_id,Name,id
0,AT11,Burgenland (AT),1.0
36,AT12,Niederösterreich,2.0
72,AT13,Wien,3.0
108,AT21,Kärnten,4.0
144,AT22,Steiermark,5.0
...,...,...,...
4356,SE22,Sydsverige,122.0
4392,SE23,Västsverige,123.0
4428,SE31,Norra Mellansverige,124.0
4464,SE32,Mellersta Norrland,125.0


In [19]:
(
    len(data[["Nuts_id", "Name", "id"]].drop_duplicates()) # unique nuts_id/name/id pairs
    == len(data["Nuts_id"].drop_duplicates()) # unique nuts_id
    == len(data[["Name"]].drop_duplicates()) # unique names
    == len(data[["id"]].drop_duplicates()) # unique ids
)

True

---
### Question 10

Drop `cid` and `id` from the dataset, which you should continue to store in the variable
`data`. Verify that the columns are gone.

To do so, you can either use the `drop(columns=[...])` method or select those columns
that you want to keep (there are even more options). Can you think of a reasons for each
strategy, particularly in an interactive setting like a notebook?

In [20]:
data = data.drop(columns=["cid", "id"])
data.shape

(4536, 16)


- `drop()` is somewhat more explicit, particularly when the list of columns to be kept
  is long
- Selecting the columns to be kept makes it possible to execute the cell again without
  error (though non-linear execution is somewhat dangerous)

No clear winner here.

---
### Question 11

Give the columns (more) sensible names using the `lowercase_with_underscores`
convention. You should continue to store the dataset in the variable `data`. 

> Note: While trying out, you don't want to assign to `data` yet because once you change
> the column names, you cannot access the old ones anymore. That will eventually be 
> fine, but not in the interactive setting yet.

> Tip: You can just copy the output of the `variable_labels()` call above to get
> started.

In [21]:
data = data.rename(
    columns={
        "Country": "country",
        "Nuts_id": "nuts_id",
        "Name": "nuts_name",
        "Year": "year",
        "ElectionType": "election_type",
        "EligibleVoters": "number_eligible_voters",
        "Valid": "number_valid_votes",
        "HHI": "number_parties_effective",
        "Far_Right": "number_votes_far_right",
        "Far_Left": "number_votes_far_left",
        "Far_Right_share": "share_votes_far_right",
        "Far_Left_share": "share_votes_far_left",
        "Far_share": "share_votes_far_any",
        "Turnout": "share_voter_turnout",
        "F0Far_Incumbent": "number_votes_far_any_incumbent",
        "left": "pm_party_left",
    },
)

---
### Question 12

Convert all variables to sensible data types. In some cases, you may want to keep the column names, in some cases you may want to generate new ones. Briefly explain all of  your choices.

#### `"country", "nuts_id", "nuts_name"` to `pd.CategoricalDtype()`

This can be done in one step because the variables already have sensible labels.

In [22]:
for col in "country", "nuts_id", "nuts_name":
    data[col] = data[col].astype(pd.CategoricalDtype())

data["nuts_id"]

0       AT11
1       AT11
2       AT11
3       AT11
4       AT11
        ... 
4531    SE33
4532    SE33
4533    SE33
4534    SE33
4535    SE33
Name: nuts_id, Length: 4536, dtype: category
Categories (126, string): [AT11, AT12, AT13, AT21, ..., SE23, SE31, SE32, SE33]


- All three columns already have sensible names and sensible outcomes
- We just need to convert to `pd.CategoricalDtype()`
- We verify for one of the three

### `year` to a `pd.Int16Dtype()`

In [23]:
data["year"] = data["year"].astype(pd.Int16Dtype())
data["year"]

0       1980
1       1981
2       1982
3       1983
4       1984
        ... 
4531    2011
4532    2012
4533    2013
4534    2014
4535    2015
Name: year, Length: 4536, dtype: Int16

### `election_type` to a `pd.CategoricalDtype()`

The labels are not meaningful, but we can use the `value_labels` attribute of the
   `pd.StataReader` object to get the labels.

1. Create `election_cats`, a dictionary with the labels from the `value_labels`) 
2. Replace the duplicated label with `National + Regional A` and `National + Regional B`
3. First, convert `election_type` to a `pd.Int8Dtype()` and then to a `pd.CategoricalDtype()`
4. Last, you can use the `election_cats` dictionary to convert the labels to a
   `pd.CategoricalDtype()`.

In [24]:
election_cats = data_info.value_labels()["ElectionType"].copy()
for i, to_append in (5, " A"), (7, " B"):
    election_cats[i] += to_append
election_cats

{1: 'National',
 2: 'European',
 3: 'Regional',
 4: 'National + European',
 5: 'National + Regional A',
 6: 'European + Regional + European',
 7: 'National + Regional B',
 8: '2 National',
 9: '2 Regional',
 10: 'European + 2 Regional'}

In [25]:
data["election_type"]

0       NaN
1       NaN
2       3.0
3       1.0
4       NaN
       ... 
4531    NaN
4532    NaN
4533    NaN
4534    7.0
4535    NaN
Name: election_type, Length: 4536, dtype: float32

In [26]:
election_type = (
    data["election_type"].astype(pd.Int8Dtype()).astype(pd.CategoricalDtype())
)
election_type = election_type.cat.rename_categories(election_cats)
data["election_type"] = election_type
data["election_type"]

0                         NaN
1                         NaN
2                    Regional
3                    National
4                         NaN
                ...          
4531                      NaN
4532                      NaN
4533                      NaN
4534    National + Regional B
4535                      NaN
Name: election_type, Length: 4536, dtype: category
Categories (7, string): [National, European, Regional, National + European, National + Regional A, European + Regional + European, National + Regional B]


- First get Stata's `value_labels` and fix the duplicate. The copy is important, else
  we would modify the original dictionary. Note that it is important to make the copy
  of the inner dictionary, not the outer one.
- Election type first needs to be converted to a nullable Integer data type to match
  the contents. Which length we use does not matter, this is temporary, anyhow.
- Then can convert to `category`, can verify at that point that categories are 
  {1, 2, ..., 7}.
- Then we call `.rename_categories()` with the fixed dictionary and verify the result.

### Round and convert all columns starting with `number` to `pd.Int32Dtype()` (except for `number_parties_effective`)

In [27]:
for col in data.columns:
    if col.startswith("number_") and col != "number_parties_effective":
        data[col] = data[col].round().astype(pd.UInt32Dtype())
data["number_eligible_voters"]

0        192225
1        192225
2        198000
3        197459
4        197459
         ...   
4531     803062
4532     803062
4533     803062
4534    1211834
4535    1211834
Name: number_eligible_voters, Length: 4536, dtype: UInt32


- We can loop over the columns since we have given them sensible names, just need to be
  careful with the Herfindahl-Hirschman index (could debate that choice of column name)
- Apparently some of the values are numerically too far away from integers so that
  pandas complains. We need to explicitly round first. In cases where it would be
  crucial to get the correct integer, we may want to investigate deeper if that happens
  (e.g., we expect integers 1 to 7, but may have a 1.4 in there). Here, it does not
  matter given the size of the electorates and likely measurement error.
- We use `UInt32`, but it really does not matter as long as we can represent all
  numbers. Seems like a good idea to disallow negative numbers as a safety measure.

### `pm_party_left` to `pd.CategoricalDtype()`
- Convert it first to a `pd.Int8Dtype()` and then to a `pd.CategoricalDtype()`
- You can then rename the categories to meaningful labels

In [28]:
pm_party_orientation = (
    data["pm_party_left"].astype(pd.Int8Dtype()).astype(pd.CategoricalDtype())
)
pm_party_orientation = pm_party_orientation.cat.rename_categories(
    {0: "Other", 1: "Left-leaning"},
)
data["pm_party_orientation"] = pm_party_orientation
data = data.drop(columns="pm_party_left")
pm_party_orientation

0       Left-leaning
1       Left-leaning
2       Left-leaning
3       Left-leaning
4       Left-leaning
            ...     
4531           Other
4532           Other
4533           Other
4534    Left-leaning
4535    Left-leaning
Name: pm_party_left, Length: 4536, dtype: category
Categories (2, string): [Other, Left-leaning]


- Even though we often say "X is a dummy variable for whatever", the better
  representation usually are categorical variables.
- This will make plotting etc. easier and more consistent.
- Decent statistical packages handle these out-of-the-box, too

---
### Question 13

Summarise the cleaned data in a similar way as above. Now also look at summary
statistics, including value counts of categorical variables. Do you notice anything
when doing the latter? If so, will you have to be careful for interpreting descriptive
statistics?

In [29]:
data.columns

Index(['country', 'nuts_id', 'nuts_name', 'year', 'election_type',
       'number_eligible_voters', 'number_valid_votes',
       'number_parties_effective', 'number_votes_far_right',
       'number_votes_far_left', 'share_votes_far_right',
       'share_votes_far_left', 'share_votes_far_any', 'share_voter_turnout',
       'number_votes_far_any_incumbent', 'pm_party_orientation'],
      dtype='string')

In [30]:
data.shape

(4536, 16)

In [31]:
data

Unnamed: 0,country,nuts_id,nuts_name,year,election_type,number_eligible_voters,number_valid_votes,number_parties_effective,number_votes_far_right,number_votes_far_left,share_votes_far_right,share_votes_far_left,share_votes_far_any,share_voter_turnout,number_votes_far_any_incumbent,pm_party_orientation
0,Austria,AT11,Burgenland (AT),1980,,192225,180717,0.474132,4941,694,0.027341,0.003840,0.031181,0.949138,0,Left-leaning
1,Austria,AT11,Burgenland (AT),1981,,192225,180717,0.474132,4941,694,0.027341,0.003840,0.031181,0.949138,0,Left-leaning
2,Austria,AT11,Burgenland (AT),1982,Regional,198000,173691,0.469377,5559,942,0.032005,0.005423,0.037429,0.900672,0,Left-leaning
3,Austria,AT11,Burgenland (AT),1983,National,197459,184704,0.460471,4090,543,0.022144,0.002940,0.025083,0.947716,4090,Left-leaning
4,Austria,AT11,Burgenland (AT),1984,,197459,184704,0.460471,4090,543,0.022144,0.002940,0.025083,0.947716,4090,Left-leaning
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4531,Sweden,SE33,Övre Norrland,2011,,803062,660666,0.255184,16251,68424,0.024598,0.103568,0.128166,0.835183,0,Other
4532,Sweden,SE33,Övre Norrland,2012,,803062,660666,0.255184,16251,68424,0.024598,0.103568,0.128166,0.835183,0,Other
4533,Sweden,SE33,Övre Norrland,2013,,803062,660666,0.255184,16251,68424,0.024598,0.103568,0.128166,0.835183,0,Other
4534,Sweden,SE33,Övre Norrland,2014,National + Regional B,1211834,877980,0.213000,59626,94202,0.067913,0.107294,0.175207,0.734514,0,Left-leaning


In [32]:
data.dtypes

country                           category
nuts_id                           category
nuts_name                         category
year                                 Int16
election_type                     category
number_eligible_voters              UInt32
number_valid_votes                  UInt32
number_parties_effective           float32
number_votes_far_right              UInt32
number_votes_far_left               UInt32
share_votes_far_right              float32
share_votes_far_left               float32
share_votes_far_any                float32
share_voter_turnout                float32
number_votes_far_any_incumbent      UInt32
pm_party_orientation              category
dtype: object

In [33]:
data.describe()

Unnamed: 0,year,number_eligible_voters,number_valid_votes,number_parties_effective,number_votes_far_right,number_votes_far_left,share_votes_far_right,share_votes_far_left,share_votes_far_any,share_voter_turnout,number_votes_far_any_incumbent
count,4536.0,4456.0,4456.0,4456.0,4456.0,4456.0,4456.0,4456.0,4456.0,4445.0,4536.0
mean,1997.5,2141666.067101,1410316.829892,0.275324,81547.823833,108467.507406,0.064704,0.076852,0.141555,0.677067,21154.655423
std,10.38944,1945254.6358,1299094.071993,0.086081,131807.248451,170331.202597,0.07829,0.076002,0.107653,0.143562,78392.97134
min,1980.0,0.0,40957.0,0.089534,0.0,0.0,0.0,0.0,0.0,0.180519,0.0
25%,1988.75,909974.0,562222.0,0.208878,5784.0,9675.0,0.006395,0.012246,0.057848,0.575033,0.0
50%,1997.5,1525082.0,1006852.0,0.267532,32631.5,51129.0,0.032525,0.062298,0.111554,0.684145,0.0
75%,2006.25,2824193.0,1899729.0,0.341251,99195.0,126114.0,0.100929,0.107574,0.212819,0.791886,0.0
max,2015.0,18187513.0,12315054.0,0.555216,1656040.0,2232531.0,0.461377,0.476447,0.711271,0.962936,1651320.0


In [34]:
data["country"].value_counts()

country
Germany     1368
France       792
Italy        756
Spain        612
Austria      324
Sweden       288
Portugal     252
Finland      144
Name: count, dtype: int64

In [35]:
data["nuts_id"].value_counts().unique()

array([36])

In [36]:
data["election_type"].value_counts()

election_type
National                          763
Regional                          537
European                          528
National + Regional A             252
National + European               116
European + Regional + European    116
National + Regional B              50
Name: count, dtype: int64

In [37]:
data["pm_party_orientation"].value_counts()

pm_party_orientation
Left-leaning    2238
Other           2234
Name: count, dtype: int64


- Value counts for NUTS ids show that we have a balanced panel
- Apparently always last election results are used, see query below
- Need to be extremely careful when doing descriptives, because all other
  variables are filled.

In [38]:
data.query("nuts_id == 'SE33'")

Unnamed: 0,country,nuts_id,nuts_name,year,election_type,number_eligible_voters,number_valid_votes,number_parties_effective,number_votes_far_right,number_votes_far_left,share_votes_far_right,share_votes_far_left,share_votes_far_any,share_voter_turnout,number_votes_far_any_incumbent,pm_party_orientation
4500,Sweden,SE33,Övre Norrland,1980,,749207,664915,0.313727,0,57649,0.0,0.086701,0.086701,0.898924,0,Other
4501,Sweden,SE33,Övre Norrland,1981,,749207,664915,0.313727,0,57649,0.0,0.086701,0.086701,0.898924,0,Other
4502,Sweden,SE33,Övre Norrland,1982,National + Regional A,763039,679257,0.339119,0,53172,0.0,0.07828,0.07828,0.901354,0,Left-leaning
4503,Sweden,SE33,Övre Norrland,1983,,763039,679257,0.339119,0,53172,0.0,0.07828,0.07828,0.901354,0,Left-leaning
4504,Sweden,SE33,Övre Norrland,1984,,763039,679257,0.339119,0,53172,0.0,0.07828,0.07828,0.901354,0,Left-leaning
4505,Sweden,SE33,Övre Norrland,1985,National + Regional A,771411,674180,0.32775,0,53542,0.0,0.079418,0.079418,0.886132,0,Left-leaning
4506,Sweden,SE33,Övre Norrland,1986,,771411,674180,0.32775,0,53542,0.0,0.079418,0.079418,0.886132,0,Left-leaning
4507,Sweden,SE33,Övre Norrland,1987,,771411,674180,0.32775,0,53542,0.0,0.079418,0.079418,0.886132,0,Left-leaning
4508,Sweden,SE33,Övre Norrland,1988,National + Regional A,773096,647397,0.319311,0,51009,0.0,0.078791,0.078791,0.850112,0,Left-leaning
4509,Sweden,SE33,Övre Norrland,1989,,773096,647397,0.319311,0,51009,0.0,0.078791,0.078791,0.850112,0,Left-leaning


---
### Question 14

Make three plots of mean vote shares by year, across all NUTS regions and elections.
- far right share
- far left share
- far share (any)

Don't worry about whether these plots make a lot of sense (i.e., any weighting with
electorate size or the like).

In [39]:
only_elections = data[data["election_type"].notna()]
only_elections.shape

(2362, 16)

In [40]:
only_elections.groupby("year")["share_votes_far_right"].mean().plot()

In [41]:
only_elections.groupby("year")["share_votes_far_left"].mean().plot()

In [42]:

only_elections.groupby("year")["share_votes_far_any"].mean().plot()

---
### Question 15

Make three scatterplots of vote shares by year, across all NUTS regions and elections.

- far right share
- far left share
- far share (any)

Colour the dots using the country.

In [43]:
only_elections.plot.scatter(x="year", y="share_votes_far_right", color="country")

In [44]:
only_elections.plot.scatter(x="year", y="share_votes_far_left", color="country")

In [45]:
only_elections.plot.scatter(x="year", y="share_votes_far_any", color="country")