In [None]:
%matplotlib inline

import numpy as np
import pandas as pd

# Dataset: Studienanfänger: Bundesländer, Semester, Nationalität,Geschlecht

Source: Statistisches Bundesamt 

License: [Data licence Germany – attribution – Version 2.0](http://www.govdata.de/dl-de/by-2-0)

URL: https://www-genesis.destatis.de/genesis/downloads/00/tables/21311-0014_00.csv

URI: https://www.govdata.de/web/guest/suchen/-/details/studienanfanger-bundeslander-semester-nationalitatgeschlecht

## Information

### What is contained in the dataset?

The dataset contains the number of first-year students per winter semester in all 16 federal states from 1998 to 2021. Data is splitted in male and female students as well as german students and students of other nationality.

### Encoding

The file is encoded in *ISO-8859-1* (sometimes referred to as *Latin 1*). Refer to this [list of Python standard encoding](https://docs.python.org/3/library/codecs.html#standard-encodings).

### Format

From the URL we infer that the dataset is provided as a CSV file.

### File header

This is the header of the CSV file:

```
GENESIS-Tabelle: 21311-0014
Studienanfänger: Bundesländer, Semester, Nationalität,;;;;;;;;;;
Geschlecht;;;;;;;;;;
Statistik der Studenten;;;;;;;;;;
Studienanfänger (Anzahl);;;;;;;;;;
;;Deutsche;Deutsche;Deutsche;Ausländer;Ausländer;Ausländer;Insgesamt;Insgesamt;Insgesamt
;;männlich;weiblich;Insgesamt;männlich;weiblich;Insgesamt;männlich;weiblich;Insgesamt
```

## Tasks

### `(A)` Read in the dataframe

The column names are not provided in the csv file, we suggest to use the following ones:
|                  Content of column(s)                  |          Column title in DataFrame         |
|:---------------------------------------------------:|:------------------------------------------:|
| Federal state                                       | `"federal_state"`                              |
| Winter semester                                     | `"winter_semester"`                            |
| Count for females and males with german nationality | `"count_german_female"` and `"count_other_female"` |
| Count for females and males with other nationality  | `"count_other_female"` and `"count_other_male"`    |

*Hint*: Have a look at the *end* of the dataframe. There might be some rows that have to be removed before you can work with the data.

In [None]:
...

The following cells will help you check if the file has been imported as intended.

In [None]:
df.tail(10)

In [None]:
df.head(10)

### `(T)` Convert to long format

Convert the `DataFrame` to a long format with the following column labels (*Hint*: Have a look at the documentation of the [`pd.melt`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) function.): 
* `"federal_state"` and `"winter_semester"`. These are the same columns as in the original `DataFrame`.
* `"count"`: Number of first-year students of a certain nationality (other or german) and gender (female or male).
* `"nationality_and_gender"`: Column with values being one of `"count_[female|male]_[other|german]"`.

In [None]:
...

In [None]:
df.head()

### `(A)` Add new columns 

Use the content of the `"nationality_and_gender"` column to generate columns: 
* `"nationality"`: Column with values either being "german" or "other".
* `"gender"`: Column with values either being "female" or "male".
   
Finally, the column `"nationality_and_gender"` should be removed from the `DataFrame`. 

In [None]:
...

In [None]:
df.head()

### `(R)`  Datatypes

Change the type of the columns in the following manner: 
* `"federal_state"`: category
* `"winter_semester"`: category
* `"count"`: unsigned 16 bit integer
* `"nationality"`: category
* `"gender"`: category

In [None]:
...

In [None]:
df.info()

### `(A)`  Visualization

Plot the count of female and male students with german nationality as well as the count of female and male students with other nationality for each winter semester (aggregated over all federal states). Choose a type of plot that allows to easily compare the count of female and male students.

Things to consider when making the plot:
* Use Pandas' plotting capabilities.
* Use axis labels, a legend, and a title.


*Hint*: You can either use the [`pivot_table`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html) or the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method and perform the necessary aggregations.


In [None]:
...

### `(A)` Aggregation

What is total the percentage of female students in each  year?

In [None]:
...

In [None]:
...

Which german state has the largest / least count of female students?

In [None]:
...

In [None]:
...