In [None]:
!ls datasets/kontali

In [13]:
import pandas as pd
from pathlib import Path

dirpath = Path("datasets/kontali")

# Currency

### Read the file `currency.csv` into a pandas DataFrame

Hints:
1. the function to use here is `pd.read_csv`
1. we can specify the path to `currency.csv` with `os.path.join` or `pathlib.Path` from the builtin modules
1. we must use the keyword argument `encoding="latin-1"`
1. we should also specify two other keywords

In [None]:
df_currency = pd.read_csv(dirpath / "currency.csv", encoding="latin-1", delimiter=";", decimal=",")

### What is the most recent date for which we have foreign exchange rates?

In [None]:
df_currency.tail()

### Change `Dato` dtype to Datetime, and move `Dato` to the index.

In [None]:
df_currency["Dato"] = pd.to_datetime(df_currency["Dato"])
df_currency.set_index("Dato", inplace=True)
df_currency.head()

### Reshape Currency dataframe

Make a new dataframe with the following criteria:
1. a column called `Ccy` (short for Currency)
2. a column called `FX` (short for Foreign eXchange)

In [None]:
df_currency.dtypes

In [None]:
df_currency.head()

### Extract the unit (1 or 100) and the currency name from the column labels

Hint: You can use the `.str` accessor on `df_currency.columns`. Here is a useful [link](https://pandas.pydata.org/docs/user_guide/text.html#extracting-substrings) on extracting substrings from a string.

This following regular expression pattern can be used to get the unit and currency:

`pattern = r"(\d+) (\w+)"`


#### Explanation of the regular expression pattern (can be skipped)

The `r` prefix before the string indicates that it is a [raw string literal](https://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-prefixes-do-and-what-are-raw-string-literals), which in our case means that the backslash is just treated as a backslash.

`\d+` matches Unicode decimal digits (the plus means "at least one", so we match on "at least one digit".

`\w+` matches Unicode word characters.

The parentheses indicate a "capture group". Each capture group will become a column in the returned dataframe.

So to sum it all up, the regular expression first matches one or more digits, then a whitespace, then one or more word letters.

In [None]:
pattern = r"(\d+) (\w+)"

df_ccy = df_currency.columns.str.extract(pattern)
df_ccy

Did it work? Great!

Let's add some names on the two captured groups. This will give us **correctly named columns** for our returned dataframe.

Now the pattern is looking really complex, but we don't really have to concern ourselves with that. Having `?P<unit>` inside the capture group just means: *the captured group should be named `unit`*.

`pattern = r"(?P<unit>\d+) (?P<Ccy>\w+)"`

In [None]:
pattern = r"(?P<unit>\d+) (?P<Ccy>\w+)"
df_ccy = df_currency.columns.str.extract(pattern, expand=True)
df_ccy

In [None]:
df_ccy = df_ccy.astype({"unit": "int64"})
df_ccy

In [None]:
df_cur = df_currency.copy()

df_cur.columns = df_ccy.Ccy
df_cur

In [None]:
df_ccy.set_index('Ccy', inplace=True)

In [None]:
df_ccy

In [None]:
df_cur / df_ccy["unit"]

In [None]:
df_cur.div(df_ccy['unit'].T, axis=1)

In [None]:
df_cur /= df_ccy["unit"]

In [None]:
df_cur

# Country

### Read the file `country.csv` into a pandas DataFrame

Hints:
1. the function to use here is `pd.read_csv`
1. we can specify the path to `country.csv` with `os.path.join` or `pathlib.Path` from the builtin modules
1. we must use the keyword argument `encoding="latin-1"`
1. we should also specify two other keywords

In [14]:
df_country = pd.read_csv(dirpath / "country.csv",
                         encoding="latin-1",
                         delimiter=";",
                         index_col="country_code"
                        )

df_country.head()

Unnamed: 0_level_0,Country,Land,Market,Market_SBSB,Market_SLX
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AD,Andorra,Andorra,Other Europe,All others,Other
AE,United Arab Emirates,De forente Arabiske Emirater,Asia,Asia,Other
AF,Afghanistan,Afghanistan,Asia,Asia,Other
AG,Antigua and Barbuda,Antigua og Barbuda,North-America,North America,Other
AI,Anguilla,Anguilla,North-America,North America,Other


In [None]:
!pip3 install --isolated lxml

---
# Web scraping

In [32]:
import pandas as pd

table_continents = pd.read_html('https://statisticstimes.com/geography/countries-by-continents.php', match="Countries or Areas")

df_cont = table_continents[0]
df_cont = df_cont.set_index("Country or Area")[["Continent"]]
df_cont.head()

Unnamed: 0_level_0,Continent
Country or Area,Unnamed: 1_level_1
Afghanistan,Asia
Åland Islands,Europe
Albania,Europe
Algeria,Africa
American Samoa,Oceania


In [28]:
df_country["Country"].head()

country_code
AD                 Andorra
AE    United Arab Emirates
AF             Afghanistan
AG     Antigua and Barbuda
AI                Anguilla
Name: Country, dtype: object

### Using `DataFrame.join()`, add `Continent` column to `df_country`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

In [34]:
df_country = df_country.join(df_cont, on="Country", how="left")
df_country.head()

Unnamed: 0_level_0,Country,Land,Market,Market_SBSB,Market_SLX,Continent
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AD,Andorra,Andorra,Other Europe,All others,Other,Europe
AE,United Arab Emirates,De forente Arabiske Emirater,Asia,Asia,Other,Asia
AF,Afghanistan,Afghanistan,Asia,Asia,Other,Asia
AG,Antigua and Barbuda,Antigua og Barbuda,North-America,North America,Other,North America
AI,Anguilla,Anguilla,North-America,North America,Other,North America


In [35]:
df_country.shape

(282, 6)

# Product

### Read the file `product.csv` into a pandas DataFrame

Hints:
1. the function to use here is `pd.read_csv`
1. we can specify the path to `product.csv` with `os.path.join` or `pathlib.Path` from the builtin modules
1. we must use the keyword argument `encoding="latin-1"`
1. we want to set the index to be the `Product_Code` column
1. we also want to set the `delimiter` keyword argument

In [None]:
df_product = pd.read_csv(dirpath / "product.csv",
                         encoding="latin-1", 
                         delimiter=";", 
                         index_col="Product_Code",
                        )
df_product.head(100)

### How many product categories are there?

In [None]:
len(df_product), df_product.shape

### Make a selection dataframe that only contains trout products

In [None]:
df_product.loc[df_product["Species_Code"] == "TRR"]

### Sort dataframe by product code

In [None]:
df_product.sort_index()

### Sort dataframe by species code (lexicographically)

In [None]:
df_product.sort_values("Species_Code")

### Sort dataframe by species code, then presentation, then preservation (lexicographically)

In [None]:
df_product.sort_values(["Species_Code", "Presentation", "Preservation"])

### List all the trout product categories by only using the "Product_Description_KA" column?

Do not use the `Species_Code` column for this task.

Hint: it should be sufficient to check if "trout" is mentioned in the description field.

In [None]:
trout_idx = df_product["Product_Description_KA"].str.contains("trout")

df_product[trout_idx]

### Make a new column named "Head" with a category dtype. Possible values should be YES, NO and UNKNOWN.

In [None]:
df = df_product.copy()
head_on = df.Product_Description_KA.str.contains("head on")
head_off = df.Product_Description_KA.str.contains("head off")

df["Head"] = pd.Categorical(["unknown"]*len(df), categories=["yes", "no", "unknown"])
df.loc[head_on, "Head"] = "yes"
df.loc[head_off, "Head"] = "no"
df


### [Challenging] Can you recreate the "Preservation" column by using the "Product_Description_KA" column and the below dict named `keywords`? 

```python
keywords = {
    "PRS": ["brine", "canned", "smoked", "airtight"],
    "FRO": ["frozen"],
    "FRE": ["fresh"],
    "ALI": ["live"],
}
```


In [None]:
keywords = {
    "PRS": ["brine", "canned", "smoked", "airtight"],
    "FRO": ["frozen"],
    "FRE": ["fresh"],
    "ALI": ["live"],
}

# SSB

### Read the file `ssb_export.csv` into a pandas DataFrame

Hints:
1. the function to use here is `pd.read_csv`
1. we can specify the path to `ssb_export.csv` with `os.path.join` or `pathlib.Path` from the builtin modules
1. we must use the keyword argument `encoding="latin-1"`
1. we want to set the index to be the `ID` column
1. we also want to set the `delimiter` keyword argument

In [None]:
df_ssb = pd.read_csv(dirpath / "ssb_export.csv",
                     encoding="latin-1",
                     delimiter=";",
                     index_col="ID"
                    )
df_ssb.tail()

In [None]:
df_ssb.describe()

### How many transactions are there in total?

In [None]:
len(df_ssb), df_ssb.shape[0] # use either of the two

### How many columns are there?

In [None]:
len(df_ssb.columns), df_ssb.shape[1] # use either of the two

### What years do the transactions cover

In [None]:
df_ssb.År.unique().tolist()

### How many transactions were there in 2020?

In [None]:
len(df_ssb[df_ssb["År"] == 2020])

### What was the largest single transaction in terms of value?

In [None]:
df_ssb.Verdi.max()

### What is the ID of this transaction?

In [None]:
df_ssb.Verdi.idxmax()

### What year was this transaction?

Try to not directly use the ID from previous answer.

In [None]:
df_ssb.loc[df_ssb.Verdi.idxmax()].År

### Does the dataframe contain both import and export transactions?

In [None]:
df_ssb.Vareflyt.unique()

### Make a selection of Canadian transactions only

In [None]:
df_ssb_ca = df_ssb[df_ssb["Landkode"] == "CA"]

### Make a selection of Canadian transactions only, for year 2022

In [None]:
df_ssb_ca = df_ssb[(df_ssb["År"] == 2022) & (df_ssb["Landkode"] == "CA")]

### Calculate the total weight and value of the above selection for each product number ("Varenr")

Hint: group the above selection by product number ("Varenr"). You can use the `sum()` aggregator on the `Grouper` object.

In [None]:
df = df_ssb_ca.groupby("Varenr").sum()[["Mengde", "Verdi"]]
df

In [None]:
df["Varebeskrivelse"] = df_product.loc[df.index, "Product_Description_KA"]

In [None]:
df.set_index("Varebeskrivelse").plot(kind="pie", y="Verdi");

In [None]:
df.set_index("Varebeskrivelse").plot(kind="bar", y="Verdi");

### What was the total export in kg for Smoked salmon in 2020? (Are you able to find it with a single line of code?)

In [None]:
df_ssb[(df_ssb["Varenr"] == 3054100) & (df_ssb["År"] == 2020)].Mengde.sum()

### Calculate the average price (NOK/kg) for Fresh Pacific Salmon in 2019 

In [None]:
df = df_ssb[(df_ssb["Varenr"] == 3044100) & (df_ssb["År"] == 2019)]
(df["Verdi"] / df["Mengde"]).mean()

### Bonus: Make a bar chart of the average price for Fresh Pacific Salmon by year

In [None]:
df = df_ssb[df_ssb["Varenr"] == 3044100].copy()
df["price/kg"] = df["Verdi"] / df["Mengde"]
df.groupby("År").mean()["price/kg"].plot(kind="bar");

---
# Memory usage

https://pandas.pydata.org/docs/user_guide/gotchas.html#dataframe-memory-usage

### How much memory does this dataframe use?

For this question, you can use `DataFrame.info()` and read the memory usage from there. Alternatively you can use `DataFrame.memory_usage(deep=True).sum()`.

Try `DataFrame.memory_usage().sum()` (without `deep=True`). Why is the reported memory usage lower now?

Hint: look at the official pandas documentation for `DataFrame.memory_usage()`

In [None]:
df_ssb.memory_usage(deep=True).sum()

### The `Landkode` column has the `object` dtype. How much is the memory usage reduced (in percent) if `Landkode` dtype is changed to `Categorical`? 

In [None]:
mem_obj = df_ssb["Landkode"].memory_usage(deep=True)
mem_cat = df_ssb["Landkode"].astype('category').memory_usage(deep=True)

# reduction in percent for "Landkode" column
100*(1-mem_cat/mem_obj)

### The `Vareflyt` column has the `object` dtype. How much is the memory usage reduced (in percent) if `Vareflyt` dtype is changed to `StringDtype`? 

In [None]:
mem_obj = df_ssb["Vareflyt"].memory_usage(deep=True)
mem_str = df_ssb["Vareflyt"].astype("string").memory_usage(deep=True)

# reduction in percent for "Vareflyt" column
100*(1-mem_cat/mem_obj)

---

### CHALLENGING: Assuming that the df_product["Verdi"] is in nok and all exports were traded in local recipient currency,  what was the total export value, in the various local currencies, for Smoked salmon in 2020? 

In [None]:
df = df_ssb[(df_ssb["Varenr"] == 3054100) & (df_ssb["År"] == 2022)]