In [1]:
import polars as pl
print(pl.__version__)

0.20.31


### Big Countries

#### Question:

DataFrame: World

| Column Name | Type    |
|:-----------:|:-------:|
| name        | varchar |
| continent   | varchar |
| area        | int     |
| population  | int     |
| gdp         | bigint  |


name is the primary key (column with unique values) for this table.<br>
Each row of this table gives information about the name of a country, the continent to which it belongs, its area, the population, and its GDP value.


A country is big if:

- it has an area of at least three million (i.e., 3000000 km2), or
- it has a population of at least twenty-five million (i.e., 25000000).

Write a solution to find the name, population, and area of the big countries.

Return the result table in any order. The result format is in the following example.

Input:<br>

world DataFrame:

| name        | continent | area    | population | gdp          |
|:-----------:|:---------:|:-------:|:----------:|:------------:|
| Afghanistan | Asia      | 652230  | 25500100   | 20343000000  |
| Albania     | Europe    | 28748   | 2831741    | 12960000000  |
| Algeria     | Africa    | 2381741 | 37100000   | 188681000000 |
| Andorra     | Europe    | 468     | 78115      | 3712000000   |
| Angola      | Africa    | 1246700 | 20609294   | 100990000000 |

Output: 
| name        | population | area    |
|:----------:|:----------:|--------:|
| Afghanistan | 25500100   | 652230  |
| Algeria     | 37100000   | 2381741 |


#### Testcase

In [2]:
# Test data
data = [['Afghanistan', 'Asia', 652230, 25500100, 20343000000], 
        ['Albania', 'Europe', 28748, 2831741, 12960000000], 
        ['Algeria', 'Africa', 2381741, 37100000, 188681000000], 
        ['Andorra', 'Europe', 468, 78115, 3712000000], 
        ['Angola', 'Africa', 1246700, 20609294, 100990000000]]

# Create the DataFrame
world = pl.DataFrame(
    data,
    schema=['name', 'continent', 'area', 'population', 'gdp']
)

# Display the DataFrame
print(world)

shape: (5, 5)
┌─────────────┬───────────┬─────────┬────────────┬──────────────┐
│ name        ┆ continent ┆ area    ┆ population ┆ gdp          │
│ ---         ┆ ---       ┆ ---     ┆ ---        ┆ ---          │
│ str         ┆ str       ┆ i64     ┆ i64        ┆ i64          │
╞═════════════╪═══════════╪═════════╪════════════╪══════════════╡
│ Afghanistan ┆ Asia      ┆ 652230  ┆ 25500100   ┆ 20343000000  │
│ Albania     ┆ Europe    ┆ 28748   ┆ 2831741    ┆ 12960000000  │
│ Algeria     ┆ Africa    ┆ 2381741 ┆ 37100000   ┆ 188681000000 │
│ Andorra     ┆ Europe    ┆ 468     ┆ 78115      ┆ 3712000000   │
│ Angola      ┆ Africa    ┆ 1246700 ┆ 20609294   ┆ 100990000000 │
└─────────────┴───────────┴─────────┴────────────┴──────────────┘


#### Solution

In [3]:
def big_countries(world: pl.DataFrame, area: int, population: int) -> pl.DataFrame:
    df = world.filter(
        (pl.col('area') >= area) | (pl.col('population') >= population)
        )
    return df.select(['name', 'population', 'area'])

# Display the result
print(big_countries(world=world, area=3000000, population=25000000))

shape: (2, 3)
┌─────────────┬────────────┬─────────┐
│ name        ┆ population ┆ area    │
│ ---         ┆ ---        ┆ ---     │
│ str         ┆ i64        ┆ i64     │
╞═════════════╪════════════╪═════════╡
│ Afghanistan ┆ 25500100   ┆ 652230  │
│ Algeria     ┆ 37100000   ┆ 2381741 │
└─────────────┴────────────┴─────────┘


### Recyclable and Low Fat Products

#### Question

DataFrame: Products

| Column Name | Type    |
|:-----------:|:-------:|
| product_id  | int     |
| low_fats    | enum    |
| recyclable  | enum    |

product_id is the primary key (column with unique values) for this table.<br>
low_fats is an ENUM (category) of type ('Y', 'N') where 'Y' means this product is low fat and 'N' means it is not.<br>
recyclable is an ENUM (category) of types ('Y', 'N') where 'Y' means this product is recyclable and 'N' means it is not.
 

Write a solution to find the ids of products that are both low fat and recyclable.

Return the result table in any order. The result format is in the following example.

Input:<br>

products DataFrame:

| product_id  | low_fats | recyclable |
|:-----------:|:--------:|:----------:|
| 0           | Y        | N          |
| 1           | Y        | Y          |
| 2           | N        | Y          |
| 3           | Y        | Y          |
| 4           | N        | N          |

Output: 

| product_id  |
|:-----------:|
| 1           |
| 3           |

Explanation: Only products 1 and 3 are both low fat and recyclable.

#### Testcase

In [4]:
# Test data
data = [['0', 'Y', 'N'], ['1', 'Y', 'Y'], ['2', 'N', 'Y'], ['3', 'Y', 'Y'], ['4', 'N', 'N']]

# Create the DataFrame
products = pl.DataFrame(
    data,
    schema=['product_id', 'low_fats', 'recyclable']
)

# Display the DataFrame
print(products)

shape: (5, 3)
┌────────────┬──────────┬────────────┐
│ product_id ┆ low_fats ┆ recyclable │
│ ---        ┆ ---      ┆ ---        │
│ str        ┆ str      ┆ str        │
╞════════════╪══════════╪════════════╡
│ 0          ┆ Y        ┆ N          │
│ 1          ┆ Y        ┆ Y          │
│ 2          ┆ N        ┆ Y          │
│ 3          ┆ Y        ┆ Y          │
│ 4          ┆ N        ┆ N          │
└────────────┴──────────┴────────────┘


#### Solution

In [5]:
def find_products(products: pl.DataFrame, low_fats: str, recyclable: str) -> pl.DataFrame:
    df = products.filter(
        (pl.col('low_fats') == low_fats) & (pl.col('recyclable') == recyclable)
        )
    return df.select(['product_id'])


# Display the result
print(find_products(products=products, low_fats="Y", recyclable="Y"))

shape: (2, 1)
┌────────────┐
│ product_id │
│ ---        │
│ str        │
╞════════════╡
│ 1          │
│ 3          │
└────────────┘


### Customers Who Never Order

#### Question

DataFrame: customers

| Column Name | Type    |
|:-----------:|:-------:|
| id          | int     |
| name        | varchar |

id is the primary key (column with unique values) for this table.<br>
Each row of this DataFrame indicates the ID and name of a customer.
 

DataFrame: orders

| Column Name | Type |
|:-----------:|:----:|
| id          | int  |
| customerId  | int  |

id is the primary key (column with unique values) for this table.<br>
customerId is a foreign key (reference columns) of the ID from the Customers table.<br>
Each row of this table indicates the ID of an order and the ID of the customer who ordered it.


Write a solution to find all customers who never order anything.

Return the result table in any order.

The result format is in the following example.

Input:<br>

customers DataFrame:

| id | name  |
|:--:|:-----:|
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |

orders DataFrame:

| id | customerId |
|:--:|:----------:|
| 1  | 3          |
| 2  | 1          |

Output: 

| Customers |
|:---------:|
| Henry     |
| Max       |

#### Testcase

In [7]:
# Test data
customers_data = [[1, 'Joe'], [2, 'Henry'], [3, 'Sam'], [4, 'Max']]
orders_data = [[1,2], [3, 1]]

# Create the DataFrame
customers = pl.DataFrame(
    customers_data,
    schema=['id', 'name']
)

orders = pl.DataFrame(
    orders_data,
    schema=['id', 'customerId']
)

# Display the DataFrame
print('customers df:',customers)
print('orders df', orders)

customers df: shape: (4, 2)
┌─────┬───────┐
│ id  ┆ name  │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ Joe   │
│ 2   ┆ Henry │
│ 3   ┆ Sam   │
│ 4   ┆ Max   │
└─────┴───────┘
orders df shape: (2, 2)
┌─────┬────────────┐
│ id  ┆ customerId │
│ --- ┆ ---        │
│ i64 ┆ i64        │
╞═════╪════════════╡
│ 1   ┆ 3          │
│ 2   ┆ 1          │
└─────┴────────────┘


#### Solution

In [8]:
def find_customers(customers: pl.DataFrame, orders: pl.DataFrame) -> pl.DataFrame:
    df = customers.filter(
        ~pl.col("id").is_in(orders.select(['customerId']))
        )
    return df.select(['name']).rename({'name': 'Customers'})


# Display the result
print(find_customers(customers=customers, orders=orders))

shape: (2, 1)
┌───────────┐
│ Customers │
│ ---       │
│ str       │
╞═══════════╡
│ Henry     │
│ Max       │
└───────────┘


### Article Views I

#### Question

DataFrame: views

| Column Name   | Type    |
|:-------------:|:-------:|
| article_id    | int     |
| author_id     | int     |
| viewer_id     | int     |
| view_date     | date    |

There is no primary key (column with unique values) for this table, the table may have duplicate rows.<br>
Each row of this table indicates that some viewer viewed an article (written by some author) on some date.<br> 
Note that equal author_id and viewer_id indicate the same person.
 

Write a solution to find all the authors that viewed at least one of their own articles.

Return the result table sorted by `id` in ascending order.

The result format is in the following example.

Input:<br>

views DataFrame:

| article_id | author_id | viewer_id | view_date  |
|:----------:|:---------:|:----------|:----------:|
| 1          | 3         | 5         | 2019-08-01 |
| 1          | 3         | 6         | 2019-08-02 |
| 2          | 7         | 7         | 2019-08-01 |
| 2          | 7         | 6         | 2019-08-02 |
| 4          | 7         | 1         | 2019-07-22 |
| 3          | 4         | 4         | 2019-07-21 |
| 3          | 4         | 4         | 2019-07-21 |

Output: 

| id   |
|:----:|
| 4    |
| 7    |

#### Testcase

In [9]:
# Test data
data = [[1, 3, 5, '2019-08-01'], 
        [1, 3, 6, '2019-08-02'], 
        [2, 7, 7, '2019-08-01'], 
        [2, 7, 6, '2019-08-02'], 
        [4, 7, 1, '2019-07-22'], 
        [3, 4, 4, '2019-07-21'], 
        [3, 4, 4, '2019-07-21']]

# Create the DataFrame
views = pl.DataFrame(
    data,
    schema=['article_id', 'author_id', 'viewer_id', 'view_date']
)

# Display the DataFrame
print(views)

shape: (7, 4)
┌────────────┬───────────┬───────────┬────────────┐
│ article_id ┆ author_id ┆ viewer_id ┆ view_date  │
│ ---        ┆ ---       ┆ ---       ┆ ---        │
│ i64        ┆ i64       ┆ i64       ┆ str        │
╞════════════╪═══════════╪═══════════╪════════════╡
│ 1          ┆ 3         ┆ 5         ┆ 2019-08-01 │
│ 1          ┆ 3         ┆ 6         ┆ 2019-08-02 │
│ 2          ┆ 7         ┆ 7         ┆ 2019-08-01 │
│ 2          ┆ 7         ┆ 6         ┆ 2019-08-02 │
│ 4          ┆ 7         ┆ 1         ┆ 2019-07-22 │
│ 3          ┆ 4         ┆ 4         ┆ 2019-07-21 │
│ 3          ┆ 4         ┆ 4         ┆ 2019-07-21 │
└────────────┴───────────┴───────────┴────────────┘


#### Solution

In [11]:
def article_views(views: pl.DataFrame) -> pl.DataFrame:
    df = views.filter(
        (pl.col('author_id') == pl.col('viewer_id'))
        )
    return df.select(['author_id']).unique().sort('author_id').rename({'author_id': 'id'})


# Display the result
print(article_views(views=views))

shape: (2, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 4   │
│ 7   │
└─────┘
