### Chapter 2 Exercises

In [37]:
import polars as pl

pl.Config.set_fmt_str_lengths(50)

polars.config.Config

1.	Use Polars to read the CSV file reviews_videogames_350_simplified from the dataset folder. Use the option `try_parse_dates` to ensure that Polars recognizes the review_date as a date. Use `schema` to verify that all columns have the right data type. 

In [39]:
videogames = pl.read_csv('../datasets/reviews_videogames_350_simplified.csv', try_parse_dates=True)

In [40]:
videogames.schema

{'title': Utf8, 'rating': Float64, 'review_date': Date, 'review_text': Utf8}

2. Filter the same DataFrame to keep only the reviews given on or later of 1st January 2018. Then show the first 5 rows (hint: you can use `from datetime import date` to import the Python date module, then create a Python date and use it for comparison)

In [41]:
from datetime import date

videogames.filter(
    pl.col('review_date') >= date(2018,1,1)
).head(5)

title,rating,review_date,review_text
str,f64,date,str
"""Call of Duty: Ghosts - PlayStation 4""",5.0,2018-06-30,"""excelente"""
"""Until Dawn - PlayStation 4""",5.0,2018-01-07,"""A really fun game, that I recommend trying."""
"""Halo 4 - Xbox 360 (Standard Game)""",5.0,2018-05-09,"""no"""
"""UNCHARTED: The Nathan Drake Collection - PlayStat…",5.0,2018-01-12,"""Just got a PS4 (I'm pretty late to the game...lit…"
"""DualShock 4 Wireless Controller for PlayStation 4…",4.0,2018-02-24,"""For two player games, it's a must!"""


3. Repeat the operations as above but create a LazyFrame and filter it. (hint: you create a LazyFrame by using `scan_` to read files, and use `collect` when you want to visualize the result)

In [42]:
videogames_lf = pl.scan_csv('../datasets/reviews_videogames_350_simplified.csv', try_parse_dates=True)

In [43]:
videogames_lf = videogames_lf.filter(
    pl.col('review_date') >= date(2018,1,1)
)

In [44]:
videogames_lf.head(5).collect()

title,rating,review_date,review_text
str,f64,date,str
"""Call of Duty: Ghosts - PlayStation 4""",5.0,2018-06-30,"""excelente"""
"""Until Dawn - PlayStation 4""",5.0,2018-01-07,"""A really fun game, that I recommend trying."""
"""Halo 4 - Xbox 360 (Standard Game)""",5.0,2018-05-09,"""no"""
"""UNCHARTED: The Nathan Drake Collection - PlayStat…",5.0,2018-01-12,"""Just got a PS4 (I'm pretty late to the game...lit…"
"""DualShock 4 Wireless Controller for PlayStation 4…",4.0,2018-02-24,"""For two player games, it's a must!"""


4. Print the LazyFrame from question 3 without using `collect` and visualize the query plan.

In [30]:
videogames_lf

5. Sort the LazyFrame based on title and review date in descending order. Show the first 5 rows.

In [45]:
videogames_lf.sort(['title', 'review_date'], descending=True).head(5).collect()

title,rating,review_date,review_text
str,f64,date,str
,5.0,2018-04-29,"""Can't rate the game because it wasn't for me but …"
"""inFAMOUS: Second Son Standard Edition (PlayStatio…",4.0,2018-06-23,"""game case got squeezed and the cd wasn't secured …"
"""inFAMOUS: Second Son Standard Edition (PlayStatio…",4.0,2018-06-11,"""The boy played it multiple times so..."""
"""inFAMOUS: Second Son Standard Edition (PlayStatio…",4.0,2018-06-01,"""Good!"""
"""inFAMOUS: Second Son Standard Edition (PlayStatio…",1.0,2018-04-27,"""Rubbish... Good only for a 5 year old maybe... No…"


6. Calculate the minimum value for the rating column and for the review_date column.

In [46]:
videogames_lf.select(
    pl.col('title', 'review_date').min()
).collect()

title,review_date
str,date
"""Alan Wake - Xbox 360""",2018-01-01


7. Calculate the deciles for the rating column

In [49]:
videogames_lf.select(
    pl.col('rating').quantile(i/100).alias(f'perc_{i}') for i in range(0, 101, 10)
).collect()

perc_0,perc_10,perc_20,perc_30,perc_40,perc_50,perc_60,perc_70,perc_80,perc_90,perc_100
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
1.0,3.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


8. Count the number of missing values for the title column.

In [55]:
videogames_lf.select(
    pl.col('title').null_count()
).collect()

title
u32
1


9. Calculate the minimum, maximum, average, median, and standard deviation for the rating column

In [58]:
videogames_lf.select(
    min_rating = pl.min('rating'),
    max_rating = pl.max('rating'),
    avg_rating = pl.mean('rating'),
    median_rating = pl.median('rating'),
    std_rating = pl.std('rating'),
).collect()

min_rating,max_rating,avg_rating,median_rating,std_rating
f64,f64,f64,f64,f64
1.0,5.0,4.414088,5.0,1.133849


10. Show the rows where the combination of title, review_date, and review_text appear more than once (duplicated rows). Hint: you can use `pl.struct()` to combine the 3 columns together and then verify when they are duplicated. Finally, sort the table by the same columns.

In [64]:
videogames_lf.filter(
    pl.struct('title', 'review_date', 'review_text').is_duplicated()
).sort('title', 'review_date', 'review_text').collect()

title,rating,review_date,review_text
str,f64,date,str
"""Batman: Arkham Knight - PlayStation 4 [Digital Co…",5.0,2018-02-08,"""ok"""
"""Batman: Arkham Knight - PlayStation 4 [Digital Co…",5.0,2018-02-08,"""ok"""
"""DualShock 4 Wireless Controller for PlayStation 4…",5.0,2018-03-07,"""works as described"""
"""DualShock 4 Wireless Controller for PlayStation 4…",5.0,2018-03-07,"""works as described"""
"""Final Fantasy X""",5.0,2018-02-27,"""Most favorite game of the series"""
"""Final Fantasy X""",5.0,2018-02-27,"""Most favorite game of the series"""
"""Final Fantasy X""",5.0,2018-04-20,"""Played this game long ago, can't wait to try it o…"
"""Final Fantasy X""",5.0,2018-04-20,"""Played this game long ago, can't wait to try it o…"
"""Halo - Xbox""",5.0,2018-03-02,"""awesome game."""
"""Halo - Xbox""",5.0,2018-03-02,"""awesome game."""
