- #### Download data from here:<br>
https://catalog.data.gov/dataset/crimes-2001-to-present/resource/31b027d7-b633-4e82-ad2e-cfa5caaf5837


### Install Python Libraries


#### Run this command in your python (virtual) environment


#### <i> `pip install duckdb magic_duckdb polars plotly_express nbformat --quiet --user` </i>


#### 1. duckdb:

DuckDB is an embedded analytical SQL database for Python.
It is designed for fast query execution and low memory usage.
Developers often use it for data analysis, data manipulation, and reporting.
You can find more information about DuckDB on their official website .

#### 2. magic_duckdb:

magic_duckdb is a Python package that provides Jupyter Notebook magic commands for interacting with DuckDB.
It allows you to run SQL queries against DuckDB directly within a Jupyter Notebook.
You can explore its usage and documentation in the GitHub repository: <br> https://github.com/iqmo-org/magic_duckdb, or here:<br>https://pypi.org/project/magic-duckdb/

#### 3. Polars:

Polars is a fast DataFrame library for Python and Rust.
It is designed for big data processing and provides a similar interface to Pandas.
Polars is particularly useful for working with large datasets efficiently.
To learn more about Polars visit the official website:<br> https://pola.rs/

#### 4. Plotly Express:

Plotly Express is a high-level Python visualization library built on top of Plotly.
It simplifies the creation of interactive plots, charts, and graphs.
With Plotly Express, you can quickly generate visualizations without writing extensive code.
Explore its capabilities in the official documentation: <br>https://plotly.com/python/plotly-express/

#### 5. nbformat:

nbformat is a Python library for working with Jupyter Notebook file formats.
It allows you to read, write, and manipulate Jupyter Notebook files programmatically.
Developers often use it for tasks like converting notebooks to different formats or extracting cell content.
You can find more details in the official documentation: <br>https://nbformat.readthedocs.io/en/latest/ and here: <br>https://pypi.org/project/nbformat/

Source:  <font color="orange"><i>Microsoft Copilot. (2024). Python libraries. Retrieved from the OpenAI ChatGPT platform.</i></font>

#### Use the `--quiet` flag to suppress output during installation, and the `--user` flag to install the packages in your user-specific Python environment.


In [None]:
# .torenv\Scripts\Activate.ps1

In [None]:
import duckdb
import pandas as pd

%load_ext magic_duckdb

#### We already installed and loaded the duckdb magic in our notebook. Let us take advantage of it <br>so that we don't repeate `duckdb.sql` ever time. <br>Instead we can use: <br>

- `%dql` for single line queries and:<br>
- `%%dql` for multi-line queries instead


##### Because we are using the magic_duckdb extension, our queries will return a Pandas DataFrame, <br> bringing the entire query result into memory.

##### We can avoid this by setting the type of return by using `"-t"` followed by the type, choosing from "df", "arrow", "pl", "describe", "show" and "relation".


#### Show pandas dataframe without cutting out some rows in the display

In [None]:
# Set Pandas to display all rows without truncation
pd.set_option('display.max_rows', None)

#### `duckdb_settings()` is a Table Function: This function returns a table with information about all configuration <br>options and their current values. ####

#### Get duckdb metadata parameters

In [None]:
%dql -t df SELECT * FROM duckdb_settings();

In [None]:
%dql SELECT * EXCLUDE input_type FROM duckdb_settings() WHERE name = 'memory_limit';

### Parquet File Format

#### - The Parquet file format is widely recognized as one of the most efficient storage <br>options in today’s data landscape. Here are some reasons why it’s considered a de-facto standard: 

- <b> <font color="#B0FC38">Data Compression: </font></b> Parquet files apply various encoding and compression algorithms, resulting in reduced memory consumption.
Columnar Storage: In analytic workloads, where fast data read operations are crucial, Parquet’s column-based storage shines. It stores values from each column together, enabling efficient query processing.
Language Agnostic: Developers can manipulate Parquet data using different programming languages, making it versatile for diverse data teams.
- <b><font color="#B0FC38">Open-Source Format:  </font></b> Parquet is not tied to a specific vendor, ensuring flexibility and compatibility.
Now, let’s explore the differences between row-based and column-based storage:

- <b><font color="#B0FC38">Row-Based Storage: </font></b> In traditional row-based storage, data is stored as a sequence of rows. Imagine a table with rows representing individual records. However, this approach may not be optimal for OLAP scenarios where specific questions need quick answers (e.g., sales inquiries).
- <b><font color="#B0FC38">Column-Based Storage (Parquet): </font></b> Parquet stores data in a column-oriented manner. Each column is independently accessible, making encoding, compression, and optimization possible. This design significantly improves performance for analytical queries.<br><br>
Source:  <font color="orange"><i>Microsoft Copilot. (2024). Parquet File Format. Retrieved from the OpenAI ChatGPT platform.</i></font>

#### For an excellent discussion and in-depth look at the parquet file structure, <br>check out this simplified but yet very good explanation that Data-Mozart provided:<br><i> https://data-mozart.com/parquet-file-format-everything-you-need-to-know/<br><br>https://www.youtube.com/watch?v=5NA57Pfpdr4&t=1s</i>

#### Let us look at the datatypes in the parquet file - <i>Note the duckdb sql syntax </i>

In [None]:
%%dql
DESCRIBE FROM 'Crimes_2001_to_Present.parquet';

#### Get RowCount from Parquet file

In [None]:
%%dql -t df
select format('{:,}', count(*)) as count from 'Crimes_2001_to_Present.parquet';

#### Let us look at the parquet metadata

In [None]:
%%dql -t df
select * from parquet_metadata('Crimes_2001_to_Present.parquet') LIMIT 20;

In [None]:
%%dql -t df
select * from PARQUET_SCHEMA('Crimes_2001_to_Present.parquet') LIMIT 20

In [None]:
%dql -t df SELECT * FROM 'Crimes_2001_to_Present.parquet' LIMIT 3;

#### Let us take a look at the file's columns and datatypes

In [None]:
%dql DESCRIBE FROM 'Crimes_2001_to_Present.parquet';

#### Let us group the data in the parquet file using the Primary Type column


In [None]:
%%dql
SELECT "Primary Type", FORMAT('{:,}', COUNT(*)) AS RowCount
FROM 'Crimes_2001_to_Present.parquet'
GROUP BY "Primary Type"
ORDER BY COUNT(*) DESC;

#### Let us add the year to the grouping of the data, and filter out the data less than 2019

In [None]:
%%dql -o df1
SELECT  date_part('year', DATE) AS year, 
"Primary Type", COUNT(*) AS RowCount
FROM 'Crimes_2001_to_Present.parquet'
WHERE "Primary Type" IN('THEFT', 'MOTOR VEHICLE THEFT','ROBBERY','HOMICIDE','BURGLARY')
AND date_part('year', DATE) > 2018
GROUP BY date_part('year', DATE), "Primary Type"
ORDER BY year DESC, COUNT(*) DESC;

#### let us style our newly created pandas DataFrame

In [None]:
df2 = df1.copy()
df2[['year', 'Primary Type', 'RowCount']].style.background_gradient(cmap='PuBu', axis=0)

In [None]:
df3 = df1.copy()
cols=['Primary Type', 'year','RowCount']
(df3[cols]  #.head(10)
   .style.background_gradient(axis=0).highlight_min(color='lightgreen')
)

#### Find current working directory path

In [None]:
%pwd 

#### END OF FILE