## Colab Prep

Execute the following code cells to whenever you open/restart the notebook in Google Colab.

In [None]:
!pip install "polars[all]" #execute each time you start/restart a Colab session

In [None]:
!wget https://github.com/WSU-DataScience/dsci_325_module6_basic_data_management_in_python/raw/main/sample_data.zip

In [None]:
!unzip ./sample_data.zip

# Module 6.1 - Reading Data in `polars`

## Dataframes in Python

Here a summary of some of the important the data management libraries in Python.

* `pandas` was the first (and still most popular) data frame library.  It was based on `R` data frames, but is starting to show its age.
* `polars` is a new library similar to `pandas`, but has new features that make it easier to work with and more efficient for large data and multi-core machines.
* `pyspark` is used for managing very large data on a distributed network of machines.
* `koalas` is an interface to `pyspark` that based on the `pandas` interface.

**Note.** We will be primarily focusing on `polars`, but will occasionally need to convert to `pandas` to work with other libraries.

## Polars provides Python next-generation data frames

* **Expressive.** Queries are familiar, readable, and combosable.
* **Parallel.** Can use all cores/threads
* **Fast.** [Fastest] in-memory data frames
* **Lazy.** Allows lazy evaluation for
    * Efficient memory usage
    * Query optimization
    * Filter pushdown
* **Eager.** Allows eager evaluation for convenience on small data sets.

In [None]:
import polars as pl

## Our first dataframe

In [None]:
df = pl.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "Love_of_R": [2, 5, 11],
                   "years_at_wsu": [4, 17, 5]})
df.head()

## Reading from a data file

* Most data sets will be read in from a csv or JSON data file
* `Pandas` provides `read_csv` and `read_json`

### Open a CSV file from a local file w/ relative path

In [None]:
artists = pl.read_csv('./sample_data/Artists.csv')
artists.head()

### Open a CSV using a web address

In [None]:
url = "https://github.com/MuseumofModernArt/collection/raw/master/Artists.csv"
artists =  pl.read_csv(url)
artists.head()

### What is a JSON data file?

* Another (more modern) storage
* Here the data is stored in row `dict`

```{json}
[
{
  "ConstituentID": 1,
  "DisplayName": "Robert Arneson",
  "ArtistBio": "American, 1930–1992",
  "Nationality": "American",
  "Gender": "Male",
  "BeginDate": 1930,
  "EndDate": 1992,
  "Wiki QID": null,
  "ULAN": null
},
{
  "ConstituentID": 2,
  "DisplayName": "Doroteo Arnaiz",
  "ArtistBio": "Spanish, born 1936",
  "Nationality": "Spanish",
  "Gender": "Male",
  "BeginDate": 1936,
  "EndDate": 0,
  "Wiki QID": null,
  "ULAN": null
},
...
```

### `polars` can read `json` data

In [None]:
artists =  pl.read_json('./sample_data/Artists.json')
artists.head()

## <font color="red"> Exercise 6.1.1 </font>
    
Use tab-completion and `help` to discover and explore two more methods of reading a file into a `Pandas` dataframe.


In [None]:
pl.read_ #<-- Tab here

> Discuss what you found here

## <font color="red"> Exercise 6.1.2 </font>
    
Read in the `./sample_data/Artwork.csv` from [https://github.com/MuseumofModernArt/collection](https://github.com/MuseumofModernArt/collection) and display the head of the resulting dataframe.


In [None]:
# Your code here

## Working with other character encodings

Data stored in a text file 

* Is encoding using some [character encoding](https://en.wikipedia.org/wiki/Character_encoding) and 
* Is commonly stored using [UTF-8](https://en.wikipedia.org/wiki/UTF-8), but
* Needs to be read and converted when using another encoding.

### Example - MoMA exhibitions

An example of a data set that is stored with a non-standard encoding is the `./sample_data/MoMAExhibitions1929to1989.csv` provided by the [Museam of Modern Art (MoMA)](https://github.com/MuseumofModernArt/collection).

### The exhibition file gives encoding errors by default

When trying to read this file, we get an error about the encoding.

In [None]:
exhibitions = pl.read_csv('./sample_data/MoMAExhibitions1929to1989.csv')

## Switching encodings fixes the problem

* This file uses ISO-8859-1 encoding, see [this Stack Overflow question](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python)
* More details on [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
* How to read non-utf8 encodings
    * Use Python's tools (`with` statement and `open`) to read the file.
    * Encode as `utf-8` and pass to `polars`

In [None]:
with open('./sample_data/MoMAExhibitions1929to1989.csv', 'r', encoding='ISO-8859-1') as fh:
    converted_file = fh.read().encode('utf-8')
    exhibitions = pl.read_csv(converted_file,
                              ignore_errors=True,
                              parse_dates=True)
    
exhibitions.head(2)

## So what is a `DateFrame`

* Like R, `polars` focuses on columns
* Think `dict` of `(str, Series)` pairs 
* A series is a typed list-like structure

In [None]:
# This is how I imagine a dataframe
df = pl.DataFrame({"Names": ["Iverson", "Malone", "Bergen"],
                   "Python_mastery": [10, 5, 1.0],
                   "years_at_wsu": [4.5, 17.5, 5.5]})

In [None]:
type(df)

In [None]:
df

## Two ways to access a column

* **Method 1:** Actual data series
    * `df["column_name"]`
* **Method 2:** lazy column expression used in other context
    * `pl.col('column_name')`
    * Only for proper names!

In [None]:
artists['BeginDate'].head(2)

In [None]:
pl.col('BeginDate') # Lazy - Nothing (yet)

## Columns are type `Series` and hold one type of data

In [None]:
type(artists['BeginDate'])

In [None]:
type(artists['DisplayName'])

In [None]:
artists['BeginDate'].dtype

In [None]:
artists['DisplayName'].dtype

## More on data types

* a list of all `polars` data types are available in `pl.datatypes`
    * Look for names starting with a capital letter.
* Use `df.dtypes` to see the column types in a dataframe named `df`

#### A list of all `polars` data types

In [None]:
[m for m in dir(pl.datatypes) if m.istitle()] # istitle used to filter names starting with a capital letter

#### Inspecting the data types for a data frame

In [None]:
artists.dtypes

## Setting `dtypes` with `read_csv`

We can pass a `dict` of types to `dtype` keyword

In [None]:
artist_types = {'ConstituentID': pl.Int64,
                'DisplayName': pl.Utf8,
                'ArtistBio': pl.Utf8,
                'Nationality': pl.Utf8,
                'Gender':pl.Utf8,
                'BeginDate': pl.Int64,
                'EndDate': pl.Int64,
                'Wiki QID': pl.Utf8,
                'ULAN':pl.Int64} 

artists2 = pl.read_csv('./sample_data/Artists.csv', dtypes = artist_types)
artists2.head()

## More on `None` and `NaN`

`polars` has two types of missing data.

* `None`/`null` is a missing value.
* `NaN` represents the result of an undefined operation
* `NaN` is **not** missing

In [None]:
df = pl.DataFrame({'a': [-1, 0 , 1, None],
                   'b': [1, 2, None, 4],
                   'c': [1, 2, float('nan'), 4]})
df

### `Nan` are a result of undefined operations

Note that computing the square root of a negative number returns `Nan`, not `None`/`null`

In [None]:
df_w_sqrt = (df
             .select([pl.col('a'),
                      pl.col('a').sqrt().alias('sqrt_a'),
                     ])
)
df_w_sqrt

### `Nan`  are not `None` 

In [None]:
(df_w_sqrt
 .select([
          pl.col('sqrt_a'),
          pl.col('sqrt_a').is_null().alias('Is null'),
          pl.col('sqrt_a').is_nan().alias('Is nan'),
             ])
)

### `NaN` and `None` affect aggregation differently.

We will discuss the effects of these values on aggregation in a future lecture.

## Getting to know your data

To get to know your data, use the following data frame methods.

* `df.head()`        first five rows
* `df.tail()`        last five rows
* `df.sample(5)`     random sample of rows
* `df.shape`         number of rows/columns in a tuple
* `df.describe()`    calculates measures of central tendency

#### Getting the number of rows and columns using `shape`

In [None]:
df.shape

#### Getting summary statistics for each column with `describe`

In [None]:
df.describe()

## <font color="red"> Exercise 6.1.3</font>

**Tasks.**

* Use various method to inspect the `./sample_data/Artwork.csv` data from MoMA 
* Write a short summary of what your learn.

In [None]:
# Your code here (open new code cells for each method)

> Your thoughts here (open new markdown cells for each method)