![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). <br>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0**

# <a name="top">01 - Fundamentals of `pandas`</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

This series of notebooks is an introduction to the `pandas` module, the most popular Python library for data analysis in a tabular structure. This notebook, in particular, would emphasize on the inception of the two core data objects of `pandas`, which are `Series` and `DataFrame`.

By the end of this tutorial, you will be able to:

1. Explain about `Series` and `DataFrame` object
2. Create the `Series` and `DataFrame` object
3. Read data from CSV and JSON
4. Write data to CSV and JSON

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [Core Components](#core)
    - [Series](#series)
    - [DataFrame](#dataframe)
4. [Reading Data Files](#read)
    - [Read from CSV](#read_csv)
    - [Read from JSON](#read_json)
5. [Writing Data Files](#write)
    - [Write to CSV](#write_csv)
    - [Write to JSON](#write_json)
6. [Summary](#summary)
7. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

First of all, the `pandas` module has to be imported. Since we'll be using it so much, let's use the common alias `pd`.

In [2]:
### BEGIN SOLUTION
import pandas as pd
### END SOLUTION

## <a name="core">Core Components</a>

The core components of `pandas` are the `Series` and `DataFrame` objects.

|  | cats |
|-|:-:|
| 0 | 3 |
| 1 | 7 |
| 2 | 1 |
| 3 | 0 |

<center><i>Series</i></center>

|  | cats | dogs |
|-|:-:|:-:|
| 0 | 3 | 4 |
| 1 | 7 | 1 |
| 2 | 1 | 6 |
| 3 | 0 | 9 |

<center><i>DataFrame</i></center>

But before we get ahead of ourselves, let's visualize some of the attributes of the data objects so that you will be able to relate to them when creating your data objects.

![01-00](../../img/pandas/01-00.png)

You should also take note of the data types or `dtypes` in `pandas` as when doing data analysis, it is important to make sure you are using the correct data types; otherwise you may get unexpected results or errors. In `pandas`, most time we will let `pandas` to infer the data type automatically based on the data that we input, but despite that, at some point in your data processing, you will likely need to explicitly convert data types from one type to another.

| `pandas` dtypes | Python | Usage |
|:-:|:-:|:-:|
| object | str or mixed | Text or mixed numeric and <br>non-numeric values |
| int64 | int | Integer numbers |
| float64 | float | Floating point numbers |
| bool | bool | True/False values |
| datetime64 | datetime | Date and time values |
| timedelta[ns] | NA | Differences between <br>two datetimes |
| category | NA | Finite list of text <br>values |


### <a name="series">Series</a>

Essentially a `Series` is a single-column or 1-dimensional data with homogenous data. Homogenous meaning the column contains data with the same data type. For example, the following `Series` is a collection of integers:-

|  | cats |
|-|:-:|
| 0 | 3 |
| 1 | 7 |
| 2 | 1 |
| 3 | 0 |

You may create the object with `pd.Series(data, index, dtype, name)` constructor.

In [None]:
# Jupyter allows the use of Shift+Tab+Tab shortcut to get hints for the methods.
# Tab to autocomplete or get suggestions
# IMPORTANT: To know the default values of a method! Some remain "None" while some are inferred based on the input.

### BEGIN SOLUTION
series = pd.Series(data=[3,7,1,0], name='cats')
### END SOLUTION


series

In the previous example, you created a `Series` with an array. Also observe that the index for the series was generated automatically, since there was nothing passed into the `index` attribute. The `name` attribute of the series will be the name of the column when multiple series are concatenated when creating a `DataFrame`.

Another way to create a `Series` is with Python's `dict`. Do keep in mind that the keys you use with a `dict` in a Series will become the `index` of the row. This is not the same when using `dict` to create a DataFrame. If the `index` attribute was passed along with the `dict`, the index labels will abide to the indexes provided under that attribute. 

> *Note: Good indexing allows easier data locating. Basic operation like locating data by index is part of the coming notebooks.*

In [None]:
### BEGIN SOLUTION
dict = {'a': 5, 'b':3, 'c':8, 'd':9}
series = pd.Series(dict)
### END SOLUTION


series

### <a name="dataframe">DataFrame</a>

`DataFrame` is a 2-dimensional table and potentially allows heterogenous data. It contains an array of individual entries, each of which has a certain value. Each entry correspond to a *row* and a *column*. :-

|  | animals | quantity |
|-|:-:|:-:|
| 0 | cat | 4 |
| 1 | dog | 1 |
| 2 | fish | 6 |
| 3 | hamster | 9 |

You may create the object with `pd.DataFrame(data, index, columns, dtype)` constructor.

In [None]:
### BEGIN SOLUTION
dict = {
    "animals": ["cat","dog","fish","frog"],
    "quantity": [4,1,6,9]
}
df = pd.DataFrame(data = dict)
### END SOLUTION


df

In the example above, we used a `dict` object to create the `DataFrame`. Unlike `Series`, when a `dict` is inputted into a `DataFrame` constructor, its keys become the column labels.

You can also use `pd.concat` to concatenate two `Series` objects into a `DataFrame`.

In [None]:
column1 = pd.Series(["cat","dog","fish","frog"], name= "animals")
column2 = pd.Series([4,1,6,9], name = "quantity")

### BEGIN SOLUTION
df = pd.concat([column1,column2], axis=1)
### END SOLUTION


print(type(df))
print(df.shape)
df

Watch that you'll have to modify the `axis` attribute to explicitly describe **which direction** you want the data to be concatenated. It is advised to master these constructors and understand how their attributes affect the object, before you continue your learning journey.

## <a name="read">Reading Data Files</a>
Being able to create a `Series` or `DataFrame` manually is handy but, most of the time, we won't be doing that. Instead, we will be working with data that already exists. Data can be stored in various form and formats. By far, the most basic of these is the CSV file.

### <a name="read_csv">Read from CSV</a>
CSV is abbreviated from comma-separated values. So, a CSV file is a table of values separated by commas.

Let's use the `pd.read_csv()` method to read a sample file at `../../Datasets/pandas/sample.csv`.

In [None]:
### BEGIN SOLUTION
sample = pd.read_csv("../../Datasets/pandas/sample.csv", index_col=[0])
### END SOLUTION


sample

Great! So if you noticed, `pd.read_csv` has so many attributes making it a very powerful and flexible method. The `sep` attribute can be modified to any delimiter that your data file uses. Other common ones include Tabs (`/t`), space (` `) and colons (`:`). This method can also be used for any plain text files in the `.txt` format.

Let's try to import data from one of those at `../../Datasets/pandas/sample.txt`

In [None]:
### BEGIN SOLUTION
sample = pd.read_csv("../../Datasets/pandas/sample.txt", index_col = 0)
### END SOLUTION


sample

Observe that `pandas` provides a great number of options to read data files from various formats. Another common format of storing data is in the form of `.json` files.

### <a name="read_json">Read from JSON</a>
Big data sets in database systems are often extracted or stored in JSON files. JSON is a plain text file (which is human-readable by the way) and is a popular standard for communicating between client and servers.

Let's import the same data but from a file with `.json` extension from `../../Datasets/pandas/sample.json`

In [None]:
### BEGIN SOLUTION
sample = pd.read_json("../../Datasets/pandas/sample.json")
### END SOLUTION


sample

Note that when importing a well-formatted `.json` file, the indexes have been inferred correctly without any need to specify in the attributes. You may also supply URLs into the read methods.

`pandas` also offers methods to read from binary files which are not human-readable such as `.pkl`. Different read methods have different attributes to them to deal with importing issues that come up when reading data from their respective extensions. Hence, it's a no-brainer to use the right method for the given file format. 

## <a name="read">Writing Data Files</a>

Lastly, it is of course important to be able to write the data that you have analyzed into your computer. This is really handy in `pandas` as it supports many different data formats by default.

### <a name="write_csv">Writing to CSV</a>
The most typical output format is the CSV file. One could easily save the data in their `DataFrame` into a CSV using the method `to_csv()`.

In [3]:
data = {
    "animals": ["cat","dog","fish","frog"],
    "quantity": [4,1,6,9]
}
df = pd.DataFrame(data)

### BEGIN SOLUTION
df.to_csv("output.csv")
### END SOLUTION


df

AttributeError: 'DataFrame' object has no attribute 'to_'

### <a name="write_json">Write to JSON</a>
We can write our data into a JSON file with the `to_json()` command.

In [None]:
data = {
    "animals": ["cat","dog","fish","frog"],
    "quantity": [4,1,6,9]
}
df = pd.DataFrame(data)

### BEGIN SOLUTION
df.to_json("output.json")
### END SOLUTION


df

##  <a name="summary">Summary</a>
To conclude, you should now be able to:
1. Explain about `Series` and `DataFrame` object
2. Create the `Series` and `DataFrame` object
3. Read data from CSV and JSON
4. Write data to CSV and JSON

Congratulations, that concludes this lesson. In the next lesson, we will explore on the must-know basic functions when doing data analysis with `pandas`. 

See you!

## <a name="reference">Reference</a>
* [Attribute Visualization Source](https://geo-python.github.io/site/notebooks/L5/exploring-data-using-pandas.html)
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)

<font size=2>[Back to Top](#top)</font>