## ![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). <br>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0**

# <a name="top">02 - Basic Functions</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

There are many basic functions in `pandas` that you will need to know to quickly reconnect with your data time to time. This notebook sheds light on some functions that you'll use everytime you have a data-related task to be done with `pandas`. 

By the end of this tutorial, you will be able to:

1. View your data.
2. Obtain general info from your data.
3. Obtain special info from your data.

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [View your data](#view)
4. [Gain info from data](#gain)
    - [General](#general)
    - [Special](#special)
5. [Summary](#summary)
6. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

First of all, the `pandas` module has to be imported. We'll also use a real dataset containing reviews of wine from Wine Magazine. 

Import the module and the dataset from `../../Datasets/pandas/winemag-data-130k-v2.csv`. Name the `DataFrame` object `reviews`.

In [None]:
# YOUR CODE HERE


reviews

## <a name="view">View your data</a>

We can examine the contents of our `DataFrame` using the `.head()` command. By default, the command grabs the **first five rows only**. You may use this command anytime that you'd like to have a quick glance to the structure without having to print every row which could potentially cost more time when the dataset is big.

In [None]:
# YOUR CODE HERE


The method allows any single integer as an argument. Similarly, `.tail()` gives access to the last five rows by default.

Try to create a visual reference of the **last 10 rows** of data with `.tail()`.

In [None]:
# YOUR CODE HERE


When we load a dataset, we usually look at the first five or six rows to see what's going on. Here we can see the titles of each column, the index, and samples of values in each row.

## <a name="gain">Gain info from data</a>
The `pandas` module has built-in methods that makes analyzing your data much easier with multiple methods that can supply insights about your data quickly.

### <a name="general">General info</a>
The `DataFrame` object has a method called `.info()`, that gives you more information about the dataset.

In [None]:
# YOUR CODE HERE


The results tells us a few essential details about the dataset.
1. The number of rows and columns
2. The name of each column and their data types.
3. The number of non-null values
4. How much memory is your `DataFrame` using.

> *Seeing the `dtypes` quickly is actually quite useful. Imagine you just imported some JSON and the integers were recorded as strings. You go to do some calculation and find an "unsupported operand" exception because you can't do math with strings. Calling `.info()` will quickly point out that your column you thought was all integers are actually strings.* 

Another fast and useful attribute is the `.shape`, which outputs just a tuple of (rows, columns).

> *Note: Observe that there is no parentheses while doing* `.shape`.

In [None]:
# YOUR CODE HERE


### <a name="special">Special info</a>

To show a quick statistic summary of your data, you may use `.describe()` method.

In [None]:
# YOUR CODE HERE


This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for the columns with numerical data; for string data (`dtype = "object"`) here's what we get:

In [None]:
# YOUR CODE HERE


This descriptive information that we obtained on our `taster_name` column shows that there are 103727 spaces filled with `objects` and 19 unique values. The first unique value was `"Roger Voss:` and this name repeats 25,514 times in the column.

To get a full list of unique values, one could use the `.unique()` method.

In [None]:
# YOUR CODE HERE


The uniques are returned in order of appearance and is not sorted.

To see a list of unique values **and** how often they repeat themselves in the column, we can use the `.value_counts()` method:

In [None]:
# YOUR CODE HERE


Finally, if you would like to get some particular simple summary statistic about a column in a `DataFrame` or a `Series`, there is usually a helpful `pandas` function to make that happen.

As an example, to see the mean of the `points` column, we can use the `.mean()` function.

In [None]:
# YOUR CODE HERE


##  <a name="summary">Summary</a>
To conclude, you should now be able to:

1. View your data.
2. Obtain general info from your data.
3. Obtain special info from your data.

Congratulations, that concludes this lesson. In the next lesson, we will explore on the common operations such as slicing, selecting and extracting the `DataFrame` in `pandas`. 

See you!

## <a name="reference">Reference</a>
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)

<font size=2>[Back to Top](#top)</font>