## ![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful,
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.

# <a name="top">02 - Basic Functions</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

There are many basic functions in `pandas` that you will need to know to quickly reconnect with your data time to time. This notebook sheds light on some functions that you'll use everytime you have a data-related task to be done with `pandas`. 

By the end of this tutorial, you will be able to:

1. View your data.
2. Obtain general info from your data.
3. Obtain special info from your data.

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [View your data](#view)
4. [Gain info from data](#gain)
    - [General](#general)
    - [Special](#special)
5. [Summary](#summary)
6. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

First of all, the `pandas` module has to be imported. We'll also use a real dataset containing reviews of wine from Wine Magazine. 

Import the module and the dataset from `../../Datasets/pandas/winemag-data-130k-v2.csv`. Name the `DataFrame` object `reviews`.

In [1]:
### BEGIN SOLUTION
import pandas as pd
reviews = pd.read_csv("../../Datasets/pandas/winemag-data-130k-v2.csv", index_col=0)
### END SOLUTION


reviews

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,90,28.0,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,90,75.0,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,90,30.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,90,32.0,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


## <a name="view">View your data</a>

We can examine the contents of our `DataFrame` using the `.head()` command. By default, the command grabs the **first five rows only**. You may use this command anytime that you'd like to have a quick glance to the structure without having to print every row which could potentially cost more time when the dataset is big.

In [None]:
### BEGIN SOLUTION
reviews.head()
### END SOLUTION


The method allows any single integer as an argument. Similarly, `.tail()` gives access to the last five rows by default.

Try to create a visual reference of the **last 10 rows** of data with `.tail()`.

In [None]:
### BEGIN SOLUTION
reviews.tail(10)
### END SOLUTION


When we load a dataset, we usually look at the first five or six rows to see what's going on. Here we can see the titles of each column, the index, and samples of values in each row.

## <a name="gain">Gain info from data</a>
The `pandas` module has built-in methods that makes analyzing your data much easier with multiple methods that can supply insights about your data quickly.

### <a name="general">General info</a>
The `DataFrame` object has a method called `.info()`, that gives you more information about the dataset.

In [None]:
### BEGIN SOLUTION
reviews.info()
### END SOLUTION


The results tells us a few essential details about the dataset.
1. The number of rows and columns
2. The name of each column and their data types.
3. The number of non-null values
4. How much memory is your `DataFrame` using.

> *Seeing the `dtypes` quickly is actually quite useful. Imagine you just imported some JSON and the integers were recorded as strings. You go to do some calculation and find an "unsupported operand" exception because you can't do math with strings. Calling `.info()` will quickly point out that your column you thought was all integers are actually strings.* 

Another fast and useful attribute is the `.shape`, which outputs just a tuple of (rows, columns).

> *Note: Observe that there is no parentheses while doing* `.shape`.

In [None]:
### BEGIN SOLUTION
reviews.shape
### END SOLUTION


### <a name="special">Special info</a>

To show a quick statistic summary of your data, you may use `.describe()` method.

In [None]:
### BEGIN SOLUTION
reviews.describe(include ='all')
### END SOLUTION


This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for the columns with numerical data; for string data (`dtype = "object"`) here's what we get:

In [None]:
### BEGIN SOLUTION
reviews.taster_name.describe()
### END SOLUTION


This descriptive information that we obtained on our `taster_name` column shows that there are 103727 spaces filled with `objects` and 19 unique values. The first unique value was `"Roger Voss:` and this name repeats 25,514 times in the column.

To get a full list of unique values, one could use the `.unique()` method.

In [None]:
### BEGIN SOLUTION
reviews.taster_name.unique()
### END SOLUTION


The uniques are returned in order of appearance and is not sorted.

To see a list of unique values **and** how often they repeat themselves in the column, we can use the `.value_counts()` method:

In [None]:
### BEGIN SOLUTION
reviews.taster_name.value_counts()
### END SOLUTION


Finally, if you would like to get some particular simple summary statistic about a column in a `DataFrame` or a `Series`, there is usually a helpful `pandas` function to make that happen.

As an example, to see the mean of the `points` column, we can use the `.mean()` function.

In [None]:
### BEGIN SOLUTION
reviews.points.mean()
### END SOLUTION


##  <a name="summary">Summary</a>
To conclude, you should now be able to:

1. View your data.
2. Obtain general info from your data.
3. Obtain special info from your data.

Congratulations, that concludes this lesson. In the next lesson, we will explore on the common operations such as slicing, selecting and extracting the `DataFrame` in `pandas`. 

See you!

## <a name="reference">Reference</a>
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)

<font size=2>[Back to Top](#top)</font>