## ![logo](../../img/license_header_logo.png)
> **Copyright &copy; 2021 CertifAI Sdn. Bhd.**<br>
 <br>
This program and the accompanying materials are made available under the
terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). <br>
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License. <br>
<br>**SPDX-License-Identifier: Apache-2.0**

# <a name="top">06 - Exercise for Common Operations & Data Cleanup and Missing Data</a>
Authored by: Scotrraaj Gopal - scotrraaj.gopal@certifai.ai

## <a name="description">Notebook Description</a>

One is able to consolidate their understanding during practical use of the knowledge. This notebook contains exercises to test your understanding and help you identify which application needs more reading and practise.

By the end of this exercise, you will be able to:

1. Relate and apply the common operations in `pandas`.
2. Explain and apply data cleanup methods in `pandas`.
3. Understand, detect and deal with missing data with `pandas`.

## Notebook Outline
Here's the outline for this tutorial:
1. [Notebook Description](#description)
2. [Notebook Configurations](#configuration)
3. [Section I - Common Operations](#common)
4. [Section II - Data Cleanup and Missing Data](#missing)
5. [Summary](#summary)
6. [Reference](#reference)

## <a name="configuration">Notebook Configurations</a>

**Task 0:** Import `pandas` with alias `pd`.

**Expected output:**
>![05-00](../../img/pandas/05-00.png)

In [1]:
### BEGIN SOLUTION
import pandas as pd
### END SOLUTION


dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Flags',
 'Float32Dtype',
 'Float64Dtype',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p17',
 '_np_version_under1p18',
 '_testing'

## <a name="common">Section I - Common Operations</a>

**Task I-1:** Load the CSV file at `../../Datasets/pandas/winemag-data-130k-v2.csv` in the `wine` variable and show only the first three rows.

**Expected Output:**
>![05-04](../../img/pandas/05-04.png)

In [2]:
### BEGIN SOLUTION
wine = pd.read_csv("../../Datasets/pandas/winemag-data-130k-v2.csv", index_col=0)
wine.head(3)
### END SOLUTION


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm


**Task I-2:** Extract the `description` column and assign it to the variable `desc`.

**Expected Output:**
>![06-01](../../img/pandas/06-01.png)

In [None]:
### BEGIN SOLUTION
desc = wine.description
# desc = wine['description']
# desc = wine.loc[:,'description']
### END SOLUTION


desc

**Task I-3:** Create a `DataFrame` object called `df` made up of rows with index labels `15`,`18`,`45`,`700` and column labels `country`, `province`, and `title`.

**Expected Output:**
>![06-02](../../img/pandas/06-02.png)

In [None]:
### BEGIN SOLUTION
df = wine.loc[[15,18,45,700],['country','province','title']]
### END SOLUTION


df

**Task I-4:** Generate a `DataFrame`, `df` filled with elements that lie from the 60th to 66th row and the 3rd to 7th column.

**Expected Output:**
>![06-03](../../img/pandas/06-03.png)

In [None]:
### BEGIN SOLUTION
df = wine.iloc[60:67,3:8]
### END SOLUTION


df

**Task I-5:** Generate a `DataFrame`, `top_MexicoGermany_wines` that contains all reviews from **Mexico** and **Germany** with **at least 85 points** and is **priced between 45 and 50**.

**Expected Output:**
>![06-04](../../img/pandas/06-04.png)

In [3]:
### BEGIN SOLUTION
top_MexicoGermany_wines = wine[(wine.country.isin(['Mexico','Germany']))&(wine.points >=85)&(wine.price>=45)&(wine.price <=50)]
top_MexicoGermany_wines.reset_index(drop=True, inplace=True)
### END SOLUTION


top_MexicoGermany_wines

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Germany,"Verging on orange in color, this intensely bot...",Beerenauslese,90,47.0,Pfalz,,,Joe Czerwinski,@JoeCz,Fitz-Ritter 2006 Beerenauslese Rieslaner (Pfalz),Rieslaner,Fitz-Ritter
1,Germany,Shows good varietal character in a somewhat sw...,Neipperger Schlossberg Auslese,88,48.0,Württemberg,,,Joe Czerwinski,@JoeCz,Grafen Neipperg 2006 Neipperger Schlossberg Au...,Traminer,Grafen Neipperg
2,Germany,Dr. Loosen offers a solid value in this remark...,Eiswein,90,48.0,Mosel,,,Anna Lee C. Iijima,,Dr. Loosen 2009 Eiswein Riesling (Mosel),Riesling,Dr. Loosen
3,Germany,While richly textured and boldly perfumed with...,Hochheimer Domdechaney Trocken Gold Cap,90,47.0,Rheingau,,,Anna Lee C. Iijima,,Domdechant Werner 2013 Hochheimer Domdechaney ...,Riesling,Domdechant Werner
4,Germany,"This is a somewhat soft and friendly wine, fil...",Von Rotliegenden Spätlese,89,45.0,Pfalz,,,Joe Czerwinski,@JoeCz,Ökonomierat Rebholz 2007 Von Rotliegenden Spät...,Riesling,Ökonomierat Rebholz
...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,Germany,This wine boasts a lovely nose redolent of dri...,Trabener Würzgarten Beerenauslese 375 ml,90,49.0,Mosel,,,Joe Czerwinski,@JoeCz,Max Wagner 2006 Trabener Würzgarten Beerenausl...,Riesling,Max Wagner
91,Germany,Puckering lime zest and lemon notes are soften...,Alte Reben Trarbacher Burgberg,89,45.0,Mosel,,,Anna Lee C. Iijima,,Richard Böcking 2013 Alte Reben Trarbacher Bur...,Riesling,Richard Böcking
92,Germany,Bottling a TBA in this format makes great sens...,Wehlener Sonnenuhr Trockenbeerenauslese Goldkap,96,49.0,Mosel-Saar-Ruwer,,,Joe Czerwinski,@JoeCz,Dr. Loosen 2006 Wehlener Sonnenuhr Trockenbeer...,Riesling,Dr. Loosen
93,Germany,The von Kesselstatt estate released four GG bo...,Kaseler Nies'chen Trocken GG,88,49.0,Mosel,,,Joe Czerwinski,@JoeCz,Reichsgraf von Kesselstatt 2009 Kaseler Nies'c...,Riesling,Reichsgraf von Kesselstatt


## <a name="missing">Data Cleanup and Missing Data</a>
**Task II-1:** Often the `price` column has `NaN` values. Find the total number of rows with `NaN` values in the `price` column.

**Expected Output:**
>8996

In [None]:
### BEGIN SOLUTION
wine.price.isna().sum()
### END SOLUTION


**Task II-2:** Use the string "Unknown" to replace all `NaN` values in the `DataFrame` and save it a new object, `replaced`.

**Expected Output:**

> ![06-05](../../img/pandas/06-05.png)

In [None]:
### BEGIN SOLUTION
replaced = wine.fillna("Unknown")
### END SOLUTION


replaced

**Task II-3:** Obtain the shape of the `DataFrame`, `processed` when all duplicated rows and rows with `NaN` values are dropped.

**Expected Output:**

>(20493,13)

In [None]:
### BEGIN SOLUTION
processed = wine.dropna().drop_duplicates()
processed.shape
### END SOLUTION


##  <a name="summary">Summary</a>
To conclude, you should now be able to:

1. Relate and apply the common operations in `pandas`.
2. Explain and apply data cleanup methods in `pandas`.
3. Understand, detect and deal with missing data with `pandas`.

Congratulations, you have completed this exercise!

## <a name="reference">Reference</a>
* [Dataset Source](https://www.kaggle.com/zynicide/wine-reviews)
* [Question Reference](https://www.kaggle.com/learn/pandas)

<font size=2>[Back to Top](#top)</font>