![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

In this notebook, you'll learn all about **[pandas](https://pandas.pydata.org)**, the most popular Python library for data analysis.

# Notebook Content

* [Getting Started](#Getting-started)


* [Creating data](#Creating-data)
    * [DataFrame](#DataFrame)
    * [Series](#Series)
    
    
* [Reading Data File](#Reading-Data-File)


* [Writing Data File](#Writing-Data-File)

# Getting started

We will first import `pandas` library first and we usually named it as `pd`

In [1]:
import pandas as pd

# Creating data

There are two core objects in pandas: the **DataFrame** and the **Series**.

### DataFrame

A DataFrame is a table. It contains an array of individual *entries*, each of which has a certain *value*. Each entry corresponds to a row (or *record*) and a *column*.

For example, consider the following simple DataFrame:

In [2]:
pd.DataFrame({'Exam 1': [50, 21], 'Exam 2': [89, 60]})

Unnamed: 0,Exam 1,Exam 2
0,50,89
1,21,60


In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on.

DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are strings:

In [3]:
pd.DataFrame({'Exam 1': [50, 69], 'Exam 2': [89, 60], 'Remark': ['Performance increase', 'Performance decrease']})

Unnamed: 0,Exam 1,Exam 2,Remark
0,50,89,Performance increase
1,69,60,Performance decrease


We are using the `pd.DataFrame()` constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (`Exam 1` and `Exam 2` in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the *column labels*, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the *row labels*. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an `index` parameter in our constructor:

In [4]:
pd.DataFrame({'Exam 1': [50, 69], 'Exam 2': [89, 60], 'Remark': ['Performance increase', 'Performance decrease']},
             index=['Student 1', 'Student 2'])

Unnamed: 0,Exam 1,Exam 2,Remark
Student 1,50,89,Performance increase
Student 2,69,60,Performance decrease


### Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [5]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an `index` parameter. However, a Series does not have a column name, it only has one overall `name`:

In [6]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

The Series and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together". We'll see more of this in the next section of this tutorial.

# Reading Data File

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

```
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
```

So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.

Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the `pd.read_csv()` function to read the data into a DataFrame. This goes thusly:

In [7]:
water_potability = pd.read_csv("../../../resources/day_01/water_potability.csv")

We can use the `shape` attribute to check how large the resulting DataFrame is:

In [8]:
# Return rows and columns

water_potability.shape

(3276, 10)

We can use the `size` to get all the cells in the DataFrame:

In [9]:
water_potability.size

32760

So our new DataFrame has 130,000 records split across 14 different columns. That's almost 2 million entries!

We can examine the contents of the resultant DataFrame using the `head()` command, which grabs the first five rows:

In [10]:
water_potability.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


# Writing Data File

We use `pandas.DataFrame.to_csv` function to write the pandas DataFrame into CSV format

For more detail documentation, please refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [11]:
exam_result = pd.DataFrame({'Student_name':["John", "Jason", "Jasmine", "Nik"], 
                            'Exam 1': [66, 76, 59, 80], 
                            'Exam 2': [75, 48, 94, 78]})

In [12]:
exam_result

Unnamed: 0,Student_name,Exam 1,Exam 2
0,John,66,75
1,Jason,76,48
2,Jasmine,59,94
3,Nik,80,78


In [13]:
exam_result.to_csv("../../../resources/day_01/exam_result.csv", header=True)

# Contributors

**Author**
<br>Chee Lam

# References

1. [Learning Pandas](https://www.kaggle.com/learn/pandas)
2. [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html)