<div align="center"><img src="../images/LKYCIC_Header.jpg"></div>

**Table of contents**<a id='toc0_'></a>    
- [1-02: DataFrame and GeoDataFrame](#toc1_)    
  - [DataFrame](#toc1_1_)    
    - [Dictionary to DataFrame](#toc1_1_1_)    
    - [List to DataFrame](#toc1_1_2_)    
    - [Read CSV files](#toc1_1_3_)    
  - [Method Call / Function / Attribute](#toc1_2_)    
      - [Method Call: `object.function()`](#toc1_2_1_1_)    
      - [Function: `function(object)`](#toc1_2_1_2_)    
      - [Attribute: `object.attribute`](#toc1_2_1_3_)    
  - [DataFrame](#toc1_3_)    
    - [Select one specific column](#toc1_3_1_)    
    - [Select one specific row](#toc1_3_2_)    
    - [Modify the contents of DataFrames](#toc1_3_3_)    
  - [GeoDataFrame](#toc1_4_)    
    - [GeoDataFrame](#toc1_4_1_)    
  - [Function enquiry (Example: gpd.points_from_xy):](#toc1_5_)    
  - [Next Step](#toc1_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[1-02: DataFrame and GeoDataFrame](#toc0_)

Python offers a robust way to manage **tabular data** through the **Pandas** library, which includes the versatile **DataFrame** structure and a variety of associated functions. 

For geospatial data, the **GeoDataFrame** (provided by the GeoPandas library) serves as the equivalent, enabling spatial analysis. 

## <a id='toc1_1_'></a>[DataFrame](#toc0_)

In Python, the package similar to `Data Frames` (R) is called Pandas:

| R                                                            | Python                                                       |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| Data Frames                                                  | Pandas DataFrames                                            |
| A data frame is a two-dimensional structure where columns can have different types. | The Pandas `DataFrame` is the equivalent, widely used for tabular data manipulation. |
| df <- data.frame(A = c(1, 2, 3), B = c("x", "y", "z"))       | import pandas as pd<br/>df = pd.DataFrame({"A": [1, 2, 3], "B": ["x", "y", "z"]}) |

Operations like slicing (`df[1:3]`), selecting columns (`df["A"]`), and filtering rows are conceptually similar.

**Import package**

To use a package, you have to **import it**.

To import a package, you have to **install it**.

As we mentioned in the [environment setting](../0-02_environment.ipynb), using the command `!pip install package_name` directly within a Jupyter cell.

In [None]:
# uncomment the following line to install pandas
#!pip install pandas

**Syntax of importing a package**

Before we can use <u>a package like Pandas</u>, we have to import it into the current session. 

For many packages, like pandas, we use <u>an alias</u>, or nickname, when importing them. 

This is just done to save some typing when we refer to the package in the following code.

`Task:` Let's import the `pandas` module, and add the alias `pd`.

In [None]:
import pandas as pd

### <a id='toc1_1_1_'></a>[Dictionary to DataFrame](#toc0_)

| country           | population | area    |
|-------------------|------------|---------|
| Brunei Darussalam | 434000     | 5765    |
| Cambodia          | 16250000   | 181035  |
| Indonesia         | 273523621  | 1904569 |
| Lao PDR           | 7123205    | 236800  |
| Malaysia          | 32365999   | 329847  |
| Myanmar           | 54339766   | 676578  |
| Philippines       | 108116615  | 300000  |
| Singapore         | 5870750    | 710     |
| Thailand          | 69950807   | 513120  |
| Vietnam           | 104256076  | 331212  |

We have learnt how to transform list into dictionary. We can also transform dictionary into dataframe:

In [None]:
asean_country_list = ['Brunei Darussalam', 'Cambodia', 'Indonesia', 'Lao PDR', 'Malaysia', 'Myanmar', 'Philippines', 'Singapore', 'Thailand', 'Vietnam']

population = [434000, 16250000, 273523621, 7123205, 32365999, 54339766, 108116615, 5870750, 69950807, 104256076]

area = [5765, 181035, 1904569, 236800, 329847, 676578, 300000, 710, 513120, 331212]

If you want to create a DataFrame from multiple lists. Each List represents a column from Excel.

The rule is similar: Each column (list) has **same length**.

In [None]:
# transform dictionary to country:population key-value pair
asean_country_info = {
    'country': asean_country_list,
    'population': population,
    'area': area
}

In [None]:
type(asean_country_info)

`pd.Dataframe.head()` returns the first `n` rows. The default is 5.

In [None]:
asean_country_df = pd.DataFrame(asean_country_info)
asean_country_df.head()

In [None]:
type(asean_country_df)

### <a id='toc1_1_2_'></a>[List to DataFrame](#toc0_)

You can also directly tranform lists into dataframe using the `zip()` function.

In [None]:
# from pprint import pprint

asean_country_df = pd.DataFrame(zip(asean_country_list, population, area), columns=['country', 'population', 'area'])

print(asean_country_df.head())

### <a id='toc1_1_3_'></a>[Read CSV files](#toc0_)

Relative path and absolute path:

1. `pwd` The path of current folder

2. `../` Go up one folder from where the current file is

3. `./` Same folder from where the current file is

Relative path allows you can reuse the syntax on other computer. Because it is **the relationship** between your target file and **the current working directory**.

In [None]:
# uncomment to check the path of current working directory
#%pwd

In [None]:
df = pd.read_csv('../data/raw/part_i/mrt_sg_dt.csv')

df.head()

In [None]:
df.tail(3)

The above is a example of **method call**.

## <a id='toc1_2_'></a>[Method Call / Function / Attribute](#toc0_)

#### <a id='toc1_2_1_1_'></a>[Method Call: `object.function()`](#toc0_)

- **Definition**: A method is a function that is defined within a class and operates on instances of that class.

- **Usage**: Called on an instance of the class using the dot notation.

It calls a function owns by the object.

To use an external on the object:

#### <a id='toc1_2_1_2_'></a>[Function: `function(object)`](#toc0_)

- **Definition**: A function that takes an instance of a class as an argument and operates on it.

- **Usage**: Called with the instance passed as an argument.


In [None]:
len(df)

#### <a id='toc1_2_1_3_'></a>[Attribute: `object.attribute`](#toc0_)

- **Definition**: An attribute is a variable that is bound to an instance of a class.

- **Usage**: Accessed directly using the dot notation.

`.shape` is an attribute of the dataframe.

In [None]:
df.shape

similarlly, we can get name of all the columns in the dataframe by using `.columns`

In [None]:
df.columns

- **Method Call (`object.function()`)**: Invokes a method **predefined** within the class, operating on the instance.

- **Function (`function(object)`)**: A **standalone** function that takes an instance as an argument and operates on it.

- **Attribute (`object.attribute`)**: Accesses **a variable** bound to the instance.

## <a id='toc1_3_'></a>[DataFrame](#toc0_)

### <a id='toc1_3_1_'></a>[Select one specific column](#toc0_)

In [None]:
df['Name']

- `.unique()`: The unique values in the 'Name' column

- `.nunique()`: The number of unique values in the 'Name' column

In [None]:
df['Name'].unique()

In [None]:
df['Name'].nunique()

`Challenge 1:` 

Is the number returned from `.nunique()` equal to length of the dataframe?

In [None]:
#——————————————————————————————————————————————————————————————————————————————————————————————#


#——————————————————————————————————————————————————————————————————————————————————————————————#

If not, what does the difference in number mean?

#——————————————————————————————————————————————————————————————————————————————————————————————#


#——————————————————————————————————————————————————————————————————————————————————————————————#

`Task:` Find the MRT stations which appears more than once and remove it

The value_counts() method in pandas returns a Series containing counts of unique values in a DataFrame column. 

It is often used to get the frequency of each unique value in a column.

So we can see the MRT station(s) _______ has duplicate entries in the table

In [None]:
df['Name'].value_counts()

Remove the duplicated Bedok Reservoir:

In [None]:
# Remove the row = Bedok Reservoir
df = df.drop(
    df[df['Name'] == 'Bedok Reservoir'].index[1:]
    )

In [None]:
df['Name'].value_counts()

The `drop_duplicates()` function in pandas dataframe is used to remove duplicate rows. 

In [None]:
df = pd.read_csv('../data/raw/part_i/mrt_sg_dt.csv')

In [None]:
df['Name'].value_counts()

In [None]:
df.drop_duplicates(subset='Name', keep='first', inplace=True)

In [None]:
df['Name'].value_counts()

### <a id='toc1_3_2_'></a>[Select one specific row](#toc0_)

To select a specific row, what we need is a <u>conditional statement</u>.

In [None]:
df['Name'] == 'Upper Changi'

In [None]:
(df['Name'] == 'Upper Changi').sum()

**Boolean mask**

It's a list of **True/False labels** to filter our Data Frame for a certain condition.

In [None]:
df[df['Name'] == 'Upper Changi']

In [None]:
upper_changi = df[df['Name'] == 'Upper Changi']

In [None]:
upper_changi

Transpose: `transpose()` or `T`

This is useful when you want to switch the orientation of your data.

In [None]:
upper_changi.transpose()

In [None]:
upper_changi.T

### <a id='toc1_3_3_'></a>[Modify the contents of DataFrames](#toc0_)

In [None]:
len(df['Name'])

In [None]:
type(df['Name'])

It is a Pandas series object.

The `.str` accessor in pandas is used to apply string functions to each element in a Series that contains string data.

Then you can apply the string operations in [1-01_data.ipynb](./1-01_data.ipynb).

In [None]:
df['Name'].str

In [None]:
df['Name'].str.replace('a', 'e')

In [None]:
df['Name'].str.split('')

Get the names of MRT stations with names longer than 15 characters.

`.len()` is a built-in Python function for string that returns the length of string

In [None]:
df['Name'].str.len()

In [None]:
df['Name'].str.len() > 15

In [None]:
df[df['Name'].str.len() > 15]

## <a id='toc1_4_'></a>[GeoDataFrame](#toc0_)

### <a id='toc1_4_1_'></a>[GeoDataFrame](#toc0_)

> A GeoDataFrame object is a pandas.DataFrame that has one or more columns containing geometry.

<div align="center">
    <img src="../images/gpd_structure.webp">
    <br><b>GeoDataFrame Structure</b>
    <br>Source: <u>https://geopandas.org/en/stable/docs.html</u>
</div>

References:

1. https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html#geopandas-geodataframe

In [None]:
type(df)

In [None]:
import geopandas as gpd

In [None]:
gpd.points_from_xy(df.lat, df.lng)

## <a id='toc1_5_'></a>[Function enquiry (Example: gpd.points_from_xy):](#toc0_)

In addition to ChatGPT, there are two more traditional options to enquiry the function use within the Jupyter Notebook:

1. `help()`:
    
    This calls the **built-in Python** help() function.

    It displays the documentation (docstring) for the function.

2. `function?`:

    This is a feature specific to **IPython and Jupyter notebooks**.
    
    It displays the documentation (docstring) for the specified function in a more interactive and user-friendly manner. It is used in **Jupyter notebooks**.


In [None]:
help(gpd.points_from_xy)

In [None]:
gpd.points_from_xy?

In [None]:
gpd.GeoDataFrame?

In [None]:
df_gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lat, df.lng))

df_gdf

In [None]:
type(df_gdf)

In [None]:
df_gdf.plot()

## <a id='toc1_6_'></a>[Next Step](#toc0_)

Go to [1-03: Questionnaires and Survey Data Cleaning](./1-03_questionnaires.ipynb)