# **From Numbers to Plots: Managing DataFrames and Plotting Floor Plans**

Last week, we learned how to deal with tuples, lists, and arrays. We also got familiar with modifying simple shapes and finally plot them! In other words, we stored the properties of shapes in an array of polygons. One step further, we want to store ***the properties of real buildings*** (not necessariliy simple shapes) in another data storage option called ***DataFrame***. So, in this tutorial, we are going to learn how to create DataFrames from BIM files, going through the intermediate steps of IFC (Industrial Foundation Classes) and CSV (Comma-separated values) files. To navigate from IFC to CSV file formats, we use the results of a solution called *BatchPlan*! You can read more about BatchPlan [here](https://repository.tudelft.nl/record/uuid:2899ea54-d769-4154-bb04-3c95b018a194) if you're curious.

Today, we will mostly focus on the last two arrows (i.e., from CSV to DataFrame, and from DataFrame to Plots). Once we have the DataFrame, we can do all the funny stuff, including *pretty visualizations*.✨


<center>
<img src="https://drive.google.com/uc?export=view&id=1ZOWpjfmL6Ed82hJ88pxBFpwekfjbTesd" alt="floor-layout" class="center" width="750px">
</center>


## 📌 **Overview and learning objectives**

This tutorial is about the transition from the conventional method of storing building data (through BIM and IFC files) to the more machine-readable formats (CSV files and DataFrames). The ultimate aim is to come up with a ***(Geo)DataFrame*** of a number of residential building projects from KAAN Architecten and create some informative visualization.

### 🧠 **Learning objectives**
*   Making use of CSV files to create a DataFrame
*   Cleaning Data (reading, sorting, and selecting)
*   Plotting FloorPlans

### 🐍 **New in Python**
- Mounting Google Drive in Google Colaboratory
- Creating a (Geo)DataFrame
- Plotting Multypolygons
- Libraries: pandas, geopandas

### ⛩ [**Access Data**](#t1)
- [1.1 Introduction](#t1.1)
- [1.2 KAAN Residential Projects](#t1.2)
- [1.3 Loading Data](#t1.3)

### 🌐 [**Visualize Data**](#t2)
- [2.1 Pandas DataFrame](#t2.1)
- [2.2 Towards Neat Data](#t2.2)
- [2.3 GeoDataFrame Creation](#t2.3)
- [2.4 Floor plan Visualization](#t2.4)

### 🟡 [**Additional Geopandas (optional)**](#t3)
- [3.1 Maps and Plots](#t3.1)
- [3.2 Geometric Manipulation](#t3.2)
- [3.3 Operations with Overlay](#t3.3)

### 📊 [**Exercise**](#t4)


<a name="t1"></a>
## ⛩ **Data Access**

<a name="t1.1"></a>
### 1.1 **Introduction**

After getting familiar with the KAAN residential projects (which are our material to work with), we learn how to load the related files from the Google Drive.

Just so we know where we stand here, we will use the CSV files resulting from IFC files of these projects. There are two types of CSV files we will work with: one coming from the BIM files including materials, IFC classes, etc., and the other coming from the resutls of BatchPlan including geometrical properties of spaces in the projects.

Refresh: What was ***IFC*** again?

The Industry Foundation Classes (IFC) format file is an open file format used by ***BIM programs***, including spatial elements, materials, and shapes of a building. The majority of BIM modelling software, including Revit, ArchiCAD, and Rebro, currently supports the import and export of IFC files. IFC files are intended to be *platform-independent*, and therefore play an information exchanger role. This would allow interoperability between various BIM programs, as well as between BIM and other tools such as LCA (Life Cycle Assessment).

<a name="t1.2"></a>
### 1.2 **KAAN Residential Projects**

For the material in this tutorial, we will use data from KAAN Architecten projects. Several residential projects have been selected to encompass various design stages, ranging from early design to project construction.
The projects are titled ***The Stack (Overhoeks), Lumiere, SPOT, and Strijp S- Match box***. More information on the projects can be found in the slides of today's lecture.


Here is the Strijp S - Matchbox Project located in Eindhoven:

<center>
<img src="https://drive.google.com/uc?export=view&id=1j0bJWwaRcaP8DFXcwhYAoFcCrwaDYMLW" alt="isometric" class="center" width="750px">
</center>


As an example, we take the third floor of the ***Strijp S - Match Box*** project, whose isometric view looks like this:


<center>
<img src="https://drive.google.com/uc?export=view&id=1Qo9jf2_jWEwU6YsaF4eqXKeDEmQ-Sd6o" alt="isometric" class="center" width="750px">
</center>

and its floor plan:

<center>
<img src="https://drive.google.com/uc?export=view&id=1jfIhODIOg7i5fr6VfW093rv8NKKypMo2" alt="floor-plan" class="center" width="750px">
</center>

both above images are ***the output of BatchPlan***. But these are not the only outputs! We will see other output types later on in this tutorial.


<a name="t1.3"></a>
### 1.3 **Loading Data**

 We are aiming to make use of the available data files to create our very first DataFrame. Normally, datasets are released in Comma-separated value (.csv) files. Therefore, it is practical to load and read this file format to make use of them in programming platforms. We first need to load the raw data from our data storage (in this case Google Drive) to the compute platform (in this case Google Colab). Here we mount the Google Drive to use the stored files in this notebook. Long story short, we mount the drive to make ***connection between Drive and Colab***.
<center>
<img src="https://drive.google.com/uc?export=view&id=1L5A992d1NolfYRO3lo-NjGBmO0b4rhCH" alt="isometric" class="center" width="600px">
</center>


We use the **os** module (short form for Operating System) to open the desired folder in the drive and ***navigate through the directories*** (i.e., paths, addresses, where the files are stored). The OS module in Python provides functions for interacting with the operating system.

In [None]:
#importing drive and os libraries
from google.colab import drive
import os

Here we go mounting/ setting up Google Drive:

In [None]:
#mounting google drive
drive.mount('/content/drive', force_remount=True)

It's always nice to know where we stand (in life in general and also while programming). For that we use a practical function called **getcwd** (short form for "get current working directory"). This function is handy when we are lost in the directories (or just curious) and want to know our location.

In [None]:
#checking the current working directory
os.getcwd()

So there we are. Next, we can copy the path of the desired folder in the left side bar of this notebook (the folder icon), ***Files/Drive/Mydrive*** and then whichever folder we are looking for. Here you can copy the files related to your project from [this shared folder](https://drive.google.com/drive/folders/1vnRCW3_-q63WfpBkD3e2Nqx1SauMH9yS?usp=sharing) somewhere in your own Drive and then copy the path of that folder in the code cell below as the "path" variable.

In [None]:
#giving the directory of the files
path = "/content/drive/MyDrive/PhD_Education/AI in Architecture Course/KAAN_Projects/Strijp_S/general_csv"
# using chdir for changing the directory to the path we look for
os.chdir(path)

Care to check the current working directory again to make sure we are in the right place?

In [None]:
#checking the current working directory
os.getcwd()

-The place you wanted to end up? Yes? -Nice. -No? -Then use **.chdir** again.

<a name="t2"></a>
## 🌐 **Visualize Data**
Now that we landed in a folder where we have our building project files, it's time to take our first baby steps towards creating a DataFrame ^^ Here we learn how to create ***Pandas*** DataFrame, read, sort, and select the intended data, create a ***GeoDataFrame***, and visualize floor plans.


<a name="t2.1"></a>
### 2.1 **Pandas DataFrame**

In order to make an efficient analysis on big data of several buildings, it is beneficial to create a so-called clean DataFrame. One of the most practical libraries to deal with DataFrames is **pandas**, an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. A DataFrame contains labeled axes (rows and columns).
<center>
<img src="https://runcode-app-public.s3.amazonaws.com/images/pandas-online-editor-compiler.original.png" alt="isometric" class="center" width="550px">
</center>
The Pandas' highlights are as follows:


*   A fast and efficient DataFrame object for **data manipulation**
*   Tools for reading and writing data between **different formats**: CSV and text files, Microsoft Excel, etc
*   Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
*   **Size mutability** by inserting and deleting columns
*   High performance **merging and joining** of data sets

We will go through a number of pandas features in this tutorial, but if you are still curious to know more and want to play around, [here is a nice pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf). *Wait what is a cheat sheet?* A cheat sheet is a concise set of notes used for quick reference.

Mentioning the point above, we can understand the difference of ***storing big data*** of building project in DataFrames, rather than in arrays, or lists that we already learned. Of course all have their own applications, but let's *feel* the diffrence in the following cells! First, we need to import the library.



In [None]:
#importing pandas library (but we will later call it with the nickname we chose for it, here: pd)
import pandas as pd
#importing numpy library
import numpy as np

We already got familiar with the array data type. With pd.Dataframe, we can creat **a DataFrame object** from the arrays. Normally, with only **data** and **column** specification as parameters we can quickly define a DataFrame. However, there are more parameters we can define. Curious? then check [the pandas.DataFrame documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [None]:
#constructing DataFrame from numpy ndarray
df = pd.DataFrame(data=np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
#taking a look at the DataFrame just by typing its name
df

In [None]:
#getting the number of rows
len(df)

What if we want to observe a ***subset*** (i.e., a portion, a part) of the DataFrame? There are multiple ways to select the part of the DataFrame that we are mostly interested in. Let's try out some of them:

In [None]:
#simply looking into a particular column using the name of the column
df[['a']]

Interested in more than only one column? That's also straightforward!

In [None]:
#observing a subset of the initial dataframe by specifying two column names
df[['b','c']]

Here we make use of the function ***query*** to apply some conditions on the subset of interst. Check it out:

In [None]:
#observing the subset of df which complies with the defined conditions
#here: suppose we want to observe the data for which the values of column "a" is more than 1 and also column "c" is less than 9
df.query('a > 1 and c < 9')

Then what if we want to observe a subset of data, but we either don't know the names of columns or we merely interested in rows. In this case, the function ***iloc*** would become handy.

In [None]:
#observing a subset of data using the index of rows
#pay attention about the lower and upper range and its difference with what we are used to
df.iloc[0:2]

⭐ ***Question!*** Could you guess what does the cell below do?

In [None]:
df.iloc[0:2, [0, 1]]

So we saw that a DataFrame is not a big deal (***for now :)***). It looks like a table, and indeed the DataFrame object is a "tabular data" format. But the story does not end here. Now we want to create a DataFrame from the csv files that we have stored in the Google Drive. Now that the Drive has been already mounted, the only thing to do is to ***read*** the intended file. If we have a certain naming convention among the files that we want to load, it is easier to only use the variant part as the variable, and keep the repetitive part of the names as fixed. In the following example, we assume that we have the naming convention of "filename" + ".csv". After writing the project's name as the "project_name" variable, we can use the ***read.csv*** function of Pandas to load the desired file.

In [None]:
#name of the file as the variable
project_name = "STRIJP-KAAN-ZZ-ZZ-M-A-0101"
#reading a comma-separated values (csv) file into DataFrame
StrijpS_df = pd.read_csv(project_name+".csv")

Here we have created a DataFrame with the name "StrijpS_df". ***Tip:*** if we highlight the objects and keep the cursor, the type is shown (ex: StrijpS_df is a DataFrame, as we wanted it to be, luckily). This comes in handy when we doubt about the data type of a certain variable.

Next, we will get familiarized with the data and learn how to clean the DataFrame.

<a name="t2.2"></a>
### 2.2 **Towards Neat Data**


Data cleaning is performed for ***data consistency***. In data cleaning process, redundant data is removed and incorrect, incomplete, irrelevant, or improperly formatted data is modified. The data cleaning extends to fixing spelling and syntax errors, and standardizing data. It is also tailored towards the selection, manipulation, and format processing of for the downstream task.

But wait a second, how do we know what to clean and to what extent we need to clean (in plain English, is it simply *dusting* or is it *deep cleaning*?) The preliminary step to clean the data, is ***to get to know the raw data***. For instance, the data types that the file contains, the features (the columns), the first and last few rows of the DataFrame, and the number of unique items per feature. So basically, by ***cleaning*** we mean: ***reading, sorting, and selecting data***. The following cells will help getting familiar with the DataFrame and make it ready for the subsequent steps.


In [None]:
#printing a concise summary of a DataFrame
StrijpS_df.info()

In [None]:
len(StrijpS_df)

Based on the results of the last function, we know that the DataFrame has a certain number of columns. Mostly, we are interested in knowing ***the features*** a DataFrame affords. Therefore, it is also useful to get a list of column names. Next, we can decide which columns to keep and which to ignore based on our desired task.

In [None]:
#printing the list of features
list(StrijpS_df.columns)

Alright there are many of columns, apparently. But let's move on to the DataFrame itself, cause it contains also rows ***and*** columns. We can already imagine that this is a huge (relatively) DataFrame due to the number of columns. Therefore, it's always practical to check a part of the DataFrame before observing the whole. Here we use two handy functions to take a quick look at ***the first and last few rows of the DataFrame***, using .head() and .tail() functions, respectively. As we earlier saw, the DataFrame would ultimately look like a table (a fancy one, in this case).

In [None]:
#getting the first 5 rows of the DataFrame
StrijpS_df.head()

Note that the middle columns are not shown cause they are too many. However, it doesn't mean that we lost them or so. They are there, but we don't see them. Now let's see the last few rows.

In [None]:
#getting the last 5 rows of the DataFrame
StrijpS_df.tail()

After grasping an overal idea of the content, we can clean the DataFrame based on the specified task requirements. One step of the cleaning process is **to get rid of unwanted columns**. Some columns of the rawdata DataFrame are selected to keep as an example in the following cell. Then the DataFrame is shown (note that if we do not specify the head or the tail of the dataset, both are shown).

In [None]:
#defining a new DataFrame which is a sub DataFrame of the main one. So, no worries, we will never lose the data we frist loaded.
selected_df = StrijpS_df[['Class', 'Name', 'PredefinedType', 'Level', 'x_coordinate', 'y_coordinate','z_coordinate', 'Material']]
selected_df

How about creating a new DataFrame (let's call it Spaces), which is a ***selection*** of the column 'Class' in the selected_df DataFrame. What kind of selection are we talking about? Suppose we only care about IfcSpace elements at this point. Alright, let's get into it.

In [None]:
Spaces = selected_df[selected_df['Class'] == "IfcSpace"]
Spaces

Sometimes *(maybe more that sometimes)* we are interested in the number of unique values in a certain column to get the impression of **the variety of data in a ceratin feature**. Let's say we are curious to know how many unique Names are there in the Spaces DataFrame.

In [None]:
#getting the number of unique values in a certain column
print(Spaces.Name.unique())

Did you notice that ***the size of the DataFrame*** has changed? Let's keep the track of this change also later on.

Sometimes we are interested in the number of ***unique values*** in certain columns. Also, we might need a list of unique values in a certain column (and not necessarily the number of them). The following cells will go through the abovementioned applications.

In [None]:
#printing the number of unique ids in the selected DataFrame using the .nunique function
print(selected_df.nunique())

We can also refer to a certain column in the pandas DataFrame simply by putting a dot after the DataFrame's name. Let's assume we are interested to know what kind of materials have been used in the third floor of the Strijp S project. So once we talk about ***what kind***, it would be translated into ***the unique values***.

In [None]:
#printing the list of unique values in a certain column (here: Material) using the .unique function
print(selected_df.Material.unique())

What is the first material? Have you ever worked with that? I bet no.

In computing, NaN, standing for Not a Number, is a particular value of a numeric data type which is undefined as a number. If we are interested in keeping the rows with actual values (in other words, ***rows without NaN values***), we can apply the **dropna()** function over the DataFrame.

In [None]:
#dropping the rows including NaN values
clean_df= selected_df.dropna()

#let's see what happened
clean_df
#pay attention to the reduced number of rows after filtering data

we much rather to see the first column of index in a neat sequence, right? then let's reset the index

In [None]:
clean_df.reset_index(drop=True, inplace=True)
clean_df

How else can we think of ***digging into a big DataFrame of a building*** with so many features and so many rows of data? Let's some some other handy functions from Pandas Library, namely ***filter*** and ***query***. They are both used when we want to focus on a specific part of the DataFrame, either to have a closer look or to partially manipulate the data.

In [None]:
# Filter columns - Filter
organized_df = clean_df.filter(['Level','z_coordinate'])
organized_df

In [None]:
# Filter based on structures in the names of the column
organized_df = clean_df.filter(like = 'coordinate')
organized_df

In [None]:
# Filter rows based on a condition - using query
organized_df= clean_df.query('x_coordinate > 12500 and z_coordinate > 15000')
organized_df

You think a column is useless for further analysis? Then ***drop*** it!

In [None]:
organized_df = organized_df.drop(columns=['PredefinedType'])
organized_df

After cleaning so far, let's say we are curious to know which types of materials are used in the remaining dataframe (or here: in the fifth and sixth floors!)

In [None]:
#selecting the column of interest and then check the unique values of that column
organized_df.Material.unique()

In [None]:
sixth_floor = organized_df[organized_df['Level'] == "06 zesde verdieping"]
sixth_floor

Once we have a clean and organized DataFrame, we can play around with some basic plots to improve our perception of the data, this time ***VISUALLY!*** For this purpose, we benefit from a nice library for coming up with some plots, called ***Seaborn***. You can explore the library yourself [here](https://seaborn.pydata.org/index.html). Seaborn is great for statistical data visualization.

In [None]:
#let's import it and give it a nickname
import seaborn as sns

In [None]:
# Visual summary - pairplot
sns.pairplot(sixth_floor)

lots of the plots above might be most likely useless to your intilal task in mind they can be inspiring as you take some time looking into them, you never know, it might help preventing fixation!

Let's say we want to know ***what material is used where***? In other words, we want to check the x and y coordinates of the floor plan of our interest (here, after selecting and organizing, sixth floor). Take a look at the plot and see what you can conclude form the visualization?

In [None]:
sns.relplot(data = sixth_floor, #where do we read data from?
           x = 'x_coordinate',
           y = 'y_coordinate',
           kind = 'scatter', #this define the type of the plot, here, scatter
           col = 'Material', #we want to see the cooordinates per Material
           col_wrap = 4,   #controls the size of the plot
           hue = 'Class' #we will have different colors assigned to each IFC Class
);

<a name="t2.3"></a>
### 2.3 **GeoDataFrame Creation**

GeoPandas is an open source project to add support for geographic data to pandas objects. The goal of GeoPandas is to make working with ***geospatial data*** in python easier. It combines the capabilities of ***pandas*** and ***shapely*** (a library for manipulating geometric objects, get more information [here](https://shapely.readthedocs.io/en/stable/manual.html#introduction)) libraries, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely. Have you noticed the difference between the icons of ***Pandas*** and ***GeoPandas*** libraries?
<center>
<img src="https://geopandas.org/en/stable/_images/geopandas_logo.png" alt="isometric" class="center" width="750px">
</center>


All the following cells about Geopandas are the summary of the library's documentation which can be accessed [here](https://geopandas.org/en/stable/docs.html). But let's see what is GeoPandas about by first importing it.

In [None]:
#importing the geopandas library
import geopandas as gpd


GeoPandas implements two main data structures, a ***GeoSeries*** and a ***GeoDataFrame***.
A GeoSeries is essentially a vector where each entry in the vector may consist of only one shape (like a single polygon) or multiple shapes (like the many polygons that make up the State of Hawaii or a country like Indonesia).

GeoPandas has three basic classes of geometric objects:

*   Points / Multi-Points
*   Lines / Multi-Lines
*   ***Polygons / Multi-Polygons*** (this is more relevant to us)

A GeoDataFrame is a ***tabular data structure*** that contains a GeoSeries. The most important property of a GeoDataFrame is that it always has one GeoSeries column that holds a special status. This GeoSeries is referred to as the GeoDataFrame’s ***“geometry”***. *Moral of the story:* When a spatial method is applied to a GeoDataFrame, this commands will always act on the “geometry” column.


But what is so *special* about GeoPandas? A feature called CRS. What is that?
***The coordinate reference system (CRS)*** tells Python how those coordinates relate to places on the Earth. CRS is important because the geometric shapes in a GeoSeries or GeoDataFrame object are simply a collection of coordinates in an arbitrary space.


![alt text](https://pygis.io/_images/d_crs_assigned.png)


For reference codes of the most commonly used projections, see [here](https://spatialreference.org/) (if you are *highly* curious, otherwise, no need).

One of the outputs of the BatchPlan solution is a csv file containing a column of geometry for all the elements present in a certain floor plan. Here we will first ***load*** the csv file and then try to build a GeoDataFrame based on that. But to *access* this file, we need to change the directory path that we are in.

In [None]:
# giving the directory of the files
path = "/content/drive/MyDrive/PhD_Education/AI in Architecture Course/KAAN_Projects/Strijp_S/BatchPlan_output"
# using chdir for changing the directory to the path we look for
os.chdir(path)

So we changed the path and landed in a different folder. Curious to see what we have in this folder? We can use a function called ***.listdir*** to get a list of the files that we have in this directorty.

In [None]:
os.listdir()

So there are plenty of files inside this folder apparently. We choose to move forward with the csv file of ***the third floor*** with the name "03 derde verdieping.csv". As we can see among the list, except for the ground floor, all the other csv files ***share the same part*** of " verdieping.csv". We can keep this fixed part in the parameters of read_csv function and keep the changing part of the name as a variable. Of course we can insert the whole name as the parameter of the .read_csv. Up to you!

In [None]:
#name of the file as the variable
floor_name = "03 derde"
#reading a comma-separated values (csv) file into DataFrame
StrijpS03_df = pd.read_csv(floor_name+" verdieping.csv")

So StrijpS03_df is basically ***a DataFrame***, (not yet a GeoDataFrame). Now that we have created the Pandas DataFrame, we can get familiar with the content using the previously mentioned methods in section 2.2.

In [None]:
#printing the list of features
list(StrijpS03_df.columns)

In [None]:
#printing the number of unique ids in the selected DataFrame
print(StrijpS03_df.nunique())

In [None]:
#printing the list of unique values in a certain column
print(StrijpS03_df.type.unique())

In [None]:
#getting the first 5 rows
StrijpS03_df.head()

✴ *Important* ✴ So here we can see the "geometry" column acting as the GeoSeries we talked about earlier, which holds the geometrical features of the DataFrame. The values in this column are represent in ***WKT (Well-known Text) format***, a compact ***machine- and human-readable representation*** of geometric objects.

Let's say we are interested to see only the rows with "IfcSpace" type. In the following cell, we define a new DataFrame (called Spaces, for the sake of *making sense*) which is a sub-DataFrame of the main one, conditioning on the type column (in this case, IfcSpace).

In [None]:
#conditioning on one column in the DataFrame
Spaces = StrijpS03_df[StrijpS03_df['type'] == "IfcSpace"]

#resting the index, cause it's more organized!
Spaces.reset_index(drop=True, inplace=True)
Spaces

So now we have a portion of the DataFrame in which we are interested. Taking a look at the "name" column, we can see some patterns in naming of the spaces (ex: 03.B02.01). We will play around with this discovery later on. But for now, we want to convert the main DataFrame (StrijpS03_df) with a geometry column to a ***GeoDataFrame***. So, GeoDataFrame is a class of geopandas, which takes ***data, geometry, and crs*** as the main parameters.

Now that we know we are working with the projects in the Netherlands, we might want to use the related Coordinate Reference System. We can also use the EPSG code.  The EPSG codes are 4-5 digit numbers that represent CRSs definitions. The acronym EPSG, comes from European Petroleum Survey Group. Each code is a four-five digit number which represents a particular CRS definition. The code for the Netherlands is "EPSG:28992".

Here we go creating our first GeoDataFrame:

In [None]:
#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_floor03 = gpd.GeoDataFrame(StrijpS03_df, geometry=gs, crs="EPSG:28992")

Doubts about whether the crs has been assigned to the GeoDataFrame? Reality check by using ".crs is None" after the name of the GeoDataFrame. If it returns "true", we know there is no crs assigned then.

In [None]:
gdf_floor03.crs is None

Done! We have the GeoDataFrame called gdf_floor03. Now it's time to "see" what we created. In other words, let's ***plot*** the GeoDataFrame. We can conveniently use the plot function and set the color map (cmap) and the figure size (figsize) as parameters. Check [this](https://matplotlib.org/stable/users/explain/colors/colormaps.html) link if you are interested in pretty color maps. ✨

In [None]:
#plotting the created geodataframe
gdf_floor03.plot(cmap="Set3", figsize=(10,10))

This is apparently not what we were exactly looking for, right? It looks *strange*. How to fix/improve it? We will see in the next section (2.4).

<a name="t2.4"></a>
### 2.4 **Floorplan Visualization**

In the previous plot, all of the types (IfcSpace, IfcWall, etc) were plotted on top of each other. Therefore the result was maybe not what we were looking for. Here we can select a part of the DataFrame using ***loc funtion***. Then we can repeat the process to get the desired plot.

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] == "IfcWall"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="tab20", figsize=(10,10))

Note that the colors are assigned ***randomly*** to the elements. Try out different ***color maps*** to see the changes! In the same manner, we could have plots for different layers, to grasp a better idea of the floor plan.

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] == "IfcWindow"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="tab20", figsize=(10,10))

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] == "IfcSpace"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="Set3", figsize=(10,10))

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] == "IfcOpeningElement"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="Set3", figsize=(10,10))

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] == "IfcBuildingElementProxy"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="Set3", figsize=(10,10))

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] == "IfcDoor"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="Dark2", figsize=(10,10))

Sometimes, we are interested to select part of the DataFrame which does ***NOT*** have a certain feature. In this case we use the same funtion (loc) but instead of == (which means equal to), we use != (which means not equal to). Here for example, we will plot everything ***except for*** IfcSpace type:

In [None]:
#finding data about specific ids
selected = StrijpS03_df.loc[StrijpS03_df['type'] != "IfcSpace"]

#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(StrijpS03_df['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(selected, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(cmap="tab20", figsize=(10,10))

Remember we found out something like a pattern in ***naming convention*** in the "name" column in section 2.3.? Let's revisit it.

In [None]:
#conditioning on one column in the DataFrame
Spaces = StrijpS03_df[StrijpS03_df['type'] == "IfcSpace"]

#resting the index column
Spaces.reset_index(drop=True, inplace=True)
Spaces

It is usual for architectural firms to have some specfic ***annotation style*** for identifying spaces. In our example of KAAN Architecten residential buildings, spaces are denoted by the number of the floor, the intersection axes, and the sequence of spaces, as follows:

<center>
<img src="https://drive.google.com/uc?export=view&id=1QmFF9bVm1nSkuf4UisVSPiQcPIdycd8J" alt="floor-plan" class="center" width="px">
</center>


Let's assume we are interested in having a semantic ***color-coded map*** in which all the spaces with the same grid intersection are shown with the same color. For this purpose, we first need to categorize the data in the "name" column based on the naming convention, and then plot the DataFrame in the desired way.

In [None]:
#adding ax_id column to the dataframe with the same values as "name" column
Spaces['ax_id'] = Spaces['name']
#let's take a look
Spaces

Here we observe that there are two main different namings among the data in the "ax-id" column. Those with only numerical values represent the unit and those with the mentioned naming style represent spaces. One way to keep the spaces in to remove all the numerical parts from the "ax_is" column and then only pay attention to the ax name.

In [None]:
#replacing the ax_id column with a newly defined name specific to axes
Spaces.ax_id[Spaces["ax_id"].str.contains('A')] = "A_ax"
Spaces.ax_id[Spaces["ax_id"].str.contains('B')] = "B_ax"
Spaces.ax_id[Spaces["ax_id"].str.contains('C')] = "C_ax"
Spaces.ax_id[Spaces["ax_id"].str.contains('D')] = "D_ax"
Spaces.ax_id[Spaces["ax_id"].str.contains('E')] = "E_ax"
Spaces

Now that we have categorized the IfcSpace elements based on the naming convention, we can have a specfic plot based on the unique values in the selected column (in this case, ax_id). This type of map is called ***Choropleth*** (maps where the color of each shape is based on the value of an associated variable). Simply use the plot command with the column argument set to the column whose values you want used to assign colors.


In [None]:
#first defining the geoseries (geometrical data presented in a serie) from the WKT format
gs = gpd.GeoSeries.from_wkt(Spaces['geometry'])
#defining the GeoDataFrame using the data in the main DataFrame, geometry, and crs
gdf_test_floor = gpd.GeoDataFrame(Spaces, geometry=gs, crs="EPSG:28992")

#plotting the created geodataframe
gdf_test_floor.plot(column="ax_id", legend = True, cmap="Set3", figsize=(10,10))

We reached our aim of having a color-coded map of the floor plan based on the naming convention. This plot, is now not only ***human-interpretable***, but also ***machine readable***! Nice, right? Of course, this was only one example to see how working with DataFrames can lead to useful plots. In general, we can modify the pandas DataFrame based on our problem formulation and when it comes to plotting, we can convert it to a GeoDataFrame.

<a name="t3"></a>
## 🟡 **Additional GeoPandas (optional)**

In this section, we play around with some other functions offered by Geopandas, using the prepared geodatasets.

In [None]:
!pip install geodatasets

In [None]:
#importing the geodatasets module
import geodatasets

GeoPandas inherits the standard pandas methods for indexing/selecting data. This includes label based indexing with loc and integer position based indexing with iloc, which apply to both GeoSeries and GeoDataFrame objects.

In addition to the standard pandas methods, GeoPandas also provides coordinate based indexing with the ***cx*** indexer, which slices using a bounding box. Geometries in the GeoSeries or GeoDataFrame that intersect the bounding box will be returned.

In [None]:
#using one of geo datasets for this example
chile = gpd.read_file(geodatasets.get_path('geoda.chile_labor'))
chile.plot(figsize=(8, 8),);

In [None]:
#selecting parts of Chile whose boundaries extend south of the -50 degrees latitude

southern_chile = chile.cx[:, :-50]

southern_chile.plot(figsize=(8, 8), cmap ='tab10');

<a name="t3.1"></a>
### 3.1 **Maps and Plots**

GeoPandas provides a high-level interface to the matplotlib library for making maps. Mapping shapes is as easy as using the plot() method on a GeoSeries or GeoDataFrame.



In [None]:
#plotting example on Chicago
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
chicago.plot(cmap='tab10');


GeoPandas makes it easy to create ***Choropleth maps*** (maps where the color of each shape is based on the value of an associated variable). Simply use the plot command with the column argument set to the column whose values you want used to assign colors.

To create the Coropleth of Chicago, it's practical at first to know what are the column types that exist in the dataset. Then, we can plot the coropleth map based on the desired column.

In [None]:
list(chicago.columns)

In [None]:
chicago.plot(column="POP2010", figsize=(8, 8), legend = True, legend_kwds={"label": "Population in 2010", "orientation": "horizontal"});

<a name="t3.2"></a>
### 3.2 **Geometric Manipulation**

GeoPandas makes available all the tools for geometric manipulations in the Shapely library. Here we can see a number of useful functions. It is always possible to check for other functions in the documentation.

![alt text](https://geopandas.org/en/stable/_images/binary_geo-intersection.svg)


*   GeoSeries.buffer:
  Returns a GeoSeries of geometries representing all points within a given distance of each geometric object.
*   GeoSeries.boundary:
Returns a GeoSeries of lower dimensional objects representing each geometry’s set-theoretic boundary.
*   GeoSeries.centroid: Returns a GeoSeries of points for each geometric centroid.
*   GeoSeries.envelope:
Returns a GeoSeries of geometries representing the point or smallest rectangular polygon (with sides parallel to the coordinate axes) that contains each object.
*   GeoSeries.rotate: Rotate the coordinates of the GeoSeries.
*   GeoSeries.scale: Scale the geometries of the GeoSeries along each (x, y, z) dimension.

In [None]:
#example of some geometric manipulations
from geopandas import GeoSeries
from shapely.geometry import Polygon

#let's create some polygons
p1 = Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])
p2 = Polygon([(2, 0), (3, 0), (3, 1), (2, 1)])
p3 = Polygon([(0, 0), (1, 0), (1, 1)])

#put them together to make a GeoSeries
g = GeoSeries([p1, p2, p3])

#it's time for plot!
g.plot(cmap='Set3')

Looks nice, right? Let's try a couple of functions on the Geoseries that we created. Here we apply rotate, scale, and buffer functions. Note that these three functions ***return a Geoseries*** and in order to *see* them, we need to plot them.

In [None]:
#rotate the GeoSeries using the angle of rotation and the origin point
g_rotated = g.rotate(60, origin=(0, 0))
g_rotated.plot(cmap='Set3')

What is the ***reference axis*** for the rotation? And is the rotation ***clockwise or counter-clockwise***? What do you think we should do to ***change the direction of the rotation***? too many questions for a single function. But for now let's try another handy funtion, ***.scale***. To apply this function, we set the parameters as scale in x-direction, and y-direction, respectivley.

In [None]:
#scaling a geoseries here with the magnitude of 2 in x direction
g_scale = g.scale (2,1)
g_scale.plot(cmap='Set3')

A second, where is the origin of the scale in the example above?

But what if we want ***the origin of the scale operation*** to be a certain point? Then we can specify it in the input parameters of the .scale function.

In [None]:
g_scale = g.scale (2, 3.6, origin=(0,0))
g_scale.plot(cmap='Set3')

and as the last example here, we have .buffer funtion, which returns a GeoSeries of geometries representing all points within a given distance of each geometric object.

In [None]:
g_buffer = g.buffer(0.8)
g_buffer.plot(cmap='Set3')

<a name="t3.3"></a>
### 3.3 **Operations with Overlay**

When working with multiple spatial datasets – especially multiple polygon or line datasets – users often wish to create new shapes based on places where those datasets overlap (or don’t overlap). These manipulations are often referred using the language of sets – intersections, unions, and differences. These types of operations are made available in the GeoPandas library through the overlay() method.

The basic idea is demonstrated by the graphic below but keep in mind that overlays operate at the DataFrame level, not on individual geometries, and the properties from both are retained. In effect, for every shape in the left GeoDataFrame, this operation is executed against every other shape in the right GeoDataFrame:


![alt text](https://geopandas.org/en/stable/_images/overlay_operations.png)


In [None]:
# Example of overlay with the Chicago and Groceries geodatasets
chicago = gpd.read_file(geodatasets.get_path("geoda.chicago_commpop"))
groceries = gpd.read_file(geodatasets.get_path("geoda.groceries"))

# Project to crs that uses meters as distance measure
chicago = chicago.to_crs("ESRI:102003")

groceries = groceries.to_crs("ESRI:102003")

In [None]:
# Look at Chicago:
chicago.plot(cmap='tab10');

# Now buffer groceries to find area within 1km.
# Check CRS -- USA Contiguous Albers Equal Area, units of meters.
groceries.crs

# make 1km buffer
groceries['geometry']= groceries.buffer(1000)

groceries.plot(cmap='tab10');

In [None]:
#To select only the portion of community areas within 1km of a grocery, specify the how option to be “intersect”, which creates a new set of polygons where these two layers overlap:

chicago_cores = chicago.overlay(groceries, how='intersection')

chicago_cores.plot(alpha=0.5, edgecolor='k', cmap='tab10');

In [None]:
#Changing the how option allows for different types of overlay operations. For example, if you were interested in the portions of Chicago far from groceries (the peripheries), you would compute the difference of the two.

chicago_peripheries = chicago.overlay(groceries, how='difference')

chicago_peripheries.plot(alpha=0.5, edgecolor='k', cmap='tab10');

You can try out the other two overlay methods (union and symmetrical difference) for yourself to see how it looks like! Also, you can think of meaningful applications of this overlay methods in architectural design process.

<a name="t4"></a>
## 📊 **Exercise**

We have already created a GeoPandas DataFrame for a single floor of one KAAN building project. Now, the aim is to pick a whole project (containing all the floors) and create a GeoDataFrame. Your team has already chosen one project. You can conveniently divide the floors among your group members. Make sure you cover the whole building. Note that you will ***create the GeoDataFrame for the whole project***, but up to 5 plots for each member of a group would suffice.

The ***shared folder*** in Drive containing the files required for all projects can be found [here](https://drive.google.com/drive/folders/1vnRCW3_-q63WfpBkD3e2Nqx1SauMH9yS?usp=sharing). You can copy the required folders for your project in your own Drive. The cleaning process will be done one the files inside the "general_csv" folder, whereas the GeoDataFrame creation will be done on the files inside the "BatchPlan_output" folder.

Keep in mind that ***every project is unique*** and might have its own characteristic and consequently require a special way of cleaning or plotting. It is recommended to coordinate among your group members to find a unified way of creating the DataFrame.


### **Evaluation**</br>
- **Code quality (1)**
    - 1p: coding clear and using proper comments when needed

- **Loading Data (1)**
    - 1p: using a proper directory to have access to the files of the project

- **GeoDataFrame (4)**
    - 1p: creating the pandas DataFrame
    - 2p: cleaning, sorting, and selecting data
    - 1p: creating the GeoPandas DataFrame
- **Plotting (4)**
    - 1p: plotting up to 5 (depending on the project) floor plans non-semanticly
    - 3p: define a plotting rule and plot up to 5 (depending on the project) floor plans semantically

Your grade is equivalent to the amount of points you receive (out of 10).



### **Output**</br>
**Write your findings and interpretation in a new notebook** and name it **"A2_from_numbers_to_plots - \<name\>.ipynb"**.