# 2D Design Template

# Overview

The purpose of this project is for you to apply what you have learnt in this course. This includes working with data and visualizing it, create model of linear regression, as well as using metrics to measure the accuracy of your model. 

Please find the project handout description in the following [link](https://edimension.sutd.edu.sg/webapps/blackboard/content/listContent.jsp?course_id=_5582_1&content_id=_200537_1).


## Deliverables

You need to submit this Jupyter notebook together with the dataset into Vocareum. Use the template in this notebook to work on this project. You are free to edit or add more cells if needed

## Students Submission
*Include a short sentence summarizing each member’s contribution.*

Student's Name:
- Teo Li Zhong
  - Documented the code and streamlit
- Otniel Steven Krisanto
  - Scoured data for parameters and documented code
- Chew Wei-Han
  - Regression Model
- Aishaani Pal
  - Scoured data for parameters
- Tan Rui Anh
  - Cleaned and processed raw data

# Airplain

## How might we predict how much air pollution a city in the Netherlands will create, and use that data/prediction to help city planners design more sustainable cities to reduce air pollution

<strong>User Persona</strong>: Dutch Urban Planners

The decision to target the Netherlands is due to the Netherlands being well knowned for urban design, and for future considerations, the predictions can be widened to the rest of the European region.

<strong>The Problem</strong>: As countries urbanise, urban planners are tasked with designing and planning cities that are liveable and sustainable. However, it can be difficult to predict how different aspects and factors can affect air pollution in a city, hence Airplain was conceived.

Airplain is a linear regression model designed to predict the levels of air pollution (via PM2.5) a city might produce based on a set of parameters.

More details on the project structure can be found in the [readme](README.md)

## Dataset
The dataset used for training the regression model was taken from various official statistics databases from the Netherlands. 

### Data sources:
1. [Netherlands Population density data](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=70072ned&_theme=246)
2. [Netherlands region code](https://opendata.cbs.nl/statline/#/CBS/nl/dataset/84929NED/table?dl=343E)
3. [Netherlands PM2.5 data](https://www.luchtmeetnet.nl/rapportages)
4. [Netherlands proximity to facilites](https://opendata.cbs.nl/statline/#/CBS/en/dataset/85560ENG/table?ts=1754288993424)
5. [Land use by municipality](https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS&tableId=70262ENG&_theme=1182)

### Processed Data
All data were processed and converted in csv format for ease of import into `pandas`, the processed data is formatted in the following format:
<table>
    <tr>
        <td></td>
        <td>Year 1</td>
        <td>Year 2</td>
        <td>Year 3</td>
        <td>...</td>
    </tr>
    <tr>
        <td>Region 1</td>
        <td>Datapoint 11</td>
        <td>Datapoint 12</td>
        <td>Datapoint 13</td>
        <td>...</td>
    </tr>
    <tr>
        <td>Region 2</td>
        <td>Datapoint 21</td>
        <td>Datapoint 22</td>
        <td>Datapoint 23</td>
        <td>...</td>
    </tr>
    <tr>
        <td>...</td>
        <td>...</td>
        <td>...</td>
        <td>...</td>
        <td>...</td>
    </tr>
</table>

For datasets that have more than one parameter as a datapoint, the values were stored in a tuple

The independent variables that the team has come up with were:
1. Land use (Roads vs Parks)
2. Population density
3. Proximity to jobs (by km)

<sub>Each independent variable is stored in it's own csv</sub>

The target is:
1. PM2.5 value of the municipality

More details on the dataset can be found in the [readme](/datafiles/README.md)

- Put python codes for loading the data into pandas dataframe(s). The data should be the raw data downloaded from the source. No pre-processing using any software (excel, python, etc) yet. Include this dataset in your submission
- Explain each column of your dataset (can use comment or markdown)

In [None]:
import linearRegression
import cleanNether
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as axes

: 

### Clean & Analyze your data
Use python code to:
- Clean your data
- Calculate Descriptive Statistics and other statistical analysis
- Visualization with meaningful analysis description

In [None]:
## Process all the data and output to a csv format
print("Population Density Data Processed:", cleanNether.popDensity())
print("Proximity Data Processed:", cleanNether.trimProximity())
print("Land Use Data Processed:", cleanNether.landUse())
print("Travel Data Processed:", cleanNether.carTravel())

In [None]:
# descriptive statistics
## Five number summary
print("Five number summary:", cleanNether.getFiveNumberSummary())

popDensityDF: pd.DataFrame = pd.read_csv('datafiles/processedPopDensity.csv',index_col=0).fillna(-1)
# pmDF: pd.DataFrame = pd.read_csv('datafiles/processedPm2.5.csv',index_col=0).fillna(-1)
proximityDF: pd.DataFrame = pd.read_csv('datafiles/processedProximity.csv')
landUseDF: pd.DataFrame = pd.read_csv('datafiles/processedLandUse.csv',sep=";",index_col=0).fillna(-1) 
transportPublicDF: pd.DataFrame = pd.read_csv('datafiles/processedCarTravelPublic.csv')
transportPrivateDF: pd.DataFrame = pd.read_csv('datafiles/processedCarTravelPrivate.csv')

def getDataFromDF(df: pd.DataFrame, region: str, year: str, targetCol: str) -> str:
    '''
    Descripton.
    
    Args:
        some_arg (Any):
    
    Returns:
        Any:
    
    Functionality
        - does a thing
    
    Raises:
        AssertionError:
    '''
    #print(f'{region=}, {type(region)=}, {year=}, {type(year)=}')
    #print(year, type(year))
    year = int(year)
    regionMatches: pd.Series = df["Region"] == region
    yearMatches: pd.Series = df["Year"] == year
    regionMatches: set = set(df.index[regionMatches].tolist())
    yearMatches: set = set(df.index[yearMatches].tolist())
    desiredIndices: set = regionMatches & yearMatches
    #print(f'{len(regionMatches)=}, {len(yearMatches)=}')
    #print(f'{len(desiredIndices)=}')
    assert len(desiredIndices) == 1
    desiredIndex: int = list(desiredIndices)[0]
    targetIndex: int = df.columns.get_loc(targetCol)

    output = df.iloc[desiredIndex,targetIndex]
    return output
  

In [None]:
# visualization with analysis
features = ['Population Density', 'Total Road Area', 'Total Greenery Area','Public Transport Travel','Private Transport Travel']

for feature in features:
    plt.scatter(dataFeaturesTest[feature], dataTarget_test)
    plt.scatter(dataFeaturesTest[feature], pred)

### Features and Target Preparation

Prepare features and target for model training.

In [0]:
# put Python code to prepare your features and target
#Merge the cleaned data into 1 dataframe
newColumns: pd.DataFrame = pd.DataFrame(columns=['Population Density', 'Total Road Area', 'Total Greenery Area','Public Transport Travel','Private Transport Travel'])
for index, row in proximityDF.iterrows():
    region: str = row['Region']
    year: str = str(row['Year'])

    #Process population density
    try:
        popDensity: float = popDensityDF.loc[region,year]
        
    except:
        popDensity: float = -1.0

    #Process land use
    try:
        roadArea: float | str; greenArea: float
        value = landUseDF.loc[region, year]
        assert value != -1.0
        value = value.replace("(","").replace(")","").replace('\'','')
        roadArea, greenArea = value.split(",")
        roadArea = float(roadArea)
        greenArea = float(greenArea)
    except:
        roadArea: float = -1.0
        greenArea: float = -1.0

    #Process travel data
    try:
        publicTransport: float = float(getDataFromDF(transportPublicDF,region,year, 'Public Transport in km'))
        #print('found')
    except:
        publicTransport: float = -1.0
    try:
        privateTransport: float = float(getDataFromDF(transportPrivateDF,region,year, 'Private Transport in km'))
        #print('found')
    except:
        privateTransport: float = -1.0


### Building Model

Use python code to build your model. Give explanation on this process.

In [0]:
# put Python code to build your model
mergedDF: pd.DataFrame = pd.concat([proximityDF,newColumns],axis=1)
print(mergedDF)
mergedDF.to_csv('compiledData.csv',index=False)

print('-'*150)
#Remove incomplete rows from the data to get the final compiled dataset
incompleteRows: list = []
for index, row in mergedDF.iterrows():
    complete: bool = True
    # '0 to 10 km','>10 to 20 km','>20 to 50km', 'Population Density','Total Road Area','Total Greenery Area', 'Public Transport Travel', 'Private Transport Travel' For testing
    exclusionList : list[str] = [ 'Public Transport Travel'] #For testing
    for key, value in row.items():
        if key in exclusionList: #For testing
            continue #For testing
        if value == -1.0:
            complete: bool = False
            break
    if not complete:
        incompleteRows.append(index)
data: pd.DataFrame = mergedDF.drop(incompleteRows, axis=0).reset_index()

### Evaluating the Model

- Describe the metrics of your choice
- Evaluate your model performance

In [None]:
# put Python code to test & evaluate the model

### Improving the Model

- Improve the models by performing any data processing techniques or hyperparameter tuning.
- You can repeat the steps above to show the improvement as compared to the previous performance

Note:
- You should not change or add dataset at this step
- You are allowed to use library such as sklearn for data processing (NOT for building model)
- Make sure to have the same test dataset so the results are comparable with the previous model 
- If you perform hyperparameter tuning, it will require you to split your training data further into train and validation dataset

In [0]:
# Re-iterate the steps above with improvement

### Discussion and Analysis

- Analyze the results of your metrics.
- Explain how does your analysis and machine learning help to solve your problem statement.
- Conclusion