# ***DSAA5020 Assignment 1: If Sanshui Needed in City Roadside Sites?***


---
### **50015627 JIANG Zhuoyang**
### Oct.11 2023



# 1.Task Description

## (1) Motivation:
"Shanshui City" was first proposed by Qian Xuesen in 1990 and is a futuristic urban concept based on traditional Chinese views of natural landscapes and the unity of heaven and humanity. However, compared to other contemporary theories of future cities at the time, Shanshui City was more of a conceptual idea. Research and exploration in this regard were limited, and a comprehensive set of ideas and feasible solutions to address modern urban issues were lacking.

With the development of urban planning as a discipline and China's economic growth, Chinese cities have gradually shifted from incremental expansion to stock optimization. Stock optimization has become a focal point of attention. Therefore, I have combined the mature technology of street-level imagery big data and the features of geographic information systems, such as massive data scale, rich information, and the unity of data subjectivity and objectivity, to reconsider Qian Xuesen's "Shanshui City" concept. My task is to utilize machine learning techniques to assess whether fine-grained urban plots require "Shanshui-like" development.

## (2) Task:
The nature of the task is to train a binary classification model based on machine learning. The model takes input in the form of feature vectors constructed from various types of information pertaining to specific urban roadside locations. The output is a binary label indicating whether the location is suitable for "Shanshui" development. A label of 1 signifies suitability, while a label of 0 indicates unsuitability.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import metrics
from sklearn.utils import shuffle
from io import StringIO
import sys
sns.set_style('whitegrid')
%matplotlib inline

# 2. Dataset （CODE in Folder './DatasetConstruct'）
## (1) What kind Data do we need?
I believe that evaluating the potential for "Shanshui-like" development of a place requires consideration of two types of information:

1. Subjective perceptual information centered around human experiences.
2. Objective geographical information based on the overall urban development context.

The data sources for these two types of information need to correspond to their respective characteristics:

1. Street-level image data: This data can provide a fine-grained description of people's experiential perceptions in a given environment, making it human-centric. Additionally, due to the vast amount of data in street-level image datasets, batched street-level image data can also reflect the overall perceptual information of an entire location from a macro perspective.

2. Geographic information data (POI - Points of Interest): Some commonly used geographical information data have the potential to provide background information on urban development from the perspective of city services, population, and other aspects. This is also a crucial factor in determining whether an urban area can undergo "Shanshui-like" development.

## (2) How the data been collected?
Regarding these two types of data, I need to first obtain the coordinates of all street spaces in a specific research location, and then proceed with data collection in different ways, which can be broken down into the following steps:

1. Coordinate Point Information Retrieval:
   I obtained a table of geographic coordinates for all the streets in the city (Nanjing) from a colleague in urban planning. The approach to obtaining this information is as follows:
   - Use the Baidu Maps Snapshot Tool to capture the trajectories of Baidu Street View cars.
   - Vectorize and geographically align this map in ARCGIS, and discretize it at 50-meter intervals to obtain coordinates for each point along the street view trajectory (at 50-meter intervals).

2. Street View Image Information Retrieval:
   - Utilize the API key provided by the Baidu Maps Open Platform to batch retrieve street view images. Construct URLs using the latitude and longitude coordinates, API key, and other street view capture parameters.
   - Retrieve street view images at three different angles for each coordinate point: 45°, 90°, and 135° to ensure comprehensive information for each point. To ensure that the capture angle parameters are fixed on the right side of the street, it is necessary to label whether each geographic coordinate point is located on a one-way or two-way road. There are three types: one-way, two-way forward (check if it's northbound first and then eastbound), and two-way backward (check if it's southbound first and then westbound).
   - Batch retrieve street view images at these three angles.
   - Use a pre-trained ResnetPSP semantic segmentation model based on MXnet Model Zoo, trained on the cityscapes dataset, to perform semantic segmentation on all street view images at the three angles and calculate the weighted average of semantic feature proportions. Each semantic feature serves as a dimension in the feature vector (a total of 19 dimensions, located in feature vector positions 11-30).
   - Calculate six advanced semantic features based on professional formulas to extract advanced visual information for the coordinate points: GreenSpaceCompetitiveness, Publication, SkyViewFactor, GreenLookingRatio, Enclosure, PavementFeasibility. Each advanced semantic feature serves as a dimension in the feature vector (a total of 6 dimensions, located in feature vector positions 5-10).

3. Geographic Information Data Retrieval:
   Select several types of information from the POI data previously collected by urban planning colleagues that we need. Use the latitude and longitude of the coordinate points to combine the geographic information and street view image information into a single feature vector. The selected geographic information mainly includes five types: BusStationND, SubwayStationND, PubToiletND, CateringND, ConvenienceStoreND. Each geographic information feature serves as a dimension in the feature vector (a total of 5 dimensions, located in feature vector positions 0-4).

## (3) Feature Conclusion:
[0-4]BusStationND, SubwayStationND, PubToiletND, CateringND, ConvenienceStoreND.
[5-10]GreenSpaceCompetitiveness, Publication, SkyViewFactor, GreenLookingRatio, Enclosure, PavementFeasibility
[11-30]'road', 'sidewalk', 'building', 'wall', 'fence', 'pole', 'traffic light', 'traffic sign', 'vegetation', 'terrain', 'sky', 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle'
## (4) Sample Selection and Data Annotation:
I use the equal-distance Sampling menthod to select 987 samples to construct Training Set.

Then,to manually annotate the coordinate points for the suitability of "Shanshui-like" development, you can use the following labeling scheme:

1. Suitable (labeled as 1) - This includes the following categories:
 - Areas suitable for adding aesthetic value - e.g., residential and working areas (residential buildings, schools, office buildings).
 - Areas that contribute to social value - e.g., public service facilities (hospitals).
 - Areas with the potential for economic value enhancement - e.g., commercial complexes (shopping malls, supermarkets, farmer's markets).

2. Not suitable (labeled as 0) - This includes several categories:
 - Facilities that are not easily reconstructed - e.g., transportation facilities (elevated highways, interchanges, tunnels).
 - Areas with significant cultural or historical value - e.g., urban park areas (historical preservation sites, centralized parks, pocket parks).

In [None]:
# Specify the Excel file path to be read
file_path = "/content/IFShanshuiNeeded-Dataset.xls"
# Use the read_excel function from pandas to read the Excel file
df = pd.read_excel(file_path)


## (5) Data Preparation:
I use dataprepto do visualization which simplifies the creation of informative data visualizations. It offers a wide range of visualization options, including histograms, scatter plots, and bar charts, making it easier to explore and communicate insights from the data. The library provides a user-friendly interface with simple and intuitive functions, making it accessible to both data scientists and analysts with varying levels of expertise.

In [None]:
# Use Dataprep to describe data in a comprehensive way
! pip install dataprep
from dataprep.eda import create_report

Collecting dataprep
  Downloading dataprep-0.4.5-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
Collecting bokeh<3,>=2 (from dataprep)
  Downloading bokeh-2.4.3-py3-none-any.whl (18.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.5/18.5 MB[0m [31m65.2 MB/s[0m eta [36m0:00:00[0m
Collecting flask_cors<4.0.0,>=3.0.10 (from dataprep)
  Downloading Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Collecting jinja2<3.1,>=3.0 (from dataprep)
  Downloading Jinja2-3.0.3-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.6/133.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jsonpath-ng<2.0,>=1.5 (from dataprep)
  Downloading jsonpath_ng-1.6.0-py3-none-any.whl (29 kB)
Collecting metaphone<0.7,>=0.6 (from dataprep)
  Downloading Metaphone-0.6.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collec

In [None]:
create_report(df)



0,1
Number of Variables,34
Number of Rows,987
Missing Cells,0
Missing Cells (%),0.0%
Duplicate Rows,0
Duplicate Rows (%),0.0%
Total Size in Memory,262.3 KB
Average Row Size in Memory,272.1 B
Variable Types,Numerical: 33  Categorical: 1

0,1
id is uniformly distributed,Uniform
SkyViewFactor and sky have similar distributions,Similar Distribution
GreenLookingRatio and vegetation have similar distributions,Similar Distribution
truck and bus have similar distributions,Similar Distribution
id is skewed,Skewed
BusStationND is skewed,Skewed
SubwayStationND is skewed,Skewed
PubToiletND is skewed,Skewed
CateringND is skewed,Skewed
ConvenienceStoreND is skewed,Skewed

0,1
GreenSpaceCompetitiveness is skewed,Skewed
Publication is skewed,Skewed
Enclosure is skewed,Skewed
sidewalk is skewed,Skewed
wall is skewed,Skewed
fence is skewed,Skewed
pole is skewed,Skewed
traffic light is skewed,Skewed
traffic sign is skewed,Skewed
terrain is skewed,Skewed

0,1
person is skewed,Skewed
rider is skewed,Skewed
car is skewed,Skewed
truck is skewed,Skewed
bus is skewed,Skewed
train is skewed,Skewed
motorcycle is skewed,Skewed
bicycle is skewed,Skewed
label has constant length 1,Constant Length
BusStationND has 179 (18.14%) zeros,Zeros

0,1
SubwayStationND has 731 (74.06%) zeros,Zeros
PubToiletND has 387 (39.21%) zeros,Zeros
CateringND has 179 (18.14%) zeros,Zeros
ConvenienceStoreND has 318 (32.22%) zeros,Zeros
Publication has 126 (12.77%) zeros,Zeros
wall has 141 (14.29%) zeros,Zeros
fence has 59 (5.98%) zeros,Zeros
pole has 81 (8.21%) zeros,Zeros
traffic light has 803 (81.36%) zeros,Zeros
traffic sign has 187 (18.95%) zeros,Zeros

0,1
terrain has 103 (10.44%) zeros,Zeros
person has 124 (12.56%) zeros,Zeros
rider has 661 (66.97%) zeros,Zeros
truck has 871 (88.25%) zeros,Zeros
bus has 836 (84.7%) zeros,Zeros
train has 969 (98.18%) zeros,Zeros
motorcycle has 511 (51.77%) zeros,Zeros
bicycle has 259 (26.24%) zeros,Zeros

0,1
Approximate Distinct Count,987
Approximate Unique (%),100.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,9988.8349
Minimum,0
Maximum,20000

0,1
Minimum,0
5-th Percentile,986
Q1,4950
Median,9980
Q3,15070
95-th Percentile,19014
Maximum,20000
Range,20000
IQR,10120

0,1
Mean,9988.8349
Standard Deviation,5806.9249
Variance,33720000.0
Sum,9859000.0
Skewness,0.005944
Kurtosis,-1.209
Coefficient of Variation,0.5813

0,1
Approximate Distinct Count,970
Approximate Unique (%),98.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,118.7468
Minimum,118.6658
Maximum,118.835

0,1
Minimum,118.6658
5-th Percentile,118.6884
Q1,118.7222
Median,118.7392
Q3,118.777
95-th Percentile,118.8055
Maximum,118.835
Range,0.1692
IQR,0.05471

0,1
Mean,118.7468
Standard Deviation,0.03582
Variance,0.001283
Sum,117203.1158
Skewness,0.1433
Kurtosis,-0.7058
Coefficient of Variation,0.00030166

0,1
Approximate Distinct Count,967
Approximate Unique (%),98.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,32.0113
Minimum,31.9507
Maximum,32.098

0,1
Minimum,31.9507
5-th Percentile,31.9645
Q1,31.9815
Median,32.0031
Q3,32.0352
95-th Percentile,32.0866
Maximum,32.098
Range,0.1474
IQR,0.05363

0,1
Mean,32.0113
Standard Deviation,0.03776
Variance,0.001426
Sum,31595.1772
Skewness,0.7157
Kurtosis,-0.5732
Coefficient of Variation,0.00118

0,1
Approximate Distinct Count,796
Approximate Unique (%),80.7%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.1838
Minimum,0
Maximum,0.9059

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.01569
Median,0.1725
Q3,0.2832
95-th Percentile,0.4815
Maximum,0.9059
Range,0.9059
IQR,0.2675

0,1
Mean,0.1838
Standard Deviation,0.1651
Variance,0.02726
Sum,181.4386
Skewness,0.8319
Kurtosis,0.5251
Coefficient of Variation,0.8982

0,1
Approximate Distinct Count,253
Approximate Unique (%),25.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.059
Minimum,0
Maximum,0.8284

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,0.001716
95-th Percentile,0.4425
Maximum,0.8284
Range,0.8284
IQR,0.001716

0,1
Mean,0.059
Standard Deviation,0.1357
Variance,0.01843
Sum,58.2367
Skewness,2.3653
Kurtosis,4.5832
Coefficient of Variation,2.3007

0,1
Approximate Distinct Count,594
Approximate Unique (%),60.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.03615
Minimum,0
Maximum,0.3513

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.009739
Q3,0.05257
95-th Percentile,0.1491
Maximum,0.3513
Range,0.3513
IQR,0.05257

0,1
Mean,0.03615
Standard Deviation,0.05362
Variance,0.002875
Sum,35.6826
Skewness,2.1073
Kurtosis,5.413
Coefficient of Variation,1.4832

0,1
Approximate Distinct Count,797
Approximate Unique (%),80.8%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.05114
Minimum,0
Maximum,0.4782

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.001037
Median,0.01736
Q3,0.07442
95-th Percentile,0.2018
Maximum,0.4782
Range,0.4782
IQR,0.07339

0,1
Mean,0.05114
Standard Deviation,0.07502
Variance,0.005627
Sum,50.4776
Skewness,2.3031
Kurtosis,6.3109
Coefficient of Variation,1.4668

0,1
Approximate Distinct Count,661
Approximate Unique (%),67.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.04546
Minimum,0
Maximum,0.4561

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.01499
Q3,0.06617
95-th Percentile,0.1925
Maximum,0.4561
Range,0.4561
IQR,0.06617

0,1
Mean,0.04546
Standard Deviation,0.06667
Variance,0.004445
Sum,44.8675
Skewness,2.1554
Kurtosis,5.6972
Coefficient of Variation,1.4666

0,1
Approximate Distinct Count,402
Approximate Unique (%),40.7%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.8516
Minimum,0.06133
Maximum,1

0,1
Minimum,0.06133
5-th Percentile,0.5125
Q1,0.7442
Median,0.928
Q3,1.0
95-th Percentile,1.0
Maximum,1.0
Range,0.9387
IQR,0.2558

0,1
Mean,0.8516
Standard Deviation,0.1828
Variance,0.03342
Sum,840.5242
Skewness,-1.3365
Kurtosis,1.5859
Coefficient of Variation,0.2147

0,1
Approximate Distinct Count,296
Approximate Unique (%),30.0%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,2623.9632
Minimum,0
Maximum,49660.588

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,433.832
Median,1584.968
Q3,3454.4
95-th Percentile,6558.705
Maximum,49660.588
Range,49660.588
IQR,3020.568

0,1
Mean,2623.9632
Standard Deviation,4699.4673
Variance,22085000.0
Sum,2589900.0
Skewness,7.3416
Kurtosis,65.3843
Coefficient of Variation,1.791

0,1
Approximate Distinct Count,944
Approximate Unique (%),95.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.1431
Minimum,0
Maximum,0.44

0,1
Minimum,0.0
5-th Percentile,0.001886
Q1,0.06874
Median,0.1391
Q3,0.2114
95-th Percentile,0.2998
Maximum,0.44
Range,0.44
IQR,0.1427

0,1
Mean,0.1431
Standard Deviation,0.09065
Variance,0.008218
Sum,141.2322
Skewness,0.2515
Kurtosis,-0.7701
Coefficient of Variation,0.6335

0,1
Approximate Distinct Count,959
Approximate Unique (%),97.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.1655
Minimum,0
Maximum,0.5795

0,1
Minimum,0.0
5-th Percentile,0.01098
Q1,0.07362
Median,0.1489
Q3,0.2409
95-th Percentile,0.3736
Maximum,0.5795
Range,0.5795
IQR,0.1673

0,1
Mean,0.1655
Standard Deviation,0.1149
Variance,0.01321
Sum,163.369
Skewness,0.6803
Kurtosis,-0.01604
Coefficient of Variation,0.6944

0,1
Approximate Distinct Count,971
Approximate Unique (%),98.4%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,1.2434
Minimum,0
Maximum,263.9774

0,1
Minimum,0.0
5-th Percentile,0.2018
Q1,0.4138
Median,0.65
Q3,0.9429
95-th Percentile,1.9247
Maximum,263.9774
Range,263.9774
IQR,0.5292

0,1
Mean,1.2434
Standard Deviation,9.0685
Variance,82.2373
Sum,1227.2762
Skewness,25.8135
Kurtosis,722.1072
Coefficient of Variation,7.2931

0,1
Approximate Distinct Count,972
Approximate Unique (%),98.5%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.5343
Minimum,0
Maximum,0.965

0,1
Minimum,0.0
5-th Percentile,0.4294
Q1,0.4714
Median,0.5135
Q3,0.5779
95-th Percentile,0.7178
Maximum,0.965
Range,0.965
IQR,0.1065

0,1
Mean,0.5343
Standard Deviation,0.09625
Variance,0.009264
Sum,527.3925
Skewness,0.7603
Kurtosis,3.8783
Coefficient of Variation,0.1801

0,1
Approximate Distinct Count,970
Approximate Unique (%),98.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.4076
Minimum,0
Maximum,0.5883

0,1
Minimum,0.0
5-th Percentile,0.1979
Q1,0.354
Median,0.4283
Q3,0.4805
95-th Percentile,0.5387
Maximum,0.5883
Range,0.5883
IQR,0.1264

0,1
Mean,0.4076
Standard Deviation,0.1052
Variance,0.01106
Sum,402.2915
Skewness,-1.1376
Kurtosis,1.6123
Coefficient of Variation,0.258

0,1
Approximate Distinct Count,892
Approximate Unique (%),90.4%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.01573
Minimum,0
Maximum,0.1818

0,1
Minimum,0.0
5-th Percentile,7.462e-06
Q1,0.002334
Median,0.007953
Q3,0.01946
95-th Percentile,0.06278
Maximum,0.1818
Range,0.1818
IQR,0.01713

0,1
Mean,0.01573
Standard Deviation,0.02188
Variance,0.00047885
Sum,15.5219
Skewness,2.7899
Kurtosis,10.7799
Coefficient of Variation,1.3915

0,1
Approximate Distinct Count,966
Approximate Unique (%),97.9%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.1234
Minimum,0
Maximum,0.6786

0,1
Minimum,0.0
5-th Percentile,0.008075
Q1,0.0388
Median,0.09187
Q3,0.1701
95-th Percentile,0.3505
Maximum,0.6786
Range,0.6786
IQR,0.1313

0,1
Mean,0.1234
Standard Deviation,0.1133
Variance,0.01283
Sum,121.8185
Skewness,1.6157
Kurtosis,3.1411
Coefficient of Variation,0.9177

0,1
Approximate Distinct Count,795
Approximate Unique (%),80.5%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.01129
Minimum,0
Maximum,0.1378

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.00062153
Median,0.003634
Q3,0.0133
95-th Percentile,0.0484
Maximum,0.1378
Range,0.1378
IQR,0.01268

0,1
Mean,0.01129
Standard Deviation,0.01877
Variance,0.00035227
Sum,11.1442
Skewness,2.9551
Kurtosis,10.3052
Coefficient of Variation,1.6623

0,1
Approximate Distinct Count,890
Approximate Unique (%),90.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.02105
Minimum,0
Maximum,0.206

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.003424
Median,0.01135
Q3,0.02736
95-th Percentile,0.07429
Maximum,0.206
Range,0.206
IQR,0.02393

0,1
Mean,0.02105
Standard Deviation,0.02734
Variance,0.00074729
Sum,20.7769
Skewness,2.592
Kurtosis,9.4285
Coefficient of Variation,1.2986

0,1
Approximate Distinct Count,786
Approximate Unique (%),79.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.002511
Minimum,0
Maximum,0.02284

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.00042361
Median,0.001691
Q3,0.003551
95-th Percentile,0.008006
Maximum,0.02284
Range,0.02284
IQR,0.003128

0,1
Mean,0.002511
Standard Deviation,0.002786
Variance,7.7636e-06
Sum,2.4787
Skewness,1.9829
Kurtosis,6.108
Coefficient of Variation,1.1095

0,1
Approximate Distinct Count,136
Approximate Unique (%),13.8%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.00011152
Minimum,0
Maximum,0.004648

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th Percentile,0.00077257
Maximum,0.004648
Range,0.004648
IQR,0.0

0,1
Mean,0.00011152
Standard Deviation,0.0004529
Variance,2.0512e-07
Sum,0.1101
Skewness,6.5048
Kurtosis,50.9295
Coefficient of Variation,4.0611

0,1
Approximate Distinct Count,585
Approximate Unique (%),59.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.001382
Minimum,0
Maximum,0.07957

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,2.43e-05
Median,0.00035764
Q3,0.001144
95-th Percentile,0.005792
Maximum,0.07957
Range,0.07957
IQR,0.00112

0,1
Mean,0.001382
Standard Deviation,0.004097
Variance,1.6784e-05
Sum,1.3637
Skewness,10.4813
Kurtosis,159.0711
Coefficient of Variation,2.9652

0,1
Approximate Distinct Count,959
Approximate Unique (%),97.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.1655
Minimum,0
Maximum,0.5795

0,1
Minimum,0.0
5-th Percentile,0.01098
Q1,0.07362
Median,0.1489
Q3,0.2409
95-th Percentile,0.3736
Maximum,0.5795
Range,0.5795
IQR,0.1673

0,1
Mean,0.1655
Standard Deviation,0.1149
Variance,0.01321
Sum,163.369
Skewness,0.6803
Kurtosis,-0.01604
Coefficient of Variation,0.6944

0,1
Approximate Distinct Count,850
Approximate Unique (%),86.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.02457
Minimum,0
Maximum,0.2729

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.001888
Median,0.01243
Q3,0.03609
95-th Percentile,0.0865
Maximum,0.2729
Range,0.2729
IQR,0.0342

0,1
Mean,0.02457
Standard Deviation,0.03374
Variance,0.001139
Sum,24.2493
Skewness,2.8969
Kurtosis,12.3663
Coefficient of Variation,1.3734

0,1
Approximate Distinct Count,944
Approximate Unique (%),95.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.1431
Minimum,0
Maximum,0.44

0,1
Minimum,0.0
5-th Percentile,0.001886
Q1,0.06874
Median,0.1391
Q3,0.2114
95-th Percentile,0.2998
Maximum,0.44
Range,0.44
IQR,0.1427

0,1
Mean,0.1431
Standard Deviation,0.09065
Variance,0.008218
Sum,141.2322
Skewness,0.2515
Kurtosis,-0.7701
Coefficient of Variation,0.6335

0,1
Approximate Distinct Count,716
Approximate Unique (%),72.5%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.003359
Minimum,0
Maximum,0.1193

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.00018663
Median,0.00098438
Q3,0.003204
95-th Percentile,0.01294
Maximum,0.1193
Range,0.1193
IQR,0.003017

0,1
Mean,0.003359
Standard Deviation,0.008185
Variance,6.6987e-05
Sum,3.3156
Skewness,6.9024
Kurtosis,67.2098
Coefficient of Variation,2.4364

0,1
Approximate Distinct Count,249
Approximate Unique (%),25.2%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.00044526
Minimum,0
Maximum,0.02559

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,7.555e-05
95-th Percentile,0.002264
Maximum,0.02559
Range,0.02559
IQR,7.555e-05

0,1
Mean,0.00044526
Standard Deviation,0.001902
Variance,3.619e-06
Sum,0.4395
Skewness,8.5556
Kurtosis,88.6527
Coefficient of Variation,4.2725

0,1
Approximate Distinct Count,962
Approximate Unique (%),97.5%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.07205
Minimum,0
Maximum,1

0,1
Minimum,0.0
5-th Percentile,0.002498
Q1,0.01832
Median,0.05021
Q3,0.09988
95-th Percentile,0.2051
Maximum,1.0
Range,1.0
IQR,0.08156

0,1
Mean,0.07205
Standard Deviation,0.08455
Variance,0.007149
Sum,71.1171
Skewness,4.4562
Kurtosis,36.6238
Coefficient of Variation,1.1734

0,1
Approximate Distinct Count,112
Approximate Unique (%),11.3%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.001935
Minimum,0
Maximum,0.2453

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th Percentile,0.005764
Maximum,0.2453
Range,0.2453
IQR,0.0

0,1
Mean,0.001935
Standard Deviation,0.01257
Variance,0.00015803
Sum,1.91
Skewness,12.1503
Kurtosis,187.4484
Coefficient of Variation,6.496

0,1
Approximate Distinct Count,144
Approximate Unique (%),14.6%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.001786
Minimum,0
Maximum,0.2761

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th Percentile,0.005192
Maximum,0.2761
Range,0.2761
IQR,0.0

0,1
Mean,0.001786
Standard Deviation,0.01378
Variance,0.00018999
Sum,1.7627
Skewness,14.2977
Kurtosis,235.5788
Coefficient of Variation,7.7177

0,1
Approximate Distinct Count,18
Approximate Unique (%),1.8%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,9.6167e-05
Minimum,0
Maximum,0.04402

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th Percentile,0.0
Maximum,0.04402
Range,0.04402
IQR,0.0

0,1
Mean,9.6167e-05
Standard Deviation,0.001641
Variance,2.6937e-06
Sum,0.09492
Skewness,23.0312
Kurtosis,570.1791
Coefficient of Variation,17.0665

0,1
Approximate Distinct Count,396
Approximate Unique (%),40.1%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.00253
Minimum,0
Maximum,0.1284

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.0
Q3,0.001053
95-th Percentile,0.01293
Maximum,0.1284
Range,0.1284
IQR,0.001053

0,1
Mean,0.00253
Standard Deviation,0.008613
Variance,7.4185e-05
Sum,2.4975
Skewness,7.5646
Kurtosis,77.6603
Coefficient of Variation,3.4038

0,1
Approximate Distinct Count,577
Approximate Unique (%),58.5%
Missing,0
Missing (%),0.0%
Infinite,0
Infinite (%),0.0%
Memory Size,15792
Mean,0.001526
Minimum,0
Maximum,0.05112

0,1
Minimum,0.0
5-th Percentile,0.0
Q1,0.0
Median,0.00036806
Q3,0.001575
95-th Percentile,0.007146
Maximum,0.05112
Range,0.05112
IQR,0.001575

0,1
Mean,0.001526
Standard Deviation,0.003579
Variance,1.2809e-05
Sum,1.5066
Skewness,6.6156
Kurtosis,64.8694
Coefficient of Variation,2.3446

0,1
Approximate Distinct Count,2
Approximate Unique (%),0.2%
Missing,0
Missing (%),0.0%
Memory Size,65142

0,1
Mean,1
Standard Deviation,0
Median,1
Minimum,1
Maximum,1

0,1
1st row,1
2nd row,0
3rd row,1
4th row,1
5th row,1

0,1
Count,0
Lowercase Letter,0
Space Separator,0
Uppercase Letter,0
Dash Punctuation,0
Decimal Number,987


### Feature characteristics:
In the case of having image data, I first utilized semantic segmentation to manually extract semantic elements from the images as the foundational features. I then introduced advanced semantic information computed from a few related semantic elements as high-level features. Furthermore, I incorporated two-dimensional objective geographic information as a supplementary feature to enrich the feature vector with a three-dimensional subjective visual scene information (which is, in fact, a somewhat reluctant approach, as no matter how many features I summarize, they cannot fully approach the features unconsciously extracted by humans when perceiving environmental information).

Based on the feature analysis results generated using Dataprep, I can summarize the following key points:

1. The zero rate for geographic information data is relatively high, which has both advantages and disadvantages. On the positive side, the feature attributes of samples with non-zero values in this category stand out more, resulting in larger distances between different samples in the feature space, making them more distinguishable in classification. However, if this feature is overly sparse, its significance may be low.

2. The correlation between features is decent, but high-level features have not effectively linked the features. Perhaps a deep learning approach is needed to explore deeper features.

# 3. Data Processing

In [None]:
# ML Dataset construct
# Target variable:
Y = df['label']
# Feature vector:
X = df.loc[:, ['BusStationND', 'SubwayStationND', 'PubToiletND', 'CateringND', 'ConvenienceStoreND', 'GreenSpaceCompetitiveness', 'Publication',
                'SkyViewFactor', 'GreenLookingRatio', 'Enclosure', 'PavementFeasibility',
                'road', 'sidewalk', 'building', 'wall', 'fence', 'pole', 'traffic light', 'traffic sign', 'vegetation', 'terrain', 'sky', 'person', 'rider', 'car', 'truck', 'bus', 'train', 'motorcycle', 'bicycle']]
X = pd.DataFrame(X)

# Check if the number of samples in X and Y match
assert X.shape[0] == Y.size


In [None]:
# Check the number of rows and columns in the dataset
print('Number of samples in the dataset:', X.shape[0])
print('Number of features per sample:', X.shape[1])

# View the first 10 samples in the dataset
X.head(10)

Number of samples in the dataset: 987
Number of features per sample: 30


Unnamed: 0,BusStationND,SubwayStationND,PubToiletND,CateringND,ConvenienceStoreND,GreenSpaceCompetitiveness,Publication,SkyViewFactor,GreenLookingRatio,Enclosure,...,terrain,sky,person,rider,car,truck,bus,train,motorcycle,bicycle
0,0.012946,0.125749,0.040578,0.005614,0.023221,0.56,6448.78,0.261667,0.119444,0.25008,...,0.032231,0.261667,0.003661,0.0,0.039207,0.0,0.000792,0.0,0.025762,8.5e-05
1,0.14596,0.150409,0.011038,0.007192,0.02828,0.567082,6448.78,0.171665,0.212198,0.488257,...,0.036488,0.171665,0.008674,0.000174,0.003925,0.0,0.001142,0.0,0.0,0.034521
2,0.698298,0.0,0.017723,0.156722,0.068155,1.0,3168.096,0.055741,0.153368,1.409242,...,0.052937,0.055741,0.009141,0.0,0.048082,0.0,0.0,0.0,0.0,0.001644
3,0.147145,0.0,0.124481,0.002207,0.0,0.522361,5219.02,0.207717,0.10542,0.245801,...,0.0,0.207717,0.000257,0.0,0.016179,0.0,0.0,0.0,0.0,0.0
4,0.368432,0.0,0.0,0.070646,0.0,0.868,2932.148,0.062434,0.295856,0.618237,...,0.016497,0.062434,0.006486,0.0,0.01504,0.0,0.0,0.0,0.0,0.006649
5,0.461468,0.0,0.033757,0.038161,0.006607,0.796747,3475.292,0.173113,0.1238,0.689167,...,0.002295,0.173113,0.0,0.0,0.101611,0.0,0.0,0.0,0.0,0.0
6,0.470553,0.0,0.037766,0.111428,0.088404,0.904,49660.588,0.112545,0.180856,0.622038,...,0.0,0.112545,0.00245,0.0,0.068851,0.0,0.017351,0.0,0.0,0.009556
7,0.410133,0.0,0.122118,0.055117,0.009266,0.694569,3006.08,0.066161,0.280014,0.782734,...,0.074674,0.066161,0.000153,0.0,0.057642,0.0,0.000137,0.0,0.000144,0.000174
8,0.405619,0.0,0.067551,0.063022,0.128148,0.92926,3168.096,0.039394,0.120523,1.184263,...,0.004377,0.039394,0.018852,0.001646,0.065818,0.017955,0.0,0.0,0.016033,0.02174
9,0.29232,0.0,0.0,0.030734,0.080152,0.818186,2250.9,0.101571,0.186094,0.585337,...,0.007398,0.101571,0.000259,0.000896,0.05026,0.012406,0.0,0.0,0.012031,0.0


In [None]:
# Standardize X
X_scaled = preprocessing.scale(X)
# Convert the data type to a DataFrame
X_scaled = pd.DataFrame(X_scaled)
# Print the summary statistics of the standardized data
X_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
count,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,...,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0
mean,1.619778e-16,4.319409e-17,-2.879606e-17,-1.151842e-16,-7.199015000000001e-17,6.479113000000001e-17,7.199015000000001e-17,-1.0798520000000001e-17,1.223832e-16,1.259828e-17,...,8.638817e-17,-1.0798520000000001e-17,5.399261e-17,-2.6996300000000002e-17,-2.879606e-17,0.0,-5.0393100000000007e-17,1.439803e-17,1.799754e-18,-3.959458e-17
std,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,...,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507,1.000507
min,-1.113892,-0.4348712,-0.6745781,-0.6820946,-0.6821985,-4.325005,-0.5586364,-1.579261,-1.440798,-0.1371863,...,-0.7284871,-1.579261,-0.4106531,-0.2341743,-0.8526456,-0.154019,-0.1296375,-0.05862397,-0.2939395,-0.4267247
25%,-1.018796,-0.4348712,-0.6745781,-0.6682674,-0.6821985,-0.5879523,-0.4662744,-0.8206446,-0.7999796,-0.09153618,...,-0.6725052,-0.8206446,-0.3878386,-0.2341743,-0.6358736,-0.154019,-0.1296375,-0.05862397,-0.2939395,-0.4267247
50%,-0.06870659,-0.4348712,-0.4928511,-0.4505819,-0.4573025,0.4181555,-0.2211999,-0.04432321,-0.1449407,-0.06547347,...,-0.3600621,-0.04432321,-0.2903201,-0.2341743,-0.2585282,-0.154019,-0.1296375,-0.05862397,-0.2939395,-0.3238333
75%,0.6020173,-0.4222252,0.3063532,0.310516,0.310807,0.8122028,0.1767983,0.7543494,0.6561805,-0.03315316,...,0.3414744,0.7543494,-0.01898729,-0.1944402,0.3292515,-0.154019,-0.1296375,-0.05862397,-0.1716267,0.01347522
max,4.375048,5.670945,5.881183,5.695423,6.162207,0.8122028,10.014,3.27635,3.603916,28.9869,...,7.364355,3.27635,14.17574,13.22454,10.98081,19.372883,19.91145,26.77887,14.62661,13.86303


In [None]:
# Import the necessary module for train-test splitting
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

# Split the dataset into training and testing sets
# x_train: Training feature data
# x_test: Testing feature data
# y_train: Training target data
# y_test: Testing target data
x_train, x_test, y_train, y_test = train_test_split(X_scaled, Y, random_state=40, test_size=0.1)

# Create a PCA (Principal Component Analysis) object with specified settings
pca = PCA(n_components=30, whiten=True, random_state=42)

# 4. Model Training
I choose SVM to do the Classification Task

In [None]:
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, recall_score

In [None]:
# Create a Support Vector Classifier (SVC) object with the 'rbf' kernel and balanced class weights
svc = SVC(kernel='rbf', class_weight='balanced')

# Create a pipeline that first applies PCA and then uses SVC for classification
model = make_pipeline(pca, svc)

# Train the model
model.fit(x_train,y_train)

In [None]:
from sklearn.metrics import classification_report
# Calculate the accuracy score on the training set
acu_train = model.score(x_train, y_train)
# Calculate the accuracy score on the testing set
acu_test = model.score(x_test, y_test)
# Print the accuracy
print("Training Accuracy: {:.2f}%".format(100 * acu_train))
print("Testing Accuracy: {:.2f}%".format(100 * acu_test))

# Calculate the recall score on the testing set using the macro-average method
y_pred = model.predict(x_test)
recall = recall_score(y_test, y_pred, average="macro")

print(classification_report(y_test, y_pred))

Training Accuracy: 75.79%
Testing Accuracy: 73.74%
              precision    recall  f1-score   support

           0       0.50      0.35      0.41        26
           1       0.79      0.88      0.83        73

    accuracy                           0.74        99
   macro avg       0.65      0.61      0.62        99
weighted avg       0.71      0.74      0.72        99



# 5.Conclusion:
From the ML task below, I have conclude some task-ablout expierience and task-after motivations, especially below:
### (1) From Bi-Shanshui to Multi-Shanshui:
In my binary classification task, I have assessed the proposition of whether an area is landscaped or not. However, I wish to further elaborate on the directions of landscaping development, which should be done in two steps:

1. From binary classification to multi-classification. This requires me to spend more time annotating data for multi-class classification based on domain knowledge. I have already established the following categories based on domain knowledge:

  Task: Classification of the development direction of individual coordinate points in the city streets.
  Output: Five categories:

  - 1.Restore Cultural Value - Examples: Urban park sites (cultural heritage sites, centralized parks, pocket parks).

  - 2.Deepen Economic Value - Examples: Commercial complexes (malls, supermarkets, farmers' markets).

  - 3.Enhance Ecological Value - Examples: Transportation facilities (elevated bridges, overpasses, tunnels).

  - 4.Maintain Social Value - Examples: Public service facilities (hospitals).

  - 5.Add Aesthetic Value - Examples: Residential and work areas (residential buildings, schools, office buildings).
  Input: Feature vectors obtained from various POI data for each location.

2. From multi-classification to parallel task evaluation or multi-task evaluation. This requires me to restructure the model architecture, either training five parallel models or training a complex model based on multi-task learning methods.

### (2) Subjective or Objective?
Next, I combined the requirements of subjective and objective elements for the "If Shanshui-like" task and analyzed the advantages and limitations of data input in terms of subjectivity and objectivity. This is a crucial factor contributing to our less-than-optimal accuracy.

I used the semantic segmentation results of street view images, specifically the proportions of semantic elements, as several dimensions of the feature vector. From these dimensions, I selected certain semantic elements for the computation of advanced semantic features, creating additional feature dimensions, thereby linking some of the semantic elements to introduce subjective visual information into the features. Additionally, I introduced POI (Point of Interest) geographic information to construct several feature dimensions, thus incorporating objective geographic information into the features. This approach has its advantages and disadvantages:

**Advantages:**

1. **Integration of Subjective and Objective Information:** By using semantic segmentation results for feature vector construction, you introduce subjective visual information into the features. This can help the model better understand the semantic information in the images, improving content understanding.

2. **Diverse Features:** Using multiple semantic elements and POI geographic information as features provides more information about the images. This enriches the feature space, allowing the model to better capture different types of image information.

3. **Reduced Overfitting:** Introducing multiple feature dimensions can reduce the risk of model overfitting, as the model no longer relies solely on pixel-level information but makes decisions based on higher-level semantic and geographic information.

4. **Improved Localization:** Using POI geographic information can help the model more accurately locate objects or scenes in the images. This is valuable for tasks requiring spatial awareness, such as navigation or location.

**Disadvantages:**

1. **Complexity:** Introducing more feature dimensions increases the complexity of the model, which may require more computational resources and data for training and tuning. Complexity raises development and maintenance costs.

2. **Feature Selection and Maintenance:** Careful selection of which semantic elements and geographic information to use as features is necessary. Inappropriate choices may introduce noise or redundant information, affecting model performance.

3. **Feature Generation Costs:** Semantic segmentation and POI geographic information annotation may incur time costs. This can limit the available data volume and frequency.

4. **Generalization Challenges:** The model's generalization capability may be limited by the specific semantic elements and geographic information selected. For tasks unrelated to the chosen elements, the model may perform poorly.

5. **Data Consistency:** Semantic segmentation results and POI information for street view images may vary with time and location. This can lead to differences in model performance in different environments.

### (3) Deep Learning or Classical Machine Learning?
In summary, this approach essentially reflects the limitations of traditional machine learning feature engineering. Given the availability of image data, I initially used semantic segmentation to manually extract features from the images and then introduced additional features to enrich the feature vectors. This can be seen as a somewhat reluctant solution. No matter how many features I summarize, it cannot fully approximate the features unconsciously captured when humans perceive environmental information subjectively. Moreover, the process of feature engineering is time-consuming and labor-intensive. If I want to use my model on new raw data (street-view images), I still need to preprocess it into semantic data.

The advantage of deep learning, on the other hand, lies in the following:

1. **No Need for Manual Feature Engineering**: Deep learning can automatically extract more abstract features through automated feature extraction. While interpretability may decrease, the overall performance is undoubtedly superior.

2. **Direct Input of Raw Data**: Key operations like image convolution allow direct input of raw data to extract high-level features, saving data preprocessing costs.

3. **Maturity of Pre-trained Models**: There are now many highly mature pre-trained models for image tasks, especially the EfficientNet series, known for their excellent generalization capabilities and manageable size. These models can serve as a solid backbone for tasks like the one I'm working on.


**All in all, data is always a critical component in the realm of machine learning for application-based tasks. Once I've established a stable data pipeline, I can further complicate the tasks. However, this introduces two types of challenges: one concerning the model's capacity and the other related to the difficulty of data annotation. The former can be addressed by introducing the backbone network of pre-trained deep learning models, while the latter requires substantial domain knowledge and interaction with experts from various domains.**
