This is the term project of COMP4434, which aims to use different models to classify daily inflows/outflows of crowds into groups with different dates (weekends/weekdays).
- Install JetBrains PyCharm Professional and Anaconda3.
- Open PyCharm, clone this project and open it.
- [IMPORTANT] Create a folder called
data
and move the dataset files to this folder. - Enter
File > Settings > Project: TrafficAnalysis > Project Interpreter
. - Click the "gear" button on the top-right corner of the dialog, and then click
Add...
. - Select
Conda Environment
on the left, and clickOK
button. - Click
OK
button again to close theSettings
window. - Wait until PyCharm is ready.
- If a message pops up on the bottom-right corner of the screen which asks you if you want to turn on
Scientific Mode
, do not hesitate to turn it on. - PyCharm may suggest you to install missing packages. Allow it to do so.
- You are ready to go.
Currently, we do not use weather data to train data models.
Originally, date
is a simple 1d array of strings like "2015110101"
. After the preprocessing, date
becomes a 2d
array with three columns.
Column 1: Since the dates in this dataset starts from 1 Nov
2015, the strings of date are converted to the number of days since 1 Nov 2015. For example, "2015111524"
will be
converted to 14
since 15 - 01 = 14
.
Column 2: The last 2 characters of the string "2015110101"
means the number of time slot in this day. "01"
means
the first time slot which is 00:00 - 00:30. It is converted to integers which start from zero. So, "01"
will be
converted to 0
.
Column 3: If this day is Saturday or Sunday, this value will be 1
. Otherwise, it will be 0
. This is the label
data for future training and testing.
The inflow/outflow data is normalized using preprocessing.scale()
function from scikit-learn
.
In the given data, we found that some time slots are missing in some of the dates. So we iterate the date
array to
figure out those "dirty" dates and remove these data entries.
We plan to train and test the data model using the whole inflow/outflow data of a single day (with 48 time slots) as a
single input entry. Therefore, there are 138 valid days (as input entries), and each input entry has 48 * 2 * 32 * 32
float number elements. It is stored in the data
array.
There are only 2 possible values of output for each input. 1
means that this day is Saturday or Sunday. 0
means that
this day is a weekday. It is stored in the label
array.
Various classification models would be applied.
We use the following classifiers to train our model:
- SVM classfier
- SGD classfier
- k-NN classfier
- Bagging classifier
According to our experiment result, 4-Nearest-Neighbor classifier performs the best among the others with the accuracy of 96%.
- @EririSawamura, @EvilCharles, @113741090a: data model training and evaluation.
- @CrabAss, @liao-victor: data preprocessing.
- @CrabAss: project coordinator.
The dataset used in this experiment is provided in the following paper:
Junbo Zhang, Yu Zheng, Dekang Qi. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. In AAAI 2017.
This paper is available on arXiv. The dataset is available in lucktroy/DeepST.