Make sure to download the dataset here before running my code. The dataset is too big to be uploaded here.
This model is able to predict the jammed correctly with f1 score of 65,21 % and ranked top 32% of the total contestants.
- Train (70k++ rows)
- Irregularities (350k++rows) --> after deduplication become (190k++ rows)
- Alerts (7m++rows) --> after deduplication become (80k++rows)
- Test
Irregularities consist of information such as speed gap, time gap between jammed and normal condition, jam level, etc
- Join train and Irregularities on s2 location
- See the correlation between features and the labels
- List some importance features and delete unimportant features
Alerts consist of information such as weather, accident, street type, etc
- Join train and Alerts on s2 location
- See the correlation between features and the labels
- List some importance features and delete unimportant features
- Join Alerts and Irregularities based on 'supergabungan' which is combination of s2location, time, and hour
- See that we know important features in each table, we can make new features and delete unimportant features New features generated based on my observations are :
- Road Type (Whether the road is main street or not)
- Condition Type (Accident, Bad Weather, etc)
- Relialibility
- Rating Rate
- Jam Trend
- Jam Level
- Alerts Count
- Let's say the new dataset from alerts and irregularities calles 'combination'
We also construct new features such as isweekend and isbusyhour from the date and time
- Join train and combination on day_hour by taking the day and hour from s2idtoken
There are several way to clean the data. Here's some method that I've tried :
- fillna with bfil, ffil, pad, backfill
- Impute with mode
- Try to fill the nan value by search the similarity each location (very tryhard) But after all, the best score by training is using fillna ffil.
We are using 5 models to compare, which are :
- Random Forest Classifier
- XG Boost Classifier
- Logistic Regression
- Decison Tree Classifier
- Naive Bayes
F1 Score Before Tuning :
F1 Score after tuning result :
- Random Forest Tuned : 0.757 (increased 0.0002 XD)
- Decision Tree : 0.756 (increased 0.001)
- Naive Bayes : 0.797 (increased 0.01)
- Join test and combination on day_hour
- Find the prediction by fit the model
- Make sure the format is same with sample_submission
- Post in Kaggle and try your luck !
For you who wants to try this problem, you can get the dataset here : Kaggle