in this i tried to find the best model for a imbalanced dataset with 36 million rows. for this dataset we are gonna do some classification. unfortunately link of this dataset is not accessible at the time of writing this. main challenges of this code was:
- data was imbalanced
- data wasnt representative of what we are asked for
for preprocessing of this code i did lots of visualizations and based on some of them i removed what seemed to be outliers according to its z-score or IQR. after removing outliers i tried to tackle issues caused by being a imbalanced dataset. i tried different solutions and tried to find a good threshold for our predictions. with our final dataset we going after checking different models to see which will perform the best.
lets look at dataset:
| user | search_date | channel | is_mobile | is_package | destination | checkIn_date | checkOut_date | n_adults | n_children | n_rooms | hotel_category | is_booking | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | u461899 | 2019-01-07 00:00:02 | c9 | False | False | d669 | 2019-03-14 | 2019-03-15 | 2 | 1 | 1 | g41 | False |
| 1 | u13796 | 2019-01-07 00:00:06 | c9 | False | False | d8821 | 2019-01-19 | 2019-01-26 | 1 | 0 | 1 | g58 | False |
| 2 | u1128575 | 2019-01-07 00:00:06 | c9 | False | False | d25064 | 2019-01-19 | 2019-01-22 | 1 | 0 | 1 | g91 | False |
| 3 | u1080476 | 2019-01-07 00:00:09 | c9 | False | True | d7635 | 2019-05-29 | 2019-06-05 | 2 | 0 | 1 | g10 | False |
| 4 | u1080476 | 2019-01-07 00:00:17 | c9 | False | True | d7635 | 2019-05-29 | 2019-06-05 | 2 | 0 | 1 | g10 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 34742970 | u553256 | 2020-11-30 23:59:48 | c2 | True | True | d45532 | 2020-12-07 | 2020-12-08 | 2 | 0 | 1 | g48 | False |
| 34742971 | u529472 | 2020-11-30 23:59:49 | c9 | False | False | d8279 | 2020-12-27 | 2021-01-02 | 2 | 2 | 1 | g18 | False |
| 34742972 | u18236 | 2020-11-30 23:59:53 | c4 | False | False | d20275 | 2021-04-22 | 2021-04-25 | 1 | 0 | 1 | g5 | False |
| 34742973 | u10888 | 2020-11-30 23:59:54 | c9 | False | False | d19371 | 2020-12-29 | 2020-12-30 | 2 | 0 | 1 | g17 | False |
| 34742974 | u233344 | 2020-11-30 23:59:55 | c9 | False | False | d22862 | 2021-08-16 | 2021-08-18 | 2 | 0 | 1 | g44 | False |
34742975 rows × 13 columns