### OneVsRestClassifier(XGBClassifier)_Example1

### Load the packages

In [1]:
import json
import pandas as pd
import numpy as np

from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix

### DATA

In [2]:
# Read JSON file
bedrooms_df = pd.read_json("street_group_data_science_bedrooms_test.json", lines=True)
bedrooms_df.head()

Unnamed: 0,property_type,total_floor_area,number_habitable_rooms,number_heated_rooms,estimated_min_price,estimated_max_price,latitude,longitude,bedrooms
0,Flats/Maisonettes,39.0,1,1,103000,126000,52.164661,-1.856154,0
1,Flats/Maisonettes,24.0,1,1,36000,44000,52.523281,-2.054445,0
2,Flats/Maisonettes,25.0,1,1,187000,229000,51.386343,-0.108323,0
3,Flats/Maisonettes,27.0,1,1,234000,350000,51.416946,-0.151787,2
4,Flats/Maisonettes,29.0,1,1,185000,277000,52.915728,-1.475258,1


In [3]:
# Unbalanced Data; the consequences are visible e.g. in classification_report or confusion_matrix
bedrooms_df.bedrooms.value_counts()

3    449179
2    281689
4    159928
1     57049
5     38372
6      8427
0      2725
7      1864
8       595
9       172
Name: bedrooms, dtype: int64

In [4]:
# Exclude rows with unrealistic data; these conditions could be improved or left without changes
bedrooms_df2 = bedrooms_df[(bedrooms_df['total_floor_area'] > 3)&(bedrooms_df['number_habitable_rooms']<30)]
bedrooms_df2.shape

(999936, 9)

In [5]:
# Features selection based on explonatory analysis
bedrooms_df3 = bedrooms_df2[['total_floor_area', 'number_habitable_rooms','estimated_min_price','bedrooms']]
bedrooms_df3.head()

Unnamed: 0,total_floor_area,number_habitable_rooms,estimated_min_price,bedrooms
0,39.0,1,103000,0
1,24.0,1,36000,0
2,25.0,1,187000,0
3,27.0,1,234000,2
4,29.0,1,185000,1


In [6]:
# The data after initial preparation
X = bedrooms_df3.drop('bedrooms',axis=1)  # DataFrame with predictors values
y = bedrooms_df3['bedrooms']              # pdSeries of class numbers

In [7]:
X.sample(5)

Unnamed: 0,total_floor_area,number_habitable_rooms,estimated_min_price
24232,40.0,2,133000
479574,91.0,5,298000
38482,44.53,2,150000
314516,56.0,4,161000
406926,67.74,4,107000


In [8]:
y.sample(5)

110659    2
337670    2
185059    2
788852    3
449802    3
Name: bedrooms, dtype: int64

### Split into training and test datasets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=1 )

### MODEL

In [10]:
# The model definition
xgb_classifier = OneVsRestClassifier(XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'))

### Fit the model

In [11]:
xgb_classifier.fit(X_train,y_train)

OneVsRestClassifier(estimator=XGBClassifier(base_score=None, booster=None,
                                            colsample_bylevel=None,
                                            colsample_bynode=None,
                                            colsample_bytree=None,
                                            enable_categorical=False,
                                            eval_metric='mlogloss', gamma=None,
                                            gpu_id=None, importance_type=None,
                                            interaction_constraints=None,
                                            learning_rate=None,
                                            max_delta_step=None, max_depth=None,
                                            min_child_weight=None, missing=nan,
                                            monotone_constraints=None,
                                            n_estimators=100, n_jobs=None,
                                            num_p

### Generate predictions

In [12]:
y_pred = xgb_classifier.predict(X_test)

In [13]:
y_pred

array([3, 4, 4, ..., 3, 3, 4], dtype=int64)

### Evaluate model performance

In [14]:
accuracy = accuracy_score(y_test, y_pred)
accuracy

0.7478473636663655

In [15]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.47      0.23      0.31       800
           1       0.88      0.80      0.84     17217
           2       0.84      0.73      0.78     84674
           3       0.76      0.84      0.80    134667
           4       0.61      0.66      0.63     47742
           5       0.48      0.32      0.39     11518
           6       0.32      0.09      0.14      2582
           7       0.20      0.05      0.08       562
           8       0.13      0.05      0.07       178
           9       0.08      0.02      0.04        41

    accuracy                           0.75    299981
   macro avg       0.48      0.38      0.41    299981
weighted avg       0.75      0.75      0.74    299981



In [16]:
print(confusion_matrix(y_test,y_pred))

[[   185    442    105     50     14      4      0      0      0      0]
 [   176  13709   2460    639    190     35      4      1      2      1]
 [    14   1082  61672  21154    691     50     10      0      1      0]
 [     9    210   8725 113240  12150    296     24      6      5      2]
 [     4     55    358  13622  31528   2088     71      7      9      0]
 [     1      8     63    923   6511   3739    232     26     14      1]
 [     1      8     28    139    831   1289    230     39     13      4]
 [     0      3      3     28    121    258    108     27     12      2]
 [     0      2      2     10     32     63     40     19      9      1]
 [     0      1      1      1      8     11      7      7      4      1]]


### END

* Ensemble of Binary Classifiers (One-vs-Rest). 
Even with balanced data and a fine-tuned model, most classifiers are limited to distinguishing between a handful of classes well (they will start to struggle when the number of classes becomes very high). Therefore, if you have a lot of classes, instead of training a single classifier, you can train multiple binary classifiers (one for each class / one-vs-rest) - which is easier for each classifier to learn. Then combine each of the classifiers’ binary outputs to generate multi-class outputs.

* One-vs-rest xgboost classifiers (Gradient boosting is similar to a random forest in that it uses the results of many decision trees. However, in a random forest, trees are grown in parallel but are random and unrelated to each other. Each tree is grown very deep to overfit a specific part of the training data — however, in the end, all the trees’ errors cancel out when combined as different trees in the forest will overfit in different ways and thus voting averages these differences out. With boosting, however, only very shallow trees are carefully grown to find general patterns in the data. One tree is added in turn to improve / boost the already trained ensemble of trees).
