## Gradient boosted decision tree regressor analysis 

- **Purpose**: The purpose of this notebook is to test the performance of XGBRegressor 
- **Author**: Crystal Geng 

- **Data**:
    - Uniformly sampled data where the same number of data was sampled in each track momentum bin, with bin size = 1 GeV/C 
- **Model type**: Regressor 
- **Engineered features**
    - `x_realigned_min`: minimum realigned hit position x in each event
    - `x_realigned_max`: maximum realigned hit position x in each event
    - `x_realigned_mean`: mean realigned hit position x in each event
    - `x_realigned_median`: median realigned hit position x in each event 
    - `y_reaglined_min`: minimum realigned hit position y in each event
    - `y_realigned_max`: maximum realigned hit position y in each event
    - `y_realigned_mean`: mean realigned hit position y in each event
    - `y_realigned_median`: median reaglined hit position y in each event
    - `min(x+y)`: minimum reaglined hit position x + y in each event
    - `max(x+y)`: maximum realigned hit position x + y in each event
    - `min(x-y)`: minimum realigned hit position x - y in each event
    - `max(x-y)`: maximum realigned hit position x - y in each event 
    - `inner_ring_radius`: radius of the largest circle that ejects all the points to the outside (inner circle)
    - `outer_ring_radius`: radius of the smallest circle that includes all the points inside (outer circle)
    - `max_x-min_x`: maximum realigned hit position x minus minimum realigned hit position x in each event
    - `max_y-min_y`: maximum realigned hit position y minus minimum realigned hit position y in each event 
- **Preprocessing**:
    - Only the in-time hits where delta <= 0.5 are included 
    - Only muons are included 
    - Events with ring centred position greater than 1000 are excluded 
    - Examples with NAs are dropped
- **Training**:
    - XGBRegressor for training with default values 

In [1]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
import sys
import glob
import warnings
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

### 1. Read the data

#### Read the concatenated hits and events data 

In [2]:
hit_data_uniform = pd.read_csv('../hit_data_uniform.csv')

#### Read the uniformly sampled events data (for retrieving the radius)

In [3]:
sampled_data = pd.read_csv('../sampled_data_uniform.csv')

In [4]:
hit_data_uniform = hit_data_uniform.drop(hit_data_uniform.columns[0], axis=1)

In [5]:
sampled_data = sampled_data[["event", "ring_radius", "ring_centre_pos_x", "ring_centre_pos_y"]]

In [6]:
merged_df = sampled_data.merge(hit_data_uniform, left_on='event', right_on='event')

In [7]:
merged_df

Unnamed: 0,event,ring_radius,ring_centre_pos_x,ring_centre_pos_y,x,y,mirror,x_realigned,y_realigned,hit_time,chod_time,chod_delta,class,momentum
0,80852,169.37271,-238.677630,-163.017230,-36.0,77.94,0.0,55.877628,221.157230,53.321484,19.203901,34.117584,muon,20.45610
1,80852,169.37271,-238.677630,-163.017230,-45.0,62.35,0.0,46.877628,205.567226,53.034840,19.203901,33.830940,muon,20.45610
2,80852,169.37271,-238.677630,-163.017230,108.0,140.30,0.0,199.877628,283.517230,53.126720,19.203901,33.922820,muon,20.45610
3,80852,169.37271,-238.677630,-163.017230,234.0,109.12,0.0,325.877628,252.337230,52.964043,19.203901,33.760140,muon,20.45610
4,80852,169.37271,-238.677630,-163.017230,207.0,124.71,0.0,298.877628,267.927226,54.722860,19.203901,35.518960,muon,20.45610
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9366531,2178107,183.84737,-11.867355,34.194088,144.0,202.65,0.0,9.067355,148.655906,10.620909,10.989535,-0.368627,pion,44.02878
9366532,2178107,183.84737,-11.867355,34.194088,234.0,-77.94,0.0,99.067355,-131.934090,10.970553,10.989535,-0.018982,pion,44.02878
9366533,2178107,183.84737,-11.867355,34.194088,-9.0,155.88,0.0,-143.932645,101.885917,10.578908,10.989535,-0.410627,pion,44.02878
9366534,2178107,183.84737,-11.867355,34.194088,-63.0,124.71,0.0,-197.932645,70.715911,11.381413,10.989535,0.391878,pion,44.02878


### 2. Preprocessing

#### Compute the distance between each hit point and the fitted ring centre 

In [8]:
merged_df["distance"] = (
    (merged_df["ring_centre_pos_x"] - merged_df["x_realigned"]) ** 2 +
    (merged_df["ring_centre_pos_y"] - merged_df["y_realigned"]) ** 2
) ** 0.5

merged_df

Unnamed: 0,event,ring_radius,ring_centre_pos_x,ring_centre_pos_y,x,y,mirror,x_realigned,y_realigned,hit_time,chod_time,chod_delta,class,momentum,distance
0,80852,169.37271,-238.677630,-163.017230,-36.0,77.94,0.0,55.877628,221.157230,53.321484,19.203901,34.117584,muon,20.45610,484.100005
1,80852,169.37271,-238.677630,-163.017230,-45.0,62.35,0.0,46.877628,205.567226,53.034840,19.203901,33.830940,muon,20.45610,466.257768
2,80852,169.37271,-238.677630,-163.017230,108.0,140.30,0.0,199.877628,283.517230,53.126720,19.203901,33.922820,muon,20.45610,625.878373
3,80852,169.37271,-238.677630,-163.017230,234.0,109.12,0.0,325.877628,252.337230,52.964043,19.203901,33.760140,muon,20.45610,700.886557
4,80852,169.37271,-238.677630,-163.017230,207.0,124.71,0.0,298.877628,267.927226,54.722860,19.203901,35.518960,muon,20.45610,688.969360
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9366531,2178107,183.84737,-11.867355,34.194088,144.0,202.65,0.0,9.067355,148.655906,10.620909,10.989535,-0.368627,pion,44.02878,116.360517
9366532,2178107,183.84737,-11.867355,34.194088,234.0,-77.94,0.0,99.067355,-131.934090,10.970553,10.989535,-0.018982,pion,44.02878,199.762563
9366533,2178107,183.84737,-11.867355,34.194088,-9.0,155.88,0.0,-143.932645,101.885917,10.578908,10.989535,-0.410627,pion,44.02878,148.402912
9366534,2178107,183.84737,-11.867355,34.194088,-63.0,124.71,0.0,-197.932645,70.715911,11.381413,10.989535,0.391878,pion,44.02878,189.615758


#### Filter out the out-of-time hits 

In [9]:
hit_data_uniform_df_intime = merged_df[['event', 'ring_radius', 'x_realigned', 'y_realigned', 'chod_delta', 'class', 'distance']]

In [10]:
hit_data_uniform_df_intime = hit_data_uniform_df_intime.query('abs(chod_delta) <= 0.5')

#### Filter the hits data to muons only 

In [11]:
hit_data_uniform_df_intime_muons = hit_data_uniform_df_intime.query('`class` == "muon"')

#### Compute `x+y` and `x-y` columns 

In [12]:
hit_data_uniform_df_intime_muons['x+y'] = hit_data_uniform_df_intime_muons['x_realigned'] + hit_data_uniform_df_intime_muons['y_realigned']
hit_data_uniform_df_intime_muons['x-y'] = hit_data_uniform_df_intime_muons['x_realigned'] - hit_data_uniform_df_intime_muons['y_realigned']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hit_data_uniform_df_intime_muons['x+y'] = hit_data_uniform_df_intime_muons['x_realigned'] + hit_data_uniform_df_intime_muons['y_realigned']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hit_data_uniform_df_intime_muons['x-y'] = hit_data_uniform_df_intime_muons['x_realigned'] - hit_data_uniform_df_intime_muons['y_realigned']


#### Compute `max_x`, `min_x`, `median_x`, `mean_x`, `max_y`, `min_y`, `median_y`, `mean_y`, `min(x+y)`, `max(x+y)`, `min(x-y)`, `max(x-y)`

In [13]:
grouped_hit_data = hit_data_uniform_df_intime_muons.groupby('event').aggregate(
    {'x_realigned':['min', 'max', 'mean', 'median'], 
    'y_realigned':['min','max', 'mean', 'median'],
    'x+y': ['min', 'max'],
    'x-y': ['min', 'max'],
    'distance': ['min', 'max']})
grouped_hit_data.columns = ['x_realigned_min', 'x_realigned_max', 'x_realigned_mean', 'x_realigned_median', 'y_realigned_min', 'y_realigned_max', 'y_realigned_mean', 'y_realigned_median', 'min(x+y)', 'max(x+y)', 'min(x-y)', 'max(x-y)', 'inner_ring_radius', 'outer_ring_radius']
grouped_hit_data= grouped_hit_data.reset_index()
grouped_hit_data

Unnamed: 0,event,x_realigned_min,x_realigned_max,x_realigned_mean,x_realigned_median,y_realigned_min,y_realigned_max,y_realigned_mean,y_realigned_median,min(x+y),max(x+y),min(x-y),max(x-y),inner_ring_radius,outer_ring_radius
0,28,-209.671128,132.328872,-82.852946,-74.671128,-182.288785,35.951213,-74.584240,-57.578786,-271.429914,-3.189916,-230.032345,267.847660,62.018204,225.144402
1,31,-165.334827,181.565173,9.248507,19.565173,-208.168851,129.491150,-90.406626,-130.228848,-297.253675,176.046324,-254.645984,257.794022,39.827731,338.355757
2,59,-203.601382,174.398618,-55.101382,-46.101382,-199.463150,190.246849,49.951853,88.921852,-231.594531,239.935475,-288.908229,214.271764,15.701972,338.843124
3,61,-181.854565,169.145435,-17.154565,-15.354565,-178.290837,180.239162,-11.496839,-22.405839,-282.205400,253.444594,-257.153725,242.496269,26.912363,330.094827
4,68,-223.286990,172.713010,-19.943240,6.213010,-196.079175,178.050822,20.702385,61.135823,-305.416168,274.583832,-257.977812,254.842188,99.614349,271.584330
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107017,2160095,-210.240756,185.759244,48.616387,86.759244,-207.641846,150.898155,-75.508036,-82.931847,-324.352603,255.657399,-133.898910,290.221089,125.189626,391.951444
107018,2160116,-217.829153,124.170847,-72.972010,-145.829153,-197.921506,160.608493,-46.495787,-42.041501,-281.400653,239.779340,-259.667657,259.742347,137.607344,224.544513
107019,2160117,-197.448833,158.451167,-52.048833,-125.448833,-201.482132,99.987872,-61.903561,-76.772133,-300.580966,258.439040,-133.846698,220.033299,174.905138,315.788637
107020,2160186,-181.840504,169.159496,-20.532812,-60.340504,-205.775258,168.344737,-10.322569,-3.125260,-277.855768,248.154234,-247.005241,247.814758,49.800621,287.626844


#### Compute `max_x-min_x` and `max_y-min_y`

In [14]:
grouped_hit_data['max_x-min_x'] = grouped_hit_data['x_realigned_max']-grouped_hit_data['x_realigned_min']
grouped_hit_data['max_y-min_y'] = grouped_hit_data['y_realigned_max']-grouped_hit_data['y_realigned_min']

#### Extracting radius from `sampled_data` (sampled Events data)

In [15]:
merged_df = pd.merge(sampled_data, grouped_hit_data, left_on='event', right_on='event')

In [16]:
merged_df

Unnamed: 0,event,ring_radius,ring_centre_pos_x,ring_centre_pos_y,x_realigned_min,x_realigned_max,x_realigned_mean,x_realigned_median,y_realigned_min,y_realigned_max,y_realigned_mean,y_realigned_median,min(x+y),max(x+y),min(x-y),max(x-y),inner_ring_radius,outer_ring_radius,max_x-min_x,max_y-min_y
0,80852,169.37271,-238.677630,-163.017230,-183.022372,167.977628,29.120485,131.977628,-127.072769,106.747227,-17.954202,-25.747772,-247.745151,238.724854,-238.189601,268.050397,112.957102,458.430128,351.0,233.819996
1,151070,173.66950,-181.111510,162.464000,-213.588489,164.411511,-67.133943,-159.588489,-203.144005,61.865997,-55.760367,-16.084000,-263.732494,101.567509,-221.454486,233.845517,102.874674,413.859388,378.0,265.010002
2,1947832,168.91771,-56.966620,73.462840,-179.833379,144.166621,-53.083379,-121.333379,-140.032838,140.567165,4.164663,8.062164,-288.686216,228.963782,-259.810547,241.609459,80.478051,263.607744,324.0,280.600002
3,780013,169.10721,-80.256260,99.883410,-165.543736,167.456264,23.456264,122.456264,-135.273408,114.146594,-8.165716,-10.563405,-98.167146,189.832854,-270.690331,266.729671,42.737796,316.418902,333.0,249.420002
4,1258686,169.08809,-158.903850,149.289110,-190.796146,169.203854,-65.439003,-127.796146,-174.379108,106.210892,-32.969107,-26.289104,-302.175254,166.294751,-234.007037,247.642964,53.135884,395.662920,360.0,280.590000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107017,524583,186.89357,-16.883510,60.280310,-183.916489,14.083511,-94.666489,-93.916489,-220.380315,169.339687,-19.024479,-25.520310,-312.526800,102.423197,-290.256176,216.463825,110.550521,280.960015,198.0,389.720001
107018,186952,181.78853,-79.519230,82.514370,-198.180766,-63.180766,-131.180766,-117.180766,-169.954376,157.405624,-52.176597,-107.604374,-285.375140,67.224858,-265.586391,106.773610,75.646337,252.996864,135.0,327.360001
107019,592331,188.78964,-165.157290,28.893960,-220.542712,184.457288,13.457288,44.957288,-209.863962,179.846045,-23.918961,-46.188961,-290.116674,270.123324,-272.618753,273.151252,50.553489,359.303636,405.0,389.710007
107020,2080943,188.58577,2.156836,-38.680283,-202.956836,103.043164,-82.456836,-85.956836,-214.949719,174.760287,-27.019717,-43.474719,-288.376557,262.223449,-270.367118,232.812875,145.497266,224.661533,306.0,389.710007


In [18]:
merged_df

Unnamed: 0,event,ring_radius,ring_centre_pos_x,ring_centre_pos_y,x_realigned_min,x_realigned_max,x_realigned_mean,x_realigned_median,y_realigned_min,y_realigned_max,y_realigned_mean,y_realigned_median,min(x+y),max(x+y),min(x-y),max(x-y),inner_ring_radius,outer_ring_radius,max_x-min_x,max_y-min_y
0,80852,169.37271,-238.677630,-163.017230,-183.022372,167.977628,29.120485,131.977628,-127.072769,106.747227,-17.954202,-25.747772,-247.745151,238.724854,-238.189601,268.050397,112.957102,458.430128,351.0,233.819996
1,151070,173.66950,-181.111510,162.464000,-213.588489,164.411511,-67.133943,-159.588489,-203.144005,61.865997,-55.760367,-16.084000,-263.732494,101.567509,-221.454486,233.845517,102.874674,413.859388,378.0,265.010002
2,1947832,168.91771,-56.966620,73.462840,-179.833379,144.166621,-53.083379,-121.333379,-140.032838,140.567165,4.164663,8.062164,-288.686216,228.963782,-259.810547,241.609459,80.478051,263.607744,324.0,280.600002
3,780013,169.10721,-80.256260,99.883410,-165.543736,167.456264,23.456264,122.456264,-135.273408,114.146594,-8.165716,-10.563405,-98.167146,189.832854,-270.690331,266.729671,42.737796,316.418902,333.0,249.420002
4,1258686,169.08809,-158.903850,149.289110,-190.796146,169.203854,-65.439003,-127.796146,-174.379108,106.210892,-32.969107,-26.289104,-302.175254,166.294751,-234.007037,247.642964,53.135884,395.662920,360.0,280.590000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107017,524583,186.89357,-16.883510,60.280310,-183.916489,14.083511,-94.666489,-93.916489,-220.380315,169.339687,-19.024479,-25.520310,-312.526800,102.423197,-290.256176,216.463825,110.550521,280.960015,198.0,389.720001
107018,186952,181.78853,-79.519230,82.514370,-198.180766,-63.180766,-131.180766,-117.180766,-169.954376,157.405624,-52.176597,-107.604374,-285.375140,67.224858,-265.586391,106.773610,75.646337,252.996864,135.0,327.360001
107019,592331,188.78964,-165.157290,28.893960,-220.542712,184.457288,13.457288,44.957288,-209.863962,179.846045,-23.918961,-46.188961,-290.116674,270.123324,-272.618753,273.151252,50.553489,359.303636,405.0,389.710007
107020,2080943,188.58577,2.156836,-38.680283,-202.956836,103.043164,-82.456836,-85.956836,-214.949719,174.760287,-27.019717,-43.474719,-288.376557,262.223449,-270.367118,232.812875,145.497266,224.661533,306.0,389.710007


In [19]:
grouped_hit_data['ring_radius'] = merged_df['ring_radius']

#### Drop all the NAs 

In [20]:
grouped_hit_data = grouped_hit_data.dropna()

#### Filter outliers for radius 

In [21]:
grouped_hit_data = grouped_hit_data.query('ring_radius <= 1000')
grouped_hit_data

Unnamed: 0,event,x_realigned_min,x_realigned_max,x_realigned_mean,x_realigned_median,y_realigned_min,y_realigned_max,y_realigned_mean,y_realigned_median,min(x+y),max(x+y),min(x-y),max(x-y),inner_ring_radius,outer_ring_radius,max_x-min_x,max_y-min_y,ring_radius
0,28,-209.671128,132.328872,-82.852946,-74.671128,-182.288785,35.951213,-74.584240,-57.578786,-271.429914,-3.189916,-230.032345,267.847660,62.018204,225.144402,342.0,218.239998,169.37271
1,31,-165.334827,181.565173,9.248507,19.565173,-208.168851,129.491150,-90.406626,-130.228848,-297.253675,176.046324,-254.645984,257.794022,39.827731,338.355757,346.9,337.660001,173.66950
2,59,-203.601382,174.398618,-55.101382,-46.101382,-199.463150,190.246849,49.951853,88.921852,-231.594531,239.935475,-288.908229,214.271764,15.701972,338.843124,378.0,389.709999,168.91771
3,61,-181.854565,169.145435,-17.154565,-15.354565,-178.290837,180.239162,-11.496839,-22.405839,-282.205400,253.444594,-257.153725,242.496269,26.912363,330.094827,351.0,358.529999,169.10721
4,68,-223.286990,172.713010,-19.943240,6.213010,-196.079175,178.050822,20.702385,61.135823,-305.416168,274.583832,-257.977812,254.842188,99.614349,271.584330,396.0,374.129997,169.08809
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107017,2160095,-210.240756,185.759244,48.616387,86.759244,-207.641846,150.898155,-75.508036,-82.931847,-324.352603,255.657399,-133.898910,290.221089,125.189626,391.951444,396.0,358.540001,186.89357
107018,2160116,-217.829153,124.170847,-72.972010,-145.829153,-197.921506,160.608493,-46.495787,-42.041501,-281.400653,239.779340,-259.667657,259.742347,137.607344,224.544513,342.0,358.529999,181.78853
107019,2160117,-197.448833,158.451167,-52.048833,-125.448833,-201.482132,99.987872,-61.903561,-76.772133,-300.580966,258.439040,-133.846698,220.033299,174.905138,315.788637,355.9,301.470004,188.78964
107020,2160186,-181.840504,169.159496,-20.532812,-60.340504,-205.775258,168.344737,-10.322569,-3.125260,-277.855768,248.154234,-247.005241,247.814758,49.800621,287.626844,351.0,374.119995,188.58577


### 3. Training 

#### Train test split 

In [22]:
train_df, test_df = train_test_split(grouped_hit_data, random_state=42)

In [23]:
X_train = train_df.drop(columns=['ring_radius','event'])
y_train = train_df['ring_radius']
X_test = test_df.drop(columns=['ring_radius','event'])
y_test = test_df['ring_radius']

#### Fitting the `XGBRegressor` model

In [24]:
xgb_reg = xgb.XGBRegressor()

In [25]:
xgb_reg.fit(X_train, y_train)

In [26]:
reg_metrics = ['neg_root_mean_squared_error', 'r2']

In [27]:
from sklearn.model_selection import cross_val_score, cross_validate

#### Cross validate

In [28]:
xgb_cv = cross_validate(xgb_reg, X_train, y_train, cv=5, return_train_score=True, scoring=reg_metrics)

In [29]:
pd.DataFrame(xgb_cv)

Unnamed: 0,fit_time,score_time,test_neg_root_mean_squared_error,train_neg_root_mean_squared_error,test_r2,train_r2
0,2.115165,0.007499,-6.528172,-5.523952,-0.048264,0.286256
1,2.112879,0.005526,-6.486445,-5.599201,-0.046617,0.268656
2,2.082465,0.006098,-6.561469,-5.530895,-0.034622,0.280472
3,2.133665,0.005091,-7.044046,-5.498533,-0.035871,0.26156
4,2.055981,0.005125,-6.562134,-5.539866,-0.042956,0.279478


#### Predict and score on the test set

In [30]:
ypred = xgb_reg.predict(X_test)
mse = mean_squared_error(y_test, ypred)

In [31]:
pred_df = pd.DataFrame(ypred, y_test).reset_index()
pred_df.columns=['theoretical_radius', 'predcited_radius']

In [32]:
pred_df

Unnamed: 0,theoretical_radius,predcited_radius
0,186.94850,182.040985
1,186.24666,181.639954
2,181.21791,182.020264
3,189.69760,181.293900
4,184.09232,181.371185
...,...,...
26596,185.21375,182.367035
26597,185.27415,181.875717
26598,173.03697,181.225769
26599,170.28674,181.238708


In [33]:
mse

44.15870856503324

In [34]:
rmse = (mse)**0.5
rmse

6.645201920561424