Task 1: The first task is to download and parse a gene annotation from the GENCODE database. The code should be written in R. The following are the specifications:

The gene annotation can be found here: https://www.gencodegenes.org/human/. You will use the gtf file for the comprehensive gene annotation for the primary assembly (chromosomes and scaffolding). Please use release 45.
The R code needs have the following:
a function to download the gtf file
creation of a TxDb object from the gtf file
build an S4 object containing all genes and all transcript IDs corresponding to each gene
From the S4 object compute the mean, minimum and maximum number of transcripts that each gene has
Make a histogram of the number of transcripts each gene has
Save the S4 object as a .rds file

Task 2: You will download a dataset and build a boosted decision tree using xgboost in python.

Download the datafile from: http://129.10.224.71/~apaul/data/tests/dataset.csv.gz. It has 4D input (x1, x2, x3, x4) and 2D output (y1, y2) with 1M samples (You do not need to use all of them)
You will have to construct a boosted decision tree which takes a 4D input and gives a 1D output for a regression task. You can choose either y1 or y2.
Please make 3 splits of the data: training, validation and testing. Use the training and the validation sets for training the boosted decision tree and for any hyperparameter tuning. Use the test set for evaluating the model post training. Please specify what split you have chosen to take and why.

Clearly explain what hyperparameters you choose and why

In [13]:
import numpy as np
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [2]:
def x_scale(x, p=7.5):
    '''
    function for scaling x1
    argument:
        x: the input variable
        p: the scaling factor (default: 7.5)
    returns:
        the scaled variable
    '''
    return 1/p * np.log(1 + x * (np.exp(p) - 1))

def y_scale(y):
    '''
    function for scaling y1 and y2
    argument:
        y: the input variable
    returns:
        the scaled variable
    '''
    return np.log(1 + y) if y >= 0 else -np.log(1 - y)


In [3]:
df = pd.read_csv("dataset.csv.gz", compression="gzip")
df

Unnamed: 0,x1,x2,x3,x4,y1,y2
0,0.000001,0.531085,0.855767,0.611623,4.970868,-6.548192
1,0.000002,0.495374,0.961166,0.213009,70.588434,-79.824819
2,0.000002,0.045435,0.109773,0.292582,75.105905,-77.123068
3,0.000002,0.064068,0.774908,0.820672,93.196487,-96.298052
4,0.000003,0.606808,0.861252,0.604051,-27.099713,26.709271
...,...,...,...,...,...,...
999995,0.112098,0.017806,0.594506,0.500338,387.199202,-64.342897
999996,0.112162,0.100856,0.503775,0.872723,90.173227,-27.617936
999997,0.112235,0.154529,0.147156,0.431693,57.011681,-21.132118
999998,0.112255,0.282299,0.888741,0.684521,24.573689,-9.313387


In [4]:
# Scale x1 and y1

df['x1'] = df['x1'].apply(x_scale)
df['y1'] = df['y1'].apply(y_scale)
df

Unnamed: 0,x1,x2,x3,x4,y1,y2
0,0.000355,0.531085,0.855767,0.611623,1.786892,-6.548192
1,0.000469,0.495374,0.961166,0.213009,4.270934,-79.824819
2,0.000546,0.045435,0.109773,0.292582,4.332126,-77.123068
3,0.000568,0.064068,0.774908,0.820672,4.545383,-96.298052
4,0.000698,0.606808,0.861252,0.604051,-3.335759,26.709271
...,...,...,...,...,...,...
999995,0.708798,0.017806,0.594506,0.500338,5.961519,-64.342897
999996,0.708874,0.100856,0.503775,0.872723,4.512761,-27.617936
999997,0.708960,0.154529,0.147156,0.431693,4.060644,-21.132118
999998,0.708984,0.282299,0.888741,0.684521,3.241564,-9.313387


In [5]:
# Define features and target
features = ['x1', 'x2', 'x3', 'x4']
target = 'y1'

X = df[features]
y = df[target]

In [6]:
# Split data into train, test and validation set

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.30, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val, y_val, test_size=0.50, random_state=42)

print("Shape of Training Data: ", X_train.shape)
print("Shape of Validation Data: ", X_val.shape)
print("Shape of Testing Data: ", X_test.shape)

Shape of Training Data:  (700000, 4)
Shape of Validation Data:  (150000, 4)
Shape of Testing Data:  (150000, 4)


In [7]:
# Train the XGBoost model
model = XGBRegressor(n_estimators=1000,
                     max_depth=7,
                     eta=0.1,
                     subsample=0.7,
                     colsample_bytree=1.0,
                     eval_metric=mean_absolute_error)

model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

# Predict on validation set
y_pred = model.predict(X_val)

# Calculate RMSE on validation set
val_rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print("Validation RMSE:", val_rmse)


[0]	validation_0-rmse:3.28042	validation_0-mean_absolute_error:3.08684
[1]	validation_0-rmse:2.96903	validation_0-mean_absolute_error:2.79584
[2]	validation_0-rmse:2.68963	validation_0-mean_absolute_error:2.53357
[3]	validation_0-rmse:2.43820	validation_0-mean_absolute_error:2.29678
[4]	validation_0-rmse:2.21257	validation_0-mean_absolute_error:2.08340
[5]	validation_0-rmse:2.00968	validation_0-mean_absolute_error:1.89036
[6]	validation_0-rmse:1.82831	validation_0-mean_absolute_error:1.71691
[7]	validation_0-rmse:1.66697	validation_0-mean_absolute_error:1.56173
[8]	validation_0-rmse:1.52280	validation_0-mean_absolute_error:1.42195
[9]	validation_0-rmse:1.39203	validation_0-mean_absolute_error:1.29445
[10]	validation_0-rmse:1.27703	validation_0-mean_absolute_error:1.18117
[11]	validation_0-rmse:1.17309	validation_0-mean_absolute_error:1.07808
[12]	validation_0-rmse:1.08045	validation_0-mean_absolute_error:0.98522
[13]	validation_0-rmse:0.99781	validation_0-mean_absolute_error:0.90144
[1

[114]	validation_0-rmse:0.22334	validation_0-mean_absolute_error:0.10118
[115]	validation_0-rmse:0.22231	validation_0-mean_absolute_error:0.10058
[116]	validation_0-rmse:0.22175	validation_0-mean_absolute_error:0.10025
[117]	validation_0-rmse:0.22136	validation_0-mean_absolute_error:0.09999
[118]	validation_0-rmse:0.22064	validation_0-mean_absolute_error:0.09948
[119]	validation_0-rmse:0.22062	validation_0-mean_absolute_error:0.09946
[120]	validation_0-rmse:0.22006	validation_0-mean_absolute_error:0.09906
[121]	validation_0-rmse:0.21997	validation_0-mean_absolute_error:0.09903
[122]	validation_0-rmse:0.21967	validation_0-mean_absolute_error:0.09887
[123]	validation_0-rmse:0.21965	validation_0-mean_absolute_error:0.09885
[124]	validation_0-rmse:0.21920	validation_0-mean_absolute_error:0.09856
[125]	validation_0-rmse:0.21852	validation_0-mean_absolute_error:0.09814
[126]	validation_0-rmse:0.21821	validation_0-mean_absolute_error:0.09795
[127]	validation_0-rmse:0.21799	validation_0-mean_a

[227]	validation_0-rmse:0.19914	validation_0-mean_absolute_error:0.08648
[228]	validation_0-rmse:0.19889	validation_0-mean_absolute_error:0.08635
[229]	validation_0-rmse:0.19851	validation_0-mean_absolute_error:0.08618
[230]	validation_0-rmse:0.19850	validation_0-mean_absolute_error:0.08616
[231]	validation_0-rmse:0.19846	validation_0-mean_absolute_error:0.08612
[232]	validation_0-rmse:0.19842	validation_0-mean_absolute_error:0.08609
[233]	validation_0-rmse:0.19792	validation_0-mean_absolute_error:0.08591
[234]	validation_0-rmse:0.19786	validation_0-mean_absolute_error:0.08588
[235]	validation_0-rmse:0.19784	validation_0-mean_absolute_error:0.08586
[236]	validation_0-rmse:0.19778	validation_0-mean_absolute_error:0.08582
[237]	validation_0-rmse:0.19773	validation_0-mean_absolute_error:0.08581
[238]	validation_0-rmse:0.19770	validation_0-mean_absolute_error:0.08580
[239]	validation_0-rmse:0.19767	validation_0-mean_absolute_error:0.08575
[240]	validation_0-rmse:0.19733	validation_0-mean_a

[340]	validation_0-rmse:0.18552	validation_0-mean_absolute_error:0.07823
[341]	validation_0-rmse:0.18549	validation_0-mean_absolute_error:0.07822
[342]	validation_0-rmse:0.18546	validation_0-mean_absolute_error:0.07821
[343]	validation_0-rmse:0.18524	validation_0-mean_absolute_error:0.07808
[344]	validation_0-rmse:0.18518	validation_0-mean_absolute_error:0.07805
[345]	validation_0-rmse:0.18513	validation_0-mean_absolute_error:0.07803
[346]	validation_0-rmse:0.18508	validation_0-mean_absolute_error:0.07799
[347]	validation_0-rmse:0.18500	validation_0-mean_absolute_error:0.07795
[348]	validation_0-rmse:0.18495	validation_0-mean_absolute_error:0.07791
[349]	validation_0-rmse:0.18493	validation_0-mean_absolute_error:0.07789
[350]	validation_0-rmse:0.18485	validation_0-mean_absolute_error:0.07785
[351]	validation_0-rmse:0.18486	validation_0-mean_absolute_error:0.07784
[352]	validation_0-rmse:0.18475	validation_0-mean_absolute_error:0.07778
[353]	validation_0-rmse:0.18472	validation_0-mean_a

[453]	validation_0-rmse:0.17763	validation_0-mean_absolute_error:0.07349
[454]	validation_0-rmse:0.17761	validation_0-mean_absolute_error:0.07347
[455]	validation_0-rmse:0.17759	validation_0-mean_absolute_error:0.07345
[456]	validation_0-rmse:0.17746	validation_0-mean_absolute_error:0.07337
[457]	validation_0-rmse:0.17731	validation_0-mean_absolute_error:0.07327
[458]	validation_0-rmse:0.17724	validation_0-mean_absolute_error:0.07322
[459]	validation_0-rmse:0.17720	validation_0-mean_absolute_error:0.07319
[460]	validation_0-rmse:0.17720	validation_0-mean_absolute_error:0.07319
[461]	validation_0-rmse:0.17708	validation_0-mean_absolute_error:0.07311
[462]	validation_0-rmse:0.17694	validation_0-mean_absolute_error:0.07301
[463]	validation_0-rmse:0.17686	validation_0-mean_absolute_error:0.07297
[464]	validation_0-rmse:0.17682	validation_0-mean_absolute_error:0.07295
[465]	validation_0-rmse:0.17671	validation_0-mean_absolute_error:0.07291
[466]	validation_0-rmse:0.17668	validation_0-mean_a

[566]	validation_0-rmse:0.17155	validation_0-mean_absolute_error:0.06977
[567]	validation_0-rmse:0.17153	validation_0-mean_absolute_error:0.06976
[568]	validation_0-rmse:0.17129	validation_0-mean_absolute_error:0.06962
[569]	validation_0-rmse:0.17111	validation_0-mean_absolute_error:0.06957
[570]	validation_0-rmse:0.17105	validation_0-mean_absolute_error:0.06955
[571]	validation_0-rmse:0.17094	validation_0-mean_absolute_error:0.06948
[572]	validation_0-rmse:0.17092	validation_0-mean_absolute_error:0.06946
[573]	validation_0-rmse:0.17088	validation_0-mean_absolute_error:0.06945
[574]	validation_0-rmse:0.17082	validation_0-mean_absolute_error:0.06944
[575]	validation_0-rmse:0.17081	validation_0-mean_absolute_error:0.06943
[576]	validation_0-rmse:0.17072	validation_0-mean_absolute_error:0.06939
[577]	validation_0-rmse:0.17072	validation_0-mean_absolute_error:0.06938
[578]	validation_0-rmse:0.17068	validation_0-mean_absolute_error:0.06936
[579]	validation_0-rmse:0.17064	validation_0-mean_a

[679]	validation_0-rmse:0.16702	validation_0-mean_absolute_error:0.06705
[680]	validation_0-rmse:0.16695	validation_0-mean_absolute_error:0.06703
[681]	validation_0-rmse:0.16695	validation_0-mean_absolute_error:0.06702
[682]	validation_0-rmse:0.16677	validation_0-mean_absolute_error:0.06690
[683]	validation_0-rmse:0.16676	validation_0-mean_absolute_error:0.06689
[684]	validation_0-rmse:0.16668	validation_0-mean_absolute_error:0.06686
[685]	validation_0-rmse:0.16663	validation_0-mean_absolute_error:0.06684
[686]	validation_0-rmse:0.16662	validation_0-mean_absolute_error:0.06682
[687]	validation_0-rmse:0.16660	validation_0-mean_absolute_error:0.06680
[688]	validation_0-rmse:0.16652	validation_0-mean_absolute_error:0.06677
[689]	validation_0-rmse:0.16646	validation_0-mean_absolute_error:0.06671
[690]	validation_0-rmse:0.16639	validation_0-mean_absolute_error:0.06665
[691]	validation_0-rmse:0.16633	validation_0-mean_absolute_error:0.06663
[692]	validation_0-rmse:0.16636	validation_0-mean_a

[792]	validation_0-rmse:0.16329	validation_0-mean_absolute_error:0.06467
[793]	validation_0-rmse:0.16327	validation_0-mean_absolute_error:0.06464
[794]	validation_0-rmse:0.16327	validation_0-mean_absolute_error:0.06464
[795]	validation_0-rmse:0.16326	validation_0-mean_absolute_error:0.06463
[796]	validation_0-rmse:0.16314	validation_0-mean_absolute_error:0.06456
[797]	validation_0-rmse:0.16308	validation_0-mean_absolute_error:0.06451
[798]	validation_0-rmse:0.16302	validation_0-mean_absolute_error:0.06445
[799]	validation_0-rmse:0.16300	validation_0-mean_absolute_error:0.06444
[800]	validation_0-rmse:0.16300	validation_0-mean_absolute_error:0.06444
[801]	validation_0-rmse:0.16300	validation_0-mean_absolute_error:0.06444
[802]	validation_0-rmse:0.16298	validation_0-mean_absolute_error:0.06443
[803]	validation_0-rmse:0.16298	validation_0-mean_absolute_error:0.06442
[804]	validation_0-rmse:0.16297	validation_0-mean_absolute_error:0.06440
[805]	validation_0-rmse:0.16294	validation_0-mean_a

[905]	validation_0-rmse:0.16090	validation_0-mean_absolute_error:0.06297
[906]	validation_0-rmse:0.16090	validation_0-mean_absolute_error:0.06296
[907]	validation_0-rmse:0.16087	validation_0-mean_absolute_error:0.06294
[908]	validation_0-rmse:0.16086	validation_0-mean_absolute_error:0.06294
[909]	validation_0-rmse:0.16084	validation_0-mean_absolute_error:0.06293
[910]	validation_0-rmse:0.16083	validation_0-mean_absolute_error:0.06292
[911]	validation_0-rmse:0.16080	validation_0-mean_absolute_error:0.06290
[912]	validation_0-rmse:0.16079	validation_0-mean_absolute_error:0.06289
[913]	validation_0-rmse:0.16079	validation_0-mean_absolute_error:0.06289
[914]	validation_0-rmse:0.16078	validation_0-mean_absolute_error:0.06288
[915]	validation_0-rmse:0.16068	validation_0-mean_absolute_error:0.06284
[916]	validation_0-rmse:0.16066	validation_0-mean_absolute_error:0.06283
[917]	validation_0-rmse:0.16064	validation_0-mean_absolute_error:0.06282
[918]	validation_0-rmse:0.16063	validation_0-mean_a

In [15]:
# Evaluate the model on the test set
y_test_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_test_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_test_pred)
r_squared = r2_score(y_test, y_test_pred)

print("Mean Squared Error (MSE):", round(mse, 3))
print("Root Mean Squared Error (RMSE):", round(rmse, 3))
print("Mean Absolute Error (MAE):", round(mae, 3))
print("R-squared (R2) Score:", round(r_squared, 3))


Mean Squared Error (MSE): 0.023
Root Mean Squared Error (RMSE): 0.152
Mean Absolute Error (MAE): 0.062
R-squared (R2) Score: 0.998
