### Loading Dataset

In [1]:
import pandas as pd
import scipy.stats as stats 

In [2]:
test_1 = pd.read_csv('drive-download-20250414T115948Z-001/test1.csv')
test_2 = pd.read_csv('drive-download-20250414T115948Z-001/test2.csv')
train = pd.read_csv('drive-download-20250414T115948Z-001/train.csv')

In [3]:
test_1.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,3123,18/07/2004,21.00.00,12,1067.0,-200.0,90,938.0,102.0,825.0,99.0,1520.0,912.0,297,248,10160,,
1,877,16/04/2004,07.00.00,45,1657.0,523.0,232,1384.0,352.0,579.0,109.0,2176.0,1600.0,128,710,10428,,
2,3457,01/08/2004,19.00.00,14,1037.0,-200.0,80,900.0,75.0,817.0,95.0,1584.0,619.0,331,327,16200,,
3,1494,12/05/2004,00.00.00,17,1122.0,-200.0,87,926.0,105.0,805.0,88.0,1619.0,1174.0,169,588,11250,,
4,713,09/04/2004,11.00.00,26,-200.0,262.0,-2000,-200.0,219.0,-200.0,121.0,-200.0,-200.0,-200,-200,-200,,


In [4]:
test_2.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,8500,27/02/2005,22.00.00,10,875.0,-200.0,21,594.0,128.0,1079.0,105.0,793.0,451.0,45,480,4085,,
1,8501,27/02/2005,23.00.00,13,943.0,-200.0,39,703.0,169.0,950.0,119.0,870.0,581.0,43,486,4069,,
2,8502,28/02/2005,00.00.00,16,947.0,-200.0,38,697.0,215.0,913.0,150.0,878.0,698.0,40,500,4115,,
3,8503,28/02/2005,01.00.00,10,865.0,-200.0,18,566.0,111.0,1119.0,94.0,797.0,423.0,40,529,4338,,
4,8504,28/02/2005,02.00.00,6,823.0,-200.0,10,503.0,60.0,1268.0,56.0,755.0,332.0,40,510,4200,,


In [5]:
train.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
0,1849,26/05/2004,19.00.00,-200,1130.0,-200.0,227,1368.0,-200.0,933.0,-200.0,1709.0,1269.0,267,195,6754,,
1,2533,24/06/2004,07.00.00,12,1030.0,-200.0,69,851.0,102.0,824.0,68.0,1700.0,983.0,219,570,14742,,
2,3047,15/07/2004,17.00.00,32,1164.0,-200.0,203,1306.0,259.0,648.0,198.0,1886.0,1218.0,355,191,10888,,
3,805,13/04/2004,07.00.00,39,1496.0,524.0,191,1272.0,328.0,667.0,130.0,2011.0,1399.0,110,642,8398,,
4,2962,12/07/2004,04.00.00,-200,780.0,-200.0,18,568.0,24.0,1200.0,34.0,1331.0,501.0,199,513,11803,,


#### Cleaining (converting to numeric) and removing negative values(instructed by ta)

In [6]:
# #Converting NO2(GT) col to numeric values, replacing invalid entries with NaN
# train['NO2(GT)'] = pd.to_numeric(train['NO2(GT)'], errors='coerce')
# test_1['NO2(GT)'] = pd.to_numeric(test_1['NO2(GT)'], errors='coerce')
# test_2['NO2(GT)'] = pd.to_numeric(test_2['NO2(GT)'], errors='coerce')

In [7]:
for df in (train, test_1, test_2):
    df['NO2(GT)'] = (
        pd.to_numeric(df['NO2(GT)'], errors='coerce')      #strings → numeric / NaN
          .where(lambda s: s >= 0)                        #keep only non‑negatives, else NaN
    )

### Kolmogorov-Smirnov test

In [8]:
ks_test_1_stat, ks_test_1_p = stats.ks_2samp(train['NO2(GT)'].dropna(), test_1['NO2(GT)'].dropna())#dropping NaN values
ks_test_2_stat, ks_test_2_p = stats.ks_2samp(train['NO2(GT)'].dropna(), test_2['NO2(GT)'].dropna())

In [9]:
print("KS Test for test_1 vs train:")
print(f"KS Statistic: {ks_test_1_stat}, P-value: {ks_test_1_p}")

print("\nKS Test for test_2 vs train:")
print(f"KS Statistic: {ks_test_2_stat}, P-value: {ks_test_2_p}")

KS Test for test_1 vs train:
KS Statistic: 0.017062220028073977, P-value: 0.9971378232852736

KS Test for test_2 vs train:
KS Statistic: 0.3688536442438679, P-value: 2.53172387531317e-74


### Comment on covariate shift 

In [10]:
if ks_test_1_p < 0.05:
    print("\nCovariate shift detected in test_1.")
else:
    print("\nNo covariate shift detected in test_1.")

if ks_test_2_p < 0.05:
    print("\nCovariate shift detected in test_2.")
else:
    print("\nNo covariate shift detected in test_2.")


No covariate shift detected in test_1.

Covariate shift detected in test_2.


### Explaination

For the **KS test between test_1 and train**, the **KS statistic** is **0.017** and the **p-value** is **0.99**, which is much greater than the common threshold of **0.05**. This suggests that there is **no significant difference** between the distribution of `NO2(GT)` in `test_1` and `train`, meaning **no covariate shift** is detected in test_1.

On the other hand, for the **KS test between test_2 and train**, the **KS statistic** is **0.368** and the **p-value** is extremely small (**2.5e-74**), this is almost zero, which is **much less than 0.05**. This indicates a **significant difference** in the distribution of `NO2(GT)` between `test_2` and `train`, suggesting the presence of a **covariate shift** in test_2.

In summary, **test_1** does not exhibit a covariate shift, while **test_2** shows significant covariate shift compared to the training dataset.

### Conclusion 

We know, A covariate shift refers to a situation where the distribution of the input features (covariates) in the test data is different from the distribution in the training data, even though the relationship between the features and the target variable (NO2(GT) in our case) remains the same.

test_1 shows no covariate shift, meaning its data is similar to what the model saw during training.

test_2 exhibits covariate shift, meaning the distribution of NO2(GT) in test_2 is quite different from what was seen in the training data, which could impact model performance.

Hence, based on the results of the Kolmogorov-Smirnov (KS) test, we can ideally use test_1 for evaluating our model, since test_1 does not show covariate shift relative to the training dataset. 