# Hypothesis Testing
The purpose of hypothesis testing is to test whether the null hypothesis (there is no difference, no effect) can be rejected or approved. If the null hypothesis is rejected, then the research hypothesis can be accepted. If the null hypothesis is accepted, then the research hypothesis is rejected.  
In our particular case, we use hypothesis testing to find out whether the quality of dataset has been negotiated or not during the rigorous processes of down sampling and feature extraction.

In [29]:
#Importing the necessary libraries
import pandas as pd
from scipy import stats
from statsmodels.stats.weightstats import ztest as ztest

In [30]:
#importing dataset from csv file into a dataframe
reduced_data = pd.read_csv(r"D:\AISSMS IOIT - AI&DS (628299510)\General\Hackathons\Prasunethon\2. FeatureExtraction-Dataset.csv")

In [31]:
reduced_data.head()

Unnamed: 0,year_month,average_temperature_median,maximum_temperature_median,minimum_temperature_median,precipitation_lag_median,snow_depth_lag_median,wind_speed_lag_median,maximum_sustained_wind_speed_lag_median,wind_gust_lag_median,dew_point_lag_median,fog_lag_mean,thunder_lag_mean,lat_lag_median,lon_lag_median
0,2013-01,55.614471,64.960743,44.885953,0.011162,999.9,7.991097,13.838489,956.09265,41.105964,0,0,32.669667,-6.103159
1,2013-02,54.778911,66.249974,42.401767,1.201047,999.899998,7.141677,20.923666,920.017515,52.711859,0,0,32.641867,-6.160246
2,2013-03,59.054068,68.776137,48.846444,1.082615,999.900009,9.067693,16.035876,822.054271,46.756065,0,0,32.383879,-6.408526
3,2013-04,65.017165,75.533435,53.436463,0.502933,999.900011,7.43501,15.745283,892.987485,49.527429,1,0,32.52487,-6.464255
4,2013-05,65.352257,76.215955,53.52306,0.252309,999.900019,8.49357,17.927379,865.032257,59.992219,0,0,32.973982,-5.748396


In [32]:
#importing dataset from parquet file into a dataframe
actual_data = pd.read_parquet(r"D:\AISSMS IOIT - AI&DS (628299510)\General\Hackathons\Prasunethon\dataset.parquet")

In [33]:
actual_data.head()

Unnamed: 0,acq_date,latitude,longitude,is_holiday,day_of_week,day_of_year,is_weekend,NDVI,SoilMoisture,sea_distance,...,wind_gust_quarterly_mean,dew_point_quarterly_mean,average_temperature_yearly_mean,maximum_temperature_yearly_mean,minimum_temperature_yearly_mean,precipitation_yearly_mean,snow_depth_yearly_mean,wind_gust_yearly_mean,dew_point_yearly_mean,is_fire
0,2015-05-28,31.390602,-4.254445,0.0,3.0,148.0,0.0,1139.0,7.0,464731.9375,...,882.085571,27.641111,71.703011,82.031784,58.668766,2.21326,999.900024,857.071777,32.760273,1.0
1,2017-12-05,33.832943,-5.188356,0.0,1.0,339.0,0.0,3223.0,31.0,186799.984375,...,936.817383,52.452175,65.621719,80.128555,53.295628,2.256421,999.900024,864.04834,47.819126,1.0
2,2021-11-19,35.385689,-5.684218,0.0,4.0,323.0,0.0,4987.0,30.0,44937.300781,...,884.289124,65.556519,66.108742,74.353828,58.815575,0.050738,999.900024,806.456848,56.298634,1.0
3,2014-04-19,30.122351,-7.498038,0.0,5.0,109.0,1.0,991.0,12.5,231336.125,...,839.094421,21.176111,69.008766,82.65918,54.666027,0.005205,999.900024,771.132629,26.421097,0.0
4,2014-04-11,30.221554,-9.154314,0.0,4.0,101.0,0.0,2171.0,18.0,51333.945312,...,945.704468,46.40889,66.195618,79.167122,54.768494,0.032082,999.900024,951.709595,52.842464,1.0


In [34]:
print("Reduced Data Shape: ", reduced_data.shape)
print("Actual Data Shape: ", actual_data.shape)

Reduced Data Shape:  (120, 14)
Actual Data Shape:  (934586, 278)


In [35]:
print("Comparing Reduced Dataset and Actual Dataset")
reduced_data.info()
actual_data.info()

Comparing Reduced Dataset and Actual Dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 14 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   year_month                               120 non-null    object 
 1   average_temperature_median               120 non-null    float64
 2   maximum_temperature_median               120 non-null    float64
 3   minimum_temperature_median               120 non-null    float64
 4   precipitation_lag_median                 120 non-null    float64
 5   snow_depth_lag_median                    120 non-null    float64
 6   wind_speed_lag_median                    120 non-null    float64
 7   maximum_sustained_wind_speed_lag_median  120 non-null    float64
 8   wind_gust_lag_median                     120 non-null    float64
 9   dew_point_lag_median                     120 non-null    float64
 10  fog_l

In [36]:
print("Reduced Data Describe:")
reduced_data.describe().T

Reduced Data Describe:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
average_temperature_median,120.0,67.094705,9.761947,50.546622,58.208873,66.021485,75.632634,84.000238
maximum_temperature_median,120.0,78.437687,10.609209,60.599082,68.434892,77.619218,86.800631,112.397668
minimum_temperature_median,120.0,58.836999,14.636721,39.22485,47.180134,56.527563,66.380742,112.374741
precipitation_lag_median,120.0,1.033148,0.947844,0.003546,0.409405,0.742399,1.430888,5.215956
snow_depth_lag_median,120.0,999.859217,0.274642,997.567419,999.900005,999.900009,999.900013,999.90003
wind_speed_lag_median,120.0,8.031676,5.476815,5.061358,6.543611,7.146042,7.84165,52.233926
maximum_sustained_wind_speed_lag_median,120.0,18.308724,7.643256,11.399268,14.766496,16.16067,18.363745,63.151377
wind_gust_lag_median,120.0,867.825837,60.353581,640.94936,822.901827,877.415119,915.797053,981.604077
dew_point_lag_median,120.0,75.879491,117.260965,36.212799,45.35386,52.410611,56.991478,953.012955
fog_lag_mean,120.0,0.083333,0.277544,0.0,0.0,0.0,0.0,1.0


In [37]:
print("Actual Data Describe:")
actual_data.describe()

Actual Data Describe:


Unnamed: 0,acq_date,latitude,longitude,is_holiday,day_of_week,day_of_year,is_weekend,NDVI,SoilMoisture,sea_distance,...,wind_gust_quarterly_mean,dew_point_quarterly_mean,average_temperature_yearly_mean,maximum_temperature_yearly_mean,minimum_temperature_yearly_mean,precipitation_yearly_mean,snow_depth_yearly_mean,wind_gust_yearly_mean,dew_point_yearly_mean,is_fire
count,934586,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,...,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0,934586.0
mean,2018-02-11 14:49:51.211081472,32.79335,-6.055579,0.062526,2.976634,194.607834,0.28215,2639.589844,20.082907,140767.546875,...,872.539551,67.082825,66.840477,77.929955,59.79364,1.218641,999.864441,879.436035,85.134514,0.5
min,2013-01-31 00:00:00,27.074949,-12.044235,0.0,0.0,1.0,0.0,-765.0,0.0,0.0,...,13.169967,17.941111,51.708469,63.495628,41.978142,0.0,997.16272,55.930412,26.421097,0.0
25%,2015-09-16 00:00:00,31.421812,-7.749783,0.0,1.0,121.0,0.0,1308.0,11.0,49483.511719,...,827.22998,42.710869,64.644806,74.977325,52.58548,0.05209,999.900024,848.476135,47.557259,0.0
50%,2018-03-10 00:00:00,33.368835,-5.626343,0.0,3.0,206.0,0.0,2240.0,17.5,79821.320312,...,924.950562,50.806595,66.522057,77.053696,55.05452,0.554986,999.900024,903.814453,50.807671,0.5
75%,2020-08-18 00:00:00,34.670555,-4.479708,0.0,5.0,268.0,1.0,3625.0,27.0,210482.03125,...,999.900024,55.703297,69.07486,80.613937,58.807652,1.667041,999.900024,989.327332,54.635067,1.0
max,2022-12-23 00:00:00,35.915291,-1.656559,1.0,6.0,366.0,1.0,8432.0,99.0,553629.9375,...,999.900024,9890.538086,78.635201,133.978638,213.562561,20.853014,999.900024,999.900024,9536.286133,1.0
std,,2.302324,2.398227,0.242109,2.011039,94.253609,0.449208,1554.052734,12.918498,124798.367188,...,169.851944,277.602966,2.979686,4.958928,15.354151,1.811131,6.784895,153.50563,277.867645,0.5


### Comparing Means

In [38]:
reduced_data.select_dtypes(include = "number").mean()

average_temperature_median                  67.094705
maximum_temperature_median                  78.437687
minimum_temperature_median                  58.836999
precipitation_lag_median                     1.033148
snow_depth_lag_median                      999.859217
wind_speed_lag_median                        8.031676
maximum_sustained_wind_speed_lag_median     18.308724
wind_gust_lag_median                       867.825837
dew_point_lag_median                        75.879491
fog_lag_mean                                 0.083333
thunder_lag_mean                             0.066667
lat_lag_median                              32.665196
lon_lag_median                              -6.187990
dtype: float64

In [39]:
actual_data[["average_temperature_lag_1", "maximum_temperature_lag_1", "minimum_temperature_lag_1",
             "precipitation_lag_1", "snow_depth_lag_1", "wind_speed_lag_1",
             "maximum_sustained_wind_speed_lag_1", "wind_gust_lag_1", "dew_point_lag_1",
             "fog_lag_1", "thunder_lag_1", "lat_lag_1", "lon_lag_1"]].mean()

average_temperature_lag_1              70.183754
maximum_temperature_lag_1              81.901993
minimum_temperature_lag_1              61.261856
precipitation_lag_1                     0.901875
snow_depth_lag_1                      999.872681
wind_speed_lag_1                        7.229457
maximum_sustained_wind_speed_lag_1     17.946209
wind_gust_lag_1                       872.638428
dew_point_lag_1                       249.158188
fog_lag_1                               0.046173
thunder_lag_1                           0.035310
lat_lag_1                              32.958057
lon_lag_1                              -6.102798
dtype: float32

### Comparing Standard Deviations

In [40]:
reduced_data.select_dtypes(include = "number").std()

average_temperature_median                   9.761947
maximum_temperature_median                  10.609209
minimum_temperature_median                  14.636721
precipitation_lag_median                     0.947844
snow_depth_lag_median                        0.274642
wind_speed_lag_median                        5.476815
maximum_sustained_wind_speed_lag_median      7.643256
wind_gust_lag_median                        60.353581
dew_point_lag_median                       117.260965
fog_lag_mean                                 0.277544
thunder_lag_mean                             0.250490
lat_lag_median                               0.380048
lon_lag_median                               0.457120
dtype: float64

In [41]:
actual_data[["average_temperature_lag_1", "maximum_temperature_lag_1", "minimum_temperature_lag_1",
             "precipitation_lag_1", "snow_depth_lag_1", "wind_speed_lag_1",
             "maximum_sustained_wind_speed_lag_1", "wind_gust_lag_1", "dew_point_lag_1",
             "fog_lag_1", "thunder_lag_1", "lat_lag_1", "lon_lag_1"]].std()

average_temperature_lag_1               11.863132
maximum_temperature_lag_1               41.870552
minimum_temperature_lag_1              192.461395
precipitation_lag_1                      9.337241
snow_depth_lag_1                         8.472099
wind_speed_lag_1                        18.894102
maximum_sustained_wind_speed_lag_1      69.490189
wind_gust_lag_1                        328.882477
dew_point_lag_1                       1394.581177
fog_lag_1                                0.210411
thunder_lag_1                            0.185292
lat_lag_1                                2.155823
lon_lag_1                                2.537551
dtype: float32

### Comparing variance

In [42]:
reduced_data.select_dtypes(include = "number").var()

average_temperature_median                    95.295601
maximum_temperature_median                   112.555312
minimum_temperature_median                   214.233611
precipitation_lag_median                       0.898409
snow_depth_lag_median                          0.075428
wind_speed_lag_median                         29.995503
maximum_sustained_wind_speed_lag_median       58.419355
wind_gust_lag_median                        3642.554788
dew_point_lag_median                       13750.133799
fog_lag_mean                                   0.077031
thunder_lag_mean                               0.062745
lat_lag_median                                 0.144436
lon_lag_median                                 0.208958
dtype: float64

In [43]:
actual_data[["average_temperature_lag_1", "maximum_temperature_lag_1", "minimum_temperature_lag_1",
             "precipitation_lag_1", "snow_depth_lag_1", "wind_speed_lag_1",
             "maximum_sustained_wind_speed_lag_1", "wind_gust_lag_1", "dew_point_lag_1",
             "fog_lag_1", "thunder_lag_1", "lat_lag_1", "lon_lag_1"]].var()

average_temperature_lag_1             1.407339e+02
maximum_temperature_lag_1             1.753143e+03
minimum_temperature_lag_1             3.704139e+04
precipitation_lag_1                   8.718407e+01
snow_depth_lag_1                      7.177646e+01
wind_speed_lag_1                      3.569871e+02
maximum_sustained_wind_speed_lag_1    4.828886e+03
wind_gust_lag_1                       1.081637e+05
dew_point_lag_1                       1.944856e+06
fog_lag_1                             4.427278e-02
thunder_lag_1                         3.433326e-02
lat_lag_1                             4.647573e+00
lon_lag_1                             6.439165e+00
dtype: float32

# Performing Z-Test
We begin with collection 50 samples from each dataset and then perform the z-test on it accordingly.

## Null and Alternate Hypothesis

In [44]:
null_hypothesis = "The reduced dataset has not negotiated with data quality."
alternate_hypothesis = "The reduced dataset has negotiated with data quality."

## Significance Level
We'll set the significance level (α) to 0.05, indicating a 5% chance of rejecting the null hypothesis when it's actually true.

In [45]:
alpha = 0.05

In [49]:
z_statistic, p_value_ztest = ztest(reduced_data["lat_lag_median"], actual_data["lat_lag_1"], alternative='two-sided')

decision_ztest = "Reject" if p_value_ztest <= alpha else "Fail to reject"

if decision_ztest == "Reject":
    conclusion_ztest = "Reduced Data set is not an accurate representation of orginal dataset."
else:
    conclusion_ztest = "Reduced Data set is an accurate representation of orginal dataset."

In [50]:
print("\nZ-statistic (Z-test):", z_statistic)
print("P-value (Z-test):", p_value_ztest)
print("Decision (Z-test):", decision_ztest)
print("Conclusion(Z-test):", conclusion_ztest)


Z-statistic (Z-test): -1.4875612298642822
P-value (Z-test): 0.13686664433286086
Decision (Z-test): Fail to reject
Conclusion(Z-test): Reduced Data set is an accurate representation of orginal dataset.


# Conclusion
Comparing the mean, standard deviation and variance, we can see that there has not been a very large deflection from the orignal data.  
Moreover, the Z-test suggests that: "Reduced Data set is an accurate representation of orginal dataset."  
Hence, we have concluded to move forward with the reduced dataset for model building.
  
Thank You!!!