In [1]:
#import necessary packages
import pandas as pd
from field_data_processor import FieldDataProcessor
from weather_data_processor import WeatherDataProcessor
import logging 
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

In [2]:
#import the configuration parameters

from data_ingestion import config_params
config_params = config_params

In [3]:
#use the .process method from each module to transform ingested data

field_processor = FieldDataProcessor(config_params)
field_processor.process()
field_df = field_processor.df

weather_processor = WeatherDataProcessor(config_params)
weather_processor.process()
weather_df = weather_processor.weather_df

2024-02-28 17:37:39,588 - data_ingestion - INFO - Database engine created successfully.
2024-02-28 17:37:39,688 - data_ingestion - INFO - Query executed successfully.
2024-02-28 17:37:39,688 - field_data_processor.FieldDataProcessor - INFO - Sucessfully loaded data.
2024-02-28 17:37:39,688 - field_data_processor.FieldDataProcessor - INFO - Swapped columns: Annual_yield with Crop_type
2024-02-28 17:37:40,678 - data_ingestion - INFO - CSV file read successfully from the web.
2024-02-28 17:37:41,768 - data_ingestion - INFO - CSV file read successfully from the web.
2024-02-28 17:37:41,768 - weather_data_processor.WeatherDataProcessor - INFO - Successfully loaded weather station data from the web.
2024-02-28 17:37:41,788 - weather_data_processor.WeatherDataProcessor - INFO - Messages processed and measurements extracted.
2024-02-28 17:37:41,788 - weather_data_processor.WeatherDataProcessor - INFO - Data processing completed.


#### Here's the plan
    Create a null hypothesis.
    Import the field dataset and clean it up.
    Import the weather data.
    Map the weather data to the field data.
    Calculate the means of the weather station dataset and the means of the main dataset.
    Calculate all the parameters we need to do a t-test.
    Interpret our results.


In [4]:
# Rename 'Ave_temps' in field_df to 'Temperature' to match weather_df
field_df.rename(columns={'Ave_temps': 'Temperature'}, inplace=True)

# Validating the dataset

#### Hypothesis

So what are we testing with our null hypothesis $H_0$? Well, we want to know if our field data is representing the reality in Maji Ndogo by looking at an independent set of data. If our field data (means) are the same as the weather data (means), then it indicates no significant difference between the datasets. We're essentially saying that any difference we see between these means is because of randomness. However, if the means differ significantly, we'll know there is a reason for it, and that it is not just a random fluctuation in the data. 

<br>

Given a significance level $\alpha$ of 0.05 for a two-tailed test, we have the following conditions for our hypothesis test at a 95% confidence interval:

- $H_0$: There is no significant difference between the means of the two datasets. This is expressed as $\mu_{field} = \mu_{weather}$.

- $H_a$: There is a significant difference between the means of the two datasets. This is expressed as $\mu_{field} \neq \mu_{weather}$.

<br>

If the p-value obtained from the test:
- is less than or equal to the significance level, so $p \leq \alpha$, we reject the null hypothesis.
- is larger than the significance level, so $p > \alpha$, we cannot reject the null hypothesis, as we cannot find a statistically significant difference between the datasets at the 95% confidence level.

First, we're going to import all of the packages and define a few variables. You might notice we're importing a new method, .ttest_ind(). This method takes in two data columns and calculates means, variance, and returns the the t- and p-statistics. So our t-test is reduced to one line. Since our alternative hypothesis does not make a claim of greater or less than, we will use the two-sided t-test, by adding the alternative = 'two-sided' keyword.

In [5]:
from scipy.stats import ttest_ind
import numpy as np

# Now, the measurements_to_compare can directly use 'Temperature', 'Rainfall', and 'Pollution_level'
measurements_to_compare = ['Temperature', 'Rainfall', 'Pollution_level']

We want to compare the means of the temperature, rainfall, and pollution data, for fields assigned to a specific weather station. So for both datasets, we need to isolate the measurement type and weather station for each data, so we're comparing the correct means.

Let's break down what we need to do:

   We need to filter both field_df and weather_df based on the given station ID and measurement. We can use filter_field_data(df, station_id, measurement) and filter_weather_data(df, station_id, measurement).
    
   We need to perform a t-test to conduct the t-test on the filtered data. So we're going to use ttest_ind(data_col1, data_col2, equal_var=False) from scipy.stats. 
    
  print_ttest_results(station_id, measurement, p_val, alpha) to interpret and print the results from the t-test.

We'll first define these functions, focusing on Temperature for station ID = 0. Then, we'll integrate these functions into a loop that iterates over each station ID and measurement type.

I'll create a filter_field_data function that takes in the field_df DataFrame, the station_id, and measurement type, and retuns a single column (series) of data filtered by the station_id, and measurement.

In [6]:
### START FUNCTION
def filter_field_data(df, station_id, measurement):
    # Check if measurement is in the valid columns
    if measurement not in measurements_to_compare:
        raise ValueError(f"Invalid measurement. Supported columns: {valid_columns}")

    # Filter data based on station_id and measurement
    if measurement in measurements_to_compare:
        filtered_data = df[(df['Weather_station'] == station_id) & (df[measurement].notna())] #.notna() checks that it exists
    else:
        raise ValueError(f"Column {measurement} does not exist in the DataFrame.")

    # Return the single column (Series) of filtered data
    return filtered_data[measurement]
    
### END FUNCTION

Create a data filter function that takes in the weather_df DataFrame, the station_id, and measurement type, and returns a single column (series) of data filtered by the station_id, and measurement.

In [7]:
### START FUNCTION

def filter_weather_data(df, station_id, measurement):
    # Check if measurement is in the valid measurements
    if measurement not in measurements_to_compare:
        raise ValueError(f"Invalid measurement type. Supported types: {valid_measurements}")

    # Filter data based on station_id and measurement_type
    filtered_data = df[(df['Weather_station_ID'] == station_id) & (df['Measurement'] == measurement)]['Value']

    # Return the single column (Series) of filtered data
    return filtered_data
### END FUNCTION

I'll create a function that calculates the t-statistic and p-value. The function should accept two single columns of data and return a tuple of the t-statistic and p-value.

In [8]:
### START FUNCTION

def run_ttest(Column_A, Column_B):
    t_statistic, p_value = ttest_ind(Column_A, Column_B, equal_var=False)
    return t_statistic, p_value    
    
### END FUNCTION

I'll create a function to print out the t-test result.

In [9]:
### START FUNCTION

def print_ttest_results(station_id, measurement, p_val, alpha):
    if p_val < alpha:
        print(f"   Significant difference in {measurement} detected at Station  {station_id}, (P-Value: {p_val:.5f} < {alpha}). Null hypothesis rejected.")
    else:
        print(f"   No significant difference in {measurement} detected at Station  {station_id}, (P-Value: {p_val:.5f} > {alpha}). Null hypothesis not rejected.")

### END FUNCTION

I'll create a function that loops over measurements_to_compare and all station_id, perform a t-test and print the results. The function should accept field_df, weather_df, list_measurements_to_compare, alpha. the value of alpha should default to a value of 0.05.

In [10]:
### START FUNCTION
def hypothesis_results(field_df, weather_df, list_measurements_to_compare, alpha=0.05):
    for station_id in sorted(field_df['Weather_station'].unique()):
        for measurement in list_measurements_to_compare:
            
            # Filter data for the specific station and measurement
            field_values = filter_field_data(field_df, station_id, measurement)
            weather_values = filter_weather_data(weather_df, station_id, measurement)

            # Perform t-test
            t_stat, p_val = run_ttest(field_values, weather_values)

            # Print t-test results
            print_ttest_results(station_id, measurement, p_val, alpha)
            
            
### END FUNCTION

In [11]:
alpha = 0.05
hypothesis_results(field_df, weather_df, measurements_to_compare, alpha)

   No significant difference in Temperature detected at Station  0, (P-Value: 0.90761 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station  0, (P-Value: 0.21621 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station  0, (P-Value: 0.56418 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station  1, (P-Value: 0.47241 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station  1, (P-Value: 0.54499 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station  1, (P-Value: 0.24410 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station  2, (P-Value: 0.88671 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station  2, (P-Value: 0.36466 > 0.05). Null hypothesis not rejected.
 