### ANOVA test

Performing an ANOVA test can help determine if there are significant differences in the mean number of dengue cases between different regions. This information can be useful in building prediction models, as it suggests that different regions may require different models to accurately predict the number of dengue cases.

For example, if the ANOVA test shows that there are significant differences in the mean number of dengue cases between different regions, it may indicate that there are different underlying factors affecting dengue transmission in each region. By building separate prediction models for each region, you can account for these differences and potentially improve the accuracy of the predictions.

In summary, performing an ANOVA test can help identify if there are differences in the mean number of dengue cases between different regions, which can inform the development of more accurate prediction models.

In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from tabulate import tabulate

In [3]:
# Read each CSV file into a Pandas DataFrame
df_west = pd.read_csv('../data/df_west_merge.csv')
df_northeast = pd.read_csv('../data/df_northeast_merge.csv')
df_north = pd.read_csv('../data/df_north_merge.csv')
df_east = pd.read_csv('../data/df_east_merge.csv')
df_central = pd.read_csv('../data/df_central_merge.csv')


In [7]:
# Combine the DataFrames into a single DataFrame
df = pd.concat([df_west, df_northeast, df_north, df_east, df_central])
df


Unnamed: 0.1,Unnamed: 0,yr,week,region,no_cases,total_daily_rainfall,max_wind_sp,max_temp,rainy_day,mean_temp,...,mosquito,insect_repellent,dengue_fever_diff,dengue_fever_2nd_diff,dengue_diff,dengue_2nd_diff,mosquito_diff,mosquito_2nd_diff,insect_repellent_diff,insect_repellent_2nd_diff
0,0,2013,21,West,90,153.8,53.3,34.3,1,28.3,...,10.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,2013,22,West,93,91.9,72.7,34.7,1,28.4,...,9.0,3.0,-4.0,0.0,-2.0,0.0,-1.0,0.0,1.0,0.0
2,2,2013,23,West,120,562.1,63.4,35.4,1,28.5,...,18.0,4.0,27.0,31.0,24.0,26.0,9.0,10.0,1.0,0.0
3,3,2013,24,West,239,51.5,67.3,34.7,1,30.0,...,14.0,2.0,-14.0,-41.0,-13.0,-37.0,-4.0,-13.0,-2.0,-3.0
4,4,2013,25,West,286,0.0,56.9,35.0,0,30.1,...,8.0,2.0,-17.0,-3.0,-16.0,-3.0,-6.0,-2.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
387,387,2020,41,Central,193,176.0,58.3,33.3,1,28.2,...,10.0,3.0,1.0,3.0,2.0,4.0,0.0,-1.0,0.0,1.0
388,388,2020,42,Central,193,44.4,53.0,33.8,1,29.0,...,8.0,3.0,-2.0,-3.0,-1.0,-3.0,-2.0,-2.0,0.0,0.0
389,389,2020,43,Central,193,100.6,45.4,33.7,1,28.6,...,10.0,2.0,-1.0,1.0,-2.0,-1.0,2.0,4.0,-1.0,-1.0
390,390,2020,44,Central,193,410.7,40.7,34.6,1,28.4,...,6.0,1.0,-1.0,0.0,-1.0,1.0,-4.0,-6.0,-1.0,0.0


In [6]:
# Perform ANOVA using statsmodels
model = ols('no_cases ~ C(region)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)



In [8]:
# Print the ANOVA table
print(anova_table)

                 sum_sq      df          F        PR(>F)
C(region)  3.576117e+06     4.0  16.478509  2.753627e-13
Residual   1.060671e+08  1955.0        NaN           NaN


In [10]:
# Create a DataFrame from the ANOVA table
anova_df = pd.DataFrame(anova_table)

In [11]:
# Reset the index to create a column for the index values
anova_df.reset_index(inplace=True)

In [12]:
# Rename the columns
anova_df.columns = ['Source', 'Sum of Squares', 'Degrees of Freedom', 'F Value', 'P Value']


In [13]:
# Print the table with lines using tabulate
print(tabulate(anova_df, headers='keys', tablefmt='grid', floatfmt='.3e', showindex=False))

+-----------+------------------+----------------------+-------------+-------------+
| Source    |   Sum of Squares |   Degrees of Freedom |     F Value |     P Value |
| C(region) |        3.576e+06 |            4.000e+00 |   1.648e+01 |   2.754e-13 |
+-----------+------------------+----------------------+-------------+-------------+
| Residual  |        1.061e+08 |            1.955e+03 | nan         | nan         |
+-----------+------------------+----------------------+-------------+-------------+


The result of the ANOVA test shows that there is a statistically significant difference in the mean number of dengue cases between different regions (p-value < 0.05), as indicated by the F-test statistic of 16.48 and the associated p-value of 2.75e-13.

The sum of squares for the region variable is 3.58 million, which represents the variability in the mean number of dengue cases between the different regions. The sum of squares for the residual is 106 million, which represents the variability in the mean number of dengue cases within each region. The fact that the sum of squares for the residual is much larger than the sum of squares for the region variable suggests that there is a large amount of variability in the mean number of dengue cases within each region, which may be due to other factors such as weather conditionsand others.

Overall, the result of the ANOVA test provides evidence that there is a significant difference in the mean number of dengue cases between different regions, but further analysis is needed to understand the factors that contribute to the variability within each region.