# üìä Comprehensive Statistical Analysis of Indian Rainfall

This notebook covers the complete statistical curriculum applied to real-world rainfall data.
We use the **Rainfall in India (1901-2015)** dataset to demonstrate:

1.  **Probability Theory & Random Variables** (Univariate/Bivariate)
2.  **Probability Distributions** (Discrete & Continuous)
3.  **Sampling Distributions** (CLT)
4.  **Hypothesis Testing** (One & Two Populations)
5.  **Analysis of Variance** (ANOVA)
6.  **Chi-Square Testing**

---

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# Settings
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Load Data
# header=1 ensures we skip the filename row if it exists
try:
    df = pd.read_csv('rainfall in india 1901-2015.csv', header=1)
except:
    # Fallback if header is standard
    df = pd.read_csv('rainfall in india 1901-2015.csv')

# Clean Data: Remove rows with missing Annual rainfall
df_clean = df.dropna(subset=['ANNUAL'])
print(f"Data Loaded. Shape: {df_clean.shape}")
df_clean.head()

## 1Ô∏è‚É£ Unique & Bivariate Random Variables
### Univariate: Annual Rainfall
A single random variable $X$ representing the total annual rainfall.

In [None]:
# Univariate Analysis
annual_rain = df_clean['ANNUAL']

plt.figure(figsize=(10, 5))
sns.histplot(annual_rain, kde=True, color='skyblue')
plt.title('Univariate Distribution: Annual Rainfall')
plt.xlabel('Rainfall (mm)')
plt.show()

print(f"Mean: {annual_rain.mean():.2f}, Std Dev: {annual_rain.std():.2f}")

### Bivariate: June vs July Rainfall
Two random variables $X$ (June) and $Y$ (July). We calculate their correlation to see if they move together.

In [None]:
# Filter data to ensure no NaNs in June/July for comparison
df_bivar = df_clean.dropna(subset=['JUN', 'JUL'])

plt.figure(figsize=(8, 6))
sns.scatterplot(x='JUN', y='JUL', data=df_bivar, alpha=0.5)
plt.title('Bivariate Analysis: June vs July Rainfall')
plt.xlabel('June Rainfall (mm)')
plt.ylabel('July Rainfall (mm)')
plt.show()

# Covariance and Correlation
correlation = df_bivar['JUN'].corr(df_bivar['JUL'])
print(f"Correlation Coefficient (Jun vs Jul): {correlation:.3f}")

## 2Ô∏è‚É£ Distributions: Continuous & Discrete
### Continuous: Normal Distribution Fit
We assume Annual Rainfall follows a Normal Distribution $\mathcal{N}(\mu, \sigma^2)$.

In [None]:
mu, std = stats.norm.fit(annual_rain)

plt.figure(figsize=(10, 5))
sns.histplot(annual_rain, stat="density", alpha=0.4, label="Observed")

# Plot Theoretical PDF
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
plt.plot(x, p, 'r', linewidth=2, label="Normal Theoretical")
plt.title('Continuous Distribution Fit')
plt.legend()
plt.show()

### Discrete: Binomial Distribution
Let's define a **"Flood Year"** as one where rainfall > 2500mm.
We sample $n=10$ random years. What is the probability of exactly $k$ flood years?
This models a **Binomial process** $B(n, p)$.

In [None]:
threshold = 2500
# Calculate probability of success (flood) 'p'
p_flood = (annual_rain > threshold).mean()
n_trials = 10

# Generate Binomial PMF
x_binom = np.arange(0, n_trials + 1)
pmf_binom = stats.binom.pmf(x_binom, n_trials, p_flood)

plt.figure(figsize=(10, 5))
plt.bar(x_binom, pmf_binom, color='orange', alpha=0.7)
plt.title(f'Discrete Distribution (Binomial): Probability of Flood Years in a Decade\n(p={p_flood:.2f}, n={n_trials})')
plt.xlabel('Number of Flood Years')
plt.ylabel('Probability')
plt.xticks(x_binom)
plt.show()

## 3Ô∏è‚É£ Sampling Distribution
We demonstrate that the **distribution of the sample mean** tends towards Normal, even if the graphical data (histogram above) wasn't perfectly normal.

In [None]:
sample_means = []
sample_size = 50
num_samples = 1000

for _ in range(num_samples):
    # Take a random sample of 50 and calculate mean
    sample = np.random.choice(annual_rain, size=sample_size, replace=True)
    sample_means.append(sample.mean())

plt.figure(figsize=(10, 5))
sns.histplot(sample_means, kde=True, color='purple')
plt.title(f'Sampling Distribution of the Mean (n={sample_size})')

# Confidence Interval (95%) for the sampling distribution
ci_lower = np.percentile(sample_means, 2.5)
ci_upper = np.percentile(sample_means, 97.5)
plt.axvline(ci_lower, color='red', linestyle='--', label='95% CI')
plt.axvline(ci_upper, color='red', linestyle='--')
plt.legend()
plt.show()

## 4Ô∏è‚É£ Hypothesis Testing: One Population
**Scenario**: The national average rainfall is historically stated to be **1500 mm**.
Has it changed significantly in this dataset?

*   $H_0: \mu = 1500$
*   $H_1: \mu \neq 1500$

We use a **1-sample t-test**.

In [None]:
hypothesized_mean = 1500
t_stat, p_val = stats.ttest_1samp(annual_rain, hypothesized_mean)

print(f"Hypothesized Mean: {hypothesized_mean}")
print(f"Sample Mean: {annual_rain.mean():.2f}")
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.4e}")

if p_val < 0.05:
    print("Result: Reject Null Hypothesis (Significant Difference)")
else:
    print("Result: Fail to Reject Null Hypothesis (No Significant Difference)")

## 5Ô∏è‚É£ Hypothesis Testing: Two Populations
**Scenario**: Is there a significant difference in annual rainfall between **Kerala** and **Assam & Meghalaya**?

*   $H_0: \mu_{Kerala} = \mu_{Assam}$
*   $H_1: \mu_{Kerala} \neq \mu_{Assam}$

In [None]:
# Filter data for two regions
rain_kerala = df_clean[df_clean['SUBDIVISION'] == 'KERALA']['ANNUAL']
rain_assam = df_clean[df_clean['SUBDIVISION'] == 'ASSAM & MEGHALAYA']['ANNUAL']

# Perform 2-sample t-test (independent)
t_stat_2, p_val_2 = stats.ttest_ind(rain_kerala, rain_assam, equal_var=False)

print(f"Mean Kerala: {rain_kerala.mean():.2f}")
print(f"Mean Assam & Meghalaya: {rain_assam.mean():.2f}")
print(f"P-value: {p_val_2:.4f}")

if p_val_2 < 0.05:
    print("Result: Significant difference found between regions.")
else:
    print("Result: No significant difference.")

## 6Ô∏è‚É£ Analysis of Variance (ANOVA)
**Scenario**: Compare means across **three** regions: `WEST RAJASTHAN`, `KERALA`, and `GANGETIC WEST BENGAL`.

*   $H_0$: All means are equal.
*   $H_1$: At least one mean is different.

In [None]:
group1 = df_clean[df_clean['SUBDIVISION'] == 'WEST RAJASTHAN']['ANNUAL']
group2 = df_clean[df_clean['SUBDIVISION'] == 'KERALA']['ANNUAL']
group3 = df_clean[df_clean['SUBDIVISION'] == 'GANGETIC WEST BENGAL']['ANNUAL']

f_stat, p_val_anova = stats.f_oneway(group1, group2, group3)

print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_val_anova:.4e}")

sns.boxplot(data=[group1, group2, group3])
plt.xticks([0, 1, 2], ['W. Rajasthan', 'Kerala', 'W. Bengal'])
plt.title('ANOVA: Rainfall Distribution by Region')
plt.ylabel('Rainfall (mm)')
plt.show()

## 7Ô∏è‚É£ Chi-Square Test
**Scenario**: Is "Flood Risk" independent of "Region"?
We define a **High Rain** year as > 2000mm. We check if the proportion of High Rain years differs between **Orissa** and **Jharkhand**.

*   $H_0$: Rain Category is independent of Region.
*   $H_1$: They are dependent.

In [None]:
# Select Data
subset_df = df_clean[df_clean['SUBDIVISION'].isin(['ORISSA', 'JHARKHAND'])].copy()

# Create Categorical Variable: 'High' vs 'Normal' Rain
subset_df['Rain_Category'] = subset_df['ANNUAL'].apply(lambda x: 'High (>1500)' if x > 1500 else 'Normal')

# Create Contingency Table
contingency_table = pd.crosstab(subset_df['SUBDIVISION'], subset_df['Rain_Category'])
print("Contingency Table:\n", contingency_table)

# Chi-Square Test
chi2, p_val_chi2, dof, expected = stats.chi2_contingency(contingency_table)

print(f"\nChi2 Statistic: {chi2:.4f}")
print(f"P-value: {p_val_chi2:.4f}")

if p_val_chi2 < 0.05:
    print("Result: Variables are DEPENDENT (Region affects Rain Category).")
else:
    print("Result: Variables are INDEPENDENT.")