# Prediction of daily monsoon rainfall

Two files are attached which contain daily rainfall data over India for 2010 and 2011. Both of them contain a 357x122 matrix (XR1 and XR) an a binary vector (ZR1 and ZR). The matrices contain rainfall amounts at 357 locations over India, on each day during the monsoon seasons of 2010 and 2011 (122 days from 1 June to 30 September). ZR1 and ZR are binary vectors which classify every day as 'rainy" (1) or non-rainy (0) based on the rainfall across the landmass.

1) Read the .mat files in Python and access the variables

2) Use a linear regression model to predict the rainfall XR(s,t) at any location 's' on day 't', using as predictor the rainfall at all other locations on the same day, and also rainfall at the same location on the previous 2 days [XR(1,t)....XR(s-1,t), XR(s+1,t),....XR(357,t), XR(s,t-1), XR(s,t-2)]. Use 2010 data for training.

Build such a model for s=42 (Mumbai), s=158 (Delhi), s= 299 (Kharagpur)

3) Use the same model to predict the rainfall at these 3 locations on each day of 2011.  Use values in XR as predictors. Compare the results with the true values and compute error for 3 locations separately.

4) Repeat the same process using LASSO linear regression. Using the coefficients, identify the top 5 predictors for each of the 3 locations.

5) Use Decision Tree on 2010 data to classify each day as 1 or 0. For each day, use the 357-dimensional rainfall vector as feature vector. Report the 10 most discriminative features (i.e. locations)

6) Use this Decision Tree to classify each day of 2011 as 1 or 0. Report accuracy.

In [252]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.io
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_squared_error, accuracy_score
from math import sqrt
from sklearn.tree import DecisionTreeClassifier

In [253]:
# Loading data
train = scipy.io.loadmat("2010rainfall.mat")
test = scipy.io.loadmat("2011rainfall.mat")
# row 158 is Delhi
# Kgp 299
# row 122 mumbai
# 122 days monsoon

In [254]:
train_data = train["XR1"]
test_data = test["XR"]

In [255]:
# temporal coherence, spatial coherence
test_data.shape

(357, 122)

In [256]:
x_train[:, 1].shape

(122,)

In [257]:
def get_data(x, column, row):
    rows_include = list(range(0, row)) + list(range(row+1, x.shape[0]))
    data = x[rows_include, column]
    data = np.append(data, [x[row-2][column], x[row-1][column]])
    return data

In [258]:
def feature_vector(x, row):
    x_train = []
    y_train = []
    for i in range(2, x.shape[1]):
        x_train.append(get_data(x, i, row))
        y_train.append(x[row][i])
    return np.array(x_train), np.array(y_train)

In [259]:
x_train, y_train = feature_vector(train_data, 42)

In [260]:
print(f"x_train shape is {x_train.shape}")
print(f"y_train shape is {y_train.shape}")

x_train shape is (120, 358)
y_train shape is (120,)


## Linear Regression

### Mumbai

In [261]:
row = 42
x_train, y_train = feature_vector(train_data, row)
x_test, y_test = feature_vector(test_data, row)

In [262]:
m = LinearRegression()
m.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [263]:
error = sqrt(mean_squared_error(m.predict(x_test), y_test))
print(f"Error is {error}")

Error is 25.779489035763323


### Delhi

In [264]:
row = 158
x_train, y_train = feature_vector(train_data, row)
x_test, y_test = feature_vector(test_data, row)

In [265]:
m = LinearRegression()
m.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [266]:
error = sqrt(mean_squared_error(m.predict(x_test), y_test))
print(f"Error is {error}")

Error is 13.399315105760785


### Kharagpur

In [267]:
row = 299
x_train, y_train = feature_vector(train_data, row)
x_test, y_test = feature_vector(test_data, row)

In [268]:
m = LinearRegression()
m.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [269]:
error = sqrt(mean_squared_error(m.predict(x_test), y_test))
print(f"Error is {error}")

Error is 16.061575334154583


## Lasso

### Mumbai

In [270]:
row = 42
x_train, y_train = feature_vector(train_data, row)
x_test, y_test = feature_vector(test_data, row)

In [271]:
m = Lasso(max_iter=10000)
m.fit(x_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=10000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [272]:
error = sqrt(mean_squared_error(m.predict(x_test), y_test))
print(f"Error is {error}")

Error is 25.3835019883162


In [273]:
coefficients = m.coef_

In [274]:
coefficients.shape

(358,)

In [275]:
# getting top 5 elments from coefficeints
index = sorted(range(len(coefficients)), key=lambda i: coefficients[i], reverse=True)[:5]
print(f"Most important 5 indices are {index}")

Most important 5 indices are [26, 25, 47, 90, 92]


In [276]:
coefficients[26]

0.5790623596033783

In [277]:
coefficients[25]

0.3533835636785544

### Delhi

In [278]:
row = 158
x_train, y_train = feature_vector(train_data, row)
x_test, y_test = feature_vector(test_data, row)

In [279]:
m = Lasso(max_iter=10000)
m.fit(x_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=10000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [280]:
error = sqrt(mean_squared_error(m.predict(x_test), y_test))
print(f"Error is {error}")

Error is 12.208246968097699


In [281]:
coefficients = m.coef_

In [282]:
coefficients.shape

(358,)

In [283]:
# getting top 5 elments from coefficeints
index = sorted(range(len(coefficients)), key=lambda i: coefficients[i], reverse=True)[:5]
print(f"Most important 5 indices are {index}")

Most important 5 indices are [131, 202, 153, 74, 73]


### Kharagpur

In [284]:
row = 42
x_train, y_train = feature_vector(train_data, row)
x_test, y_test = feature_vector(test_data, row)

In [285]:
m = Lasso(max_iter=10000)
m.fit(x_train, y_train)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=10000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [286]:
error = sqrt(mean_squared_error(m.predict(x_test), y_test))
print(f"Error is {error}")

Error is 25.3835019883162


In [287]:
coefficients = m.coef_

In [288]:
coefficients.shape

(358,)

In [289]:
# getting top 5 elments from coefficeints
index = sorted(range(len(coefficients)), key=lambda i: coefficients[i], reverse=True)[:5]
print(f"Most important 5 indices are {index}")

Most important 5 indices are [26, 25, 47, 90, 92]


## Decision Tree Classifier

In [290]:
x_train = train_data.T
y_train = train["ZR1"].T
x_test = test_data.T
y_test = test["ZR"].T

In [291]:
print(f"x_train shape is {x_train.shape}")
print(f"y_train shape is {y_train.shape}")
print(f"x_test shape is {x_test.shape}")
print(f"y_test shape is {y_test.shape}")

x_train shape is (122, 357)
y_train shape is (122, 1)
x_test shape is (122, 357)
y_test shape is (122, 1)


In [292]:
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

In [293]:
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy is {acc}")

Accuracy is 0.6885245901639344


In [294]:
feature_importance = clf.feature_importances_

In [295]:
# Picking the values for feature importance
index = sorted(range(len(feature_importance)), key=lambda i: feature_importance[i], reverse=True)[:5]
print(f"Most important 5 indices are {index}")

Most important 5 indices are [184, 252, 163, 26, 204]
