### Q6b)

In [1]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.linear_model import LinearRegression

In [2]:
saar_data = pd.read_csv('../../datasets/SAAR_elevation.csv')

In [3]:
saar_data.head(5)

Unnamed: 0,Name,elevation_m,SAAR_mm,E,N
0,ASK,186,1136,351500,523900
1,OAS,165,889,357910,527500
2,SYH,120,850,358200,522000
3,SLE,212,1166,359300,519200
4,BLA,330,1425,363100,511700


### Check for missing:
* any missing data
* any erroneous data (e.g. negative, or large values which are clearly incorrect)
* any samples which are too small or not matching in size if they are to be compared against other samples

In [4]:
# using isnull() function
saar_data.isnull().head(5)

Unnamed: 0,Name,elevation_m,SAAR_mm,E,N
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False


In [5]:
# Any missing values?
saar_data.isnull().values.any()

False

### Function for calculating estimated coefficient and plotting

In [6]:
# Function for calculating estimated coefficient and plotting

def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)

    # mean of x and y vector
    m_x = np.mean(x)
    m_y = np.mean(y)

    # calculating cross-deviation and deviation about x
    SS_xy = np.sum(y*x) - n*m_y*m_x
    SS_xx = np.sum(x*x) - n*m_x*m_x

    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x

    return (b_0, b_1)

def plot_regression_line(x, y, b):
    # plotting the actual points as scatter plot
    plt.scatter(x, y, color = "m",
               marker = "o", s = 30)

    # predicted response vector
    y_pred = b[0] + b[1]*x

    # plotting the regression line
    plt.plot(x, y_pred, color = "g")

    # putting labels
    plt.xlabel('elevation')
    plt.ylabel('rainfall')

    # function to show plot
    plt.show()


### simple linear regression relation between rainfall and elevation

In [7]:
 # observations / data
rainfall = saar_data.SAAR_mm
elevation_easting_northing = saar_data[['elevation_m', 'E', 'N']]

# estimating coefficients
# b = estimate_coef(elevation_easting_northing, rainfall)
# print("Estimated coefficients:\nb_0 = {}  \
#      \nb_1 = {}".format(b[0], b[1]))

# plotting regression line
# plot_regression_line(elevation_easting_northing, rainfall, b)

## Comparing base-line method results with sklearning library funncton

In [8]:
rainfall = rainfall.values
elevation_easting_northing = elevation_easting_northing.values.reshape(-1, 3)

In [9]:
model = LinearRegression().fit(elevation_easting_northing, rainfall)

In [10]:
r_sq = model.score(elevation_easting_northing, rainfall)
print(f"coefficient of determination: {r_sq}")

coefficient of determination: 0.8150923030070694


In [11]:
print(f"intercept: {model.intercept_}")
print(f"slope: {model.coef_}")

intercept: 12860.230479130896
slope: [ 2.12417658 -0.00843012 -0.01766988]


In [12]:
# plotting regression line
# plot_regression_line(elevation_easting_northing, rainfall, [model.intercept_, model.coef_])

### Q6c) For a new ungauged location at (E,N) (380000,500000) and elevation of 400 m, use your best regression relation from 2b to estimate the SAAR.

In [13]:
estimate_saar_input_vales = np.array([400, 380000, 500000]).reshape(-1, 3)

In [15]:
estimate_saar_prediction = model.predict(estimate_saar_input_vales)
print(f"The estimate value of the SAAR is:\n{estimate_saar_prediction}")

The estimate value of the SAAR is:
[1671.51616078]


### Comment:
Comment
Despite the lapses of linear regression and its requisites mentioned below
1. Observations are limited.
2. Assume that errors in X or regressor values are minor. It follows the zero-error exogeneity principle.
3. Regressors should be constants or specified random variables.
4. An economic model fits data better with reduced variance between data points and the regression line.
5. Maximum likelihood estimator maximizes model-data agreement.

Linear regression computationally cheap to calculate coefficients and provide a sight into what the model captured. Also, our data meets all the requirements. The correlation is positive and that the R-squared value is very good; meaning that the model is very good (Table 1. The slope and the intercept also show a good correlation. In general, our relation is reliable.
Table 1: General performance ratings for evaluation criteria.
|                  |                  | **MONTHLY**      |                  |                               |
|------------------|------------------|------------------|------------------|-------------------------------|
| **    NSE   **   | **    r, R2   ** | **    PBIAS   ** | **    RSR   **   | **    Performance Rating   ** |
|     >0.90        |     >0.90        |     ≤±5          |     0.00         |     Excellent                 |
|     0.80-0.90    |     0.80-0.90    |     ≤±5-±10      |     0-0.50       |     Very good                 |
|     0.70-0.80    |     0.70-0.80    |     ±10-±15      |     0.50-0.60    |     Good                      |
|     0.50-0.70    |     0.50-0.70    |     ±15-±25      |     0.60-0.70    |     Satisfactory              |
|     <0.50        |     <0.50        |     ≥±25         |     >0.70        |     Unsatisfactory            |

There has been improvement from the multiple regression to linear regression except for the slope that reduces from 2.58 to 2.12. Multiple regression is having R-squared and intercept of 0.72 and 532.9 compared to the linear having R-squared and intercept of 0.82 and 12860 respectively. In general, making reference to table 1, the model is very good.
