#### Importing Libraries...

In [None]:
#pip install missingno

In [None]:
import numpy as np                                       # Numpy for arrays...
import pandas as pd                                      # Pandas for the datasets...
import seaborn as sns                                    # Seaborn for visualization...
import matplotlib.pyplot as plt                          # Matplotlib for visualization...
from sklearn.base import BaseEstimator, TransformerMixin      # Transformer classes for Pipelining... 
from sklearn.impute import SimpleImputer                    # SimpleImputer to compute the missing values...
import missingno as missing                               # Visualization of Missing values...

In [None]:
#dataset = pd.read_csv("E:/Downloads/train_data.csv")
dataset = pd.read_csv("train_data.csv")

In [None]:
dataset.head(3)

#### Geographical Concepts:- 
1. NMME - North America Multi-Model Ensemble.
2. Temperature is affected by wind. Wind is affected by Atmospheric Pressure and Topography (relative height from sea level). contest-wind-h10-14d__wind-hgt-10 means that the temperature values are provided with varying pressure and height respectively. Wind can make the area warmer or cooler and it depends upon the region from where the arriving from.
3. contest-pevpr-sfc-gauss-14d__pevpr denotes the Evaporation. Temperature is directly proportional to the rate of Evaporation.
4. contest-rhum-sig995-14d__rhum denotes the relative humidity of the data. The Relative Humidity is inversely proportional to the Temperature.
5. The values as 56w and 34w are weeks like 5-6 weeks and 3-4 weeks.
6. The total Precipitable water is the precipitation that can occur on a region. Precipitation is the precipitation occured in the region.
7. MEI in Geography stands for Maps Etc. Inc.
8. The Sea Surface Temperature (SST) defines the Sea Temperature at the surface and is a prime factor for Temperature determination.

#### Below Explanation is my own explanation as perceived from the dataset and via internet. Feel free to provide any correction or updation. Now We will group the columns. I am providing the description of the column names for everybody reference:-
1. index for the index.
2. lat for the lattitudes, lon for the longitudes.
3. we have a start date column.
contest-pevpr-sfc-gauss-14d__pevpr column for evaporation.
4. Digged into net and NMME stands for North America Multi-Model Ensemble, and 10 row enteries after the contest-pevpr-sfc-gauss-14d__pevpr column deal with the evaporation data. The last column of Evaporation data deals with the mean values of the Weather Forecast Stations.
5. contest-wind-h10-14d__wind-hgt-10 column deals with the Wind. Similarly the 10 row enteries after the contest-wind-h10-14d__wind-hgt-10 column deal with the Height data and we have the last Height column as the mean data.
6. contest-rhum-sig995-14d__rhum is the relative humidity data. The next 10 row enteries deal with the Relative Humidity and the last column with the Mean.
7. contest-wind-h100-14d__wind-hgt-100 is another column dealing with the wind. We have total 20 columns for this particular data. The first 10 columns deals with p-rate 54w and the next 10 columns deal with p-rate 34w.
8. contest-tmp2m-14d__tmp2m, contest-slp-14d__slp are probably the ratio of the atmospheric factors.
contest-wind-vwnd-925-14d__wind-vwnd-925 is the data for longitudinal wind. It also has 10 rows of data succedding it with last row as mean data. contest-wind-uwnd-250-14d__wind-uwnd-250 Also deals with the wind (longitudinal) with different atmospheric factors and time duration.
9. contest-prwtr-eatm-14d__prwtr is the precipitable water for entire atmosphere.
10. contest-precip-14d__precip deals with the precipitation.
11. climateregions__climateregion, elevation__elevation gives the idea of the climate region and the precipitation.
12. mjo1d__phase, mjo1d__amplitude are the MJO phase and amplitudes.
13. mei__mei, mei__meirank, mei__nip is the MEI system.
14. sst-2010-1, sst-2010-2, sst-2010-3, sst-2010-4, sst-2010-5, sst-2010-6, sst-2010-7, sst-2010-8, sst-2010-9, sst-2010-10 are the 10 columns of the Sea-Surface-Temperature.
15. The other columns are the different values of wind with different atmospheric pressures and are in a group of 10 or 20 columns each. Thus we can create a seperate column for the mean of each Wind values of unique conditions.
16. wind-hgt columns have two groups of 10 columns each. One group corresponds to 500m and 850m above sea level.

In [None]:
lst = []
for column in dataset.columns:
    if dataset[column].isnull().any():
        lst.append(column)
lst

#### Later, we will find that these all values are dropped, due to being insignificant in computing the temperature, so I simply used the mean strategy to compute it.

#### Let us check whether the missing values are a case of NMAR (Not Missing at Random), MCAR (Missing Completely at Random) OR MAR (Missing at Random).

In [None]:
missing.matrix(dataset)       # Visualizing the missing values...

#### Looking at the graph it is clear, that it is a case of NMAR, like we have them missing values all lying in a small range of time say a month probably. Also, they are separated into two groups in time series (according to the time). Thus, according to my intuition, the missing values are separated by a Year and both the groups of missing values lie in the same month, thus it is case of NMAR (Not Missing at Random).

#### First Let us break startdate column to month and years.

In [None]:
class DateToMonthAndYears(BaseEstimator, TransformerMixin):     # Using Pipelining for faster throughput...
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['startdate'] = pd.to_datetime(X['startdate'])         # Converting to date and time...
        X['date'] = pd.DatetimeIndex(X['startdate']).date
        X['Month'] = pd.DatetimeIndex(X['startdate']).month
        X['Year'] = pd.DatetimeIndex(X['startdate']).year
        return X.drop(columns="startdate")     # Dropping the startdate column...

#### Now Let us Remove the Index Column

In [None]:
class RemoveIndex(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(columns="index")    # We drop the Index Column...

#### Now, we will find all the columns with the null values and then compute the null values using the Mean strategy of the Simple Imputer from sklearn library.

In [None]:
null_cols = []
class ObtainingNullValues(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        missing_values = []
        for col in dataset.columns:
            if(dataset[col].isnull().any()):    # All columns have missing values as float values...
                null_cols.append(col)
                missing_values.append(dataset[col].isnull().sum())   # The maximum missing value is 15,934 values in a single column out of 3,75,734 rows...
        # Thus if we evaluate the percentage of maximum missing values in a column, it accounts to 4.24 % only... Thus we can take the missing values as the mean of the given values in the column...
        Imputer = SimpleImputer(strategy="mean")
        for col in null_cols:
            X[col] = Imputer.fit_transform(X[[col]])     # Now we set the null values as the mean of the dataset column...
        return X

#### Evaluating the means of each row for every year for El Nino phenomenon for the Year 2010-2020.

In [None]:
class TwentyColumnSet(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        wind, wind1, wind2, wind3 = [], [], [], []
        X['wind-vwnd-250-2010-mean'] = 0     # Creating an empty column for MEAN...
        X['wind-uwnd-250-2010-mean'] = 0     
        X['wind-uwnd-925-2010-mean'] = 0      
        X['wind-vwnd-925-2010-mean'] = 0
        for i in range(1, 21):
            wind.append('wind-vwnd-250-2010-{a}'.format(a=i))    # Using format string to store the column names...
            wind1.append('wind-uwnd-250-2010-{a}'.format(a=i))
            wind2.append('wind-uwnd-925-2010-{a}'.format(a=i))
            wind3.append('wind-vwnd-925-2010-{a}'.format(a=i))    # Using format string...
        for j in range(0, len(wind)):
            X['wind-vwnd-250-2010-mean'] = X['wind-vwnd-250-2010-mean'] + X[wind[j]]
            X['wind-uwnd-250-2010-mean'] = X['wind-uwnd-250-2010-mean'] + X[wind1[j]]
            X['wind-uwnd-925-2010-mean'] = X['wind-uwnd-925-2010-mean'] + X[wind2[j]]
            X['wind-vwnd-925-2010-mean'] = X['wind-vwnd-925-2010-mean'] + X[wind3[j]]
        X['wind-vwnd-250-2010-mean'] = X['wind-vwnd-250-2010-mean'] / 20    # Storing the mean...
        X['wind-uwnd-250-2010-mean'] = X['wind-uwnd-250-2010-mean'] / 20
        X['wind-uwnd-925-2010-mean'] = X['wind-uwnd-925-2010-mean'] / 20
        X['wind-vwnd-925-2010-mean'] = X['wind-vwnd-925-2010-mean'] / 20
        for j in range(0, len(wind)):
            X = X.drop(columns="wind-vwnd-250-2010-{a}".format(a=j+1))    # Dropping the irrelevant columns...
            X = X.drop(columns="wind-uwnd-250-2010-{a}".format(a=j+1))
            X = X.drop(columns="wind-uwnd-925-2010-{a}".format(a=j+1))
            X = X.drop(columns="wind-vwnd-925-2010-{a}".format(a=j+1))
        return X

#### Finding the Yearly Mean of every day for the El Nino parameters.

In [None]:
class TenColumnSet(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        set1, set2, set3, set4, set5, set6 = [], [], [], [], [], []
        X['icec-2010-mean'] = 0
        X['wind-hgt-850-2010-mean'] = 0    # Creating columns for storing the mean of the values...
        X['wind-hgt-10-2010-mean'] = 0
        X['wind-hgt-100-2010-mean'] = 0
        X['wind-hgt-500-2010-mean'] = 0
        X['sst-2010-mean'] = 0
        for i in range(1,11):
            set1.append('icec-2010-{a}'.format(a=i))          # Using format string to store the column name...
            set2.append('wind-hgt-850-2010-{a}'.format(a=i))
            set3.append('wind-hgt-10-2010-{a}'.format(a=i))
            set4.append('wind-hgt-100-2010-{a}'.format(a=i))
            set5.append('wind-hgt-500-2010-{a}'.format(a=i))
            set6.append('sst-2010-{a}'.format(a=i))
        for j in range(0, len(set1)):
            X['icec-2010-mean'] = X['icec-2010-mean'] + X[set1[j]]    # Evaluating the sum separately...
            X['wind-hgt-850-2010-mean'] = X['wind-hgt-850-2010-mean'] + X[set2[j]]
            X['wind-hgt-10-2010-mean'] = X['wind-hgt-10-2010-mean'] + X[set3[j]]
            X['wind-hgt-100-2010-mean'] = X['wind-hgt-100-2010-mean'] + X[set4[j]]
            X['wind-hgt-500-2010-mean'] = X['wind-hgt-500-2010-mean'] + X[set5[j]]
            X['sst-2010-mean'] = X['sst-2010-mean'] + X[set6[j]]
        X['icec-2010-mean'] = X['icec-2010-mean'] / 10
        X['wind-hgt-850-2010-mean'] = X['wind-hgt-850-2010-mean'] / 10
        X['wind-hgt-10-2010-mean'] = X['wind-hgt-10-2010-mean'] / 10     # Calculating the mean...
        X['wind-hgt-100-2010-mean'] = X['wind-hgt-100-2010-mean'] / 10
        X['wind-hgt-500-2010-mean'] = X['wind-hgt-500-2010-mean'] / 10
        X['sst-2010-mean'] = X['sst-2010-mean'] / 10
        for j in range(0, len(set1)):    # Removing the unnecessary columns...
            X = X.drop(columns="icec-2010-{a}".format(a=j+1))
            X = X.drop(columns="wind-hgt-850-2010-{a}".format(a=j+1))
            X = X.drop(columns="wind-hgt-10-2010-{a}".format(a=j+1))
            X = X.drop(columns="wind-hgt-100-2010-{a}".format(a=j+1))
            X = X.drop(columns="wind-hgt-500-2010-{a}".format(a=j+1))
            X = X.drop(columns="sst-2010-{a}".format(a=j+1))
        return X

#### Now we use the Pipe Classes and thus, call the Pipelining concept.

In [None]:
from sklearn.pipeline import Pipeline          # Pipeline for preprocessing called...
Pipe = Pipeline([                            # Using the Pipeline classes...
    ("Date", DateToMonthAndYears()),
    ("Index", RemoveIndex()),
    ("NullValues", ObtainingNullValues()),     # Every class in Pipeline is called in a Sequential manner...
    ("TwentySet", TwentyColumnSet()),
    ("TenSet", TenColumnSet())
])
dataset = Pipe.fit_transform(dataset)     # Passing the Original dataset into the Pipeline...
dataset.shape

In [None]:
dataset.head(5)

#### The number of Columns have been reduced from 246 to 117.

#### Now we first find the Variation of Temperature in the given dataset.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(dataset['contest-tmp2m-14d__tmp2m'], bins=30)
plt.xlabel("Temperature Bins", c="Red", size=18)
plt.show()

#### Now we will find the relation between various environmental values and the temperature.

In [None]:
data = pd.DataFrame()    # creating a dataset of environmental factors...
data['Evaporation'] = dataset['contest-pevpr-sfc-gauss-14d__pevpr']
data['Wind'] = dataset['contest-wind-h10-14d__wind-hgt-10']
data['Humidity'] = dataset['contest-rhum-sig995-14d__rhum']
data['Temperature'] = dataset['contest-tmp2m-14d__tmp2m']
data['SeaLevelPressure'] = dataset['contest-slp-14d__slp']
data['Pressure'] = dataset['contest-pres-sfc-gauss-14d__pres']
data['Elevation'] = dataset['elevation__elevation']
data['Precipitation'] = dataset['contest-precip-14d__precip']
data['Month'] = dataset['Month']

In [None]:
def PlottingChart():
    plt.figure(figsize=(10, 6))
    sns.heatmap(data.corr(), cmap="Spectral", annot=True)
    plt.title("Correlation Heatmap of Various Factors", size=18, c="red")
    plt.show()

PlottingChart()    # Visualizing data...

#### Finding the number of counts of each respective Climatic regions. Since one climatic region quantity is much more, we have to train our model effectively for under-fitting and over-fitting reasons. As for cross validation we can use LeavePOut Cross Validation or Rolling Cross Validation.

In [None]:
dataset['climateregions__climateregion'].value_counts().sort_values().plot(kind='bar', figsize=(10,4), rot=0)

In [None]:
data['Year'] = dataset['Year']

#### Since we have many Weather forecast stations in the dataset, so we can first check like which forecast station is having similarity with the Temperature. The best thing to do is by checking their means and standard deviations. If the standard deviation or mean is too high, then we should not take that weather forecast station for training or checking the temperature.

In [None]:
temp_mean = data['Temperature'].mean()
X = [temp_mean] * 20
Y = []
stations = ['nmme0-tmp2m-34w__cancm30', 'nmme0-tmp2m-34w__cancm40', 'nmme0-tmp2m-34w__ccsm30', 'nmme0-tmp2m-34w__ccsm40', 'nmme0-tmp2m-34w__cfsv20', 'nmme0-tmp2m-34w__gfdlflora0', 'nmme0-tmp2m-34w__gfdlflorb0', 'nmme0-tmp2m-34w__gfdl0', 'nmme0-tmp2m-34w__nasa0', 'nmme0-tmp2m-34w__nmme0mean']     # Checking for 3-4 weeks...
for station in stations:
    Y.append(dataset[station].mean())
    Y.append(dataset[station].std())
Z = []
for values in range(0, int(len(X)/2)):
    Z.append("Mean")
    Z.append("Standard Deviation")
NMME = ['Cancm30', 'Cancm30', 'Cancm40', 'Cancm40', 'Ccsm30', 'Ccsm30', 'Ccsm40', 'Ccsm40', 'cfsv20', 'cfsv20','gfd_a', 'gfd_a', 'gfd_b', 'gfd_b', 'gfdl', 'gfdl', 'nasa', 'nasa', 'mean', 'mean']

#### The Pressure vs Temperature contour map is much widespread, and the data mostly is contained within the range of 600 to 900 contour lines.

In [None]:
import plotly.express as px
df = data
fig = px.density_contour(df, x="Temperature", y="Pressure", title="Pressure X Temperature")
fig.update_traces(contours_coloring="fill", contours_showlabels = True, colorscale="Thermal")
fig.update_layout(width=1000, height=750, font_family="Courier New", font_color="green",     title_font_family="Times New Roman", title_font_color="red", title_font_size=24, font_size=16)
fig.show()

#### The Elevation vs Temperature contour graph is less widespread and is distinctly classified into two groups. Most of the values are within the range of 500 to 1000.

In [None]:
df = data
fig = px.density_contour(df, x="Temperature", y="Elevation", title="Elevation X Temperature")
fig.update_traces(contours_coloring="fill", contours_showlabels = True, colorscale="Thermal")
fig.update_layout(width=1000, height=750, font_family="Courier New", font_color="green",     title_font_family="Times New Roman", title_font_color="red", title_font_size=24, font_size=16)
fig.show()

In [None]:
stations1 = ['nmme0-tmp2m-56w__cancm3', 'nmme0-tmp2m-56w__cancm4', 'nmme0-tmp2m-56w__ccsm3', 'nmme0-tmp2m-56w__ccsm4', 'nmme0-tmp2m-56w__cfsv2', 'nmme0-tmp2m-56w__gfdlflora', 'nmme0-tmp2m-56w__gfdlflorb', 'nmme0-tmp2m-56w__gfdl', 'nmme0-tmp2m-56w__nasa', 'nmme0-tmp2m-56w__nmmemean']      # Checking for 5-6 weeks...
Y1 = []
for station in stations:
    Y1.append(dataset[station].mean())
    Y1.append(dataset[station].std())

In [None]:
stations34 = ['nmme-tmp2m-34w__cancm3', 'nmme-tmp2m-34w__cancm4', 'nmme-tmp2m-34w__ccsm3', 'nmme-tmp2m-34w__ccsm4', 'nmme-tmp2m-34w__cfsv2', 'nmme-tmp2m-34w__gfdl', 'nmme-tmp2m-34w__gfdlflora', 'nmme-tmp2m-34w__gfdlflorb', 'nmme-tmp2m-34w__nasa', 'nmme-tmp2m-34w__nmmemean'] # Checking for 3-4 weeks...
Y34 = []
for station in stations34:
    Y34.append(dataset[station].mean())
    Y34.append(dataset[station].std())

In [None]:
df1 = pd.DataFrame(X, columns=["Temperature"])
df1['StationData'] = Y      # 3-4 week period data for NMME0...
df1['Station1Data'] = Y1    # 5-6 week period data...
df1['Station2Data'] = Y34   # 3-4 week period data...
df1['Stations'] = NMME
df1['Values'] = Z

#### Since the mean and standard deviation attributes of temperatures is almost constant in the each Month, thus we can clearly divide the data into each Month and then perform the EDA specifically.

In [None]:
fig = px.bar(df1, x='Stations', y='StationData', color="Values", title="Values for Various Forecast Stations for 3 to 4 Weeks Period for Most Recent Monthly Forecast", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

#### Since 5 to 6 week period data is exactly similar, so we will later check whether they are highly correlated. Because sometimes, even if two distributions have same mean and variance, they may not be exactly correlated because in time series data we have a very influential factor as Date and Time.

In [None]:
fig = px.bar(df1, x='Stations', y='Station1Data', color="Values", title="Values for Various Forecast Stations for 5 to 6 Weeks Period", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

In [None]:
fig = px.bar(df1, x='Stations', y='Station2Data', color="Values", title="Values for Various Forecast Stations for 3 to 4 Weeks Period", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

#### Intuition from the Bar Plots:-
1. Since the means and variance of all the weather forecast stations are almost same, we can exclude many of them and take only one as a parameter while training. Also, when we are training a simple infusion of bias with value of 1 will solve these little discrepancies in the mean while training the model.

In [None]:
data1 = pd.DataFrame()
data1['Temperature'] = data['Temperature']
data1['MeanPrecipitation34'] = dataset['nmme-prate-34w__nmmemean']
data1['MeanPrecipitation56'] = dataset['nmme-prate-56w__nmmemean']
data1['MeanPrecipitation0of34'] = dataset['nmme0-prate-34w__nmme0mean']
data1['MeanPrecipitation0of56'] = dataset['nmme0-prate-56w__nmme0mean']
data1['MeanTemperature34'] = dataset['nmme-tmp2m-34w__nmmemean']
data1['MeanTemperature56'] = dataset['nmme-tmp2m-56w__nmmemean']
data1['MeanTemperature0'] = dataset['nmme0-tmp2m-34w__nmme0mean']
data1['MostRecent'] = dataset['nmme0mean']
data1['MJOphase'] = dataset['mjo1d__phase']
data1['MJOamplitude'] = dataset['mjo1d__amplitude']
data1['Region'] = dataset['climateregions__climateregion']
data1['MEIrank'] = dataset['mei__meirank']

def PlottingChart1():
    plt.figure(figsize=(10, 6))
    sns.heatmap(data1.corr(), cmap="magma", annot=True)
    plt.title("Correlation Heatmap of Computational Factors", size=18, c="red")
    plt.show()

PlottingChart1()

#### Intuition from the Heat Map:- 
1. Since the Correlation coefficient of Temperature mean of 3-4 weeks and Temperature mean of 5-6 weeks is exactly same (0.95), so we can drop out any one of the group of columns. Here I will be dropping out the 5-6 week columns for all weather forecast stations.
2. Also the correlation between MeanTemperature0 and MeanTemperature56 is 0.91 and MeanTemperature0 and MeanTemperature34 is 0.9, so we will remove the 5-6 week entire group of columns.
Also I will be dropping out the MeanTemperature0 column and only have its mean value column as a parameter.
3. Similarly we will do it for the Precipitation as well since the three columns of Precipitation are highly correlated as well (0.99, 0.84 as correlation coefficient).

In [None]:
class DroppingWeatherStationsI(BaseEstimator, TransformerMixin):    # Removing Weather Stations...
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):      # Removing the 5-6 week Temperature columns...
        X = X.drop(columns='nmme-tmp2m-56w__cancm3', axis=1)    # The name of each weather station
        X = X.drop(columns='nmme-tmp2m-56w__cancm4', axis=1)    # removed is separately provided...
        X = X.drop(columns='nmme-tmp2m-56w__ccsm3', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__ccsm4', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__cfsv2', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__gfdlflora', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__gfdlflorb', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__gfdl', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__nasa', axis=1)
        X = X.drop(columns='nmme-tmp2m-56w__nmmemean', axis=1)
        return X

In [None]:
class DroppingWeatherStationsII(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):      # Removing the 3-4 week temperature as well and keeping only the mean column in the dataset...
        X = X.drop(columns='nmme-tmp2m-34w__cancm3', axis=1)     
        X = X.drop(columns='nmme-tmp2m-34w__cancm4', axis=1)     # The name of each Weather Station removed
        X = X.drop(columns='nmme-tmp2m-34w__ccsm3', axis=1)      # is separately provided...
        X = X.drop(columns='nmme-tmp2m-34w__ccsm4', axis=1)
        X = X.drop(columns='nmme-tmp2m-34w__cfsv2', axis=1)
        X = X.drop(columns='nmme-tmp2m-34w__gfdlflora', axis=1)
        X = X.drop(columns='nmme-tmp2m-34w__gfdlflorb', axis=1)
        X = X.drop(columns='nmme-tmp2m-34w__gfdl', axis=1)
        X = X.drop(columns='nmme-tmp2m-34w__nasa', axis=1)    # The Mean column is not removed...
        return X

In [None]:
class DroppingWeatherStationsIII(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):     # Removing the NMME0 Temperature 3-4 week columns...
        X = X.drop(columns="nmme0-tmp2m-34w__cancm30", axis=1)
        X = X.drop(columns="nmme0-tmp2m-34w__cancm40", axis=1)
        X = X.drop(columns='nmme0-tmp2m-34w__ccsm30', axis=1)    # Name of each removed Weather Station is 
        X = X.drop(columns='nmme0-tmp2m-34w__ccsm40', axis=1)    # separately provided...
        X = X.drop(columns='nmme0-tmp2m-34w__cfsv20', axis=1)
        X = X.drop(columns='nmme0-tmp2m-34w__gfdlflora0', axis=1)
        X = X.drop(columns='nmme0-tmp2m-34w__gfdlflorb0', axis=1)
        X = X.drop(columns='nmme0-tmp2m-34w__gfdl0', axis=1)
        X = X.drop(columns='nmme0-tmp2m-34w__nasa0', axis=1)     # Keeping the Mean column as parameter...
        return X

In [None]:
class DroppingWeatherStationsIV(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):     # Removing the most recent Temperature forecast columns...
        X = X.drop(columns="cancm30", axis=1)
        X = X.drop(columns="cancm40", axis=1)
        X = X.drop(columns='ccsm30', axis=1)    # Name of each removed Weather Station is 
        X = X.drop(columns='ccsm40', axis=1)    # separately provided...
        X = X.drop(columns='cfsv20', axis=1)
        X = X.drop(columns='gfdlflora0', axis=1)
        X = X.drop(columns='gfdlflorb0', axis=1)
        X = X.drop(columns='gfdl0', axis=1)
        X = X.drop(columns='nasa0', axis=1)     # Keeping the Mean column as parameter...
        return X

#### We used a Pipeline to drop the weather stations which we do not need as for training and testing our model.

In [None]:
Pipe1 = Pipeline([       # Pipeline for removing the Weather Stations created...
    ("DropStationsI", DroppingWeatherStationsI()),
    ("DropStationsII", DroppingWeatherStationsII()),
    ("DropStationsIII", DroppingWeatherStationsIII()),     # Integrating the required classes...
    ("DropStationsIV", DroppingWeatherStationsIV())
])
dataset = Pipe1.fit_transform(dataset)
dataset.shape

#### The dataset is now reduced to from 117 to 89 columns.

In [None]:
data2 = pd.DataFrame()
data2['Temperature'] = data1['Temperature']
data2['ZonalWind250'] = dataset['contest-wind-uwnd-250-14d__wind-uwnd-250']
data2['ZonalWind925'] = dataset['contest-wind-uwnd-925-14d__wind-uwnd-925']
data2['LongitudeWind250'] = dataset['contest-wind-vwnd-250-14d__wind-vwnd-250']
data2['LongitudeWind925'] = dataset['contest-wind-vwnd-925-14d__wind-vwnd-925']
data2['Height10'] = dataset['contest-wind-h10-14d__wind-hgt-10']
data2['Height100'] = dataset['contest-wind-h100-14d__wind-hgt-100']
data2['Height500'] = dataset['contest-wind-h500-14d__wind-hgt-500']
data2['Height850'] = dataset['contest-wind-h850-14d__wind-hgt-850']
data2['Evaporation'] = data['Evaporation']

def PlottingChart2():
    plt.figure(figsize=(10, 6))
    sns.heatmap(data2.corr(), cmap="icefire", annot=True)
    plt.title("Correlation Heatmap of Computational Factors", size=18, c="red")
    plt.show()

PlottingChart2()

#### The Contour Lines are dispersed into a dense set of two regions, thus the data mostly lies between te height of (16.2k to 16.4k) Ist set and (16.6k to 16.7k) IInd set.

In [None]:
import plotly.express as px
df = data2
fig = px.density_contour(df, x="Height100", y="ZonalWind250", title="Height100 Millibars X Zonal Wind 250")
fig.update_traces(contours_coloring="fill", contours_showlabels = True, colorscale="ylgnbu_r")
fig.update_layout(width=900, height=550, font_family="Courier New", font_color="green",     title_font_family="Times New Roman", title_font_color="red", title_font_size=24, font_size=16)
fig.show()

#### The Contour lines are quite linear in shape and the distribution is pretty constant and enclosed within a single large group with Zonal Wind range (-1 to 2).

In [None]:
import plotly.express as px
df = data2
fig = px.density_contour(df, x="Height500", y="ZonalWind925", title="Height 500 Millibars X ZonalWind925")
fig.update_traces(contours_coloring="fill", contours_showlabels = True, colorscale="ylgnbu_r")
fig.update_layout(width=900, height=550, font_family="Courier New", font_color="green",     title_font_family="Times New Roman", title_font_color="red", title_font_size=24, font_size=16)
fig.show()

In [None]:
ElNino = pd.DataFrame()
# Creating the Mean of El Nino dataset of the 2010-2020 when it prevailed...
ElNino['Temperature'] = data1['Temperature']
ElNino['ElNino-Mean-LongitudeWind250'] = dataset['wind-vwnd-250-2010-mean']
ElNino['ElNino-Mean-LongitudeWind925'] = dataset['wind-vwnd-925-2010-mean']
ElNino['ElNino-Mean-ZonalWind250'] = dataset['wind-uwnd-250-2010-mean']
ElNino['ElNino-Mean-ZonalWind925'] = dataset['wind-uwnd-925-2010-mean']
ElNino['Mean-GlacierFactor'] = dataset['icec-2010-mean']
ElNino['ElNino-Mean-Height10'] = dataset['wind-hgt-10-2010-mean']
ElNino['ElNino-Mean-Height100'] = dataset['wind-hgt-100-2010-mean']
ElNino['ElNino-Mean-Height500'] = dataset['wind-hgt-500-2010-mean']
ElNino['ElNino-Mean-Height850'] = dataset['wind-hgt-850-2010-mean']
ElNino['Mean-SeaTemperature'] = dataset['sst-2010-mean']

def PlottingChartElNino():          # Plotting the Heatmap...
    plt.figure(figsize=(10, 6))
    sns.heatmap(ElNino.corr(), cmap="icefire", annot=True)
    plt.title("Correlation Heatmap of ElNino (2010-2020) to Environment Factors", size=22, c="red")
    plt.show()

ElNino['Month'] = dataset['Month']
PlottingChartElNino()

#### Now we will reduce the number of Precipitation rate columns as well as we did for the Temperature columns.

In [None]:
Y = []
precip = ['nmme-prate-34w__cancm3', 'nmme-prate-34w__cancm4','nmme-prate-34w__ccsm3','nmme-prate-34w__ccsm4','nmme-prate-34w__cfsv2', 'nmme-prate-34w__gfdl','nmme-prate-34w__gfdlflora','nmme-prate-34w__gfdlflorb','nmme-prate-34w__nasa', 'nmme-prate-34w__nmmemean']     # Checking for 3-4 weeks...
for station in precip:
    Y.append(dataset[station].mean())
    Y.append(dataset[station].std())

In [None]:
Y1 = []
precip1 = ['nmme-prate-56w__cancm3', 'nmme-prate-56w__cancm4','nmme-prate-56w__ccsm3','nmme-prate-56w__ccsm4','nmme-prate-56w__cfsv2', 'nmme-prate-56w__gfdl','nmme-prate-56w__gfdlflora','nmme-prate-56w__gfdlflorb','nmme-prate-56w__nasa', 'nmme-prate-56w__nmmemean']     # Checking for 5-6 weeks...
for station in precip1:
    Y1.append(dataset[station].mean())
    Y1.append(dataset[station].std())

In [None]:
Y2 = []
precip2 = ['nmme0-prate-34w__cancm30', 'nmme0-prate-34w__cancm40','nmme0-prate-34w__ccsm30','nmme0-prate-34w__ccsm40','nmme0-prate-34w__cfsv20', 'nmme0-prate-34w__gfdl0','nmme0-prate-34w__gfdlflora0','nmme0-prate-34w__gfdlflorb0','nmme0-prate-34w__nasa0', 'nmme0-prate-34w__nmme0mean']     # Checking for 3-4  weeks of most recent (NMME0)...
for station in precip2:
    Y2.append(dataset[station].mean())
    Y2.append(dataset[station].std())

In [None]:
Y3 = []
precip3 = ['nmme0-prate-56w__cancm30', 'nmme0-prate-56w__cancm40','nmme0-prate-56w__ccsm30','nmme0-prate-56w__ccsm40','nmme0-prate-56w__cfsv20', 'nmme0-prate-56w__gfdl0','nmme0-prate-56w__gfdlflora0','nmme0-prate-56w__gfdlflorb0','nmme0-prate-56w__nasa0', 'nmme0-prate-56w__nmme0mean']     # Checking for 5-6  weeks of most recent (NMME0)...
for station in precip3:
    Y3.append(dataset[station].mean())
    Y3.append(dataset[station].std())

In [None]:
precipitate = pd.DataFrame(Z, columns=["values"])
precipitate["Stations"] = NMME            # Stations list...
precipitate["Precip1"] = Y               # Column for 3-4 week mean precipitation...
precipitate['Precip2'] = Y1              # Column for 5-6 week mean precipitation...
precipitate["Precip3"] = Y2              # Column for 3-4 week most recent mean precipitation NMME0...
precipitate["Precip4"] = Y3              # Column for 5-6 week most recent mean precipitation NMME0...
precipitate.head(4)

In [None]:
fig = px.bar(precipitate, x='Stations', y='Precip1', color="values", title="Values for Various Forecast Stations for 3 to 4 Weeks Precipitation Rate", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

In [None]:
fig = px.bar(precipitate, x='Stations', y='Precip2', color="values", title="Values for Various Forecast Stations for 5 to 6 Weeks Precipitation Rate", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

#### The above two bar plots have exactly same mean and standard deviation and are exactly correlated (0.99). Thus, we can remove any one of the above groups for Precipitation Rate.

In [None]:
fig = px.bar(precipitate, x='Stations', y='Precip3', color="values", title="Values for Various Forecast Stations for 3 to 4 Weeks Most Recent Precipitation Rate", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

In [None]:
fig = px.bar(precipitate, x='Stations', y='Precip4', color="values", title="Values for Various Forecast Stations for 5 to 6 Weeks Most Recent Precipitation Rate", text_auto=True)
fig.update_layout(title_font_color="purple", title_font_size=24, font_color="green", font_size=18)
fig.show()

#### The above two bar plots have almost same mean and standard deviation and are highly correlated (0.92). Thus, we can remove any one of the above groups for Precipitation Rate.

In [None]:
class PrecipitationDehydratedI(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):     # Removing the entire column set of 3-4 Week Precipitation...
        X = X.drop(columns="nmme-prate-34w__cancm3", axis=1)
        X = X.drop(columns="nmme-prate-34w__cancm4", axis=1)
        X = X.drop(columns="nmme-prate-34w__ccsm3", axis=1)
        X = X.drop(columns="nmme-prate-34w__ccsm4", axis=1)
        X = X.drop(columns="nmme-prate-34w__cfsv2", axis=1)
        X = X.drop(columns="nmme-prate-34w__gfdl", axis=1)
        X = X.drop(columns="nmme-prate-34w__gfdlflora", axis=1)
        X = X.drop(columns="nmme-prate-34w__gfdlflorb", axis=1)
        X = X.drop(columns="nmme-prate-34w__nasa", axis=1)
        X = X.drop(columns="nmme-prate-34w__nmmemean", axis=1)
        return X

#### I am keeping two columns here as parameters one of the lowest and one of the highest, since the means and deviations are widespread in this case.

In [None]:
class PrecipitationDehydratedII(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):     # Keeping cancm4 and ccsm3 columns for parameters...
        X = X.drop(columns="nmme-prate-56w__cancm3", axis=1)
        X = X.drop(columns="nmme-prate-56w__ccsm4", axis=1)
        X = X.drop(columns="nmme-prate-56w__cfsv2", axis=1)
        X = X.drop(columns="nmme-prate-56w__gfdl", axis=1)
        X = X.drop(columns="nmme-prate-56w__gfdlflora", axis=1)
        X = X.drop(columns="nmme-prate-56w__gfdlflorb", axis=1)
        X = X.drop(columns="nmme-prate-56w__nasa", axis=1)
        X = X.drop(columns="nmme-prate-56w__nmmemean", axis=1)
        return X

In [None]:
class PrecipitationDehydratedIII(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):     # Removing the entire column set of 3-4 Week most recent Precipitation...
        X = X.drop(columns="nmme0-prate-34w__cancm30", axis=1)
        X = X.drop(columns="nmme0-prate-34w__cancm40", axis=1)
        X = X.drop(columns="nmme0-prate-34w__ccsm30", axis=1)
        X = X.drop(columns="nmme0-prate-34w__ccsm40", axis=1)
        X = X.drop(columns="nmme0-prate-34w__cfsv20", axis=1)
        X = X.drop(columns="nmme0-prate-34w__gfdl0", axis=1)
        X = X.drop(columns="nmme0-prate-34w__gfdlflora0", axis=1)
        X = X.drop(columns="nmme0-prate-34w__gfdlflorb0", axis=1)
        X = X.drop(columns="nmme0-prate-34w__nasa0", axis=1)
        X = X.drop(columns="nmme0-prate-34w__nmme0mean", axis=1)
        return X

#### I am keeping two columns here as parameters one of the lowest and one of the highest, since the means and deviations are widespread in this case.

In [None]:
class PrecipitationDehydratedIV(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):     # Keeping cancm4 and ccsm3 columns for parameters...
        X = X.drop(columns="nmme0-prate-56w__cancm30", axis=1)
        X = X.drop(columns="nmme0-prate-56w__ccsm40", axis=1)
        X = X.drop(columns="nmme0-prate-56w__cfsv20", axis=1)
        X = X.drop(columns="nmme0-prate-56w__gfdl0", axis=1)
        X = X.drop(columns="nmme0-prate-56w__gfdlflora0", axis=1)
        X = X.drop(columns="nmme0-prate-56w__gfdlflorb0", axis=1)
        X = X.drop(columns="nmme0-prate-56w__nasa0", axis=1)
        X = X.drop(columns="nmme0-prate-56w__nmme0mean", axis=1)
        return X

#### We use another Pipeline to remove the insignificant or repetitive precipitation features.

In [None]:
Precipitation = Pipeline([     # Using the Pipeline to call for specific classes... 
    ("Dehydrate1", PrecipitationDehydratedI()),
    ("Dehydrate2", PrecipitationDehydratedII()),
    ("Dehydrate3", PrecipitationDehydratedIII()),
    ("Dehydrate4", PrecipitationDehydratedIV()),
])
dataset = Precipitation.fit_transform(dataset)
dataset.shape

In [None]:
for i in dataset.columns:
    print(i)

In [None]:
'''dataset.to_csv("E:/Downloads/updated_train_data.csv", index=False)'''
# Use this cell only when you want to download the processed dataset...