**Xtreme Weather Forecasting**

The task is to predict the arithmetic mean of the maximum and minimum temperature over the next 14 days, for each location and start date.

Dataset : https://www.kaggle.com/competitions/widsdatathon2023/data


In [None]:
"""
# mounting GDrive to connect csv file
from google.colab import drive
drive.mount('/content/gdrive')

"""

In [None]:
# import libraries
import numpy as np
import pandas as pd  
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
import math
from sklearn.impute import SimpleImputer
import missingno as msno  
#ignore warnings
import warnings
warnings.filterwarnings('ignore')


In [None]:
#import test dataset
#test_df = pd.read_csv('/content/gdrive/My Drive/Omdena project/test_data.csv')
test_df = pd.read_csv('test_data.csv')
test_df.head()

In [None]:
#import train dataset
#train_df = pd.read_csv('/content/gdrive/My Drive/Omdena project/train_data.csv')
train_df = pd.read_csv('train_data.csv')
train_df.head()

In [None]:
train_df .info()

In [None]:
train_df.shape

In [None]:
train_df.columns

**Description of Column Namings**

lat: latitude of location (anonymized)

lon: longitude of location (anonymized)

startdate: startdate of the 14 day period

sst: sea surface temperature

icec: sea ice concentration

cancm30, cancm40, ccsm30, ccsm40, cfsv20, gfdlflora0, gfdlflorb0, gfdl0, nasa0, nmme0mean: most recent forecasts from weather models

contest-slp-14d: file containing sea level pressure (slp)

nmme0-tmp2m-34w: file containing most recent monthly NMME model forecasts for tmp2m (cancm30,cancm40, ccsm30, ccsm40, cfsv20, gfdlflora0, gfdlflorb0, gfdl0, nasa0,nmme0mean) and average forecast across those models (nmme0mean)

contest-pres-sfc-gauss-14d: pressure

mjo1d: MJO phase and amplitude

contest-pevpr-sfc-gauss-14d: potential evaporation

contest-wind-h850-14d: geopotential height at 850 millibars

contest-wind-h500-14d: geopotential height at 500 millibars

contest-wind-h100-14d: geopotential height at 100 millibars

contest-wind-h10-14d: geopotential height at 10 millibars

contest-wind-vwnd-925-14d: longitudinal wind at 925 millibars

contest-wind-vwnd-250-14d: longitudinal wind at 250 millibars

contest-wind-uwnd-250-14d: zonal wind at 250 millibars

contest-wind-uwnd-925-14d: zonal wind at 925 millibars

contest-rhum-sig995-14d: relative humidity

contest-prwtr-eatm-14d: precipitable water for entire atmosphere

nmme-prate-34w: weeks 3-4 weighted average of monthly NMME model forecasts for precipitation

nmme-prate-56w: weeks 5-6 weighted average of monthly NMME model forecasts for precipitation

nmme0-prate-56w: weeks 5-6 weighted average of most recent monthly NMME model forecasts for precipitation

nmme0-prate-34w: weeks 3-4 weighted average of most recent monthly NMME model forecasts for precipitation

nmme-tmp2m-34w: weeks 3-4 weighted average of most recent monthly NMME model forecasts for target label, contest-tmp2m-14d__tmp2m

nmme-tmp2m-56w: weeks 5-6 weighted average of monthly NMME model forecasts for target label, contest-tmp2m-14d__tmp2m

mei: MEI (mei), MEI rank (rank), and Niño Index Phase (nip)

elevation: elevation

contest-precip-14d: measured precipitation

climateregions: Köppen-Geigerclimateclassifications



**Target Variable**

In [None]:
target = [c for c in train_df.columns if c not in test_df.columns][0]
print(target)

**CHECKING MISSING VALUES**

In [None]:
def filter_na_cols(df):
    count_na_df = df.isna().sum() 
    if count_na_df[count_na_df > 0].tolist():
        return count_na_df[count_na_df > 0]
    else:
        return 'Clean dataset'

In [None]:
filter_na_cols(train_df)

There are 8 Columns with missing data.

In [None]:
filter_na_cols(train_df).sort_values().plot(kind='barh');

In [None]:
null_values = train_df.isnull()
sns.heatmap(null_values,cbar=False)
plt.show()

In [None]:
train_df['startdate'] = pd.to_datetime(train_df['startdate']) 
train_df['month_year'] = train_df['startdate'].dt.to_period("M").astype(str)
ccsm30_missing_df = train_df[['nmme0-prate-34w__ccsm30','nmme0-prate-56w__ccsm30','nmme0-tmp2m-34w__ccsm30','ccsm30','startdate']]
ccsm30_missing_df.set_index('startdate',inplace=True)
ax1 = ccsm30_missing_df.plot(figsize=(16,5),linewidth=2, fontsize=12)
ax1.set_title('CCSM30 Data')

In [None]:
ax1 = ccsm30_missing_df['2015-09-01':'2015-09-30'].plot(figsize=(16,5),linewidth=2, fontsize=12)
ax1.set_title('CCSM30 Data')

Most of the missing values are mainly from NMME CCSM30 models .Also most of the data are missing from same period of time. Its clear from the above graphs that there is no data from sept 5th to sept 25th 2015.

In [None]:
filter_na_cols(test_df)


There are no missing values in test dataset.

**HANDLING THE MISSING VALUES**

In [None]:
def impute_number_col(df):
    null_col = ['nmme0-tmp2m-34w__ccsm30',
    'nmme-tmp2m-56w__ccsm3',
    'nmme-prate-34w__ccsm3',
    'nmme0-prate-56w__ccsm30',
    'nmme0-prate-34w__ccsm30',
    'nmme-prate-56w__ccsm3',
    'nmme-tmp2m-34w__ccsm3',
    'ccsm30']
    number_imputer = SimpleImputer(missing_values=np.nan, strategy='median')
    fixed_column_df = number_imputer.fit_transform(df[null_col])
    df[null_col] = fixed_column_df
    return df

Here ,we know that the missing values are a case of NMAR(Not Missing At Random) , and also there is only a small amount of data missing , we will set the null values with the mean of the dataset column.

**CONVERTING STARTDATE INTO DATETIME FORMAT**

In [None]:
train_df['startdate'] = pd.to_datetime(train_df['startdate'])


In [None]:
train_df['year'] = pd.DatetimeIndex(train_df['startdate']).year
print (train_df['year'].unique())

In [None]:
train_df['startdate'] = pd.to_datetime(train_df['startdate']) 
train_df['month_year'] = train_df['startdate'].dt.to_period("M").astype(str)
print (train_df['month_year'])

**COMBINING LATITUDE AND LONGITUDE TO GET LOCATION**

In [None]:
train_df['location'] = train_df['lat'].map(str) + ' , ' + train_df['lon'].map(str)
print (train_df['location'])

**UNIQUE LOCATIONS**

In [None]:
train_df['location'].unique()



In [None]:
print(train_df.location.value_counts())

There are 514 unique locations in the dataset.

**LOCATION SCATTER PLOT**

In [None]:
plt.scatter(train_df['lon'], train_df['lat'])
plt.show()

As we know the locations are from US , the above cordinates are similar to the Gulf Coast and South east of US.

**LOCATIONS IN TEST DATA**

In [None]:
test_df['location'] = test_df['lat'].map(str) + ' , ' + test_df['lon'].map(str)
print (test_df['location'])

In [None]:
test_df['location'].unique()


In [None]:
print(test_df.location.value_counts())

There are 514 unique locations in test data .

In [None]:
scale=14
train_df.loc[:,'lat']=round(train_df.lat,scale)
train_df.loc[:,'lon']=round(train_df.lon,scale)

test_df.loc[:,'lat']=round(test_df.lat,scale)
test_df.loc[:,'lon']=round(test_df.lon,scale)

all_df = pd.concat([train_df, test_df], axis=0)

all_df['loc_group'] = all_df.groupby(['lat','lon']).ngroup()
print(f'{all_df.loc_group.nunique()} unique locations')



In [None]:
train_df = all_df.iloc[:len(train_df)]
test_df = all_df.iloc[len(train_df):]

locations_train=list(train_df.loc_group.unique())
locations_test=list(test_df.loc_group.unique())
result_1 = list(set(locations_train).difference(locations_test))
print(result_1)

result_2=list(set(locations_test).difference(locations_train))
print(result_2)

All the locations in test data are in train data.

**TEMPERATURE TIME SERIES**

In [None]:
train_df.plot(x="startdate", y="contest-tmp2m-14d__tmp2m")
plt.show()

This shows a typical temperature pattern .

In [None]:
fig, ax = plt.subplots(figsize = (4,4))

sns.boxplot(data=train_df, x='year', y='contest-tmp2m-14d__tmp2m')

There is a increase in mean temperature values each year.

In [None]:
df= pd.DataFrame()   
df['Temperature'] = train_df.groupby('year')['contest-tmp2m-14d__tmp2m'].mean()
df.plot(y='Temperature', kind='line',figsize=(10, 5))

Each year tempertaure rises 2-4 degrees!!!

**CLIMATE REGION**

Abbreviations of Climate regions 

BSh: Hot semi-arid climate

BSk: Cold semi-arid climate

BWh: Hot desert climate

BWk: Cold desert climate

Cfa: Humid subtropical climate

Cfb: Temperate oceanic climate or subtropical highland climate

Csa: Hot-summer Mediterranean climate

Csb: Warm-summer Mediterranean climate

Dfa: Hot-summer humid continental climate

Dfb: Warm-summer humid continental climate

Dfc: Subarctic climate

Dsb: Mediterranean-influenced warm-summer humid continental climate

Dsc: Mediterranean-influenced subarctic climate

Dwa: Monsoon-influenced hot-summer humid continental climate

Dwb: Monsoon-influenced warm-summer humid continental climate

**PLOTTING CLIMATE REGIONS**

In [None]:
plt.figure(figsize=(10,8))
plt.xlabel("Longitude")
plt.ylabel("Latitude")
sns.scatterplot(x='lon', y='lat',hue ="climateregions__climateregion",data=train_df).set(title ="Train_Location")



**Climate Region Counts** 

In [None]:
climate_regions = train_df.groupby('climateregions__climateregion').size().reset_index(name='counts')
climate_regions['percentage'] = climate_regions['counts'] / climate_regions['counts'].sum() * 100
print(climate_regions)

In [None]:
train_df['climateregions__climateregion'].value_counts().sort_values().plot(kind='bar', figsize=(10,4), rot=0)


In [None]:
test_df['climateregions__climateregion'].value_counts().sort_values().plot(kind='bar', figsize=(10,4), rot=0)

The distributions of climate regions for train data and test data are pretty much similar.

**RELATIONSHIP BETWEEN THE VARIABLES BASED ON CLIMATIC REGIONS**

**TEMPERATURE VS HUMIDITY**

In [None]:
sample_df=train_df.sample(int(0.1*len(train_df)))
plt.figure(figsize=(10,8))
plt.xlabel("Temperature")
plt.ylabel("Humidity")
sns.scatterplot(x='contest-tmp2m-14d__tmp2m', y='contest-rhum-sig995-14d__rhum',hue ="climateregions__climateregion",data=sample_df).set(title ="Temperature Vs Humidity")

**PRECIPATION VS HUMIDITY**

In [None]:
sample_df=train_df.sample(int(0.1*len(train_df)))
plt.figure(figsize=(10,8))
plt.xlabel("Humidity")
plt.ylabel("Percipitation")
sns.scatterplot( x='contest-rhum-sig995-14d__rhum',y='contest-precip-14d__precip',hue ="climateregions__climateregion",data = sample_df).set(title ="PERCIPITATIONS VS HUMIDITY")

Findings : 1.The lower the humidity , the higher the temperature .
           2.Higher the humidity , higher the percipitaions . 
           In other words , lower the temperature , higher the humidity and higher will be the percipitations and vice versa.

In [None]:

sample_df=train_df.sample(int(0.01*len(train_df)))
# select columns for pair plot
cols = ['contest-tmp2m-14d__tmp2m', 'contest-rhum-sig995-14d__rhum', 'contest-precip-14d__precip','contest-wind-h10-14d__wind-hgt-10']

# create pair plot with selected columns
sns.pairplot(sample_df[cols])

# show plot
plt.show()

The above pairplot shows the effect of each environmental factors over the other.Main factors considered in the plot are temperature , humidity, wind and percipitations .

**CORRELATION BETWEEN THE VARIABLES**

In [None]:
data = pd.DataFrame()    # creating a dataset of environmental factors...
data['Temperatue'] = train_df['contest-tmp2m-14d__tmp2m']
data['evaporation'] = train_df['contest-pevpr-sfc-gauss-14d__pevpr']
data['humidity'] = train_df['contest-rhum-sig995-14d__rhum']
data['sealevel_pressure'] = train_df['contest-slp-14d__slp']
data['pressure'] = train_df['contest-pres-sfc-gauss-14d__pres']
data['wind'] = train_df['contest-wind-h10-14d__wind-hgt-10']
data['elevation'] = train_df['elevation__elevation']
data['percipitation'] = train_df['contest-precip-14d__precip']




In [None]:
def Plotting():          # Plotting the Heatmap...
    plt.figure(figsize=(10, 6))
    sns.heatmap(data.corr(), cmap='inferno', annot=True)
    plt.title("Correlation of Target Variable with Main Features ", size=18, c="red")
    plt.show()

Plotting()

The above plot shows the correlation  between the various factors and also its impact on temperture ( targeted variable) in heat map .


In [None]:
data1 = pd.DataFrame()    # creating a dataset of temperature and percipiations  at 3-4 weeks and 5-6 weeks...
data1['MostrecentMeanTemperature34'] = train_df['nmme0-tmp2m-34w__nmme0mean']
data1['MeanPercipitation34'] = train_df['nmme-prate-34w__nmmemean']
data1['MeanPercipitation56'] = train_df['nmme-prate-56w__nmmemean']
data1['MostrecentPercipiations56'] = train_df['nmme0-prate-56w__nmme0mean']
data1['MostrecentPercipiations34'] = train_df['nmme0-prate-34w__nmme0mean']
data1['MeanTemperature34'] = train_df['nmme-tmp2m-34w__nmmemean']
data1['MeanTemperature56'] = train_df['nmme-tmp2m-56w__nmmemean']
data1['Mostrecentforecast'] = train_df['nmme0mean']
data1['Targetvariable'] = train_df['contest-tmp2m-14d__tmp2m']


In [None]:
def Plotting1():          # Plotting the Heatmap...
    plt.figure(figsize=(12,5))
    sns.heatmap(data1.corr(), cmap="YlGnBu", annot=True)
    plt.title("Correlation of Temperature with computational factors", size=18, c="blue")
    plt.show()

Plotting1()

In the above plot , we can see that the NMME forecasts for target variable for 3-4 weeks and 5-6 weeks are same and the same apply for percipitation forecasts too.Since both the forecasts 3-4 weeks and 5-6 weeks  columns have high correlation between them , we can drop one of them and thus reduce the number of columns.

In [None]:
updated_df = pd.DataFrame(data)
#updated_df = train_df.drop(['index', ''], axis=1)
updated_df = train_df.drop(columns='index')