# Will it rain?
Predict if it is going to rain, based on the meteorological measurements.

![rozmarne_leto.jpg](attachment:b1dbc785-f1ea-42f9-adfe-637791fffdf6.jpg)

## Evaluation
Submissions are evaluated on area under the [ROC curve](https://www.youtube.com/watch?v=4jRBRDbJemM) between the predicted probability and the observed target.

## Dataset

**train.csv** - the training dataset; rainfall is the binary target\
**test.csv** - the test dataset; the objective is to predict the probability of rainfall for each row\
**sample_submission.csv** - a sample submission file in the correct format\
**rainfall.csv** - the original data used to train deep learning model used to generate datasets for the competition


## Initialize Notebook and Load Data

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np

import warnings
warnings.filterwarnings("ignore")

# Keep important settings/configuration in one place
class Config:
    
    # datasets
    train_path = '/kaggle/input/playground-series-s5e3/train.csv'
    test_path = '/kaggle/input/playground-series-s5e3/test.csv'
    sample_submission_path = '/kaggle/input/playground-series-s5e3/sample_submission.csv'
    original_data_path = '/kaggle/input/rainfall-prediction-using-machine-learning/Rainfall.csv'  

    target = 'rainfall'
    random_state = 2

In [None]:
# Load Data
train = pd.read_csv(Config.train_path, index_col = 'id')
test = pd.read_csv(Config.test_path, index_col = 'id')
submission = pd.read_csv(Config.sample_submission_path, index_col = 'id')
original_data = pd.read_csv(Config.original_data_path)

print('train:',train.shape)
print('test:',test.shape)
print('submission', submission.shape)
print('original_data:',original_data.shape)

# # combine train and original dataset
# train = pd.concat([train,original_data], ignore_index=True)
# print('combined train:',train.shape)

display(train.head())

We have following columns:
- **day**: The specific day of observation.
- **pressure**: Atmospheric pressure measured in hPa.
- **maxtemp**: Maximum recorded temperature for the day (°C).
- **temparature**: Average temperature for the day (°C).
- **mintemp**: Minimum recorded temperature for the day (°C).
- **dewpoint**: Temperature at which air becomes saturated with moisture (°C).
- **humidity**: Relative humidity percentage.
- **cloud**: Cloud cover percentage.
- **sunshine**: Total hours of sunshine for the day.
- **winddirection**: Direction of the wind in degrees.
- **windspeed**: Wind speed measured in km/h.

Target:
- **rainfall**: Did it rain? Boolean.

## Data Completenes

In [None]:
# Count of NaN values per column
nans_train = train.isna().sum()
nans_test = test.isna().sum()

# Only show columns with at least one NaN
display(nans_train[nans_train > 0])
display(nans_test[nans_test > 0])

There is only one missing value in test dataset. Let's drop the line.

In [None]:
test.dropna(inplace = True)

## Distribution and Correlation

In [None]:
sns.pairplot(train, hue=Config.target, corner=True);

In [None]:
# compare distributions of train and test datasets
