# 1. Understanding the Data Set: A Historical Overview

Consider a dataset containing weather-related information with 7750 instances and 22 features. The dataset includes details such as weather station number, date, air temperatures, humidity forecasts, wind speed, cloud cover forecasts, precipitation forecasts, and geographical information. Local Data Assimilation and Prediction System (LDAPS)


# 2. Dataset - Features(Columns) Description
    • station: Weather station number (1 to 25)
    • Date: Present day in the format yyyy-mm-dd ('2013-06-30' to '2017-08-30')
    • Present_Tmax: Maximum air temperature between 0 and 21 h on the present day (°C): 20 to 37.6
    • Present_Tmin: Minimum air temperature between 0 and 21 h on the present day (°C): 11.3 to 29.9
    • LDAPS_RHmin: LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5
    • LDAPS_RHmax: LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100
    • LDAPS_Tmax_lapse: LDAPS model forecast of next-day maximum air temperature applied lapse rate (°C): 17.6 to 38.5
    • LDAPS_Tmin_lapse: LDAPS model forecast of next-day minimum air temperature applied lapse rate (°C): 14.3 to 29.6
    • LDAPS_WS: LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9
    • LDAPS_LH: LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4
    • LDAPS_CC1 to LDAPS_CC4: LDAPS model forecast of next-day cloud cover for different 6-hour split intervals (0-5 h to 18-23 h): 0 to 0.98
    • LDAPS_PPT1 to LDAPS_PPT4: LDAPS model forecast of next-day precipitation for different 6-hour split intervals (0-5 h to 18-23 h): 0 to 23.7
    • lat: Latitude (°): 37.456 to 37.645
    • lon: Longitude (°): 126.826 to 127.135
    • DEM: Elevation (m): 12.4 to 212.3
    • Slope: Slope (°): 0.1 to 5.2


 # 3.Importing the Libraries
 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# 4. Exploring the DataSet

Load the dataset

In [2]:
df = pd.read_csv('../Lab 1/Bias_correction_ucl.csv')

Display the dimension, shape, size and attributes type

    • For dimenstions we use .ndim (dataframe)
    • For Shape: df.shape gives us a tuple with number of rows and columns
    • Size is total number of elements in DataFrame
    • Attributes Type can be obtained using type() function

In [3]:
print("\nDimension: ", df.ndim)
print("Shape: ", df.shape)
print("Size: ", df.size)
print("Attributes Type: ", type(df))


Dimension:  2
Shape:  (7752, 25)
Size:  193800
Attributes Type:  <class 'pandas.core.frame.DataFrame'>


display the first few rows.

In [4]:
df.head(10)

Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar radiation,Next_Tmax,Next_Tmin
0,1.0,2013-06-30,28.7,21.4,58.255688,91.116364,28.074101,23.006936,6.818887,69.451805,...,0.0,0.0,0.0,37.6046,126.991,212.335,2.785,5992.895996,29.1,21.2
1,2.0,2013-06-30,31.9,21.6,52.263397,90.604721,29.850689,24.035009,5.69189,51.937448,...,0.0,0.0,0.0,37.6046,127.032,44.7624,0.5141,5869.3125,30.5,22.5
2,3.0,2013-06-30,31.6,23.3,48.690479,83.973587,30.091292,24.565633,6.138224,20.57305,...,0.0,0.0,0.0,37.5776,127.058,33.3068,0.2661,5863.555664,31.1,23.9
3,4.0,2013-06-30,32.0,23.4,58.239788,96.483688,29.704629,23.326177,5.65005,65.727144,...,0.0,0.0,0.0,37.645,127.022,45.716,2.5348,5856.964844,31.7,24.3
4,5.0,2013-06-30,31.4,21.9,56.174095,90.155128,29.113934,23.48648,5.735004,107.965535,...,0.0,0.0,0.0,37.5507,127.135,35.038,0.5055,5859.552246,31.2,22.5
5,6.0,2013-06-30,31.9,23.5,52.437126,85.307251,29.219342,23.822613,6.182295,50.231389,...,0.0,0.0,0.0,37.5102,127.042,54.6384,0.1457,5873.780762,31.5,24.0
6,7.0,2013-06-30,31.4,24.4,56.287189,81.01976,28.551859,24.238467,5.587135,125.110007,...,0.0,0.0,0.0,37.5776,126.838,12.37,0.0985,5849.233398,30.9,23.4
7,8.0,2013-06-30,32.1,23.6,52.326218,78.004539,28.851982,23.819054,6.104417,42.011547,...,0.0,0.0,0.0,37.4697,126.91,52.518,1.5629,5863.992188,31.1,22.9
8,9.0,2013-06-30,31.4,22.0,55.338791,80.784607,28.426975,23.332373,6.017135,85.110971,...,0.0,0.0,0.0,37.4967,126.826,50.9312,0.4125,5876.901367,31.3,21.6
9,10.0,2013-06-30,31.6,20.5,56.651203,86.849632,27.576705,22.527018,6.518841,63.006075,...,0.0,0.0,0.0,37.4562,126.955,208.507,5.1782,5893.608398,30.5,21.0


Provide summary statistics for key features

In [5]:
df.describe()

Unnamed: 0,station,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,LDAPS_CC1,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar radiation,Next_Tmax,Next_Tmin
count,7750.0,7682.0,7682.0,7677.0,7677.0,7677.0,7677.0,7677.0,7677.0,7677.0,...,7677.0,7677.0,7677.0,7752.0,7752.0,7752.0,7752.0,7752.0,7725.0,7725.0
mean,13.0,29.768211,23.225059,56.759372,88.374804,29.613447,23.512589,7.097875,62.505019,0.368774,...,0.485003,0.2782,0.269407,37.544722,126.991397,61.867972,1.257048,5341.502803,30.274887,22.93222
std,7.211568,2.969999,2.413961,14.668111,7.192004,2.947191,2.345347,2.183836,33.730589,0.262458,...,1.762807,1.161809,1.206214,0.050352,0.079435,54.27978,1.370444,429.158867,3.12801,2.487613
min,1.0,20.0,11.3,19.794666,58.936283,17.624954,14.272646,2.88258,-13.603212,0.0,...,0.0,0.0,0.0,37.4562,126.826,12.37,0.098475,4329.520508,17.4,11.3
25%,7.0,27.8,21.7,45.963543,84.222862,27.673499,22.089739,5.678705,37.266753,0.146654,...,0.0,0.0,0.0,37.5102,126.937,28.7,0.2713,4999.018555,28.2,21.3
50%,13.0,29.9,23.4,55.039024,89.79348,29.703426,23.760199,6.54747,56.865482,0.315697,...,0.0,0.0,0.0,37.5507,126.995,45.716,0.618,5436.345215,30.5,23.1
75%,19.0,32.0,24.9,67.190056,93.743629,31.71045,25.152909,8.032276,84.223616,0.575489,...,0.018364,0.007896,4.1e-05,37.5776,127.042,59.8324,1.7678,5728.316406,32.6,24.6
max,25.0,37.6,29.9,98.524734,100.000153,38.542255,29.619342,21.857621,213.414006,0.967277,...,21.621661,15.841235,16.655469,37.645,127.135,212.335,5.17823,5992.895996,38.9,29.8


Identify and handle any missing values or outliers.

In [6]:
df.isnull()

Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar radiation,Next_Tmax,Next_Tmin
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7747,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7748,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7749,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7750,True,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [7]:
df.isnull().sum()

station              2
Date                 2
Present_Tmax        70
Present_Tmin        70
LDAPS_RHmin         75
LDAPS_RHmax         75
LDAPS_Tmax_lapse    75
LDAPS_Tmin_lapse    75
LDAPS_WS            75
LDAPS_LH            75
LDAPS_CC1           75
LDAPS_CC2           75
LDAPS_CC3           75
LDAPS_CC4           75
LDAPS_PPT1          75
LDAPS_PPT2          75
LDAPS_PPT3          75
LDAPS_PPT4          75
lat                  0
lon                  0
DEM                  0
Slope                0
Solar radiation      0
Next_Tmax           27
Next_Tmin           27
dtype: int64

In [8]:
df = df.dropna()

In [9]:
df.dropna()

Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar radiation,Next_Tmax,Next_Tmin
0,1.0,2013-06-30,28.7,21.4,58.255688,91.116364,28.074101,23.006936,6.818887,69.451805,...,0.0,0.0,0.0,37.6046,126.991,212.3350,2.7850,5992.895996,29.1,21.2
1,2.0,2013-06-30,31.9,21.6,52.263397,90.604721,29.850689,24.035009,5.691890,51.937448,...,0.0,0.0,0.0,37.6046,127.032,44.7624,0.5141,5869.312500,30.5,22.5
2,3.0,2013-06-30,31.6,23.3,48.690479,83.973587,30.091292,24.565633,6.138224,20.573050,...,0.0,0.0,0.0,37.5776,127.058,33.3068,0.2661,5863.555664,31.1,23.9
3,4.0,2013-06-30,32.0,23.4,58.239788,96.483688,29.704629,23.326177,5.650050,65.727144,...,0.0,0.0,0.0,37.6450,127.022,45.7160,2.5348,5856.964844,31.7,24.3
4,5.0,2013-06-30,31.4,21.9,56.174095,90.155128,29.113934,23.486480,5.735004,107.965535,...,0.0,0.0,0.0,37.5507,127.135,35.0380,0.5055,5859.552246,31.2,22.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7745,21.0,2017-08-30,23.1,17.8,24.688997,78.261383,27.812697,18.303014,6.603253,9.614074,...,0.0,0.0,0.0,37.5507,127.040,26.2980,0.5721,4456.024414,27.6,17.7
7746,22.0,2017-08-30,22.5,17.4,30.094858,83.690018,26.704905,17.814038,5.768083,82.146707,...,0.0,0.0,0.0,37.5102,127.086,21.9668,0.1332,4441.803711,28.0,17.1
7747,23.0,2017-08-30,23.3,17.1,26.741310,78.869858,26.352081,18.775678,6.148918,72.058294,...,0.0,0.0,0.0,37.5372,126.891,15.5876,0.1554,4443.313965,28.3,18.1
7748,24.0,2017-08-30,23.3,17.7,24.040634,77.294975,27.010193,18.733519,6.542819,47.241457,...,0.0,0.0,0.0,37.5237,126.909,17.2956,0.2223,4438.373535,28.6,18.8


In [10]:
df.shape

(7588, 25)