# MODULES AND CONSTANTS

### MODULES, IMPORTS AND LIBRARIES USED

In [71]:
import numpy as np
import pandas as pd



### CONSTANTS USED

In [72]:
DATA_PATH = 'SummaryofWeather.csv'

# 1 - READ FILE

Load the Second World War Weather dataset. In particular you have to load the data contained in
SummaryofWeather.csv

In [73]:
def loadDataFrame(path:str):
    return pd.read_csv(filepath_or_buffer=path, header=0, dtype={7:str, 8:str, 18:str, 25:str}).replace({'#VALUE!':np.nan})

# 2 - MISSING VALUES

Inspect the content of the dataset identifying if there are missing values for the sensors in the dataset,
checking for the top 10 most complete sensors (in terms of collected data) the distribution of the
recorded mean temperatures (MeanTemp column).
- Can you identify if these sensors are located in part of the world with similar weather conditions?.
- Is it necessary to normalize the data in this case?
- Which pre-processing step can be useful to solve the forecasting task?

In [101]:
def removeMissingColumns(df:pd.DataFrame)->pd.Series:
    temp = df.isna().sum()
    return df[[k for k in temp.index if temp[k] < df.shape[0]]]

def displayNMostCompleteSensors(df: pd.DataFrame, n:int, printOnScreen:bool=False) -> list[str]:
    res = df.groupby(['STA']).agg('count').sum(axis=1).sort_values(ascending=False).head(n)
    if printOnScreen:
        print(f"Top {n} most active sensors")
        display(res)
        
    return list(res.index)

# 3 - SENSOR SELECTION

For simplicity, we will exploit the data collected by a specific sensor. Filter the data by STA (Station)
and extract the mean temperature measurements corresponding to sensor with id 22508.
<br><br>
Info: Do not forget to keep track also of the date on which each measurements has been taken
and be sure that each date is properly converted to a Datetime data type like the datetime64
type provided by numpy.

In [129]:
def filterDataFrame(df:pd.DataFrame, sensor:int=22508)->pd.DataFrame:
    df = df[df['STA']==sensor][['MeanTemp', 'Date']]
    df['Date'] = pd.to_datetime(df['Date'])
    return df

# MAIN FUNCTION

This is the main function of the file, it does:
<ol>
<li>Load the data frame and removes the np.nan columns</li>
<li>Looks for the most complete sensors</li>
<li>Filters the dataframe for a specific sensor</li>
<li></li>
<li></li>
<li></li>
<li></li>
<li></li>
<li></li>
</ol>

In [130]:
def main()->None:
    df = removeMissingColumns(loadDataFrame(DATA_PATH)) # 1,2
    # print(df[df['STA'].isin(displayNMostCompleteSensors(df, 10))][['STA','MeanTemp']].groupby('STA').agg('mean')) # 2
    df = filterDataFrame(df, 22508)
    display(df)
    
    
main()

Unnamed: 0,MeanTemp,Date
57877,20.000000,1940-01-01
57878,19.444444,1940-01-02
57879,20.000000,1940-01-03
57880,21.111111,1940-01-04
57881,18.333333,1940-01-05
...,...,...
60064,20.555556,1945-12-27
60065,21.111111,1945-12-28
60066,20.000000,1945-12-29
60067,21.111111,1945-12-30
