## Detection of missing or unreliable values

#### Get the data from [link](https://raw.githubusercontent.com/tnavarrete-iedib/bigdata-24-25/refs/heads/main/mesuraments-estacio-control-qualitat-aire-illes-balears-estacio-sant-antoni-de-portmany-eivissa.csv)

#### 1.1: Set SO₂, NO, and NO₂ values to _np.nan_ if their flag is different from ‘V’. Calculate the average of the 3 variables (without regard to the values _np.nan_).

I'm getting the following error (even on other subjects):

> URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)> 

So I'll make use of cURL command to get the data

In [23]:
!curl -o section1.csv https://raw.githubusercontent.com/tnavarrete-iedib/bigdata-24-25/refs/heads/main/mesuraments-estacio-control-qualitat-aire-illes-balears-estacio-sant-antoni-de-portmany-eivissa.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9591k  100 9591k    0     0  8407k      0  0:00:01  0:00:01 --:--:-- 8406k


In [24]:
import numpy as np
import pandas as pd

df = pd.read_csv('section1.csv', sep=';')
df.head()

Unnamed: 0,DATA_HI,PERÍODE_HI,SO2_HI,FL_SO2,NO_HI,FL_NO,NO2_HI,FL_NO2,O3_HI,FL_O3,...,TMP_HI,FL_TMP,HR_HI,FL_HR,PRB_HI,FL_PRB,LL_HI,FL_LL,RS_HI,FL_RS
0,01/22/2007 12:00:00 AM,4,,N,1.38,V,1.58,V,44.3,V,...,,,,,,,,,,
1,01/22/2007 12:00:00 AM,6,,N,1.35,V,0.7,V,25.4,V,...,,,,,,,,,,
2,01/22/2007 12:00:00 AM,7,,N,1.02,V,0.5,V,26.4,V,...,,,,,,,,,,
3,01/24/2007 12:00:00 AM,14,,N,0.33,V,5.3,V,59.6,V,...,,,,,,,,,,
4,01/24/2007 12:00:00 AM,18,,N,0.38,V,5.55,V,61.1,V,...,,,,,,,,,,


In [25]:
clean_df = df.copy() # Make a copy of the original dataframe

# Replace unwanted values for np.nan
clean_df.loc[clean_df['FL_SO2'] != 'V', 'SO2_HI'] = np.nan
clean_df.loc[clean_df['FL_NO'] != 'V', 'NO_HI'] = np.nan
clean_df.loc[clean_df['FL_NO2'] != 'V', 'NO2_HI'] = np.nan

clean_df.head()

Unnamed: 0,DATA_HI,PERÍODE_HI,SO2_HI,FL_SO2,NO_HI,FL_NO,NO2_HI,FL_NO2,O3_HI,FL_O3,...,TMP_HI,FL_TMP,HR_HI,FL_HR,PRB_HI,FL_PRB,LL_HI,FL_LL,RS_HI,FL_RS
0,01/22/2007 12:00:00 AM,4,,N,1.38,V,1.58,V,44.3,V,...,,,,,,,,,,
1,01/22/2007 12:00:00 AM,6,,N,1.35,V,0.7,V,25.4,V,...,,,,,,,,,,
2,01/22/2007 12:00:00 AM,7,,N,1.02,V,0.5,V,26.4,V,...,,,,,,,,,,
3,01/24/2007 12:00:00 AM,14,,N,0.33,V,5.3,V,59.6,V,...,,,,,,,,,,
4,01/24/2007 12:00:00 AM,18,,N,0.38,V,5.55,V,61.1,V,...,,,,,,,,,,


In [26]:
# Calculate the average of 3 columns withou NaN values -> Use pandas mena(skipna=True)
average_so2 = clean_df['SO2_HI'].mean(skipna=True)
average_no = clean_df['NO_HI'].mean(skipna=True)
average_no2 = clean_df['NO2_HI'].mean(skipna=True)

print('Average SO2:', average_so2)
print('Average NO:', average_no)
print('Average NO2:', average_no2)

Average SO2: 1.8539720923615335
Average NO: 1.954006150344058
Average NO2: 4.636526841216506


#### 1.2: Replace all np.nan values ​​as follows

- ##### For SO₂, replace with predecessor value
- ##### For NO, replace with successor value
- ##### For NO₂, replace with average of non-null values

#### Recalculate the average and compare it with the one you obtained in the previous section.

In [27]:
# Replace NaN values
clean_df['SO2_HI'] = clean_df['SO2_HI'].fillna(method='ffill')
clean_df['NO_HI'] = clean_df['NO_HI'].fillna(method='bfill')
clean_df['NO2_HI'] = clean_df['NO2_HI'].fillna(clean_df['NO2_HI'].mean(skipna=True)) # Just to make sure, recalculating the mean

# Recalculate the average
new_average_so2 = clean_df['SO2_HI'].mean(skipna=True)
new_average_no = clean_df['NO_HI'].mean(skipna=True)
new_average_no2 = clean_df['NO2_HI'].mean(skipna=True)

print('New average SO2:', new_average_so2)
print('New average NO:', new_average_no)
print('New average NO2:', new_average_no2)

New average SO2: 2.512784619989274
New average NO: 2.3080602253489957
New average NO2: 4.636526841216505


  clean_df['SO2_HI'] = clean_df['SO2_HI'].fillna(method='ffill')
  clean_df['NO_HI'] = clean_df['NO_HI'].fillna(method='bfill')


In [28]:
"""
	NO2 seems not to have any NaN values, so the mean results be the same.
	In the other hand, SO2 and NO have NaN values, so the mean results are different.
"""
print('- FULL COMPARISON -')
print(f'SO2:\t{average_so2}\t->\t{new_average_so2}')
print(f'NO:\t{average_no}\t->\t{new_average_no}')
print(f'NO2:\t{average_no2}\t->\t{new_average_no2}')

- FULL COMPARISON -
SO2:	1.8539720923615335	->	2.512784619989274
NO:	1.954006150344058	->	2.3080602253489957
NO2:	4.636526841216506	->	4.636526841216505
