# Data Cleaning & Locating Multivalued and Duplicate records (single csv file)
### Saksham Gakhar, DA - DKSF

Keep changing the input csv file and look for duplicate and multivalued records, enlist devices that generally misbehave...

In [911]:
import numpy as np 
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from collections import defaultdict
import datetime
# without mpld3
%matplotlib notebook 

In [912]:
df = pd.read_csv('2020-07-06-DataKind/output-2020-04-01T00_00_00+00_00.csv')
df.when_captured = pd.to_datetime(df.when_captured)

Need to change the format of the Time Stamp for all the measurements in the raw data

In [913]:
df.service_uploaded =  df.service_uploaded.apply(lambda x: datetime.datetime.strptime(x, '%b %d, %Y @ %H:%M:%S.%f')\
                                                 .replace(tzinfo=datetime.timezone.utc))
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56799 entries, 0 to 56798
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   service_uploaded  56799 non-null  datetime64[ns, UTC]
 1   when_captured     56799 non-null  datetime64[ns, UTC]
 2   device_urn        56799 non-null  object             
 3   device_sn         56799 non-null  object             
 4   device            56799 non-null  int64              
 5   loc_lat           56799 non-null  float64            
 6   loc_lon           56799 non-null  float64            
 7   env_temp          16830 non-null  float64            
 8   env_humid         16830 non-null  float64            
 9   pms_pm01_0        23026 non-null  float64            
 10  pms_pm02_5        23030 non-null  float64            
 11  pms_pm10_0        23026 non-null  float64            
 12  lnd_7318c         33293 non-null  float64            
 13  l

In [914]:
df[0:5]

Unnamed: 0,service_uploaded,when_captured,device_urn,device_sn,device,loc_lat,loc_lon,env_temp,env_humid,pms_pm01_0,pms_pm02_5,pms_pm10_0,lnd_7318c,lnd_7318u,bat_voltage
0,2020-04-30 23:59:23+00:00,2020-04-30 23:59:23+00:00,safecast:3714913954,Solarcast #30027,3714913954,52.395,4.875,,,,,,23.0,25.0,
1,2020-04-30 23:57:54+00:00,2020-04-30 23:57:54+00:00,safecast:678194983,Solarcast #30030,678194983,46.554,15.635,,,,,,24.0,29.0,3.876
2,2020-04-30 23:57:37+00:00,2020-04-30 23:57:39+00:00,safecast:678194983,Solarcast #30030,678194983,46.554,15.635,,,,,,,,
3,2020-04-30 23:57:20+00:00,2020-04-30 23:57:15+00:00,safecast:678194983,Solarcast #30030,678194983,46.554,15.635,,,1.0,2.0,2.0,,,
4,2020-04-30 23:45:01+00:00,2020-04-30 23:45:01+00:00,safecast:3714913954,Solarcast #30027,3714913954,52.395,4.875,,,,,,,,


Based on above table for (`device`, `when_captured`) key, let's see what these multiple values for each time stamp correspond to. Sometimes there are negative RH, sometimes 0.0 PM (which measn very clean air)

In [915]:
def findBadData(df):
    
    temp_df = df.groupby(['device_urn', 'device_sn','when_captured']).size().to_frame('size').\
                                    reset_index().sort_values('size', ascending=False)
    print("bad device data counts: ")
    print(temp_df[(temp_df['size']>1)])
    
    print("all bad device list: ")
    # Devices that have misbehaved at some point - more than one data values per time stamp
    print(np.unique(temp_df[temp_df['size']>1]['device_sn'].values)) # devices that have misbehaved

In [916]:
findBadData(df)

bad device data counts: 
                device_urn         device_sn             when_captured  size
15012  safecast:2152053642  Solarcast #30025 2020-04-07 15:59:56+00:00    24
1678   safecast:1094924990  Solarcast #30024 2020-04-15 22:36:25+00:00    19
32529  safecast:3768313999  Solarcast #30026 2020-04-20 05:44:22+00:00    19
15341  safecast:2152053642  Solarcast #30025 2020-04-20 03:24:01+00:00    17
32057  safecast:3768313999  Solarcast #30026 2020-04-06 06:24:45+00:00    17
...                    ...               ...                       ...   ...
31863  safecast:3714913954  Solarcast #30027 2020-04-30 16:14:23+00:00     2
31166  safecast:3714913954  Solarcast #30027 2020-04-27 08:44:24+00:00     2
28394  safecast:3714913954  Solarcast #30027 2020-04-13 18:22:34+00:00     2
28269  safecast:3714913954  Solarcast #30027 2020-04-13 03:52:46+00:00     2
28524  safecast:3714913954  Solarcast #30027 2020-04-14 08:52:34+00:00     2

[1080 rows x 4 columns]
all bad device list: 
['So

#### Add a column for the year

In [917]:
df['year'] = pd.DatetimeIndex(df['when_captured']).year

## Data Cleansing based on [Protocol](https://github.com/DataKind-SF/safecast/blob/master/Solarcast_data_cleansing.md)

In [918]:
print(df['when_captured'].isna().sum())
df = df[df['when_captured'].notna()]

df.shape

0


(56799, 16)

In [919]:
boolean_condition = df['when_captured'] >  pd.to_datetime(2000/1/19, infer_datetime_format=True).tz_localize('UTC')
print(boolean_condition.sum())
df = df[df['when_captured'] >  pd.to_datetime(2000/1/19, infer_datetime_format=True).tz_localize('UTC')]

df.shape

56799


(56799, 16)

In [920]:
boolean_condition = (df['env_humid']<0) | (df['env_humid']>100)
print(boolean_condition.sum())
column_name = 'env_humid'
new_value = np.nan
df.loc[boolean_condition, column_name] = new_value

df.shape

2821


(56799, 16)

In [921]:
boolean_condition = abs(df['when_captured'].subtract(df['service_uploaded'])).astype('timedelta64[D]') < 7
boolean_condition.shape
print(df.shape[0] - (boolean_condition).sum())
df = df[boolean_condition]

df.shape

2414


(54385, 16)

Dont need the following column ANY MORE

In [922]:
df.drop('service_uploaded', axis=1, inplace=True)
df.shape

(54385, 15)

Make sure all duplicates are dropped

In [923]:
incoming = df.shape[0]
df.drop_duplicates(subset=df.columns[0:df.shape[1]], inplace=True, keep='first') # args: subset=[df.columns[0:df.shape[1]]], keep = 'first'
print(-df.shape[0]+incoming)

266


### Filtering bad row records

In [924]:
temp_df = df.groupby(['device_sn','when_captured']).agg(['count','nunique'])
# temp_df.info()
num_groups = temp_df.shape[0]
print(num_groups)

50076


Merge Counts and Count-Distincts to check for duplicative records and multiplicities

In [925]:
even = list(range(0,26,2))
odd = list(range(1,26,2))
tmp_df1 = temp_df.iloc[:,even].max(axis=1).to_frame('COUNTS').reset_index()
tmp_df2 = temp_df.iloc[:,odd].max(axis=1).to_frame('DISTINCTS').reset_index()
print(tmp_df1.shape, tmp_df2.shape)
merged = pd.merge(tmp_df1, tmp_df2, left_on = ['device_sn', 'when_captured'], right_on=['device_sn', 'when_captured'])
merged

(50076, 3) (50076, 3)


Unnamed: 0,device_sn,when_captured,COUNTS,DISTINCTS
0,Solarcast #30000,2020-04-17 14:55:22+00:00,6,6
1,Solarcast #30000,2020-04-17 15:29:47+00:00,6,5
2,Solarcast #30000,2020-04-17 16:39:11+00:00,6,5
3,Solarcast #30000,2020-04-17 17:38:13+00:00,6,5
4,Solarcast #30000,2020-04-17 18:07:07+00:00,6,6
...,...,...,...,...
50071,Solarcast #33009,2020-04-28 22:34:19+00:00,1,1
50072,Solarcast #33009,2020-04-29 10:44:17+00:00,1,1
50073,Solarcast #33009,2020-04-29 22:54:14+00:00,1,1
50074,Solarcast #33009,2020-04-30 11:04:13+00:00,1,1


#### Calculating hits: Impose mutually exclusive conditions for filtering

Actionable: Records of useless data with all NaNs

In [926]:
bool1 = (merged.COUNTS >1) & (merged.DISTINCTS==0)
sum1 = bool1.sum()
print(sum1)
toDiscard1 = merged.loc[:,['device_sn', 'when_captured']][bool1]
toDiscard1.shape

0


(0, 2)

Actionable: Records that are a mix of duplicates and non-duplicate rows for a given (`device_sn`, `when_captured`) [must be all discarded]

In [927]:
bool3 = (merged.COUNTS >1) & (merged.DISTINCTS>1)
sum3 = bool3.sum()
print(sum3)
toDiscard3 = merged.loc[:,['device_sn', 'when_captured']][bool3]
toDiscard3.shape

923


(923, 2)

NOT Actionable as duplicates were dropped: Records where all rows are purely duplicates [preserve only 1 later]

In [928]:
bool2 = (merged.COUNTS >1) & (merged.DISTINCTS==1)
sum2 = bool2.sum()
print(sum2)
print("get rid of : " ,merged.COUNTS[bool2].sum() - merged.DISTINCTS[bool2].sum())

0
get rid of :  0


Records that are good

In [929]:
bool4 = (merged.COUNTS ==1) & (merged.DISTINCTS==1)
sum4 = bool4.sum()
print(sum4)

49153


In [930]:
#ensure you have all records covered by 1 of the 4 conditions
assert(num_groups == sum1+sum2+sum3+sum4)

#### Filter now from the main dataframe


In [931]:
discard = pd.concat([toDiscard1, toDiscard3], ignore_index=True)
discard['KEY_DevSN_WhenCapt'] = list(zip(discard.device_sn, discard.when_captured))
print(df.shape, discard.shape)

(54119, 15) (923, 3)


In [932]:
df['KEY_DevSN_WhenCapt'] = list(zip(df.device_sn, df.when_captured))
df.shape

(54119, 16)

In [933]:
rows_to_discard = df['KEY_DevSN_WhenCapt'].isin(discard['KEY_DevSN_WhenCapt'])
print("these many rows to discard: ", rows_to_discard.sum())

these many rows to discard:  4966


In [934]:
incoming = df.shape[0]
df = df[~rows_to_discard]
print(incoming - df.shape[0])

4966


### Now check to make sure no garbage data is left

In [935]:
findBadData(df)

bad device data counts: 
Empty DataFrame
Columns: [device_urn, device_sn, when_captured, size]
Index: []
all bad device list: 
[]
