## Missing Data

In this notebook we will look at a few datasets where values from columns are missing.
It is crucial for data science and machine learning to have a dataset where no values are missing as algorithms are usually not able to handle data with information missing.

For python, we will be using the pandas library to handle our dataset.

In [1]:
import pandas as pd

### Kamyr digester

The first dataset we will be looking at is taken from a psysical device equiped with numerous sensors, each timepoint (1 hour) these sensors are read out and the data is collected. Let's have a look at the general structure

In [2]:
kamyr_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c2_data_preparation/data/kamyr-digester.csv')
kamyr_df.head()

Unnamed: 0,Observation,Y-Kappa,ChipRate,BF-CMratio,BlowFlow,ChipLevel4,T-upperExt-2,T-lowerExt-2,UCZAA,WhiteFlow-4,...,SteamFlow-4,Lower-HeatT-3,Upper-HeatT-3,ChipMass-4,WeakLiquorF,BlackFlow-2,WeakWashF,SteamHeatF-3,T-Top-Chips-4,SulphidityL-4
0,31-00:00,23.1,16.52,121.717,1177.607,169.805,358.282,329.545,1.443,599.253,...,67.122,329.432,303.099,175.964,1127.197,1319.039,257.325,54.612,252.077,
1,31-01:00,27.6,16.81,79.022,1328.36,341.327,351.05,329.067,1.549,537.201,...,60.012,330.823,304.879,163.202,665.975,1297.317,241.182,46.603,251.406,29.11
2,31-02:00,23.19,16.709,79.562,1329.407,239.161,350.022,329.26,1.6,549.611,...,61.304,329.14,303.383,164.013,677.534,1327.072,237.272,51.795,251.335,
3,31-03:00,23.6,16.478,81.011,1334.877,213.527,350.938,331.142,1.604,623.362,...,68.496,328.875,302.254,181.487,767.853,1324.461,239.478,54.846,250.312,29.02
4,31-04:00,22.9,15.618,93.244,1334.168,243.131,351.64,332.709,,638.672,...,70.022,328.352,300.954,183.929,888.448,1343.424,215.372,54.186,249.916,29.01


Interesting, there seem to be 22 sensor values and 1 timestamp for each record. As mechanical devices are prone to noise and dropouts of sensors we would be foolish to assume no missing values are present.

In [3]:
kamyr_df.isna().sum().divide(len(kamyr_df)).round(4)*100

Observation         0.00
Y-Kappa             0.00
ChipRate            1.33
BF-CMratio          4.65
BlowFlow            4.32
ChipLevel4          0.33
T-upperExt-2        0.33
T-lowerExt-2        0.33
UCZAA               7.97
WhiteFlow-4         0.33
AAWhiteSt-4        46.84
AA-Wood-4           0.33
ChipMoisture-4      0.33
SteamFlow-4         0.33
Lower-HeatT-3       0.33
Upper-HeatT-3       0.33
ChipMass-4          0.33
WeakLiquorF         0.33
BlackFlow-2         0.33
WeakWashF           0.33
SteamHeatF-3        0.33
T-Top-Chips-4       0.33
SulphidityL-4      46.84
dtype: float64

As expected, the datapoint 'AAWhiteSt-4' even has 46% of data missing!
It seems we only have 300 datapoints and presumably these missing values occur in different records our dataset will be decimated if we just drop all rows with missing values.

In [4]:
kamyr_df.shape

(301, 23)

In [5]:
kamyr_df.dropna().shape

(131, 23)

As we drop all rows with missing values, we are left with only 131 records.
Whilst this might be good enough for some purposes, there are more viable options.

Perhaps we can first remove the column with the most missing values and then drop all remaining

In [6]:
kamyr_df.drop(columns=['AAWhiteSt-4 ','SulphidityL-4 ']).dropna().shape

(263, 21)

Significantly better, although we lost the information of 2 sensors we now have a complete dataset with 263 records. For purposes where those 2 sensors are irrelevant this is a viable option, keep in mind that this dataset is still 100% truthful, as we have not imputed any values.

Another option, where we retain all our records would be using the timely nature of our dataset, each record is a measurement with an interval of 1 hour. I have no knowledge of this dataset but one might make the assumption that the interval of 1 hour is taken as the state of the machine does not alter much in 1 hour. Therefore we could do what is called a forward fill, where we fill in the missing values with the same value of the sensor for the previous measurement.

This would solve nearly all nan values as there might be a problem where the first value is missing. This is shown below.

In [7]:
kamyr_df.fillna(method='ffill')['SulphidityL-4 ']

0        NaN
1      29.11
2      29.11
3      29.02
4      29.01
       ...  
296    30.43
297    30.29
298    30.47
299    30.47
300    30.46
Name: SulphidityL-4 , Length: 301, dtype: float64

Although our dataset is not fully the truth, we can see that little to no changes occur in the sensor and using a forward fill is arguably the most suitable option.

### Travel times

Another dataset from the same source contains a collection of recorded travel times and specific information about the travel itself as e.g.: the day of the week, where they were going, ...

In [8]:
travel_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c2_data_preparation/data/travel-times.csv')
travel_df

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
0,1/6/2012,16:37,Friday,Home,51.29,127.4,78.3,84.8,,39.3,36.3,No,
1,1/6/2012,08:20,Friday,GSK,51.63,130.3,81.8,88.9,,37.9,34.9,No,
2,1/4/2012,16:17,Wednesday,Home,51.27,127.4,82.0,85.8,,37.5,35.9,No,
3,1/4/2012,07:53,Wednesday,GSK,49.17,132.3,74.2,82.9,,39.8,35.6,No,
4,1/3/2012,18:57,Tuesday,Home,51.15,136.2,83.4,88.1,,36.8,34.8,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,7/18/2011,08:09,Monday,GSK,54.52,125.6,49.9,82.4,7.89,65.5,39.7,No,
201,7/14/2011,08:03,Thursday,GSK,50.90,123.7,76.2,95.1,7.89,40.1,32.1,Yes,
202,7/13/2011,17:08,Wednesday,Home,51.96,132.6,57.5,76.7,,54.2,40.6,Yes,
203,7/12/2011,17:51,Tuesday,Home,53.28,125.8,61.6,87.6,,51.9,36.5,Yes,


we have a total of 205 records and we can already see that the FuelEconomy column seems pretty bad, let's quantify that.

In [9]:
travel_df.isna().sum().divide(len(travel_df)).round(4)*100

Date               0.00
StartTime          0.00
DayOfWeek          0.00
GoingTo            0.00
Distance           0.00
MaxSpeed           0.00
AvgSpeed           0.00
AvgMovingSpeed     0.00
FuelEconomy        8.29
TotalTime          0.00
MovingTime         0.00
Take407All         0.00
Comments          88.29
dtype: float64

In the end, it doesn't seem that bad, but there are comments and nearly none of them are filled in. Which in perspective is understandable. Let's see what the comments look like

In [10]:
travel_df[~travel_df.Comments.isna()].Comments

15                                  Put snow tires on
39                                         Heavy rain
49                                Huge traffic backup
50      Pumped tires up: check fuel economy improved?
52                                Backed up at Bronte
54                                Backed up at Bronte
60                                              Rainy
78                                   Rain, rain, rain
91                                   Rain, rain, rain
92         Accident: backup from Hamilton to 407 ramp
110                                           Raining
132                           Back to school traffic?
133                Took 407 all the way (to McMaster)
150                             Heavy volume on Derry
156                        Start early to run a batch
158    Accident at 403/highway 6; detour along Dundas
165                                      Detour taken
166                                    Must be Friday
172                         

As you would expect, these comments are text based. Now imagine we would like to run some Natural Language Processing (NLP) on these, it would be a pain to perform string operations on it when it is riddled with missing values.

Here a simple example where we select all records containing the word 'rain', with no avail.

In [11]:
travel_df[travel_df.Comments.str.lower().str.contains('rain')]

ValueError: Cannot mask with non-boolean array containing NA / NaN values

The last line of the python error traceback gives us the reason it failed, because there were NaN values present.

Luckily the string variable has more or less it's on 'null' value, being an empty string, this way these operations are still possible, most of the comments will just contain nothing.

In [12]:
travel_df.Comments = travel_df.Comments.fillna('')

In [13]:
travel_df[travel_df.Comments.str.lower().str.contains('rain')]

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
39,11/29/2011,07:23,Tuesday,GSK,51.74,112.2,55.3,61.0,,56.2,50.9,No,Heavy rain
60,11/9/2011,16:15,Wednesday,Home,51.28,121.4,65.9,71.8,9.35,46.7,42.1,No,Rainy
78,10/25/2011,17:24,Tuesday,Home,52.87,123.5,65.1,72.4,8.97,48.7,43.8,No,"Rain, rain, rain"
91,10/12/2011,17:47,Wednesday,Home,51.4,114.4,59.7,65.8,8.75,51.7,46.9,No,"Rain, rain, rain"
110,9/27/2011,07:36,Tuesday,GSK,50.65,128.1,86.3,88.6,8.31,35.2,34.3,Yes,Raining
172,8/9/2011,08:15,Tuesday,GSK,49.08,134.8,60.5,67.2,8.54,48.7,43.8,No,Medium amount of rain


Fixed! now we can use the comments for analysis.

We still have to fix the FuelEconomy, let us take a look at the non NaN values

In [14]:
travel_df[~travel_df.FuelEconomy.isna()]

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
6,1/2/2012,17:31,Monday,Home,51.37,123.2,82.9,87.3,-,37.2,35.3,No,
7,1/2/2012,07:34,Monday,GSK,49.01,128.3,77.5,85.9,-,37.9,34.3,No,
8,12/23/2011,08:01,Friday,GSK,52.91,130.3,80.9,88.3,8.89,39.3,36.0,No,
9,12/22/2011,17:19,Thursday,Home,51.17,122.3,70.6,78.1,8.89,43.5,39.3,No,
10,12/22/2011,08:16,Thursday,GSK,49.15,129.4,74.0,81.4,8.89,39.8,36.2,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
197,7/20/2011,08:24,Wednesday,GSK,48.50,125.8,75.7,87.3,7.89,38.5,33.3,Yes,
198,7/19/2011,17:17,Tuesday,Home,51.16,126.7,92.2,102.6,7.89,33.3,29.9,Yes,
199,7/19/2011,08:11,Tuesday,GSK,50.96,124.3,82.3,96.4,7.89,37.2,31.7,Yes,
200,7/18/2011,08:09,Monday,GSK,54.52,125.6,49.9,82.4,7.89,65.5,39.7,No,


It seems that aside NaN values there are also other intruders, a quick check on the data type (Dtype) reveils it is not recognised as a number!

In [15]:
travel_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            205 non-null    object 
 1   StartTime       205 non-null    object 
 2   DayOfWeek       205 non-null    object 
 3   GoingTo         205 non-null    object 
 4   Distance        205 non-null    float64
 5   MaxSpeed        205 non-null    float64
 6   AvgSpeed        205 non-null    float64
 7   AvgMovingSpeed  205 non-null    float64
 8   FuelEconomy     188 non-null    object 
 9   TotalTime       205 non-null    float64
 10  MovingTime      205 non-null    float64
 11  Take407All      205 non-null    object 
 12  Comments        205 non-null    object 
dtypes: float64(6), object(7)
memory usage: 20.9+ KB


The column is noted as an object or string type, meaning that these numbers are given as '9.24' instead of 9.24 and numerical operations are not possible.
We can cast them to numeric but have to warn pandas to coerce errors, meaning errors will be converted to NaN values.
Later we'll handle the NaN's.

In [16]:
travel_df.FuelEconomy = pd.to_numeric(travel_df.FuelEconomy, errors='coerce')
travel_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            205 non-null    object 
 1   StartTime       205 non-null    object 
 2   DayOfWeek       205 non-null    object 
 3   GoingTo         205 non-null    object 
 4   Distance        205 non-null    float64
 5   MaxSpeed        205 non-null    float64
 6   AvgSpeed        205 non-null    float64
 7   AvgMovingSpeed  205 non-null    float64
 8   FuelEconomy     186 non-null    float64
 9   TotalTime       205 non-null    float64
 10  MovingTime      205 non-null    float64
 11  Take407All      205 non-null    object 
 12  Comments        205 non-null    object 
dtypes: float64(7), object(6)
memory usage: 20.9+ KB


Wonderful, now the column is numerical and we can see 2 more missing values have popped up!
We could easily drop these 19 records and have a complete dataset.

In [17]:
travel_df.dropna()

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
8,12/23/2011,08:01,Friday,GSK,52.91,130.3,80.9,88.3,8.89,39.3,36.0,No,
9,12/22/2011,17:19,Thursday,Home,51.17,122.3,70.6,78.1,8.89,43.5,39.3,No,
10,12/22/2011,08:16,Thursday,GSK,49.15,129.4,74.0,81.4,8.89,39.8,36.2,No,
11,12/21/2011,07:45,Wednesday,GSK,51.77,124.8,71.7,78.9,8.89,43.3,39.4,No,
12,12/20/2011,16:05,Tuesday,Home,51.45,130.1,75.2,82.7,8.89,41.1,37.3,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
197,7/20/2011,08:24,Wednesday,GSK,48.50,125.8,75.7,87.3,7.89,38.5,33.3,Yes,
198,7/19/2011,17:17,Tuesday,Home,51.16,126.7,92.2,102.6,7.89,33.3,29.9,Yes,
199,7/19/2011,08:11,Tuesday,GSK,50.96,124.3,82.3,96.4,7.89,37.2,31.7,Yes,
200,7/18/2011,08:09,Monday,GSK,54.52,125.6,49.9,82.4,7.89,65.5,39.7,No,


However im leaving them as an excercise for you to apply a technique we will see in the next part

### Material properties

Another dataset from the same source contains the material properties from 30 samples, this time there is not timestamp as the samples are not related in time with each other.

In [18]:
material_df = pd.read_csv('https://raw.githubusercontent.com/LorenzF/data-science-practical-approach/main/src/c2_data_preparation/data/raw-material-properties.csv')
material_df

Unnamed: 0,Sample,size1,size2,size3,density1,density2,density3
0,X12558,0.696,2.69,6.38,41.8,17.18,3.9
1,X14728,0.636,2.3,5.14,38.1,12.73,3.89
2,X15468,0.841,2.85,5.2,37.6,13.58,3.98
3,X21364,0.609,2.13,4.62,34.2,11.12,4.02
4,X23671,0.684,2.16,4.87,36.4,12.24,3.92
5,X24055,0.762,2.81,6.36,38.1,13.28,3.89
6,X24905,0.552,2.34,5.03,41.3,16.71,3.86
7,X25917,0.501,2.17,5.09,,,
8,X27871,0.619,2.11,5.13,,,
9,X28690,0.61,2.1,4.18,35.0,12.15,3.86


let us quantify the amount of missing data

In [19]:
material_df.isna().sum().divide(len(material_df)).round(4)*100

Sample       0.00
size1        2.78
size2        2.78
size3        2.78
density1    27.78
density2    27.78
density3    27.78
dtype: float64

Unfortunately that is a lot of missing data, covered in all records, dropping here seems almost impossible if we want to keep a healthy amount of records.

Here it would be wise to go for a more elaborate method of imputation, I opted for the K-nearest neighbours method, which looks at the K most similar records in the dataset to make an educated guess on what the missing value could be, this because we can assume that records with similar data are also similar over all the properties (columns).

Im using the sklearn library for this, which has more imputation techniques such as MICE.
More info can be found [here](https://scikit-learn.org/stable/modules/impute.html)

In [20]:
from sklearn.impute import KNNImputer

im creating an imputer object and specify that i want to use the 5 most similar records and weigh them by distance from the to imputed record, meaning closer neighbours are more important.

In [21]:
imputer = KNNImputer(n_neighbors=5, weights="distance")

As the imputer only takes numerical values I had to do some pandas magic and drop the first column, which I then added again. The result is a fully filled dataset, you can recognise the new values as they are not rounded.

In [22]:
pd.DataFrame(
    imputer.fit_transform(material_df.drop(columns=['Sample'])), 
    columns=material_df.columns.drop('Sample')
)

Unnamed: 0,size1,size2,size3,density1,density2,density3
0,0.696,2.69,6.38,41.8,17.18,3.9
1,0.636,2.3,5.14,38.1,12.73,3.89
2,0.841,2.85,5.2,37.6,13.58,3.98
3,0.609,2.13,4.62,34.2,11.12,4.02
4,0.684,2.16,4.87,36.4,12.24,3.92
5,0.762,2.81,6.36,38.1,13.28,3.89
6,0.552,2.34,5.03,41.3,16.71,3.86
7,0.501,2.17,5.09,38.495282,14.029399,3.93118
8,0.619,2.11,5.13,37.405275,13.157346,3.943667
9,0.61,2.1,4.18,35.0,12.15,3.86


This concludes the part of missing values, perhaps you can try yourself and impute the missing values for the FuelEconomy using the SimpleImputer or even the IterativeImputer.