In [3]:
import numpy as np
import pandas as pd
df = pd.read_csv('tips.csv')

Real-world data will often have missing elements, and many machine learning models are not capable of coping with missing data

pandas will display missing data with NaN and pd.NaT to show that the missing value should be a timestamp.

i.e. if you have a stock ticker with data showing stock price at given times, and you are missing a time value, that element will show pd.NaT

Options for missing data:

Keep it

Remove it

Replace it

The correct approach will depend on the specific circumstance; there is no approach which is correct in 100% of cases.

Before you drop a row or column, you should ask whether the missing data is primarily within rows or columns

If you have the following table:

In [4]:
lst = {'Year': (1776,1867,1821),
      'Pop': (328,38,126),
      'GDP' : (20.5, 1.7, 1.22),
      'Area' : ['NaN' , 'NaN', .76]}

myindex = ('USA','CANADA','MEXICO')

Note, I manually entered NaN as a string becauase I'm not sure how to actually input NaN

In [5]:
df = pd.DataFrame(lst, index = myindex)

In [6]:
df

Unnamed: 0,Year,Pop,GDP,Area
USA,1776,328,20.5,
CANADA,1867,38,1.7,
MEXICO,1821,126,1.22,0.76


Knowing what I know now, let's try this again . . . 

In [7]:
mydata = np.array([[1776, 328, 20.5, np.nan],[1867, 38, 1.7,np.nan],[1821,126,1.22,.76]])

In [8]:
mydata
myindex = ['USA','CANADA','MEXICO']
mycolumns = ['Year','Pop','GDP','Area']

In [9]:
df = pd.DataFrame(data = mydata, columns = mycolumns, index = myindex)

In [10]:
df.head()

Unnamed: 0,Year,Pop,GDP,Area
USA,1776.0,328.0,20.5,
CANADA,1867.0,38.0,1.7,
MEXICO,1821.0,126.0,1.22,0.76


In this case ^ it makes sense to drop the Area column, rather than USA or CANADA since most of the missing data is in Area

---------------------------------

Option 1: Keep the data

Pros: Easiest, remains faithful to the true data, i.e. it reflects the fact that some data was missing, which may influence statistics

Cons: Many methods including machine learning do not support NaN, many times missing data is the result of incorrect data entry for example: if the temperature is 0 degrees C this may have been input incorrectly as nonexistent which would lead to NaN and therefore be incorrect. Or, if a store is doing a promotion and giving away a product for $0 this may have been input as NaN which would lead to incorrect conclusions in the model since product was still being transfered out. In both of these cases, changing NaN to 0 would be the best option.

---------------------------------

Option 2: Remove it

Pros: This is also easy, can be based on rules which you set for example dropping every row which includes 4 data points

Cons: Potential to lose a lot of data if there are many potential inputs and one is missing. For example, if you have a climate model which takes into account temperature, humidity, air pressure, UV index and may other inputs and one of these is missing, there is the potential to discard a large amount of data because it is possible that one of many inputs will be missing for any given time. If you build a model based on a large amount of missing data, it could become irrelevant

---------------------------------

Option 3: Filling in the missing data

Pros: Potential to save a lot of data which will help train a machine learning model

Cons: The hardest to do and somewhat arbitrary as what you use to fill in is up to you which can lead to some false conclusions. The validity of the model will therefore be up to how reasonable the data you used to fill in missing points were.

Now I want to figure out how to delete the Area column and create a new Carriers column as is now shown in the lecture. 

-----------------
Moving along and not actually doing that ^ :

The interpolation method of filling in missing data means that you will have to use a reasonable guess for NaN values. You will have to have some justification for the numbers you replaced NaN with.

Pt 2: The syntax for replacing NaA - How do you actually do it???

In [11]:
df

Unnamed: 0,Year,Pop,GDP,Area
USA,1776.0,328.0,20.5,
CANADA,1867.0,38.0,1.7,
MEXICO,1821.0,126.0,1.22,0.76
