### Missing Data

In the real world can be found in a number of ways and for different reasons:
- errors in data, or the data may be corrupted
- it may be seen as zeros, question marks, negative values, maybe '99' values

You need to understand the data in order to understand what a "missing value" may be.  It's not always a straight pathway to this

Missing values in a DataFrame are indicated by NaN (Not a Number)

In [None]:
#sense of missing values
df.isna().any() 

#tells us if there are any missing values in a column

In [None]:
import matplotlib.pyplot as plt

#plotting the missinv values
df.isna().sum().plot(kind="bar")
plt.show()

##### 1 - dropping the rows with NaN values

In [None]:
#may want to replace zeros with nans 
df.column.replace(0, np.nan, inplace=True)

df.dropna()

##### 2 - replace with zeros

In [None]:
df.fillna(0)

##### 3 - imputation with Preprocessing Methods

- This is really making somewhat of an "educated guess" of what the missing data could be replaced with that is reasonable enough to be imputed in such manner.

In [None]:
#imputing the NaN with the mean across columns
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

imp.fit(X)
X = imp.transform(X)
#this is why the Imputers are known as Transformers

In [None]:
#OR can use Imputing within a pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
logreg = LogisticRegression()

steps = [('imputation', imp),
        ('logistic_regression', logreg)]

#pipeline constructor
pipeline = Pipeline(steps)

#train/test split performed now
#fit the pipeline to the training set
pipeline.fit(X_train, y_train)

#predict on the test set
y_pred = pipeline.predict(X_test)

#compute accuracy
pipeline.score(X_test, y_test)

##### extra example

In [None]:
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]
avocados_2016[cols_with_missing].hist()
plt.show()

# Fill in missing values with 0
avocados_filled = avocados_2016.fillna(0)

# Create histograms of the filled columns
avocados_filled[cols_with_missing].hist()

# Show the plot
plt.show()