#Day2 Wrap-Up

* **Missing Values**: Only 6436 of 12795 rows (50.3%) represent "complete cases", the rest are missing data. Analysis shows that eight of the fourteen predictor variables have missing data values.

* **Stars Category**: More than 26% of STARS values are missing. Is that necessarily a cause for concern? Based on domain knowledge + common sense, are there any potentially plausible explanations for why so many wines lack a STARS value? Yes, because they simply haven't been tested yet.

* __Negative Values__: Nine variables representing various chemical composition measures within wines (Chlorides, ResidualSugar, FreeSulfurDioxide, CitricAcid, VolatileAcidity, TotalSulfur, DioxideSulphates, FixedAcidity and Alcohol) contain large numbers of negative data values. Further analysis of the negative values reveals that they are pervasive throughout the data set and seem to occur randomly within the 12795 observations.  A cursory survey of the typically reported values for each of these nine attributes reveals that their values __should be strictly positive__ and __zero bound__.  In other words, __none of these variables could plausibly contain a valid negative value within the context of our data set__.


## Reasonable Assumptions and Approaches

###__Negative Values__ 

Since we see that negative values are implausible, here are some possible approaches: 
- Add a constant (e.g., the absolute value of the minimum value for a given variable) to each negative value to ensure that all values within an atrribute are positive.

- Use the absolute value of each negative data value in place of the negative value.

- Simply delete the attributes containing the negative data values

- Delete any observations containing negative data values

### __Missing Data Values__

Since we have to deal with the missing values and dropping half the data isn't plausible, here are some approaches: 

- Use of mean, median or mode values

- Imputation via a K-Nearest Neighbors imputation (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)

- Imputation via backfilling and forward filling

- Imputation via an iterative imputer (i.e., each feature is imputed sequentially, one after the other, allowing prior imputed values to be used as part of a model in predicting subsequent features. For an example see: https://machinelearningmastery.com/iterative-imputation-for-missing-values-in-machine-learning/ )

An imputation method choice can depend on computation complexity and the reliability / performance of each potential approach. For example, for a very small number of missing values it might be appropriate to rely on the use of a median value since imputing the same value for a small number of instances is relatively unlikely to introduce bias within your data (i.e., __alter the probability density function derived from the variable's known data__). 

For situations in which we have more than just a small number of missing values, use of a more complex imputation approach is generally warranted to increase the likelihood that we will maintain the shape of the probability density function we derived from the variable's known data.