[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/SmilodonCub/DS4VS/blob/master/Week6/DS4VS_week6_cleaningup.ipynb)

# Week 6: Cleaning Up

more data wrangling & evaluating missingness

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQMkTZ7gZ1I-XuSUQGdDTaBnSYCrr44R-OOHQwfR_7X6ivRmy5Gb5HV5hF74LZyqUAhJ-I&usqp=CAU" width="30%" style="margin-left:auto; margin-right:auto">


### Exploring Diabetes Data

* applying functions to DataFrame columns
* transforming data in certain DataFrame rows
* visualizing data missingness

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Recoding Values

applying functions to change DataFrame columns  

Our first dataset:

In [None]:
path = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/diabetes_data_upload.csv'
diabetes_earlystage = pd.read_csv( path )
diabetes_earlystage.head()

In [None]:
print( diabetes_earlystage.info() )

In [None]:
print( diabetes_earlystage.describe() )
#print( diabetes_earlystage.describe( include='object') )

### Apply a function to recode yes/no column data

**Method 1**: create our own function

In [None]:
def recode_YesNo( value ):
    if value == 'Yes':
        return 1
    else:
        return 0

In [None]:
# use the .apply() method to apply recode_YesNo
diabetes_earlystage[ 'Polyuria' ] = diabetes_earlystage[ 'Polyuria' ].apply( recode_YesNo )
diabetes_earlystage.head()

**Method 2**: a lambda function (slightly more 'Pythonic' 🕶️ )

In [None]:
# lambda function
diabetes_earlystage[ 'Polydipsia' ] = diabetes_earlystage[ 'Polydipsia' ].apply( lambda x : 1 if x == 'Yes' else 0 )
diabetes_earlystage.head()

**Mehtod 3**: lists with the .replace() method

In [None]:
columnsleft2decode = ['Gender', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class']

value = [ 1, 0, 1, 0, 1, 0 ]
stringval = [ 'Yes', 'No', 'Male', 'Female', 'Positive', 'Negative' ]

for column in columnsleft2decode:
    diabetes_earlystage[ column ] = diabetes_earlystage[ column ].replace( stringval, value )
    
diabetes_earlystage.head()

## Recoding Values

changing certtain values in columns

In [None]:
path = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/diabetes.csv'
diabetes_pima = pd.read_csv( path )
diabetes_pima.head()

In [None]:
diabetes_pima.info()

In [None]:
diabetes_pima.describe()

### Problematic Data Encoding

We see that for many columns, the minimum value == 0  
For some features (`Pregnancies`) this is an anticipated value.  
However, can a patient have a `SkinThickness` == 0?

### The Solution

Here we will use the .replace() method to selectively replace 0 values with `NA`

In [None]:
# display the mean values of features before transforming the data
print( diabetes_pima.describe(include='all').loc['mean'] )

In [None]:
pimacolumns_2change = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Outcome']
diabetes_pima[ pimacolumns_2change ] = diabetes_pima[ pimacolumns_2change ].replace(0, np.nan )

# display the mean values of features after transforming the data
print( diabetes_pima.describe(include='all').loc['mean'] )

### Filling Missing Values 

**Oh No!** - we accidentally replaced all the 0 values in our `Outcome` column.  
This is terrible, because this column contains the categorical labels.  
**One-hot encoding** - This style of encoding a two level category with 1s and 0s is very common and we will see it throughout our machine learning course material  

In [None]:
diabetes_pima.head()

### Let's fix this mistake using the .fillna() method:

In [None]:
diabetes_pima[ 'Outcome' ] = diabetes_pima[ 'Outcome' ].fillna(0)#.astype('Int64')
diabetes_pima.head()

### What the `NaN`, `None`?!?

**NaN** Not a Number, although, in Python, `NaN` is a member of numeric data type and so are used in numerical arrays.  

**None** is a Python keyword used to define a null value | absense of value | empty value and are used in object arrays. 

[Confused?](https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b) ....you havent seen the half of it

In [None]:
type( np.nan )

## Viewing Missingness

In [None]:
diabetes_pima.isna().sum()

### No, not viewing, let's Visualize Missingness

We will explore visual summaries of missing data to learn more about the data's structure

In [None]:
#pip install missingno
import missingno as msno

### `missingno` Matrix Style plot

Every row is an observation.  
`NaN` appears as a missing line by column

In [None]:
%matplotlib inline
msno.matrix(diabetes_pima)

### Hunch: what if the missingness is related to the categorical `Outcome` variable?

In [None]:
#reorder the dataframe by outcome
sorted_pima = diabetes_pima.sort_values( by=['Outcome'] )
# visualize a missingno matrix plot
msno.matrix(sorted_pima)
plt.show()

### `missingno` Bar Plot

gives an overview of the proportions of missingness in each feature

In [None]:
msno.bar(diabetes_pima)
plt.show()

### `missingno` Heatmap

This will identify correlated missingness (where a missing value in one column is predictive of a missing value in the other)

In [None]:
msno.heatmap(diabetes_pima)
plt.show()

### Interesting! Let's try our ordered matrix plot again....

In [None]:
#reorder the dataframe by outcome
sorted_pima = diabetes_pima.sort_values( by=['Insulin'] )
# visualize a missingno matrix plot
msno.matrix(sorted_pima)
plt.show()

### `missingno` Dendrogram

hierarchical clustering of shared missingness

In [None]:
msno.dendrogram(diabetes_pima)
plt.show()

## We looked at some visualizations of our data's missingness. Next week we will focus on visualizing our data!
<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">