## Data Cleaning

* Data cleaning is data set specific but there are some common problems
    * Missing values
    * Duplicates
    * Reporting or collection bias or drift
    * Reporting or collection errors
* Data cleaning is more effective when it is combined with visualization and some preliminary modeling
* Decribe all the steps you take to load and clean the data so that you and others can repeat the process

### In Class Exercise

Use your functions or the ones below to load the train data into a dataframe and combine all the narrative fields into one variable: Narrative. Optionally, remove the previous narrative columns.

In [1]:
# Packages for loading, cleaning, visualization, and analysis

# Data
import pandas as pd
import numpy as np
import scipy as sp
import os
import string as st


In [2]:
# Function to get the files from a directory

def getallfiles(directory, extension = ".txt"):
    '''Get all files in directory with the specified extension
        and put them into a list.
        The default extension is txt. The directory parameter is the path to 
        the directory containing the files.'''
    filenames = os.listdir(directory)
    myfiles = []
    for e in filenames:
        if e.endswith(extension):
            myfiles.append(os.path.realpath(e))
    return myfiles




In [3]:
def createlist(directory, extension = ".txt"):
    '''Put all files in the specified directory
    with the chosen extension (txt is the default) 
    into a datafame'''
    os.chdir(directory)
    files = getallfiles(directory)
    filelist = []
    for i,file in enumerate(files):
        filelist.append(pd.read_csv(os.path.realpath(file), low_memory = False, encoding = "ISO-8859-1"))
    return(filelist)


In [None]:
# Join the narrative and put them in a list

def join_narratives(DF):
    '''With the input of the accident dataframe
    merge the narrative columns into a single narrative
    and return a list of these single narratives for each
    accident report in the dataframe. '''
    narrlist = []
    for i in range(0,15):
        a = str(i+1)
        narrlist.append('NARR'+ a)
    RailNarr = DF.loc[:, narrlist]
    Narratives = []
    for i, _ in enumerate(RailNarr["NARR1"]):
        NarrativeList = RailNarr.iloc[i]
        Anarrative = ""
        for narr in NarrativeList:
            if pd.isnull(narr):
                break
            else:
                Anarrative += str(narr)
    Narratives.append(Anarrative)

## Duplicates

- Why should we remove duplicates?
- How should we remove them?

## Missing Values

There are essentially 3 ways to handle missing values:

    1. Remove the columns (variables) with missing values
    2. Remove the row (observations) with missing values
    3. Impute the missing values

The choice of which of these to use depends on the problem and the data. If a variable does not seem important to the problem or if it has many missing values, then eliminating it is reasonable. Similarly if the observation appears to not represent the data or if it has many missing values, then eliminating it seems reasonable.
Imputation

Imputation can be done in may ways. The most common are the following:

    1. Replace the missing value with the mean
    2. Replace the missing value with the median
    3. Replace the missing value with the mode
    4. Use k-nn to
        1. Replace the missing value with the mean
        2. Replace the missing value with the median
        3. Replace the missing value with the mode
 
Options 1 and 2 can only be used with numeric or quantitative variables.

### In Class Exercise

1. Find out how many missing values we have by column (variable).
2. What technique or techniques should we use to handle them?
3. What methods from pandas or sklearn can we use for imputation? Why or how would we use them?

In [None]:
# This class imputes the missing values as
# (1) most frequent if the variable is categorical 
# (2) mean if the variable is real (floating point)
# (3) median if the variable is an integer 

# Here is a class that will provide imputation
# This is an extension by D.Brown to sveitser, 2014 https://stackoverflow.com/users/469992/sveitser

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.
        
        Columns of dtype floating point are imputed with the mean.

        Columns of other types are imputed with median of the column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') 
                               else X[c].mean() if X[c].dtype == np.dtype('f')
                                else X[c].median() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)


## Categorical Variables

Categorical variables must be treated differently from quantitative variables. Their effects on the modeling
and machine learning depend on correctly coding them for the analysis. 

Categorical variables can be entered in many diffent ways. Data scientists should inspect the coding 
of these variables to insure correctness, approrpriateness for modeling, and easy interpretation.

### In Class Exercise

1. Look at the data types for the variables.
2. Which variables are categorical? Look at some of their value_counts()
3. Which variables are categorical but are coded as integers? Which variables are integers but coded as objects?
4. Replace integer values for TYPE with the text labels. Repeat for one other variable.