## Project: Data Minining Project for a non-profit organisation `UCOUNT`

----
## The CRISP-DM Framework

The CRISP-DM methodology provides a structured approach to planning a data mining project. It is a robust and well-proven methodology. 

- Business understanding (BU): Determine Business Objectives, Assess Situation, Determine Data Mining Goals, Produce Project Plan

- Data understanding (DU): Collect Initial Data, Describe Data, Explore Data, Verify Data Quality

- Data preparation (DP): Select Data, Clean Data, Construct Data, Integrate Data

- Modeling (M): Select modeling technique, Generate Test Design, Build Model, Assess Model

- Evaluation (E): Evaluate Results, Review Process, Determine Next Steps

- Deployment (D): Plan Deployment, Plan Monitoring and Maintenance, Produce Final Report, Review Project

References: 
1. [What is the CRISP-DM methodology?](https://www.sv-europe.com/crisp-dm-methodology/)

2. [Introduction to CRISP DM Framework for Data Science and Machine Learning](https://www.linkedin.com/pulse/chapter-1-introduction-crisp-dm-framework-data-science-anshul-roy/)

### Your Role
All of the tasks you are asked to do in this course assumes that you are leading a data science group for a non-profit (fake) company called UCOUNT (yoUr CONTribution). 

As a non-profit organisation, UCOUNT's main income is donation. Moreover, for some of the smaller projects, UCOUNT uses the Kickstart and other crowd funding platforms to raise funding. 

As data science group leader at UCOUNT, you are tasked to advise the marketing department such that they run an effective donation and kickstart campaigns. 

## Data Sets
You have two sets of data at your disposal. 

1. The first one is a collection of crowd funding campaigns performed in the past by different people and organisations at the [Kickstarter website](https://www.kickstarter.com/). The data is obtained using web scraper robots run by [webrobots](https://webrobots.io/kickstarter-datasets/). The bots crawled all Kickstarter projects and collect data which are then dumped in CSV format.

   You can download Kickstarter data set here: [kickstarter-cleaned.csv](https://github.com/flavianh/kickstarter-dash/blob/master/kickstarter-cleaned.csv)
   
   You will use this data to understand the important features of successful crowd funding campaigns. For example what are the common characteristics of successful campaigns?

2. The second data is a Census Income Data Set, which comes from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Census+Income). The datset was donated by Ron Kohavi and Barry Becker, after being published in the article _"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid"_. You can find the article by Ron Kohavi [online](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf).

   You can download Census Income Data Set data here: [census-income-1996.csv](s3://10ac-courses-data/csv/census-income-1996.csv)
    
   You will use this data set to predict the annual income of a target donor. Understanding an individual's income can help a non-profit company like UCOUNT better understand how large of a donation to request, or whether or not they should reach out to begin with.  


## Data Understanding

----
In this part of the CRISP-DM phase, you will explore the data to understand the feature space, the data attributes present at training and testing time, and the target label, which is used to train and validate a model.  

A good code is reusable. That means you write functions and use them repeatedly for different data analysis projects. One good example the **sklearn** python package. As you may know, most of the algorithms, and other necessary processes necessary for most machine learning activities are written in this package such that millions other users don't have to repeat them.

It is also highly recommended that you write python functions to do most of your customised plots, data loading, data cleaning and other activities you do repeatedly . That way you will save a lot of time in the future, and your code becomes readable. 

>**Note:** you can save a bunch of fuctions defined in a cell to a file for later use using the jupyter notebook magic

    %%file /path/filename.py 

>You can then load a specific fuction from the pythin file using 

    sys.path.append(/path)
    from filename import function  
>To learn more about the different notebook magic functions and other jupyter notebook tricks, see  
<a href=https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/> 28 Jupyter Notebook Tips, Tricks, and Shortcuts</a>  or <a href=https://towardsdatascience.com/jupyter-notebook-hints-1f26b08429ad>Boosting Your Jupyter Notebook Productivity</a>

To motivate you in that directions, a few functions are defined below that are useful to download data from reomote url, and check for existing data, create folder if they are not in the local file system etc. There are also a couple of functions for ploting histogram distributions from a DataFrame and print results. An example of how to save notebook cells to a file is also given below.

Since the notes in these lecture series are brief, we strongly recommend you refer to online lectures, blogs, and data science text books. Some good references are listed below. Moreover, please if you don't understand something be proactive and ask questions in 10 Academy and/or stackoverflow forums. 

You should make sure that with every exercise, you are building a strong mix of data science knowledgg, which encompases mathematical and statistical concepts and algorithms,  and data science hard skill, writing good code and getting familiar with important data science packages such as scipy, pandas, sklearn, Keras, etc. 

In [None]:
#this part means allows jupyter to automatically load modules modified after being load
%load_ext autoreload
%autoreload 2

In [None]:
import requests
import os, sys
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from time import time
from sklearn.metrics import f1_score, accuracy_score

# Pretty display for notebooks
%matplotlib inline


home = os.path.expanduser("~")
mypyrootdir = os.path.join(home, 'mypy')

#add mypydir in system path. This will allow us to import files easily
if not mypyrootdir in sys.path:
    sys.path.insert(0, mypyrootdir)

#folder to save python files for later use as modules
def create_subfolder_in_mypydir(path,
                                rootdir=os.path.expanduser("~"), 
                                ismodule=True):
    '''
    Create folder to save python files relative to home mypydir(~/mypy/tenx)
    if submodule is True, then the folder will be turned to python module
    by adding an empty ~/mypy/tenx/path/__init__.py file
    '''
    dirpath = os.path.join(rootdir,path)
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
        
    if ismodule:
        with open(os.path.join(dirpath, '__init__.py'), 'w') as f:
            f.write('')
        
    return dirpath



#Create the tenac mypydir directory to save python functions for use in many projects
tenxdir = create_subfolder_in_mypydir('tenx', rootdir=mypyrootdir)
                                      
#create util directory inside tenxdir
tenxutil = create_subfolder_in_mypydir('util', rootdir=tenxdir)

#import the python modules in the tenx folder
import tenx

In [None]:
#this is to save the next cell, which defines some python functions, to a file
ofilename = os.path.join(tenxutil,'mycurl.py')
ofilename

In [None]:
%%file $ofilename

import requests
import os, sys
                    
def get_filename_from_cd(cd):
    """
    Get filename from content-disposition
    """
    if not cd:
        return None
    fname = re.findall('filename=(.+)', cd)
    if len(fname) == 0:
        return None
    return fname[0]

def is_fsize_larger(header, size=200, unit=1e6): 
    # size is in units of 1e6 - MegaByte (Mb)
    
    content_length = header.get('content-length', None)
    
    if content_length and content_length > size*unit:  # 200 mb approx
        return False
    else:
        return True

def is_downloadable(url, size=None, unit=1e6):
    """
    Does the url contain a downloadable resource?
    To answer this question, we first fetch the headers 
    of a url before actually downloading it.
    This allows us to skip downloading files which 
    weren't meant to be downloaded    
    """    
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    
    unitd = {1e3:'Kb', 1e6:'Mb', 1e9:'Gb', 1e12:'Tb', 1e16:'Pb'}[unit]
    
    dload = True 
    if size: #if size is given (not None)        
        dload = is_fsize_larger(header, size=size, unit=unit)
        
    if not dload: #file size condition
        print('file size of %s is more than %s %s'% (url, size, unitd) )
        return False
    
    #print('url=%s HEADER'%url)
    #print(header)
    
    content_type = header.get('content-type')    
    #if 'text' in content_type.lower(): #text condition
    #    print('%s is a link to plain text - not data'%url)
    #    return False
    if 'html' in content_type.lower(): #html condition
        print('%s is a link to plain html - not data'%url)
        return False
              
    return True

def make_dir_ifnew(filename, isdir=False, rfile=True, rtest=True):
    if os.path.exists(filename):
        if rfile and rtest:
            return filename, True
        elif rfile and not rtest:
            return filename
        else:
            return True
    
    if isdir: #if filename is directory path
        dirname = filename
    else: #filename is path+filename.ext
        dirname = os.path.dirname(filename)
        
    if not os.path.exists(dirname):
        os.makedirs(dirname)
        print('created directory: %s'%dirname)
    else:
        print('directory %s already exists.'%dirname)
        
    if rfile and rtest:
        return filename, True
    elif rfile and not rtest:
        return filename    
    else:
        return True
        
def download_data(url, filename, replace=False):
        
    #make directory if it does not exist
    file_exists = make_dir_ifnew(filename,rfile=False)
    
    # only download if file doesn't exist or replace=True
    if replace or not file_exists: 
        dload = is_downloadable(url)              
        print('Is the url=%s downloadable: %s' % (url,dload) )

        if dload:
            #request HTTP GET
            r = requests.get(url)

            #write the data obtained to file
            open(filename, 'wb').write(r.content)

            print('url successfully downloaded to file: %s'%filename)
        else:
            print('url can not be downloaded')
    else:
        print('%s already exists - downloading not necessary'%filename)
              
    return filename

In [None]:
def feature_plot(importances, X_train, y_train):
    
    # Display the five most important features
    indices = np.argsort(importances)[::-1]
    columns = X_train.columns.values[indices[:5]]
    values = importances[indices][:5]

    # Creat the plot
    fig = plt.figure(figsize = (9,5))
    pl.title("Normalized Weights for First Five Most Predictive Features", fontsize = 16)
    plt.bar(np.arange(5), values, width = 0.6, align="center", color = '#00A000', \
          label = "Feature Weight")
    plt.bar(np.arange(5) - 0.3, np.cumsum(values), width = 0.2, align = "center", color = '#00A0A0', \
          label = "Cumulative Feature Weight")
    plt.xticks(np.arange(5), columns)
    plt.xlim((-0.5, 4.5))
    plt.ylabel("Weight", fontsize = 12)
    plt.xlabel("Feature", fontsize = 12)
    
    plt.legend(loc = 'upper center')
    plt.tight_layout()
    plt.show()
    
def evaluate(results, accuracy, f1):
    """
    Visualization code to display results of various learners.
    
    inputs:
      - learners: a list of supervised learners
      - stats: a list of dictionaries of the statistic results from 'train_predict()'
      - accuracy: The score for the naive predictor
      - f1: The score for the naive predictor
    """
  
    # Create figure
    fig, ax = plt.subplots(2, 3, figsize = (11,7))

    # Constants
    bar_width = 0.3
    colors = ['#A00000','#00A0A0','#00A000']
    
    # Super loop to plot four panels of data
    for k, learner in enumerate(results.keys()):
        for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']):
            for i in np.arange(3):
                
                # Creative plot code
                ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k])
                ax[j//3, j%3].set_xticks([0.45, 1.45, 2.45])
                ax[j//3, j%3].set_xticklabels(["1%", "10%", "100%"])
                ax[j//3, j%3].set_xlabel("Training Set Size")
                ax[j//3, j%3].set_xlim((-0.1, 3.0))
    
    # Add unique y-labels
    ax[0, 0].set_ylabel("Time (in seconds)")
    ax[0, 1].set_ylabel("Accuracy Score")
    ax[0, 2].set_ylabel("F-score")
    ax[1, 0].set_ylabel("Time (in seconds)")
    ax[1, 1].set_ylabel("Accuracy Score")
    ax[1, 2].set_ylabel("F-score")
    
    # Add titles
    ax[0, 0].set_title("Model Training")
    ax[0, 1].set_title("Accuracy Score on Training Subset")
    ax[0, 2].set_title("F-score on Training Subset")
    ax[1, 0].set_title("Model Predicting")
    ax[1, 1].set_title("Accuracy Score on Testing Set")
    ax[1, 2].set_title("F-score on Testing Set")
    
    # Add horizontal lines for naive predictors
    ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    ax[0, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    ax[1, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    
    # Set y-limits for score panels
    ax[0, 1].set_ylim((0, 1))
    ax[0, 2].set_ylim((0, 1))
    ax[1, 1].set_ylim((0, 1))
    ax[1, 2].set_ylim((0, 1))

    # Create patches for the legend
    patches = []
    for i, learner in enumerate(results.keys()):
        patches.append(mpatches.Patch(color = colors[i], label = learner))
    plt.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \
               loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large')
    
    # Aesthetics
    plt.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.10)
    plt.tight_layout()
    plt.show()
    
def distribution(data, col_names, 
                 transformed = False,
                 figsize = None,
                 xra=None,yra=None, 
                 yticks=None, 
                 yticklabels=None
                ):
    """
    Visualization code for displaying skewed distributions of features
    """
    #
    plotInfo = {'nrow':0, 'ncol':0,
                'isfigsize':False,                
                'isxra':False, 'isyra':False,
                'isyticks':False, 
                'isyticklabels':False,
                'istransformed':False}  
    
    slevel = 0
    
    # Create figure
    if figsize is None:
        figsize = (11,5)
    else:
        plotInfo['isfigsize'] = True
        slevel += 2 
        
    fig = plt.figure(figsize = figsize);
    
    #depending on col_names size, adject figure rows and columns
    nrow = 1 if len(col_names) <= 3 else int( np.ceil(len(col_names)/3) ) 
    ncol = len(col_names) if len(col_names) <= 3 else 3
    plotInfo['nrow'] = nrow
    plotInfo['ncol'] = ncol
    slevel += nrow*ncol #more plots means advanced use
    
    # Skewed feature plotting
    for i, feature in enumerate(col_names):
        ax = fig.add_subplot(nrow, ncol, i+1)
        ax.hist(data[feature], bins = 25, color = '#00A0A0')
        ax.set_title("'%s' Feature Distribution"%(feature), fontsize = 14)
        ax.set_xlabel("Value")
        ax.set_ylabel("Number of Records")
        
        if not xra is None: 
            if type(xra[0]) is not list: 
                xra = [xra for _ in col_names]
            ax.set_xlim(xra[i])
            plotInfo['isxra'] = True 
            slevel += 2    
            
        if not yra is None: 
            if type(yra[0]) is not list: 
                yra = [yra for _ in col_names]
            ax.set_ylim(yra[i])
            plotInfo['isyra'] = True 
            slevel += 2
            
        if not yticks is None: 
            if type(yticks[0]) is not list: 
                yticks = [yticks for _ in col_names]
            ax.set_yticks(yticks[i])
            plotInfo['isyticks'] = True 
            slevel += 2
            
        if not yticklabels is None: 
            if type(yticklabels[0]) is not list: 
                yticklabels = [yticklabels for _ in col_names]
            ax.set_yticklabels(yticklabels[i])
            plotInfo['isyticklabels'] = True 
            slevel += 2

    # Plot aesthetics
    if transformed:
        plotInfo['istransformed'] = True
        fig.suptitle("Log-transformed Distributions of Continuous Census Data Features", \
            fontsize = 16, y = 1.03)
    else:
        fig.suptitle("Skewed Distributions of Continuous Census Data Features", \
            fontsize = 16, y = 1.03)
        
    fig.tight_layout()

    plotInfo['slevel'] = slevel
    return plotInfo

### Loading the Data and Understanding the Structure
A cursory investigation of the dataset will determine how many individuals fit into either group, and will tell us about the percentage of these individuals making more than \$50,000. In the code cell below, you will need to compute the following:
- The total number of records, `'n_records'`
- The number of individuals making more than \$50,000 annually, `'n_greater_50k'`.
- The number of individuals making at most \$50,000 annually, `'n_at_most_50k'`.
- The percentage of individuals making more than \$50,000 annually, `'greater_percent'`.

In [None]:
ks_filename='../data/kickstarter-cleaned.csv'
census_filename='../data/census-income-1996.csv' 

#download data if not done already
f=download_data('https://raw.githubusercontent.com/flavianh/kickstarter-dash/master/kickstarter-cleaned.csv', 
              ks_filename)
f=download_data('https://raw.githubusercontent.com/ramsaran-vuppuluri/finding_donors/master/census.csv', 
              census_filename)

In [None]:
#TODO: read census_filename data into pandas data frame as
#csdata = 
# YOUR CODE HERE
raise NotImplementedError()

# Show an example of a record with scaling applied
display(csdata.head(n = 5))

In [None]:
# Show an example of a record with scaling applied
display(csdata.head(n = 5))

** HINT: ** You may need to add a cell above, and look at the DataFrame to understand how the `'income'` entries are formatted. 

** Featureset Exploration **

* **age**: continuous. 
* **workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
* **education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
* **education-num**: continuous. 
* **marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
* **occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
* **relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
* **race**: Black, White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other. 
* **sex**: Female, Male. 
* **capital-gain**: continuous. 
* **capital-loss**: continuous. 
* **hours-per-week**: continuous. 
* **native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [None]:
# TODO: get total number of records (rows) and number of features (columns) as
#n_records = 
#n_features = 
# YOUR CODE HERE
raise NotImplementedError()

# Number of records where individual's income is more than $50,000
n_greater_50k = len(csdata[csdata.income =='>50K'])

# TODO: Number of records where individual's income is at most $50,000
#n_at_most_50k = 
# YOUR CODE HERE
raise NotImplementedError()

# TODO: Percentage of individuals whose income is more than $50,000
#greater_percent = 
# YOUR CODE HERE
raise NotImplementedError()

# Print the results
print("Total number of records: {}".format(n_records))
print("Total number of features: {}".format(n_features))
print("Individuals making more than $50,000: {}".format(n_greater_50k))
print("Individuals making at most $50,000: {}".format(n_at_most_50k))
print("Percentage of individuals making more than $50,000: {}%".format(greater_percent))

In [None]:
assert n_records==45222
assert n_features==14


### Understanding the Data Types of the different Features (Dimensions)
The first task of any Machine Learning project is to understand the data types, distributions and other important describtive stats of features. Before proceeding futher explore the feature space and make sure you are familiar with the nature of the data. To help you do so, by the end of you exploration you should define the following variables that will be used below.

In [None]:
# TODO: define variables which contains list of columns that are numerical and categorial
#numerical_col_names = 
#categorial_col_names = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert numerical_col_names == ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

----
## Preparing the Data for ML
Before data can be used as input for machine learning algorithms, it often must be cleaned, formatted, and restructured — this is typically known as **preprocessing**. Fortunately, for this dataset, there are no invalid or missing entries we must deal with, however, there are some qualities about certain features that must be adjusted. This preprocessing can help tremendously with the outcome and predictive power of nearly all learning algorithms.

### Split Data to Features and Target Label 
Before proceeding further, let us be clear that the main task of the income data project is to predict if a given an individual described by a given cobmination of features has an yearly income of USD '>=50K' or '<50K'. These values are given in the `income` column of the training data.

This means the 'income' column is the target label while all the other columns can be considered as features that describe a sample (you can think of them as coordinates, hence the number of independent features are the numbner of dimensions).

What are the number of dimensions of the income data? 

In [None]:
# TODO: Split the data into features and target label. Use the income_raw 
#income_raw =  #for target labels
#features_raw =  #for features

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(income_raw.shape, features_raw.shape)

### Transforming Skewed Continuous Features
A dataset may sometimes contain features whose values tend to lie near a single number, but will also have a non-trivial number of vastly larger or smaller values than that single number.  Algorithms can be sensitive to such distributions of values and can underperform if the range is not properly normalized. 

Your next investigation is to explore the feature space to find out which of the features are highly skewed. In particular, list the top two highly skewd numerical features in the census dataset.  

In [None]:
#list the name of two numerical features that are most skewed by distribution
#top2_skewed = []

# YOUR CODE HERE
raise NotImplementedError()

# What are the approperiate data ranges (min, max) to use  
# the top2_skwed_col_names variable above?   
# top2_skew_ranges = [[],[]]
# YOUR CODE HERE
raise NotImplementedError()

# Testing your python skill
# TODO: what are the non-optional and optional 
# parameters in the function 'distribution()' defined above?


# using the distribution function given above, plot histogram distributions
# of the top two skewed continuous features. You should pass at least one 
# optional features to the funation to complete this task. 
# The output of distribution() should be assigned to plotInfo variable like
# pinfo = distribution()

# YOUR CODE HERE
raise NotImplementedError()

#### Logarithmic Transformations

For highly-skewed feature distributions, it is common practice to apply a <a href="https://en.wikipedia.org/wiki/Data_transformation_(statistics)">logarithmic transformation</a> on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. 

>**Deep Thinking:** 

>What does a logarithmic transformation do to a distribution? How does the distance between two points change once the data is log transformed?

> As you may guess log transformation is not always a idea even if a feature is skewed. Can you think of examples where log tranformation affects badly machine learning? Hint, if two features are correlated, and one of the feature is highly skewed while the other is not, what happens if you apply log transformation on the skewed feature?

Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of `0` or negative values are undefined, so we must translate the values by a small amount above `0` to apply the the logarithm successfully.

Run the code cell below to perform a transformation on the data and visualize the results. Again, note the range of values and how they are distributed. 

In [None]:
# Log-transform the skewed features
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[top2_skewed] = features_raw[top2_skewed].apply(lambda x: np.log(x + 1))

# Visualize the new log distributions
tpinfo = distribution(features_log_transformed,top2_skewed, transformed = True,
                      xra = [0.1,0.3], yra = [0,4000])

### Normalizing Numerical Features
In addition to performing transformations on features that are highly skewed, it is often good practice to perform some type of scaling on numerical features. Applying a scaling to the data does not change the shape of each feature's distribution; however, normalization ensures that each feature is treated equally when applying supervised learners. 

>**Deep Thinking:** 

>What does normalization do to the "units" of features? For example what is the unit of age once it is normalized to be in the range (0, 1)?

>In which scenario do you think normalization helps? and when is it not a good strategy? 

Note that once scaling is applied, observing the data in its raw form will no longer have the same original meaning, as exampled below.

Run the code cell below to normalize each numerical feature. We will use [`sklearn.preprocessing.MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) for this.

In [None]:
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)

features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical_col_names] = scaler.fit_transform(
    features_log_transformed[numerical_col_names])

# Show an example of a record with scaling applied
display(features_log_minmax_transform.head(n = 5))

### One-hot encoding of categorical variables

From the **tables** above, we can see there are several features for each record that are non-numeric. Typically, learning algorithms expect input to be numeric, which requires that non-numeric features (called *categorical variables*) be converted. One popular way to convert categorical variables is by using the **one-hot encoding** scheme. One-hot encoding creates a _"dummy"_ variable for each possible category of each non-numeric feature. 

For example, assume a feature with a header `Name` has three possible entries: `A`, `B`, or `C`. We then encode this feature into `Name_A`, `Name_B` and `Name_C`.

     | Name  |             ----> one-hot encoding ---->   | Name_A | Name_B | Name_C | 
     |  B    |             ----> one-hot encoding ---->   | 0      | 1      | 0      |  
     |  C    |             ----> one-hot encoding ---->   | 0      | 0      | 1      |  
     |  A    |             ----> one-hot encoding ---->   | 1      | 0      | 0      |  

Additionally, as with the non-numeric features, we need to convert the non-numeric target label, `'income'` to numerical values for the learning algorithm to work. Since there are only two possible categories for this label ("<=50K" and ">50K"), we can avoid using one-hot encoding and simply encode these two categories as `0` and `1`, respectively. 

In code cell below, we will do the following:
 - Use [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) to perform one-hot encoding on the `'features_log_minmax_transform'` data.
 - Convert the target label `'income_raw'` to numerical entries.
   - Set records with "<=50K" to `0` and records with ">50K" to `1`.

In [None]:
features_final = pd.get_dummies(features_log_minmax_transform)

income = pd.Series(data=income_raw)
income = income.replace('<=50K',0)
income = income.replace('>50K',1)

# Print the number of features after one-hot encoding
encoded = list(features_final.columns)
print("{} total features after one-hot encoding.".format(len(encoded)))

# Uncomment the following line to see the encoded feature names
# print encoded

### Shuffle and Split Data
Now all _categorical variables_ have been converted into numerical features, and all numerical features have been normalized. As always, we will now split the data (both features and their labels) into training and test sets. 80% of the data will be used for training and 20% for testing.

Run the code cell below to perform this split.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    income, 
                                                    test_size = 0.2, 
                                                    random_state = 0)
 

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

### Save the Train and Test sets for future analysis
The the train and test sets into files so that you can read them from another notebook.

In the cell code below, you are required to implement the missing code in the save_train_test_split() function.

In [None]:
def load_split_samples(train=True, test=True, validate=False,
                        path='./'):
    '''
        This function loads the training, testing and/or validating
        samples if the parameters train, test and validate are True.
        The order of the output is as follows:
        
        if train=True and test=False, validate=False 
        samples=[X_train, y_train], 
        
        if train=True and test=True, validate=False 
        samples=[X_train, y_train, X_test, y_test], 
        
        if train=True and test=True, validate=True
        samples=[X_train, y_train, X_test, y_test, X_validate, y_validate]
        
        All samples are read as pandas DataFrame.        
    '''
    join = os.path.join
    samples = []
    if train:
        fname_X_train,fexist1 = make_dir_ifnew(join(path,'train/features.csv'))        
        fname_y_train,fexist2 = make_dir_ifnew(join(path,'train/label.csv'))
        if fexist1 and fexist2:
            samples.append(pd.read_csv(fname_X_train,index_col=0))
            samples.append(pd.read_csv(fname_y_train,index_col=0))
        else:
            print('Error: Testing sample can not be load!')            

    if test:
        fname_X_test,fexist1 = make_dir_ifnew(join(path,'test/features.csv'))
        fname_y_test,fexist2 = make_dir_ifnew(join(path,'test/label.csv'))
        if fexist1 and fexist2:
            samples.append(pd.read_csv(fname_X_test,index_col=0))
            samples.append(pd.read_csv(fname_y_test,index_col=0))    
        else:
            print('Error: Testing sample can not be load!')            

    if validate:
        fname_X_validate,fexist1 = make_dir_ifnew(join(path,'validate/features.csv'))
        fname_y_validate,fexist2 = make_dir_ifnew(join(path,'validate/label.csv'))
        if fexist1 and fexist2:
            samples.append(pd.read_csv(fname_X_validate,index_col=0))
            samples.append(pd.read_csv(fname_y_validate,index_col=0))
        else:
            print('Error: Validation sample can not be load!')
                  
    return samples
    
    
def save_train_test_split(train_sample, test_sample, 
                        validate_sample=None,
                          path='./'):
    '''
        This function saves training features and labels passed as
        train_samples=[X_train, y_train], 
        in the path/train folder in separate CSV files.
        
        Similarly if the test_samples = [X_test, y_test] are given,  
        it saves it in the path/test folder in CSV format.
        
        If the optional validate=[X_validate, y_validate] is passed, 
        it saves it in the path/validate folder in CSV format.
    '''
    join = os.path.join
    fname_X_train = make_dir_ifnew(join(path,'train/features.csv'), rtest=False)
    fname_y_train = make_dir_ifnew(join(path,'train/labels.csv'), rtest=False)

    fname_X_test = make_dir_ifnew(join(path,'test/features.csv'), rtest=False)
    fname_y_test = make_dir_ifnew(join(path,'test/labels.csv'), rtest=False)

    if not validate_sample is None:
        fname_X_validate = make_dir_ifnew(join(path,'validate/features.csv'), rtest=False)
        fname_y_validate = make_dir_ifnew(join(path,'validate/labels.csv'), rtest=False)    

    # YOUR CODE HERE
    raise NotImplementedError()
    
    return 'train test split successfully saved to %s'%path


savepath = './'

#use save_train_test_split function to tsave the trainning and testing samples in the current directory
status = save_train_test_split([X_train, y_train], 
                      test_sample = [X_test, y_test], 
                      path=savepath)
print(status)

In [None]:
#let us check if the dataframes you saved to a file and read are the same as the original

#read saved files
X_train2, y_train2,X_test2, y_test2 =  load_split_samples(train=True, test=True, path=savepath)


assert all(X_train==X_train2)
assert all(X_test==X_test2)

