#  3  DATA PREPARATION FOR INDEPTH ANALYSES

***

##  3.1    Importing foundational libraries


### Access to the system's parameters (https://docs.python.org/3/library/sys.html)

In [1]:
import sys 
print("Python version: {}". format(sys.version))
print('-'*100)

Python version: 3.6.6 | packaged by conda-forge | (default, Jul 26 2018, 11:48:23) [MSC v.1900 64 bit (AMD64)]
----------------------------------------------------------------------------------------------------



### Pandas is a collection of functions which comes in form of dataframes with SQL-like features for data processing and analysis

In [2]:
import pandas as pd 
print("pandas version: {}". format(pd.__version__))

pandas version: 0.20.1



### Matplotlib gives us a collection of functions for scientific and publication-ready visualization

In [3]:
import matplotlib 
print("matplotlib version: {}". format(matplotlib.__version__))

matplotlib version: 2.0.2



### For mathematical and scientific computing a good starting package will be NumPy

In [4]:
import numpy as np 
print("NumPy version: {}". format(np.__version__))

NumPy version: 1.14.2



### SciPy also has a pretty good collection of functions for scientific computing and advanced mathematics

In [5]:
import scipy as sp 
print("SciPy version: {}". format(sp.__version__)) 

SciPy version: 1.2.1



### Sklearn provides a range of dazzling machine learning algorithms which is quite effective for data analyses

In [6]:
import sklearn 
print("scikit-learn version: {}". format(sklearn.__version__))

scikit-learn version: 0.18.2



### Let me add IPython to it. Or perhaps, import it for its display funtion in order to beautify the pandas dataframes in the jupyter notebook

In [7]:
import IPython
from IPython import display 
print("IPython version: {}". format(IPython.__version__)) 

IPython version: 5.3.0



### Who doesn't factor miscellaneous functions? Two most import ones: Time and randomization. 
>### Let me also ignore warnings such as deprecation alerts, etc. for a smooth guide.

In [8]:
import random
import time
import warnings
warnings.filterwarnings('ignore')
print('-'*100)

----------------------------------------------------------------------------------------------------


***
##  3.2    Importing the data modeling libraries

### Importing models / algorithms for the analyses

In [9]:
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

### Importing some model helpers from the libraries in section 3.1

In [10]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

### Loading other functions for a good visualization of the entire modeling process and analysis

In [11]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
from plotly import tools
import plotly.plotly as py
import plotly.figure_factory as ff
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

***
##  3.3 Knowing the Data We Will Deal With

### In this section, we perform a first initial attempt to know what the data is made up of. Fortunately, ANZ and DataCastle gave an elaboratate meaning of what to expect.

***
### INPUT VARIABLES
***

### Bank client data:
>#### 1. Age (numeric)
>#### 2. Job: type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
>#### 3. Marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
>#### 4. Education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
>#### 5. Default: has credit in default? (categorical: 'no','yes','unknown')
>#### 6. Housing: has home loan? (categorical: 'no','yes','unknown')
>#### 7. Loan: has personal loan? (categorical: 'no','yes','unknown')
 
### Related with the last contact of the current campaign:
>#### 8. Contact: contact communication type (categorical: 'cellular','telephone') 
>#### 9. Month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
>#### 10. Day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
>#### 11. Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

###  Other attributes:
>#### 12. Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
>#### 13. Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
>#### 14. Previous: number of contacts performed before this campaign and for this client (numeric)
>#### 15. Poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

###  Social and economic context attributes：
>#### 16. Emp.var.rate: employment variation rate. quarterly indicator (numeric)
>#### 17. Cons.price.idx: consumer price index. monthly indicator (numeric) 
>#### 18. Cons.conf.idx: consumer confidence index. monthly indicator (numeric) 
>#### 19. Euribor3m: euribor 3 month rate. daily indicator (numeric)
>#### 20. Nr.employed: number of employees. quarterly indicator (numeric)

### OUTPUT VARIABLE (desired target):
>#### 21. y - has the client subscribed a term deposit? (binary: 'yes','no')

### Despite this elaborate information, we at least want to confirm the data content is what we expect and that's what we'll be working with it through this work


***
### Declaring path to the file:
>#### You should have about 37069 rows with 21 columns

In [None]:
location =".../"
file = "DataCastleData.csv"

df = pd.read_csv(location + file)
actual_df = df.copy()

# Viewing the shape of our dataset
df.shape

print("We have " + str(df.shape[0]) + " rows with " + str(df.shape[1]) +" columns.")