# Exercise 1
## Loading the dataset and creating the target

The goal of this exercise is to:
- Obtain the csv dataset from the UCI machine learning repository
- Load the dataset into memory as a pandas dataframe
- Identify an appropriate target to predict, and create a binary numerical column of the target

#### Download the dataset
Visit https://archive.ics.uci.edu/ml/datasets/bank+marketing to download the zip file.

Note that the `!` before the command lets the notebook know that this is a shell command to run    

In [1]:
!mkdir data
!unzip bank.zip -d data/

mkdir: data: File exists
Archive:  bank.zip
replace data/bank-full.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


Let's take a sneak peak of the data

In [2]:
!head data/bank.csv

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
30;"unemployed";"married";"primary";"no";1787;"no";"no";"cellular";19;"oct";79;1;-1;0;"unknown";"no"
33;"services";"married";"secondary";"no";4789;"yes";"yes";"cellular";11;"may";220;1;339;4;"failure";"no"
35;"management";"single";"tertiary";"no";1350;"yes";"no";"cellular";16;"apr";185;1;330;1;"failure";"no"
30;"management";"married";"tertiary";"no";1476;"yes";"yes";"unknown";3;"jun";199;4;-1;0;"unknown";"no"
59;"blue-collar";"married";"secondary";"no";0;"yes";"no";"unknown";5;"may";226;1;-1;0;"unknown";"no"
35;"management";"single";"tertiary";"no";747;"no";"no";"cellular";23;"feb";141;2;176;3;"failure";"no"
36;"self-employed";"married";"tertiary";"no";307;"yes";"no";"cellular";14;"may";341;1;330;2;"other";"no"
39;"technician";"married";"secondary";"no";147;"yes";"no";"cellular";6;"may";151;2;-1;0;"unknown";"no"
41;"entrepreneur

Here we can se that the separator between fields is a ";" not a "," as is usual for CSVs. We'll make sure to include that when we load in the data with pandas.

In [3]:
import pandas as pd
bank_data = pd.read_csv('data/bank.csv', sep=';')

In [4]:
bank_data.head(n=20)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
5,35,management,single,tertiary,no,747,no,no,cellular,23,feb,141,2,176,3,failure,no
6,36,self-employed,married,tertiary,no,307,yes,no,cellular,14,may,341,1,330,2,other,no
7,39,technician,married,secondary,no,147,yes,no,cellular,6,may,151,2,-1,0,unknown,no
8,41,entrepreneur,married,tertiary,no,221,yes,no,unknown,14,may,57,2,-1,0,unknown,no
9,43,services,married,primary,no,-88,yes,yes,cellular,17,apr,313,1,147,2,failure,no


Looks good, let's see how many rows and columns we have

In [5]:
print(f'There are {bank_data.shape[0]} rows and {bank_data.shape[1]} columns')

There are 4521 rows and 17 columns


In [6]:
f = open('data/bank-names.txt', 'r')
file_contents = f.read()
print(file_contents)
f.close()

Citation Request:
  This dataset is public available for research. The details are described in [Moro et al., 2011]. 
  Please include this citation if you plan to use this database:

  [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

  Available at: [pdf] http://hdl.handle.net/1822/14838
                [bib] http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt

1. Title: Bank Marketing

2. Sources
   Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012
   
3. Past Usage:

  The full dataset was described and analyzed in:

  S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European S

Looks like we have a given output variable which we can use as our taret, it is the 17th column, "y". This seems like a intuitive, and most importantly useful, target to predict for. If we were working for a bank this would be a great variable to predict!

In [7]:
feats = bank_data.drop('y', axis=1)
target = bank_data['y']
print(f'Features table has {feats.shape[0]} rows and {feats.shape[1]} columns')
print(f'Target table has {target.shape[0]} rows')

Features table has 4521 rows and 16 columns
Target table has 4521 rows


Looks good, let's save these as csvs for later.

In [8]:
target.head()

0    no
1    no
2    no
3    no
4    no
Name: y, dtype: object

In [9]:
feats.to_csv('data/bank_data_feats.csv')
target.to_csv('data/bank_data_target.csv', header='y')