# Exercise 1
## Loading the dataset and creating the target

The goal of this exercise is to:
- Obtain the csv dataset from the UCI machine learning repository
- Load the dataset into memory as a pandas dataframe
- Identify an appropriate target to predict, and create a binary numerical column of the target

#### Download the dataset
Visit https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/ and download the files named `breast-cancer.data` and `breast-cancer.names`. Create a folder named `data` and deposit the files in this folder.

The data is located in the `breast-cancer.data` file and information about the data is provided in the `breast-cancer.names` file.

Let's take a sneak peak of the data

In [2]:
!head data/breast-cancer.data

no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_low,no
no-recurrence-events,50-59,premeno,25-29,0-2,no,2,left,left_low,no
no-recurrence-events,60-69,ge40,20-24,0-2,no,1,left,left_low,no
no-recurrence-events,40-49,premeno,50-54,0-2,no,2,left,left_low,no
no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,left_up,no


Here we can see the first 10 rows of the data. We note that there are no column names so that w We'll make sure to include that when we load in the data with pandas.

In [7]:
import pandas as pd
colnames = ['Class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
data = pd.read_csv('data/breast-cancer.data', names=colnames)

In [8]:
data.head(n=20)

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no
5,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,left,left_low,no
6,no-recurrence-events,50-59,premeno,25-29,0-2,no,2,left,left_low,no
7,no-recurrence-events,60-69,ge40,20-24,0-2,no,1,left,left_low,no
8,no-recurrence-events,40-49,premeno,50-54,0-2,no,2,left,left_low,no
9,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,left_up,no


Looks good, let's see how many rows and columns we have

In [9]:
print(f'There are {data.shape[0]} rows and {data.shape[1]} columns')

There are 286 rows and 10 columns


In [12]:
f = open('data/breast-cancer.names', 'r')
file_contents = f.read()
print(file_contents)
f.close()

Citation Request:
   This breast cancer domain was obtained from the University Medical Centre,
   Institute of Oncology, Ljubljana, Yugoslavia.  Thanks go to M. Zwitter and 
   M. Soklic for providing the data.  Please include this citation if you plan
   to use this database.

1. Title: Breast cancer data (Michalski has used this)

2. Sources: 
   -- Matjaz Zwitter & Milan Soklic (physicians)
      Institute of Oncology 
      University Medical Center
      Ljubljana, Yugoslavia
   -- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
   -- Date: 11 July 1988

3. Past Usage: (Several: here are some)
     -- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The 
        Multi-Purpose Incremental Learning System AQ15 and its Testing 
        Application to Three Medical Domains.  In Proceedings of the 
        Fifth National Conference on Artificial Intelligence, 1041-1045,
        Philadelphia, PA: Morgan Kaufmann.
        -- accuracy range: 66%-72%
     -

Looks like we have a given output variable which we can use as our taret, it is the 1st column, "Class". This seems like a intuitive, and most importantly useful, target to predict for, whether ther. If we were working for a bank this would be a great variable to predict!

In [7]:
feats = data.drop('Class', axis=1)
target = data['Class']
print(f'Features table has {feats.shape[0]} rows and {feats.shape[1]} columns')
print(f'Target table has {target.shape[0]} rows')

Features table has 4521 rows and 16 columns
Target table has 4521 rows


Looks good, let's save these as csvs for later.

In [8]:
target.head()

0    no
1    no
2    no
3    no
4    no
Name: y, dtype: object

In [9]:
feats.to_csv('data/bank_data_feats.csv')
target.to_csv('data/bank_data_target.csv', header='y')