# xgboost mini-project for Yazabi/SharpestMinds
- Author: Chris Hodapp
- Date: 2017-11-09
- Dataset: [Wisconsin Diagnostic Breast Cancer (WDBC)](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names)
- To read:
  - [A Guide to Gradient Boosted Trees with XGBoost in Python](https://jessesw.com/XG-Boost/)
  - [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754v1.pdf)

In [1]:
import pandas as pd
import train_and_test

In [2]:
# Just for testing:
import importlib
train_and_test = importlib.reload(train_and_test)

In [3]:
train_raw = train_and_test.read_data("data/train_data.txt")
test_raw = train_and_test.read_data("data/test_data.txt")
both = pd.concat((train_raw, test_raw))

- The description said no data is missing and it appears to be right:

In [4]:
both.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 569 entries, 0 to 113
Data columns (total 32 columns):
ID                    569 non-null int64
diag                  569 non-null object
radius                569 non-null float64
texture               569 non-null float64
perimeter             569 non-null float64
area                  569 non-null float64
smoothness            569 non-null float64
compactness           569 non-null float64
concavity             569 non-null float64
concave_points        569 non-null float64
symmetry              569 non-null float64
fractal_dim           569 non-null float64
radius_std            569 non-null float64
texture_std           569 non-null float64
perimeter_std         569 non-null float64
area_std              569 non-null float64
smoothness_std        569 non-null float64
compactness_std       569 non-null float64
concavity_std         569 non-null float64
concave_points_std    569 non-null float64
symmetry_std          569 non-null flo

In [5]:
both[["diag"]].groupby("diag").size()

diag
B    357
M    212
dtype: int64

In [6]:
train_raw.iloc[:20,:]

Unnamed: 0,ID,diag,radius,texture,perimeter,area,smoothness,compactness,concavity,concave_points,...,radius_w,texture_w,perimeter_w,area_w,smoothness_w,compactness_w,concavity_w,concave_points_w,symmetry_w,fractal_dim_w
0,915940,B,14.58,13.66,94.29,658.8,0.09832,0.08918,0.08222,0.04349,...,16.76,17.24,108.5,862.0,0.1223,0.1928,0.2492,0.09186,0.2626,0.07048
1,904969,B,12.34,14.95,78.29,469.1,0.08682,0.04571,0.02109,0.02054,...,13.18,16.85,84.11,533.1,0.1048,0.06744,0.04921,0.04793,0.2298,0.05974
2,88466802,B,10.65,25.22,68.01,347.0,0.09657,0.07234,0.02379,0.01615,...,12.25,35.19,77.98,455.7,0.1499,0.1398,0.1125,0.06136,0.3409,0.08147
3,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
4,903811,B,14.06,17.18,89.75,609.1,0.08045,0.05361,0.02681,0.03251,...,14.92,25.34,96.42,684.5,0.1066,0.1231,0.0846,0.07911,0.2523,0.06609
5,88350402,B,13.64,15.6,87.38,575.3,0.09423,0.0663,0.04705,0.03731,...,14.85,19.05,94.11,683.4,0.1278,0.1291,0.1533,0.09222,0.253,0.0651
6,891703,B,11.85,17.46,75.54,432.7,0.08372,0.05642,0.02688,0.0228,...,13.06,25.75,84.35,517.8,0.1369,0.1758,0.1316,0.0914,0.3101,0.07007
7,871642,B,10.66,15.15,67.49,349.6,0.08792,0.04302,0.0,0.0,...,11.54,19.2,73.2,408.3,0.1076,0.06791,0.0,0.0,0.271,0.06164
8,8911230,B,11.33,14.16,71.79,396.6,0.09379,0.03872,0.001487,0.003333,...,12.2,18.99,77.37,458.0,0.1259,0.07348,0.004955,0.01111,0.2758,0.06386
9,8912049,M,19.16,26.6,126.2,1138.0,0.102,0.1453,0.1921,0.09664,...,23.72,35.9,159.8,1724.0,0.1782,0.3841,0.5754,0.1872,0.3258,0.0972


In [7]:
# Verify that all IDs are unique (and we can just ignore them):
len(both.ID), len(both.ID.unique())

(569, 569)