
We are trying to find out the best prediction model that is able to predict if a person is suffering from benign or malignant cancer tumour based on predictors.

In this notebook we focus on the data loading, basic exploring, and preparation.

After reviewing and studying this material you should be able to:
1. Import and install python libraries
    * Anaconda has most of the libraries you'll need preloaded; but there are times you may need to install a new package.
2. Set the random seed (this ensures your work is repeatable)
    * For this course, always use 1 as your random seed. If you do not, then your results will differ from the ones used on the marking key and you will loose marks.
3. Load data
    * This can be from a database, website, file, or other. In this example we will load data from a csv (comma seperated value) file. 
4. Conduct basic evaluation of the data 
    * We want to get to know the data in the context of our problem. 
        * What is our target variable
    * What types of data do we have?
    * How many features and observations?
    * Do we have missing data?
    * Do we see evidence of corrupt data?
    * For any catagorical variables - are they nominal? ordinal without equal distance or ordinal that can be represented as an interval?
5. Process the data
    * Conduct pre-split data cleaning
    * Split data into training and test sets
    * Conduct post-split data cleaning
6. Save the data (we'll start modeling it later)
    * save the cleaned data to a csv file.

## 1.0 Import and install python libraries

Here we import any Python libraries that we plan to use. Any libraries that we import must be installed on your computer. Numpy and Pandas should be installed as part of Anaconda; but if you ever find yourself in a situation where you don't have the library installed, you can use the conda command from a terminal:

conda install -c conda-forge <package/library name you want to install>

For example:
conda install -c conda-forge numpy

In [40]:
# import numpy and pandas libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import recall_score
from sklearn.tree import DecisionTreeRegressor
from matplotlib import pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [29]:
## 2.0 Set Random Seed

It's *very* important that you set this! In this course we will use the random seed value of 1.

In [9]:
# set random seed to ensure that results are repeatable
np.random.seed(1)

## 3.0 Load data 

In [10]:
# load data
cancer = pd.read_csv("breast-cancer-wisconsin.csv")

## 4.0 Conduct initial exploration of the data

We have a number of input variables and one target variable. For this analysis, the target variable is price.

First, our initial exploration of the data should answer the following questions:
1. How many rows and columns
2. How much of a problem do we have with na's?
3. What types of data are there?
4. What types of data are stored in columns
    1. identify which variables are numeric and may need to be standardized later
    2. identify which variables are categorical and may need to be transformed using and encoders such as one-hot-encoder.
5. Identify errors in the data - this is a common problem with categorical vars where the category is mispelled or spelled differently in some instances.
 

In [11]:
# look at the data
cancer.head(3) # note that we don't want to dump all the data to the screen

Unnamed: 0,1000025,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
0,1002945,5,4,4,5,7,10,3,2,1,2
1,1015425,3,1,1,1,2,2,3,1,1,2
2,1016277,6,8,8,1,3,4,3,7,1,2


In [12]:
# generate a basic summary of the data
cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   1000025  682 non-null    int64
 1   5        682 non-null    int64
 2   1        682 non-null    int64
 3   1.1      682 non-null    int64
 4   1.2      682 non-null    int64
 5   2        682 non-null    int64
 6   1.3      682 non-null    int64
 7   3        682 non-null    int64
 8   1.4      682 non-null    int64
 9   1.5      682 non-null    int64
 10  2.1      682 non-null    int64
dtypes: int64(11)
memory usage: 58.7 KB


In [13]:
# generate a statistical summary of the numeric value in the data
cancer.describe()

Unnamed: 0,1000025,5,1,1.1,1.2,2,1.3,3,1.4,1.5,2.1
count,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0,682.0
mean,1076833.0,4.441349,3.153959,3.218475,2.832845,3.23607,3.548387,3.445748,2.872434,1.604106,2.70088
std,621092.6,2.822751,3.066285,2.989568,2.865805,2.224214,3.645226,2.451435,3.054065,1.733792,0.954916
min,63375.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,877454.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,2.0
50%,1171820.0,4.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0,2.0
75%,1238741.0,6.0,5.0,5.0,4.0,4.0,6.0,5.0,4.0,1.0,4.0
max,13454350.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [16]:
# Check the missing values by summing the total na's for each variable
cancer.isna().sum()

1000025    0
5          0
1          0
1.1        0
1.2        0
2          0
1.3        0
3          0
1.4        0
1.5        0
2.1        0
dtype: int64

In [17]:
# create a list of these catagorical variables
category_var_list = list(cancer.select_dtypes(include='object').columns)
category_var_list

[]

### Summary the findings from our initial evaluation of the data

* there is no categorical variables and hence we can proceed 

## 5.0 Process the data

* Conduct any data prepartion that should be done *BEFORE* the data split.
* Split the data.
* Conduct any data preparation that should be done *AFTER* the data split.

### 5.1  Conduct any data prepartion that should be done *BEFORE* the data split

Tasks at this stage include:
1. Drop any columns/features 
2. Decide if you with to exclude any observations (rows) due to missing na's.
2. Conduct proper encoding of categorical variables
    1. You can transform them using dummy variable encoding, one-hot-encoding, or label encoding. 

#### Drop any columns/variables we will not be using

In [20]:
# Our target is price; but there are three related price variableds - price, price_gte_150, 
# and price_category. We need to drop price_gte_150, and price_category
cancer.drop(cancer.columns[0], axis=1, inplace=True)

#### Drop observations with too many NA's

If we want to remove the rows with NA's use the following code that is commented out. For this exercise - we will not drop rows with NA's 

In [24]:
# If we want to remove rows with NA's use the following code:
cancer.dropna(axis=0, inplace=True)

In [25]:
# verify that there are now no missing values
cancer.isna().sum()

1      0
1.1    0
1.2    0
2      0
1.3    0
3      0
1.4    0
1.5    0
2.1    0
dtype: int64

In [26]:
# investigage how many rows remain 
cancer.shape

(682, 9)

In [None]:
df.

#### Encode our categorical variables

since the "Class" variable only has two unique values (benign or malignant), there is no need to perform one-hot encoding on the target variable

lets drop the temp data frames

### 5.2 Split data (train/test)

In [79]:
train_df, test_df = train_test_split(cancer, test_size=0.3)

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = '2.1'
predictors = list(cancer.columns)
predictors.remove(target)


### 5.3  Conduct any data prepartion that should be done *AFTER* the data split

We will look at the following:
1) imput any missing numeric values using the mean of the variable/column
2) remove differences of scale by standardizing the numerica variables

#### Standardize numeric values

Now, let's create a common scale between the numberic columns by standardizing each numeric column

In [109]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
scaler.fit(train_df[predictors])

# Transform the predictors of training and test sets
train_X = scaler.transform(train_df[predictors]) 
train_y = train_df[target] 

test_X = scaler.transform(test_df[predictors])
test_y = test_df[target] 

train_X = train_df[predictors]
test_X = train_df[predictors]


## 6.0 Save the data

In [112]:
train_df.to_csv('cancer_train_df.csv', index=False)
train_X.to_csv('cancer_train_X.csv', index=False)
train_y.to_csv('cancer_train_y.csv', index=False)
test_df.to_csv('cancer_test_df.csv', index=False)
test_X.to_csv('cancer_test_X.csv', index=False)
test_y.to_csv('cancer_test_y.csv', index=False)