# Final Project Part 1 - Proposal 1: Credit Default Prediction

### The target variable you want to predict

The target variable that I want to predict is if the customer will default next month. 

### How predicting that target variable could help with some kind of decision

Predicting this variable will help the bank determine whether or not to provide a credit line increase or potentially a credit line decrease to customers to either maximize revenue or minimize loss. My assumption is if the customer is paying consistently every month, the probability of it defaulting in the next payment is very unlikely. 

### The features you want to use to predict that target variable

o	The features used to predict this variable from the data are: 
- Amount of given credit 
- Gender
- Education
- Marital Status
- Age 
- History of Repayment Status from April – September 2015 (each month is a separate column)
- Amount of bill statement from April – September 2015
- Amount of previous payment from April – September 2015


### Goals and success metrics

The goal is to be able to predict if the customer is going to default the next month (October 2015). 

### Risks or limitations

The main limitation is there is only data from April – September 2015 and may not be sufficient enough to predict if the customer is going to default. Using only 6 months of data to predict if a customer is going to default is limited because the entire life cycle of the card could provide better insights. 

In [21]:
import pandas as pd
credit_card_default = pd.read_csv('./data/credit_card_default.csv')
credit_card_default.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [22]:
credit_card_default.shape

#22,500 rows and 25 columns

(22500, 25)

In [12]:
credit_card_default.isnull().sum()

#no missing data 

Variable    0
X1          0
X2          0
X3          0
X4          0
X5          0
X6          0
X7          0
X8          0
X9          0
X10         0
X11         0
X12         0
X13         0
X14         0
X15         0
X16         0
X17         0
X18         0
X19         0
X20         0
X21         0
X22         0
X23         0
Y           0
dtype: int64

# Final Project Part 1 - Proposal 2: Breast Cancer Severity

### The target variable you want to predict

The target variable that I want to predict is if the patient diagnosed with breast cancer is malignant or benign (M = Malignant or Benign). The data is collected from a digitized image of a fine needle aspirate (FNA) of a breast mass describing characteristics of the cell nuclei. 

### How predicting that target variable could help with some kind of decision

Predicting this target variable could help detect the severity of breast cancer and help doctors determine patients with similar characteristics if their stage of the breast cancer is still treatable. My assumption is if the breast cancer is malignant, the cell nucleus radius is shorter than if it is benign. Adding to the complexity, the texture of the breast may also have an impact in determining the tumor’s invasiveness. 

### The features you want to use to predict that target variable

The features used to predict this variable from the data are: 
- Radius (mean of distances from center to points on the perimeter of the nucleus)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness 
- Compactness 
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Symmetry
- Fractal Dimension


### Goals and success metrics

The goal is to be able to identifiy breast cancer diagnosis either malignant or benign. 

### Risks or limitations

One limitation of using this dataset is the data is collected in 1995 which can be outdated. However detecting breast cancer severity does not change by a significant amount in 24 years and the data should be sufficient to predict the severity of the breast cancer.

In [13]:
import pandas as pd
breast_cancer = pd.read_csv('./data/breast_cancer.csv')
breast_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [14]:
breast_cancer.shape

#569 rows and 33 columns

(569, 33)

In [15]:
breast_cancer.isnull().sum()

#no missing data except unnamed columns which will be dropped

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: