<a href="https://colab.research.google.com/github/Khislatz/DS-Unit-2-Applied-Modeling/blob/master/module1-define-ml-problems/Khislat_Zhuraeva_LS_DS_231_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [1]:
from google.colab import files
uploaded = files.upload()


Saving credit-card-approval-prediction.zip to credit-card-approval-prediction.zip


In [2]:
!unzip /content/credit-card-approval-prediction.zip

Archive:  /content/credit-card-approval-prediction.zip
  inflating: application_record.csv  
  inflating: credit_record.csv       


In [4]:
import pandas as pd
import numpy as np
application_df = pd.read_csv('application_record.csv')
credit_df = pd.read_csv('credit_record.csv')
application_df.shape, credit_df.shape

((438557, 18), (1048575, 3))

In [5]:
credit_df.head()
#0: 1-29 days past due 
# 1: 30-59 days past due 
# 2: 60-89 days overdue 
# 3: 90-119 days overdue 
# 4: 120-149 days overdue 
# 5: Overdue or bad debts, write-offs for more than 150 days 
# C: paid off that month 
#X: No loan for the month

Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,5001711,0,X
1,5001711,-1,0
2,5001711,-2,0
3,5001711,-3,0
4,5001712,0,C


In [6]:
credit_approval_df = pd.merge(credit_df, application_df)
credit_approval_df.shape

(777715, 20)

In [0]:
credit_approval_df = credit_approval_df.drop_duplicates(inplace=False)

In [8]:
credit_approval_df.head()

Unnamed: 0,ID,MONTHS_BALANCE,STATUS,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,0,C,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008804,-1,C,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008804,-2,C,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
3,5008804,-3,C,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
4,5008804,-4,C,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0


In [9]:
credit_approval_df.isnull().sum()

ID                          0
MONTHS_BALANCE              0
STATUS                      0
CODE_GENDER                 0
FLAG_OWN_CAR                0
FLAG_OWN_REALTY             0
CNT_CHILDREN                0
AMT_INCOME_TOTAL            0
NAME_INCOME_TYPE            0
NAME_EDUCATION_TYPE         0
NAME_FAMILY_STATUS          0
NAME_HOUSING_TYPE           0
DAYS_BIRTH                  0
DAYS_EMPLOYED               0
FLAG_MOBIL                  0
FLAG_WORK_PHONE             0
FLAG_PHONE                  0
FLAG_EMAIL                  0
OCCUPATION_TYPE        240048
CNT_FAM_MEMBERS             0
dtype: int64

In [10]:
credit_approval_df['STATUS'].value_counts()

C    329536
0    290654
X    145950
1      8747
5      1527
2       801
3       286
4       214
Name: STATUS, dtype: int64

In [0]:
# 0: 1-29 days past due 
# 1: 30-59 days past due 
# 2: 60-89 days overdue 
# 3: 90-119 days overdue 
# 4: 120-149 days overdue 
# 5: Overdue or bad debts, write-offs for more than 150 days 
# C: paid off that month 
#X: No loan for the month

In [12]:
credit_approval_df['STATUS'].describe()

count     777715
unique         8
top            C
freq      329536
Name: STATUS, dtype: object

In [13]:
credit_approval_df['STATUS'].value_counts(normalize=True)

C    0.423723
0    0.373728
X    0.187665
1    0.011247
5    0.001963
2    0.001030
3    0.000368
4    0.000275
Name: STATUS, dtype: float64

In [14]:
cardinality = credit_approval_df.select_dtypes(exclude='number').nunique()
high_cardinality_feat = cardinality[cardinality > 30].index.tolist()
high_cardinality_feat

[]

In [15]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(credit_approval_df, train_size=0.80, test_size=0.20, 
                              stratify=credit_approval_df['STATUS'], random_state=42)

train.shape, test.shape

((622172, 20), (155543, 20))

In [16]:
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['STATUS'], random_state=42)

train.shape, val.shape, test.shape

((497737, 20), (124435, 20), (155543, 20))

In [0]:
def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    X.drop_duplicates(inplace=False)
    # Prevent SettingWithCopyWarning
    X = X.copy()
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [18]:
credit_approval_df.head(1)

Unnamed: 0,ID,MONTHS_BALANCE,STATUS,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,0,C,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0


In [19]:
credit_approval_df['FLAG_OWN_CAR'].replace({'N': 0, 'Y': 1})
credit_approval_df['FLAG_OWN_REALTY'].replace({'N': 0, 'Y': 1})

0         1
1         1
2         1
3         1
4         1
         ..
777710    0
777711    0
777712    0
777713    0
777714    0
Name: FLAG_OWN_REALTY, Length: 777715, dtype: int64

In [20]:
credit_approval_df.dtypes

ID                       int64
MONTHS_BALANCE           int64
STATUS                  object
CODE_GENDER             object
FLAG_OWN_CAR            object
FLAG_OWN_REALTY         object
CNT_CHILDREN             int64
AMT_INCOME_TOTAL       float64
NAME_INCOME_TYPE        object
NAME_EDUCATION_TYPE     object
NAME_FAMILY_STATUS      object
NAME_HOUSING_TYPE       object
DAYS_BIRTH               int64
DAYS_EMPLOYED            int64
FLAG_MOBIL               int64
FLAG_WORK_PHONE          int64
FLAG_PHONE               int64
FLAG_EMAIL               int64
OCCUPATION_TYPE         object
CNT_FAM_MEMBERS        float64
dtype: object

In [21]:
!pip install category_encoders==2.*
!pip install pandas-profiling==2.*

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 17.5MB/s eta 0:00:01[K     |██████▌                         | 20kB 6.6MB/s eta 0:00:01[K     |█████████▉                      | 30kB 9.1MB/s eta 0:00:01[K     |█████████████                   | 40kB 11.3MB/s eta 0:00:01[K     |████████████████▍               | 51kB 7.5MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 8.6MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 9.7MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 10.8MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 8.7MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 6.0MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0
Collecting panda

In [23]:
%%time
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

CPU times: user 18 µs, sys: 2 µs, total: 20 µs
Wall time: 22.4 µs


In [25]:
target = 'STATUS'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

pipeline = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer(),
    DecisionTreeClassifier(max_depth=65)
)

pipeline.fit(X_train, y_train)
print(f'Validation accuracy: {pipeline.score(X_val, y_val)}')

Validation accuracy: 0.8480491823040142


In [26]:
pipeline = make_pipeline(
    ce.ordinal.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
)

pipeline.fit(X_train, y_train)
print(f'Validation accuracy: {pipeline.score(X_val, y_val)}')

Validation accuracy: 0.8752119580503878


In [0]:
# import graphviz
# from sklearn.tree import export_graphviz

# tree = pipeline.named_steps['decisiontreeclassifier']

# dot_data = export_graphviz(
#     tree, 
#     out_file=None, 
#     feature_names=X_train.columns, 
#     class_names=y_train.unique().astype(str), 
#     filled=True, 
#     impurity=False,
#     proportion=True
# )

# graphviz.Source(dot_data)