#HW2 (584) Credit Risk Prediction
""" 
The objective of this assignment are the following: 

Experiment with various classification models.
Think about dealing with data with different attribute types: categorical and numerical (ratio).
Think about dealing with potentially sensitive or protected attributes like gender, race, age
Think about dealing with imbalanced data i.e., class labels with varying distribution
F1 Scoring Metric

AIM: Develop predictive models that can determine someone’s credit risk 0 - high risk, 1-low risk .
--

The goal of this competition is to allow you to develop predictive models that can determine given a particular individual whether their credit risk is high denoted by 0 or low denoted by1.  As such, the goal would be to develop the best binary classification model.

Since the dataset is imbalanced the scoring function will be the F1-score instead of Accuracy.

Caveats:

+ Remember not all features will be good for predicting credit risk. Think of feature selection, engineering, reduction

+ The dataset has an imbalanced distribution i.e., within the training set there are 24720 (0) and 7841 (1). No information is provided for the test set regarding the distribution.

+ Use your data mining knowledge till now, wisely to optimize your results.
--------------------------------

Data Description:

The training dataset consists of 32561 records and the test dataset consists of 13305 records. We provide you with the training class labels and the test labels are held out.

In the training file there are 13th attributes with the 13-th attribute (or column) being the class label of interest. In the testing file there are 12 attributes.

train.csv
Description
"""
id - unique identifier - UID
F1 - Continuous value describing number of years since last degree was completed- YEARS_TO_LAST_DEGREE
F2 - Continuous value indicating hours worked per week -WORK_HOURS_PER_WEEK
F3 - Categorical Value - CAT_VALUE
F4 - Categorical Value indicating type of occupation -CAT_VALUE_OCCUPATION
F5 - continuous value denoting gains -GAINS
F6 - continuous value denoting loss - LOSS
F7 - Categorical value denoting marital status -MARITAL_STATUS
F8 - Categorical value denoting type of employment (e.g., Self) -TYPE_EMPLOYMENT
F9 - Categorical Value denoting education type -TYPE_EDUCATION
F10 - Categorical Value denoting different race -TYPE_RACE
F11 - Categorical - Female/Male -TYPE_GENDER
credit - 0: Bad, 1: Good -CREDIT_RISK
"""

In [29]:
# Import all the Libraries
import re
import timeit
import unicodedata
import matplotlib.pylab as plt
import nltk
import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import  SMOTENC

In [19]:
# Read the file and store the data
def readTrainfile(filepath):
    read_data = pd.read_csv(filepath, names=['UID', 'YEARS_TO_LAST_DEGREE','WORK_HOURS_PER_WEEK','CAT_VALUE', 'CAT_VALUE_OCCUPATION','GAINS','LOSS','MARITAL_STATUS','TYPE_EMPLOYMENT','TYPE_EDUCATION','TYPE_RACE','TYPE_GENDER','CREDIT_RISK'], sep=',',header=1)
    return read_data


# Read the file and store the data
def readtestfile(filepath):
    read_data = pd.read_csv(filepath, names=['UID', 'YEARS_TO_LAST_DEGREE','WORK_HOURS_PER_WEEK','CAT_VALUE', 'CAT_VALUE_OCCUPATION','GAINS','LOSS','MARITAL_STATUS','TYPE_EMPLOYMENT','TYPE_EDUCATION','TYPE_RACE','TYPE_GENDER'], sep=',',header=0)
    return read_data


# Save the file the output file
def saveOutput(filePath, data):
    # writing to .txt
    np.savetxt(filePath, data, fmt='%s')

In [20]:
trainingData=readTrainfile("credit_train.csv")
testingData=readtestfile("credit_test.csv")

In [21]:
# Drop the null columns where all values are null
trainingData = trainingData.dropna(axis='columns', how='all')
#testingData = testingData.dropna(axis='columns', how='all')

# Drop the null rows
trainingData = trainingData.dropna()
#testingData = testingData.dropna()


In [22]:
testingData.describe

<bound method NDFrame.describe of          UID  YEARS_TO_LAST_DEGREE  WORK_HOURS_PER_WEEK  CAT_VALUE  \
0          0                     7                   40          3   
1          1                    12                   40          0   
2          2                    10                   40          0   
3          3                    10                   30          3   
4          4                     6                   30          1   
...      ...                   ...                  ...        ...   
13300  13300                    13                   40          3   
13301  13301                    13                   36          1   
13302  13302                     9                   40          2   
13303  13303                    13                   50          0   
13304  13304                    13                   40          3   

       CAT_VALUE_OCCUPATION  GAINS  LOSS  MARITAL_STATUS  TYPE_EMPLOYMENT  \
0                         7      0     0        

"""
Categorical- 
PROCESSING AS ONE HOT ENCODING
F3 - Categorical Value - CAT_VALUE
F4 - Categorical Value indicating type of occupation -CAT_VALUE_OCCUPATION
F7 - Categorical value denoting marital status -MARITAL_STATUS
F8 - Categorical value denoting type of employment (e.g., Self) -TYPE_EMPLOYMENT
F9 - Categorical Value denoting education type -TYPE_EDUCATION
F10 - Categorical Value denoting different race -TYPE_RACE 
-- 
PROCESSING AS BOOLEAN
F11 - Categorical - Female/Male -TYPE_GENDER 
Continuous
F1 - Continuous value describing number of years since last degree was completed- YEARS_TO_LAST_DEGREE
F2 - Continuous value indicating hours worked per week -WORK_HOURS_PER_WEEK
F5 - continuous value denoting gains -GAINS
F6 - continuous value denoting loss - LOSS

CLASS
credit - 0: Bad, 1: Good -CREDIT_RISK
"""

In [23]:
#Do one hot encoding for the column_Name 
def ONE_HOT_ENCODING(dataset, column_Name, prefix, remove_Original):
    new_columns=pd.get_dummies(dataset[column_Name],prefix=prefix)
    if(remove_Original):
        dataset=dataset.drop(columns=[column_Name])
    dataset=pd.concat([dataset, new_columns], axis=1)
    #print(dataset)
    return dataset

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    # Drop the null rows
    dataset = dataset.dropna()
    dataset = dataset.reset_index(drop=True)
    X, y = dataset.drop(['CREDIT_RISK'],axis=1), dataset['CREDIT_RISK']
    skf = StratifiedKFold(n_splits=n_folds)
    #print(X,y)
    dataset_folds = list()
    for training_set_idx, test_set_idx in skf.split(X, y):
        fold = {"training_set_X": X.iloc[training_set_idx], "test_set_X": X.iloc[test_set_idx],
                "training_set_Y": y.iloc[training_set_idx], "test_set_Y": y.iloc[test_set_idx],
                "test_idx": test_set_idx}
        dataset_folds.append(fold)
    return dataset_folds

#Do one hot encoding, binary conversion of data and normaliztion
def clean_data(processed_data):
    #CONVERTING CATEGORICAL_COULMNS VIA ONE_HOT_ENCODING
    CATEGORICAL_COULMNS=["CAT_VALUE","CAT_VALUE_OCCUPATION","MARITAL_STATUS","TYPE_EMPLOYMENT", "TYPE_EDUCATION","TYPE_RACE"]    #    CATEGORICAL_COULMNS=["CAT_VALUE","CAT_VALUE_OCCUPATION","MARITAL_STATUS","TYPE_EMPLOYMENT", "TYPE_EDUCATION","TYPE_RACE"]
    for col in CATEGORICAL_COULMNS:
        processed_data=ONE_HOT_ENCODING(processed_data,col,col,True)
    #CONVERTING GENDER TO BINARY
    IS_FEMALE = {' Male': 0,' Female': 1}
    processed_data.TYPE_GENDER = [IS_FEMALE[item] for item in processed_data.TYPE_GENDER]
    processed_data=processed_data.rename(columns={'TYPE_GENDER':'IS_FEMALE'})
    #DROPPING UID CCOLUMN 
    processed_data=processed_data.drop(columns=['UID'])
    #NORMALIZING ALL THE CONTINUOUS COLUMNS
    cols_to_norm = ['YEARS_TO_LAST_DEGREE','WORK_HOURS_PER_WEEK','GAINS','LOSS']
    processed_data[cols_to_norm] = RobustScaler().fit_transform(processed_data[cols_to_norm])
    return processed_data

In [24]:
training_processing_data=trainingData.copy()
training_processing_data=trainingData.copy()
# Drop the null columns where all values are null
training_processing_data = training_processing_data.dropna(axis='columns', how='all')
training_processing_data=clean_data(training_processing_data)
test_processing_data=testingData.copy()
test_processing_data=clean_data(test_processing_data)
index_catagorical_columns= list(range(6, 63))

In [33]:
def getTestingData(trainingData, testData):
    TrainX, Trainy = trainingData.drop(['CREDIT_RISK'],axis=1), trainingData['CREDIT_RISK']
    fold = {"training_set_X": TrainX, "test_set_X": testData,
                "training_set_Y": Trainy}
    return fold
def Classification_Algo_Test(algo, Dataset,name ):
    X_test=Dataset["test_set_X"]
    X_train=Dataset["training_set_X"]
    y_train=Dataset["training_set_Y"]
    predictedOutput=dict()
    sm = SMOTENC(random_state=42, categorical_features=index_catagorical_columns)
    X_train, y_train = sm.fit_resample(X_train, y_train)
    algo.fit(X_train, y_train)
    y_predicted = algo.predict(X_test)
    print(y_predicted)
    print("%s:" % name)
    predictedOutput[name]= y_predicted
    return predictedOutput

#LRegressionD=Classification_Algo_Test(LogisticRegression(),getTestingData(training_processing_data,test_processing_data),"LogisticRegression")
#saveOutput("LRegressionDSMOTEENN.txt", LRegressionD['LogisticRegression'])

#GaussianNBD=Classification_Algo_Test(GaussianNB(),getTestingData(training_processing_data,test_processing_data),"GaussianNB")
#saveOutput("GaussianNBSMOTEENN.txt", GaussianNBD['GaussianNB'])

#LinearSVCD=Classification_Algo_Test(LinearSVC(random_state=0),getTestingData(training_processing_data,test_processing_data),"LinearSVC")
#saveOutput("LinearSVCSMOTEENN.txt", LinearSVCD['LinearSVC'])

#RandomForestClassifierD=Classification_Algo_Test(RandomForestClassifier(),getTestingData(training_processing_data,test_processing_data),"RandomForestClassifier")
#saveOutput("RandomForestClassifierSMOTEENN.txt", RandomForestClassifierD['RandomForestClassifier'])

GradientBoostingClassifierD=Classification_Algo_Test(GradientBoostingClassifier(n_estimators=50,max_depth=10,max_features=None),getTestingData(training_processing_data,test_processing_data),"GradientBoostingClassifier")
saveOutput("GradientBoostingClassifier10MaxNone4.txt", GradientBoostingClassifierD['GradientBoostingClassifier'])


[0 1 1 ... 0 1 0]
GradientBoostingClassifier:
