# Random Forest Problem 1

Use random forest ensemble to prepare a model on fraud data 
treating those who have taxable_income <= 30000 as "Risky" and others are "Good"

### Data Description:
- Undergrad : person is under graduated or not
- Marital.Status : marital status of a person
- Taxable.Income : Taxable income is the amount of how much tax an individual owes to the government 
- Work Experience : Work experience of an individual person
- Urban : Whether that person belongs to urban area or not

## Steps:

1. Import new data set
    - understand the dataset, look into it. 
    - add the new column fraud
    - perform EDA.
    - check data info and null values.
2. Visualisation EDA
    - making pairplot graphs to better understand the data.   
3. Feature engineering
    - understand all features involeved.
    - list out features that needs to be considered in the model.
    - get dummies if required
    - Train | test spilliting
4. Random Forest Classifier
    - Default Parameters
    - Evaluation
5. Visualisation of Trees estimators
    - Single tree estimator
    - Multiple tree estimators
6. Experimenting HyperParameters - Max features and Max tree nums
7. Conclusion



In [5]:
#load the libraries
import pandas as pd
import numpy as np
import pandas_profiling as pp
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

## Import New Dataset

In [6]:
raw_data = pd.read_csv("C://Users//IRFAN//Documents//ExcelR//Assignment//15//15Q1//Fraud_check.csv")
raw_data.rename(columns={'Undergrad':'undergrad','Marital.Status':'marital_status','Taxable.Income':'tax_income','City.Population':'city_pop','Work.Experience':'work_exp','Urban':'urban'}, inplace=True)
df = raw_data.copy()
df.head()

Unnamed: 0,undergrad,marital_status,tax_income,city_pop,work_exp,urban
0,NO,Single,68833,50047,10,YES
1,YES,Divorced,33700,134075,18,YES
2,NO,Married,36925,160205,30,YES
3,YES,Single,50190,193264,15,YES
4,NO,Married,81002,27533,28,NO


In [None]:
#def func to assign fraud risky and good values

def filt(x):
    if x<=30000:
        return 'Risky'
    else:
        return 'Good'

In [None]:
df['fraud'] = df['tax_income'].copy()
df['fraud'] = df['fraud'].apply(filt)
df.head()

In [None]:
df.fraud.value_counts()

In [None]:
label_encoder = preprocessing.LabelEncoder()
df['undergrad']= label_encoder.fit_transform(df['undergrad']) 
df['urban']= label_encoder.fit_transform(df['urban']) 
df['marital_status']= label_encoder.fit_transform(df['marital_status']) 
df['fraud']= label_encoder.fit_transform(df['fraud'])

df.head()

In [None]:
df.info() # No null values

In [None]:
df.describe()

## Visualization

In [None]:
sns.pairplot(df,hue='fraud',palette='Dark2')

In [None]:
sns.catplot(x='fraud',y='tax_income',data=df,kind='box',palette='Dark2')

## Feature Engineering

All the features in the dataset are relevant and can be used for model training.
There are no dummies in the dataset.

### Train and Test Split data

In [None]:
X = df.iloc[:,:-1]
y =df.fraud

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=50)

## Random Forest Classification
### Default Parameters

In [None]:
model =  RandomForestClassifier(n_estimators=10,max_features='auto',random_state=101)

In [None]:
model.fit(X_train,y_train)

In [None]:
base_pred = model.predict(X_test)

### Evaluation

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix

In [None]:
confusion_matrix(y_test,base_pred) #This is very good. But not sure if it's an overfitted model

In [None]:
plot_confusion_matrix(model,X_test,y_test)

In [None]:
print(classification_report(y_test,base_pred)) #Near perfect values

In [None]:
model.feature_importances_

In [None]:
pd.DataFrame(index=X.columns,data=model.feature_importances_,columns=['Feature Importance']) 
#Only tax_income matters since our tree is based on it. 

## Visualization of Tree estimators

### Visualization of a single tree estimator

In [None]:
from sklearn import tree
fn=df.columns.values[:-1]
cn=df.columns.values[-1]
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,4), dpi=800)
tree.plot_tree(model.estimators_[0],
               feature_names = fn, 
               class_names=cn,
               filled = True);

### Visualization of multiple trees used as estimators

In [None]:
fig, axes = plt.subplots(nrows = 1,ncols = 5,figsize = (10,2), dpi=900) 
for index in range(0, 5):
    tree.plot_tree(model.estimators_[index],
                   feature_names = fn, 
                   class_names=cn,
                   filled = True,
                   ax = axes[index]);

    axes[index].set_title('Estimator: ' + str(index), fontsize = 11)

## Experimenting with HyperParameters

Not much experimentation can be done in this case since it is a straightforward model that requires no further changes.

## Conclusion

Random forest ensemble technique was perfomed on this dataset. However, the dataset was quite straight forward and linear. It didnt not have any other bias. So the model recieved good workout.