# <center>        **Introduction to Data Science**</center>

# 1. Business Understanding

Students are expected to identify an analytical problem of your choice. You have to detail the Business Understanding part of your problem under this heading which basically addresses the following questions.

   1. What is the business problem that you are trying to solve?
   2. What data do you need to answer the above problem?
   3. What are the different sources of data?
   4. What kind of analytics task are you performing?

Score: 1 Mark in total (0.25 mark each)

<div class="alert alert-block alert-info">

**1. What is the business problem that you are trying to solve?**

> _As part of expanding the business of a commercial bank, they are planning to introduce different range of credit card offers to their customers. Now they wanted to market their offers and services to the targeted customers based on their financial capacity. The bank has collected various types of data of the potential customers that help them to predict their customers financial capacity. Predicted income then would be used to map the right credit card packages and the set the card features such as credit limit, reward schemes etc._

**2. What data do you need to answer the above problem?**

> _We need Individual's data which include Gender, Age, education details, employment status, income details, Dependents._

**3. What are the different sources of data?**

> _Bank has data set of existing customers and feature set that can be used for the training. For demonstration purpose we will be using the publicly available dataset._

**4. What kind of analytics task are you performing?**

> _We will be using Predictive analytics to suggest the income of individual based on the dataset provided._

</div>

# 2. Data Acquisition

For the problem identified , find an appropriate data set (Your data set must
be unique with minimum **20 features and 10k rows**) from any public data source.

---



## 2.1 Download the data directly



In [None]:
import os

from kaggle.api.kaggle_api_extended import KaggleApi

dataset = 'kamaumunyori/income-prediction-dataset-us-20th-century-data'
dirpath = os.getcwd()

# Setup kaggle authentication configuration
help_msg = """
Setup kaggle Username and token

    i.   login to https://www.kaggle.com/
    ii.  go to profile (top right corner) -> settings
    iii. click on "Create New Token"
    iv.  kaggle.json will be downloaded with your username and key
    v.   Copy to ~/.kaggle/ directory
    vi.  chmod 600 ~/.kaggle/kaggle.json
"""    

# 2. Authentication    
api = KaggleApi()
try:
    api.authenticate()
except Exception as error: 
    print('Download_Dataset: Authentication failure', error)
    print(help_msg)
    raise

# 3. Download all files from given dataset URL to dirpath if not present.
try:
    api.dataset_download_files(dataset=dataset, path=dirpath, unzip=True, quiet=False)
    print('Download Completed !!!')
except Exception as error:
    print('Download_Datset: failed to download files:', error)
    raise

## 2.2 Code for converting the above downloaded data into a dataframe

In [None]:
import pandas as pds
import numpy as np
import platform

# Convert the dataset in CSV to dataframe
if "macos" in platform.platform().lower():
    # For MAC OS, the name and slash is different
    Datasetfile = dirpath + "/Income Prediction./Train.csv"
    Test_Datasetfile = dirpath + "/Income Prediction./Test.csv"
else:
    # For Windows OS
    Datasetfile = dirpath + "\\Income prediction\\Train.csv"
    Test_Datasetfile = dirpath + "\\Income prediction\\Test.csv"

try:
    Dataframe = pds.read_csv(Datasetfile, skipinitialspace=True)
    Test_Dataframe = pds.read_csv(Test_Datasetfile, skipinitialspace=True)
    print("Data Frame conversion: successful")
except Exception as error:
    print('Data Frame conversion: failed:', error)

## 2.3 Confirm the data has been correctly by displaying the first 5 and last 5 records.

In [None]:
# display all columns without truncation
pds.set_option('display.max_columns', None)

# Few function like Replace, Fillna, etc in Pandas gives Future DeprecationWarning, fixing it to work like older pandas version.
Dataframe = Dataframe.infer_objects(copy=False)
Test_Dataframe = Test_Dataframe.infer_objects(copy=False)

print("First 5 records in data set")
display(Dataframe.head(5))

print("Last 5 records in data set")
display(Dataframe.tail(5))

## 2.4 Display the column headings, statistical information, description and statistical summary of the data.

In [None]:
# Displaying column headings
print("Displaying column headings:")
display(Dataframe.columns)

# 5 point statistical summary of Dataset
print("Statistical Information, description and statistical summary of the dataset:")
display(Dataframe.describe())
Dataframe.info(verbose=True)

## 2.5 Write your observations from the above.
1. Size of the dataset
2. What type of data attributes are there?
3. Is there any null data that has to be cleaned?

Score: 2 Marks in total (0.25 marks for 2.1, 0.25 marks for 2.2, 0.5 marks for 2.3, 0.25 marks for 2.4, 0.75 marks for 2.5)

In [None]:
# For 2.5 (3): Null values in each column which needs to be cleaned:
print('Columns Containing Null Entries for Train Data')
display(Dataframe.isnull().sum(axis=0))
print('\nColumns Containing Null Entries for Test Data')
display(Test_Dataframe.isnull().sum(axis=0))

In [None]:
# Grouped based on Attribute types
attribute_types = {
    'NOMINAL': ['ID', 'gender', 'class', 'marital_status', 'race', 'is_hispanic', 'is_labor_union', 'industry_code',
                'industry_code_main', 'occupation_code', 'occupation_code_main', 'citizenship', 'country_of_birth_own',
                'country_of_birth_father', 'country_of_birth_mother', 'migration_code_change_in_msa',
                'migration_code_move_within_reg', 'migration_code_change_in_reg', 'old_residence_reg',
                'old_residence_state','unemployment_reason'
                ],
    'ORDINAL': ['education', 'education_institute', 'employment_commitment', 'employment_stat', 'household_stat',
                'household_summary', 'under_18_family', 'veterans_admin_questionnaire', 'vet_benefit', 'tax_status',
                'migration_prev_sunbelt', 'residence_1_year_ago', 'income_above_limit'],
    'INTERVAL': ['mig_year'],
    'RATIO': ['age', 'wage_per_hour', 'working_week_per_year', 'total_employed', 'gains', 'losses', 'stocks_status',
              'importance_of_record']
}

<div class="alert alert-block alert-info">

**Observations:**

1. Number of Rows = 209499, Number of attributes/columns = 43
2. Type of Data Attributes out of 43:
   <ul>
   <li> Integer   :  12 </li>
   <li> Float     :  1 </li>
   <li> Object    :  30 </li>
   </ul>
   <b><i>Please refer the type of Data attributes identified above in {{ attribute_types }}</i></b>

3. There are 14 Columns/attributes containing entries with NULL data which needs Data cleaning.
   All such columns and number of null entry is captured in 2.4 Result.

</div>

# 3. Data Preparation

If input data is numerical or categorical, do 3.1, 3.2 and 3.4
If input data is text, do 3.3 and 3.4

## 3.1 Check for

* duplicate data
* missing data
* data inconsistencies


In [None]:
# Printing the unique values in each column
for colname, coltype in Dataframe.dtypes.items():
    if coltype == object:
        if Dataframe[colname].nunique() < 51:
            print("Column name: {} Unique Values count: {}\n".format(colname, Dataframe[colname].nunique()))
            print("Unique Values:{}\n".format(Dataframe[colname].unique()))
        else:
            print("Column name: {} Unique Values count: {}\n".format(colname, Dataframe[colname].nunique()))

In [None]:
# Identify Duplicate records
try:
    if Dataframe.duplicated().sum():
        print("Train: Total Duplicate Rows Identified:\n", Dataframe.duplicated().sum())
    else:
        print("Train: No Duplicate Rows Identified\n")
except Exception as error:
    print('Train: DataFrame Failed to Identify Duplicate Rows', error)

# Identify Duplicate records for Test Data
try:
    if Test_Dataframe.duplicated().sum():
        print("Test: Total Duplicate Rows Identified:\n", Test_Dataframe.duplicated().sum())
    else:
        print("Test: No Duplicate Rows Identified\n")
except Exception as error:
    print('Test: DataFrame Failed to Identify Duplicate Rows', error)

In [None]:
# Identify the columns/attributes with Missing data
try:
    na_column_list = Dataframe.columns[Dataframe.isnull().any()].to_list()
    na_column_dict = dict()
    for column in na_column_list:
        na_column_dict[column] = Dataframe[column].isnull().sum()
    print("Train: Following columns Identified with Missing entries:\n", na_column_dict)
except Exception as error:
    print('Train: DataFrame Failed to Identify Missing Data', error)

# Identify the columns/attributes with Missing data for Test Data
try:
    na_column_list = Test_Dataframe.columns[Test_Dataframe.isnull().any()].to_list()
    na_column_dict = dict()
    for column in na_column_list:
        na_column_dict[column] = Test_Dataframe[column].isnull().sum()
    print("Test: Following columns Identified with Missing entries:\n", na_column_dict)
except Exception as error:
    print('Test: DataFrame Failed to Identify Missing Data', error)

In [None]:
# Identify Inconsistent data
try:
    import re
    numeric_consistent_flag = True # Assuming numeric data is consistent
    for colname, coltype in Dataframe.dtypes.items():
        if colname in attribute_types['NOMINAL'] + attribute_types['ORDINAL']:
            if coltype == object and colname != 'ID':
                inspect = Dataframe[colname].unique()
                display(inspect)
                # 1. check for special characters.
                pattern=re.compile('[@_!#$%^&*()<>?/\\|}{~:]')
                df = Dataframe[colname].str.contains(pattern)
                print("colname: {} contain special character :{}\n".format(colname,df.unique()))
        # All the Numeric attributes must be in integer or float datatype
        if colname in attribute_types['INTERVAL'] + attribute_types['RATIO']:
            if not pds.api.types.is_numeric_dtype(coltype):
                print("Inconsistent Column: ", colname, " => number of entries:", Dataframe[colname].str.isalnum().sum() - Dataframe[colname].str.isdigit().sum())
                numeric_consistent_flag = False # Found numeric data is inconsistent
except Exception as error:
    print('DataFrame failed to identify Inconsistent data', error)

if (numeric_consistent_flag):
    print("Numeric data attribute is consistent")

## 3.2 Apply techniques
* to remove duplicate data
* to impute or remove missing data
* to remove data inconsistencies


In [None]:
# Remove the ID to effective prediction
Dataframe.drop(['ID'], axis=1, inplace=True)
Test_Dataframe.drop(['ID'], axis=1, inplace=True)

In [None]:
# Duplicate Rows Technique: Remove
# Removing duplicates in Original dataset.
try:
    if Dataframe.duplicated().sum():
        Dataframe.drop_duplicates(inplace=True)
        print("Duplicates removed")
    else:
        print("No duplicates found. No action needed")
except Exception as error:
    print('DataFrame duplicate Removal Failed', error)

# Test dataset
try:
    if Test_Dataframe.duplicated().sum():
        Test_Dataframe.drop_duplicates(inplace=True)
        print("TestDataframe: Duplicates removed")
    else:
        print("TestDataframe: No duplicates found. No action needed")
except Exception as error:
    print('TestDataframe: DataFrame duplicate Removal Failed', error)

In [None]:
# Missing Data Technique: Drop (Optional to use)
try:
    # Delete the columns with 66% missing entries
    Dataframe.dropna(axis=1, thresh=((len(Dataframe)*2) / 3), inplace=True)
    drop = False # Whether or not to drop the entire row if the attribute value is empty.
    if drop:
        # Drop missing Rows
        Dataframe.dropna(axis=0, subset=attribute_types['INTERVAL'] + attribute_types['RATIO'], inplace=True)
    print("Dropped missing data successfully")
    Dataframe.info(verbose=True)
except Exception as error:
    print("Missing data drop failure", error)

# Test Dataset
try:
    # Delete the columns with 66% missing entries
    Test_Dataframe.dropna(axis=1, thresh=((len(Test_Dataframe)*2) / 3), inplace=True)
    drop = False # Whether or not to drop the entire row if the attribute value is empty.
    if drop:
        # Drop missing Rows
        Test_Dataframe.dropna(axis=0, subset=attribute_types['INTERVAL'] + attribute_types['RATIO'], inplace=True)
    print("TestDataframe: Dropped missing data successfully")
    Test_Dataframe.info(verbose=True)
except Exception as error:
    print("TestDataframe: Missing data drop failure", error)

In [None]:
# Inconsistent data Technique: Fix the format/Delete the inappropriate data.
# Inppropriate data which is deleted will be filled with some technique in Missing value fill technique.

def is_outlier_present(df, column):
    df_temp = df.sort_values(by=[column])
    Q1 = df_temp[column].quantile(0.25)
    Q3 = df_temp[column].quantile(0.75)
    IQR = Q3 - Q1
    minimum = Q1 - 1.5 * IQR
    maximum = Q3 + 1.5 * IQR
    outliers = df_temp[(df_temp[column] < minimum) | (df_temp[column] > maximum)]
    # If outliers is empty => No outliers
    # If outliers is non-empty => Outliers present
    return len(outliers) != 0

unknown_pattern='Unknown'
# Train Dataframe
try:
    for colname, coltype in Dataframe.dtypes.items():
        # Categorical Attribute
        if colname in attribute_types['NOMINAL'] + attribute_types['ORDINAL']:
            # Remove Data Inconsistency from each column.
            # Remove '?' and replace with 'Unknown'. This is a valid data
            try:
                Dataframe.replace(['?'], unknown_pattern, inplace=True)
                print("column name {}: ? substituted with {} times".format(colname, Dataframe[colname].value_counts()[unknown_pattern]))
            except Exception as error:
                print("column name {}: no match for ? {}".format(colname, error))
            #
            # Drop the column if unknown entries are more than 45%.
            # 45% considered because of some columns are just short of 50%.
            #
            try:
                unknown_count = Dataframe[colname].value_counts()[unknown_pattern]
                if (unknown_count > (len(Dataframe)*0.45)):
                    print("{} dropped as half of the values are unknown".format(colname))
                    Dataframe.drop([colname], axis=1, inplace=True)
            except Exception as error:
                print("column name {}: no match for Unknown".format(colname))
        # Numerical Attribute
        if colname in attribute_types['INTERVAL'] + attribute_types['RATIO']:
            if coltype == object:
                # This will remove all inconsistent data.
                Dataframe[colname] = pds.to_numeric(pds.Series(Dataframe[colname]), errors='coerce', downcast=None)
            if coltype != object:
                # After converting to numeric, check for outliers
                if is_outlier_present(Dataframe, colname):
                    if colname in ['mig_year', 'age', 'working_week_per_year', 'total_employed']:
                        # No Outliers found, handling is not done since there is no outliers found
                        raise("Outlier exists in column: ", colname)
                else:
                    # Outliers drop technique cannot be used for below attributes:
                    # Since by applying IQR technique, we are getting outliers for below attributes with larger frequencies.
                    # Removing this would negatively affect the ML technique. Hence we are not applying IQR technique to drop below outliers.
                    # wage_per_hour: 11856
                    # gains: 7830
                    # losses: 4062
                    # stocks_status: 22032
                    # importance_of_record: 6759
                    pass
    # Dataframe.info(verbose=True)
    print("Train: Successfully handled inconsistency")
except Exception as error:
    print("Train: Inconsistent fix failure", error)

# Test Dataframe
try:
    for colname, coltype in Test_Dataframe.dtypes.items():
        # Categorical Attribute
        if colname in attribute_types['NOMINAL'] + attribute_types['ORDINAL']:
            # Remove Data Inconsistency from each column.
            # Remove '?' and replace with 'Unknown'. This is a valid data
            try:
                Test_Dataframe.replace(['?'], unknown_pattern, inplace=True)
                print("TestDataframe: column name {}: ? substituted with {} times".format(colname, Test_Dataframe[colname].value_counts()[unknown_pattern]))
            except Exception as error:
                print("TestDataframe: column name {}: no match for ? {}".format(colname, error))
            #
            # Drop the column if unknown entries are more than 45%.
            # 45% considered because of some columns are just short of 50%.
            #
            try:
                unknown_count = Test_Dataframe[colname].value_counts()[unknown_pattern]
                if (unknown_count > (len(Test_Dataframe)*0.45)):
                    print("TestDataframe: {} dropped as half of the values are unknown".format(colname))
                    Test_Dataframe.drop([colname], axis=1, inplace=True)
            except Exception as error:
                print("TestDataframe: column name {}: no match for Unknown".format(colname))
        # Numerical Attribute
        if colname in attribute_types['INTERVAL'] + attribute_types['RATIO']:
            if coltype == object:
                # This will remove all inconsistent data.
                Test_Dataframe[colname] = pds.to_numeric(pds.Series(Test_Dataframe[colname]), errors='coerce', downcast=None)
            if coltype != object:
                # After converting to numeric, check for outliers
                if is_outlier_present(Test_Dataframe, colname):
                    if colname in ['mig_year', 'age', 'working_week_per_year', 'total_employed']:
                        # No Outliers found, handling is not done since there is no outliers found
                        raise("TestDataframe: Outlier exists in column: ", colname)
                else:
                    # Outliers drop technique cannot be used for below attributes:
                    # Since by applying IQR technique, we are getting outliers for below attributes with larger frequencies.
                    # Removing this would negatively affect the ML technique. Hence we are not applying IQR technique to drop below outliers.
                    # wage_per_hour: 11856
                    # gains: 7830
                    # losses: 4062
                    # stocks_status: 22032
                    # importance_of_record: 6759
                    pass
    # Test_Dataframe.info(verbose=True)
    print("TestDataframe: Successfully handled inconsistency")
except Exception as error:
    print("TestDataframe: Inconsistent fix failure", error)

In [None]:
# Missing Data Technique: Impute with measure of central tendency
try:
    Dataframe.info(verbose=True)
    for colname, coltype in Dataframe.dtypes.items():
        # Categorical Attribute
        if colname in attribute_types['NOMINAL'] + attribute_types['ORDINAL']:
            # Impute missing data with mode
            if coltype == object and colname != 'ID':
                try:
                    # Dataframe[colname].fillna(Dataframe[colname].mode(dropna=True)[0], inplace=True)
                    Dataframe.fillna({colname: Dataframe[colname].mode(dropna=True)[0]}, inplace=True)
                except Exception as error:
                    print("Categorical mode imputing failed for {}, error {} ".format(colname, error))
                #debug log to verify we dont have unknown
                print("column name: {} mode: {}\n".format(colname, Dataframe[colname].mode(dropna=True)[0]))
        # Numerical Attribute
        if colname in attribute_types['INTERVAL'] + attribute_types['RATIO']:
            # Impute missing data with mean
            Dataframe[colname] = Dataframe[colname].fillna(Dataframe[colname].mean())
    print('Imputed missing and inconsistent data successfully')
except Exception as error:
    print("Missing data impute failure", error)

# Test Dataframe
try:
    Test_Dataframe.info(verbose=True)
    for colname, coltype in Test_Dataframe.dtypes.items():
        # Categorical Attribute
        if colname in attribute_types['NOMINAL'] + attribute_types['ORDINAL']:
            # Impute missing data with mode
            if coltype == object and colname != 'ID':
                try:
                    # Test_Dataframe[colname].fillna(Test_Dataframe[colname].mode(dropna=True)[0], inplace=True)
                    Test_Dataframe.fillna({colname: Test_Dataframe[colname].mode(dropna=True)[0]}, inplace=True)
                except Exception as error:
                    print("TestDataframe: Categorical mode imputing failed for {}, error {} ".format(colname, error))
                #debug log to verify we dont have unknown
                print("TestDataframe: column name: {} mode: {}\n".format(colname, Test_Dataframe[colname].mode(dropna=True)[0]))
        # Numerical Attribute
        if colname in attribute_types['INTERVAL'] + attribute_types['RATIO']:
            # Impute missing data with mean
            Test_Dataframe[colname] = Test_Dataframe[colname].fillna(Test_Dataframe[colname].mean())
    print('TestDataframe: Imputed missing and inconsistent data successfully')
except Exception as error:
    print("TestDataframe: Missing data impute failure", error)

In [None]:
# Duplicate Rows Technique: Remove
# Removing duplicates once the missing and inconsistent data is filled.
try:
    if Dataframe.duplicated().sum():
        Dataframe.drop_duplicates(inplace=True)
        print("Duplicates removed")
    else:
        print("No Duplicate found. No action needed")
except Exception as error:
    print('DataFrame duplicate Removal Failed', error)

# Removing duplicates once the missing and inconsistent data is filled in Test Dataframe.
try:
    if Test_Dataframe.duplicated().sum():
        Test_Dataframe.drop_duplicates(inplace=True)
        print("TestDataframe: Duplicates removed")
    else:
        print("TestDataframe: No Duplicate found. No action needed")
except Exception as error:
    print('TestDataframe: DataFrame duplicate Removal Failed', error)

## 3.3 Encode categorical data

In [None]:
# Aggregation and Label encoding
edu_mapping = {'Children':0,'Less than 1st grade':0,'1st 2nd 3rd or 4th grade':0,'5th or 6th grade':0,'7th and 8th grade':0,
                                         '9th grade':0,'10th grade':0, '11th grade':0,'12th grade no diploma':0,'High school graduate':1,'Some college but no degree':2,
                                         'Associates degree-academic program':2,'Associates degree-occup /vocational':2,'Bachelors degree(BA AB BS)':2,'Prof school degree (MD DDS DVM LLB JD)':3,
                                         'Masters degree(MA MS MEng MEd MSW MBA)':3,'Doctorate degree(PhD EdD)':3}
tax = {'Head of household':2, 'Single':1, 'Nonfiler':0, 'Joint both 65+':5,
       'Joint both under 65':3, 'Joint one under 65 & one 65+':4}
hh_sum = {'Householder':2, 'Child 18 or older':1, 'Child under 18 never married':0,
       'Spouse of householder':3, 'Nonrelative of householder':6,
       'Other relative of householder':4,
       'Group Quarters- Secondary individual':5,
       'Child under 18 ever married':0}
income_map = {'Below limit':0, 'Above limit':1}
hh_stat = {'Householder':1, 'Nonfamily householder':0,
       'Child 18+ never marr Not in a subfamily':5,
       'Child <18 never marr not in subfamily':4, 'Spouse of householder':2,
       'Child 18+ spouse of subfamily RP':5, 'Secondary individual':3,
       'Child 18+ never marr RP of subfamily':5,
       'Other Rel 18+ spouse of subfamily RP':5,
       'Grandchild <18 never marr not in subfamily':4,
       'Other Rel <18 never marr child of subfamily RP':4,
       'Other Rel 18+ ever marr RP of subfamily':5,
       'Other Rel 18+ ever marr not in subfamily':5,
       'Child 18+ ever marr Not in a subfamily':5,
       'RP of unrelated subfamily':0, 'Child 18+ ever marr RP of subfamily':5,
       'Other Rel 18+ never marr not in subfamily':5,
       'Child under 18 of RP of unrel subfamily':4,
       'Grandchild <18 never marr child of subfamily RP':4,
       'Grandchild 18+ never marr not in subfamily':5,
       'Other Rel <18 never marr not in subfamily':4, 'In group quarters':6,
       'Grandchild 18+ ever marr not in subfamily':5,
       'Other Rel 18+ never marr RP of subfamily':5,
       'Child <18 never marr RP of subfamily':4,
       'Grandchild 18+ never marr RP of subfamily':5,
       'Spouse of RP of unrelated subfamily':0,
       'Grandchild 18+ ever marr RP of subfamily':5,
       'Child <18 ever marr not in subfamily':4,
       'Child <18 ever marr RP of subfamily':4,
       'Other Rel <18 ever marr RP of subfamily':4,
       'Grandchild 18+ spouse of subfamily RP':5,
       'Child <18 spouse of subfamily RP':4,
       'Other Rel <18 ever marr not in subfamily':4,
       'Other Rel <18 never married RP of subfamily':4,
       'Other Rel <18 spouse of subfamily RP':4,
       'Grandchild <18 ever marr not in subfamily':4,
       'Grandchild <18 never marr RP of subfamily':4}
em_com = {'Not in labor force':0, 'Children or Armed Forces':6,
       'Full-time schedules':5, 'PT for econ reasons usually PT':3,
       'Unemployed full-time':1, 'PT for non-econ reasons usually FT':4,
       'PT for econ reasons usually FT':4, 'Unemployed part- time':2}

Dataframe = Dataframe.infer_objects(copy=False)
Test_Dataframe = Test_Dataframe.infer_objects(copy=False)

Dataframe['education']=Dataframe['education'].replace(edu_mapping)
Dataframe['tax_status']=Dataframe['tax_status'].replace(tax)
Dataframe['household_summary']=Dataframe['household_summary'].replace(hh_sum)
Dataframe['income_above_limit']=Dataframe['income_above_limit'].replace(income_map)
Dataframe['household_stat']=Dataframe['household_stat'].replace(hh_stat)
Dataframe['employment_commitment']=Dataframe['employment_commitment'].replace(em_com)

Test_Dataframe['education']=Test_Dataframe['education'].replace(edu_mapping)
Test_Dataframe['tax_status']=Test_Dataframe['tax_status'].replace(tax)
Test_Dataframe['household_summary']=Test_Dataframe['household_summary'].replace(hh_sum)
Test_Dataframe['household_stat']=Test_Dataframe['household_stat'].replace(hh_stat)
Test_Dataframe['employment_commitment']=Test_Dataframe['employment_commitment'].replace(em_com)

In [None]:
# Aggregation and Label encoding continued
#
# gender mapping.
#
Dataframe['gender'] = Dataframe['gender'].map({"Male": 0, "Female": 1})
Test_Dataframe['gender'] = Test_Dataframe['gender'].map({"Male": 0, "Female": 1})


#
# married status mapping rules.
#
marriage_mapping = {"Married-civilian spouse present": "Married",
                    "Married-spouse absent": "Married",
                    "Married-A F spouse present": "Married",
                    "Widowed": "Single", "Divorced": "Single",
                    "Separated": "Single", "Never married": "Single",
}

## Apply the mapping.
Dataframe['marital_status'] = Dataframe['marital_status'].map(marriage_mapping)
Test_Dataframe['marital_status'] = Test_Dataframe['marital_status'].map(marriage_mapping)

## Further map to binary values.
Dataframe["marital_status"] = Dataframe["marital_status"].map({"Married": 0, "Single": 1})
Test_Dataframe["marital_status"] = Test_Dataframe["marital_status"].map({"Married": 0, "Single": 1})


#
# 'citizenship' mapping rules
#
citizenship_mapping = {"Native": "Citizen",
                    "Native- Born abroad of American Parent(s)": "Citizen",
                    "Native- Born in Puerto Rico or U S Outlying": "Citizen",
                    "Foreign born- U S citizen by naturalization": "Citizen",
                    "Foreign born- Not a citizen of U S ": "Non-citizen"
}
## Apply the mapping.
Dataframe['citizenship'] = Dataframe['citizenship'].map(citizenship_mapping)
Test_Dataframe['citizenship'] = Test_Dataframe['citizenship'].map(citizenship_mapping)
## Further map to binary values.
Dataframe["citizenship"] = Dataframe["citizenship"].map({"Citizen": 0, "Non-citizen": 1})
Test_Dataframe["citizenship"] = Test_Dataframe["citizenship"].map({"Citizen": 0, "Non-citizen": 1})


## 3.4 Text data

1. Remove special characters
2. Change the case (up-casing and down-casing).
3. Tokenization â€” process of discretizing words within a document.
4. Filter Stop Words.

<div class="alert alert-block alert-info">
There is no text data in our dataset.
</div>

## 3.4 Report

Mention and justify the method adopted
* to remove duplicate data, if present
* to impute or remove missing data, if present
* to remove data inconsistencies, if present

OR for textdata
* How many tokens after step 3?
* how may tokens after stop words filtering?

If the any of the above are not present, then also add in the report below.

Score: 2 Marks (based on the dataset you have, the data prepreation you had to do and report typed, marks will be distributed between 3.1, 3.2, 3.3 and 3.4)

<div class="alert alert-block alert-info">

**Deletion technique:**
<li> Our dataset has unique ID's for all the records. Hence we didn't dropped the 'ID' column. </li>
<li> We delete the columns with 66% missing entries. </li>

**Inconsistent data:**
<li> We removed the inconsistencies like "?" => "Unknown". </li>
<li> If the Unknown is more than 45% in the column, then we drop the whole column. </li>
<li> Check for outliers for appropriate attributes. </li>

**Missing data:**
<li> We impute the Method of Central tendency - Mode technique for Categorical Attributes other than Unknown value. </li>
<li> We impute the Method of Central tendency - Mean technique for Numerical Attributes if any. </li>

**Deletion technique:**
<li> After all the above steps, we drop the duplicate records if any.</li>

</div>


<div class="alert alert-block alert-info">
There is no text data in our dataset.
</div>

## 3.5 Identify the target variables.

* Separate the data from the target such that the dataset is in the form of (X,y) or (Features, Label)

* Discretize / Encode the target variable or perform one-hot encoding on the target or any other as and if required.

* Report the observations

Score: 1 Mark

In [None]:
#discretize numerical features

#
# age
#

## Define bin edges
bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

## Create labels for the bins
bin_labels = ["0-10", "11-20", "21-30", "31-40", "41-50",
            "51-60", "61-70", "71-80", "81-90", "91-100"]

## Apply the binning using pd.cut()
Dataframe["age"] = pds.cut(Dataframe["age"], bins=bin_edges, labels=bin_labels)
print("Train: {}\n".format(Dataframe['age'].value_counts()))
Test_Dataframe["age"] = pds.cut(Test_Dataframe["age"], bins=bin_edges, labels=bin_labels)
print("Test: {}\n".format(Test_Dataframe['age'].value_counts()))

#
# Hourly wage encoding.
#

## Define bin edges
bin_edges = [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]

## Create labels for the bins
bin_labels = ["0-1000", "1001-2000", "2001-3000", "3001-4000", "4001-5000",
            "5001-6000", "6001-7000", "7001-8000", "8001-9000", "9001-9999"]

## Apply the binning using pd.cut()
Dataframe["wage_per_hour"] = pds.cut(Dataframe["wage_per_hour"], bins=bin_edges, labels=bin_labels)
print("Train: {}\n".format(Dataframe['wage_per_hour'].value_counts()))
Test_Dataframe["wage_per_hour"] = pds.cut(Test_Dataframe["wage_per_hour"], bins=bin_edges, labels=bin_labels)
print("Test: {}\n".format(Test_Dataframe['wage_per_hour'].value_counts()))

#
# Investment/capital market gains encoding.
#

# Define bin edges
bin_edges = [0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000]

## Create labels for the bins
bin_labels = ["0-10000", "10001-20000", "20001-30000", "30001-40000", "40001-50000",
            "50001-60000", "60001-70000", "70001-80000", "80001-90000", "90001-99999"]

## Apply the binning using pd.cut()
Dataframe["gains"] = pds.cut(Dataframe["gains"], bins=bin_edges, labels=bin_labels)
print("Train: {}\n".format(Dataframe['gains'].value_counts()))
Test_Dataframe["gains"] = pds.cut(Test_Dataframe["gains"], bins=bin_edges, labels=bin_labels)
print("Test: {}\n".format(Test_Dataframe['gains'].value_counts()))

#
# Investment/capital market stocks encoding.
#

# Define bin edges
bin_edges = [0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000]

## Create labels for the bins
bin_labels = ["0-10000", "10001-20000", "20001-30000", "30001-40000", "40001-50000",
            "50001-60000", "60001-70000", "70001-80000", "80001-90000", "90001-99999"]

## Apply the binning using pd.cut()
Dataframe["stocks_status"] = pds.cut(Dataframe["stocks_status"], bins=bin_edges, labels=bin_labels)
print("Train: {}\n".format(Dataframe['stocks_status'].value_counts()))
Test_Dataframe["stocks_status"] = pds.cut(Test_Dataframe["stocks_status"], bins=bin_edges, labels=bin_labels)
print("Test: {}\n".format(Test_Dataframe['stocks_status'].value_counts()))

#
# Investment/capital market gains encoding.
#

# Define bin edges
bin_edges = [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]

## Create labels for the bins
bin_labels = ["0-1000", "1001-2000", "2001-3000", "3001-4000", "4001-5000",
            "5001-6000", "6001-7000", "7001-8000", "8001-9000", "9001-9999"]

## Apply the binning using pd.cut()
Dataframe["losses"] = pds.cut(Dataframe["losses"], bins=bin_edges, labels=bin_labels)
print("Train: {}\n".format(Dataframe['losses'].value_counts()))
Test_Dataframe["losses"] = pds.cut(Test_Dataframe["losses"], bins=bin_edges, labels=bin_labels)
print("Test: {}\n".format(Test_Dataframe['losses'].value_counts()))


In [None]:
# Final Coding Procedure.

## Create a list of all the unique values in the feature
categorical_columns = ['age', 'race', 'education', 'is_hispanic', 
                       'employment_commitment', 'working_week_per_year',
                       'industry_code_main', 'household_stat', 'stocks_status',
                       'household_summary', 'tax_status', 'citizenship', 'gains', 'losses',
                       'country_of_birth_own', 'country_of_birth_father', 'country_of_birth_mother',
                       'wage_per_hour',
                      ]

## Define function that codes the values into categorical codes

def consistent_coding(dataframes, categorical_columns):
    """Applies consistent category coding across multiple DataFrames."""

    # Create a dictionary to store global category mappings
    category_codes = {}

    for df in dataframes:
        for col in categorical_columns:
            print("{} {} -> ".format(col, df[col].unique()))
            if col not in category_codes:
                # Create mapping for new categories
                category_codes[col] = {
                    value: i for i, value in enumerate(df[col].dropna().unique())
                }

            # Apply mapping to DataFrame
            df[col] = df[col].astype('category')
            df[col] = df[col].cat.set_categories(list(category_codes[col].keys()))
            df[col] = df[col].cat.codes  # Assign codes based on global mapping
            print(" {}\n".format(df[col].unique()))
    
    df_category_codes = pds.DataFrame(category_codes)
    return dataframes


## Apply consistent coding to training & testing data.
#train_df_coded = consistent_coding([train_df, test_df], categorical_columns)
train_df_coded = consistent_coding([Dataframe, Test_Dataframe], categorical_columns)
# Train Dataframe
display(train_df_coded[0].head(5))
# Test Dataframe
display(train_df_coded[1].head(5))


In [None]:
# Data target in (X, y)
x = Dataframe.columns.values.tolist()
y = ['income_above_limit']
x = list(set(x).difference(y))
print("(X_Features: {}, y_target: {}, target_label:{})\n".format(x,y,Dataframe[y[0]].unique()))

print("(X_Features: {}, y_target: {}, target_label:{})\n".format(x,y,train_df_coded[0][y[0]].unique()))

<div class="alert alert-block alert-info">
<b>Report:</b>

1. Attribute: 'ID' removed from further Feature processing.
2. y: Target variable 'income_above_limit' is categorical attribute and its Ordinal value is already encoded in section 3.3
3. X: Out of 43 initial features we have filtered 28 attributes after data cleanup going for Data Exploration stage.   

</div>

# 4. Data Exploration using various plots



## 4.1 Scatter plot of each quantitative attribute with the target.

Score: 1 Mark

In [None]:
train_df = train_df_coded[0]

import matplotlib.pyplot as plt
import seaborn as sns

ignore_plot = ['ID','importance_of_record','mig_year', 'country_of_birth_father', 'occupation_code',
               'country_of_birth_own','country_of_birth_mother','industry_code','industry_code_main',
               'is_hispanic','losses', 'household_summary', 'wage_per_hour', 'employment_stat','house_hold_stat',
               'race']
try:
    # random sampling from large dataset.
    df = train_df.sample(1000)
    for entry in x:
        if entry not in ignore_plot:
            fig = plt.figure(figsize=(25,4))
            sns.scatterplot(data=df, x=entry, y ='income_above_limit',hue = 'income_above_limit')
            plt.pause(1)
            plt.close()
            fig = plt.figure(figsize=(25,8))
            sns.countplot(data=df, x=entry,hue = 'income_above_limit')
            plt.pause(1)
            plt.close() 
except Exception as error:
    print('Failed to plot graph',error)

<div class="alert alert-block alert-info">
<b>Observations from above plots:</b>

1. 30 < Age < 65 are having more probability of having income_above_liimit as compared to other age group.
2. Record having stock_status with more Gains and less losses is having high probability of income_above_limit as compared to without stocks.
3. education: degree,masters,doctorate is having high probability of income_above_limit then others.
4. total_employed > 6 is highest probability of income_above_limit then others.
5. marital_status: married_civilian with spouse present is having highest probability of income_above_limit then others.
6. citizenship/race : native/white has highest proability of income above limit the others.
7. employment_commitment: fulltime and armed forces entry are having more probability of income_above_limit then others.
8. vet_benefit : > 1 have more probability of income above limit then 0.
9. gender: Male compare to Female are more in income_above_limit.
10. working_week_per_year > 50 have more probability of income_above_limit then < 50.
11. tax_status: Husband/wife < 65 working and paying tax has more probability of income above limit then others.

</div>

## 4.2 EDA using visuals
* Use (minimum) 2 plots (pair plot, heat map, correlation plot, regression plot...) to identify the optimal set of attributes that can be used for classification.
* Name them, explain why you think they can be helpful in the task and perform the plot as well. Unless proper justification for the choice of plots given, no credit will be awarded.

Score: 2 Marks

In [None]:
#Heat Map: To describe correlation (strength and direction) among several numerical/ordinal variables/attributes.
corr_matrix=Dataframe.corr()
#Creating a seaborn heatmap
plt.figure(figsize=(20,10))
sns.heatmap(corr_matrix,cmap='BrBG', center=0, annot=True, linewidths=0.5, linecolor='red')
plt.pause(1)
plt.close()
#Heat Map: Interpretation:
# working_week_per_year,total_employed,occupation_code,industry_code are having positive correlation.
# tax_status,vet_benefit is having age,education,working_week_per_year,total_employed and household_summary postive correlation. 
# migration_prev_sunbelt and residence_1_year_ago is having postive corelation. 
# income_above_level is positively correlated with attributes: age,education,working_week_per_year,industry_code,total_employed,vet_benefit,tax_status,gains,losses,stock_status
# house_hold_stat is negatively correlated with most of the attributes.
# employment_stat is postively correlated with total_employed,working_week_per_year,industry_code,occupation_code.
print("########################################################################################################\n")


x_list = ['age', 'wage_per_hour', 'working_week_per_year', 'gains','losses', 'stocks_status', 'importance_of_record']
y_list = [i for i in x if i not in x_list]
# pair plot : To describe distribution of data and relationship between attributes. 
# Here exploring Nominal attribute relation with ratio/ordinal attributes.
plt.figure(figsize=(20,10))
sns.pairplot(hue='income_above_limit', data=df, y_vars = y_list, x_vars = x_list)
plt.pause(1)
plt.close()

#Pair plot Map: Interpretation:
# age and marital status.
# education and job class.
# race/citizenship and with wage_per_hour, working_week_per_year.
# employment commitment and wage_per_hour, working_week_per_year,total_employed.
# All above identified attributes have relation and influence to predict target.

# 5. Data Wrangling



## 5.1 Univariate Filters

#### Numerical and Categorical Data
* Identify top 5 significant features by evaluating each feature independently with respect to the target variable by exploring
1. Mutual Information (Information Gain)
2. Gini index
3. Gain Ratio
4. Chi-Squared test
5. Fisher Score
(From the above 5 you are required to use only any <b>two</b>)

#### For Text data

1. Stemming / Lemmatization.
2. Forming n-grams and storing them in the document vector.
3. TF-IDF
(From the above 2 you are required to use only any <b>two</b>)


Score: 3 Marks

In [None]:
# IG calculation

from sklearn.feature_selection import mutual_info_classif

train_df1 = train_df_coded[0].copy()

display(train_df1.head())
train_df1.info(verbose=True)
for colname, coltype in train_df1.dtypes.items():
    print("Column name: {} Unique Values count: {}\n".format(colname, train_df1[colname].nunique()))
    
x = ['age', 'gender', 'education', 'marital_status', 'race', 'is_hispanic', 'employment_commitment', 'employment_stat', 
     'wage_per_hour', 'working_week_per_year', 'industry_code', 'industry_code_main', 'occupation_code', 'total_employed', 
     'household_stat', 'household_summary', 'vet_benefit', 'tax_status', 'gains', 'losses', 'stocks_status', 'citizenship', 
     'mig_year', 'country_of_birth_own', 'country_of_birth_father', 'country_of_birth_mother', 'importance_of_record']

x_discrete = [True, True, True, True, True, True, True, True, 
             True, False, False, False, False, True, 
             True, True, True, True, True, True, True, True, 
             False, False, False, False, False]

Y = train_df_coded[0]['income_above_limit']
#remove target
train_df1.pop('income_above_limit')
print("Set X and Y. Computing Information Gain...\n")


mi = mutual_info_classif(train_df1, Y, discrete_features=x_discrete)
print("mi {}".format(mi))

mi = pds.Series(mi)
mi.index = train_df1.columns
mi.sort_values(ascending=False).plot.bar(figsize=(10, 5))
plt.ylabel('income_above_limit')
plt.title("Mutual information between predictors and target")
plt.show()


In [None]:
#Gini index Calculation
gi = []
for feature in train_df1.columns:
  feature_income = train_df_coded[0][[feature,"income_above_limit"]]

  gini = 0
      
  for gender in feature_income[feature].unique():
    filter = feature_income[feature_income[feature] == gender]
    filter_size = len(filter)
      
    value_count = filter["income_above_limit"].value_counts()

    proportion = value_count / filter_size

    gini_filter = 1 - sum (proportion ** 2)

    gini += (len(filter)/ len(feature_income) * gini_filter)
  gi.append(gini)

  print("Gini Index for {} feature and target as income_above_limit is ".format(feature),gini)



feature_column = train_df1.columns.to_list()
gi.sort(reverse=True)
plt.figure(figsize=(10, 5))
plt.bar(feature_column, gi)
plt.xticks(rotation='vertical')
plt.ylabel('income_above_limit')
plt.title("Gini Index between predictors and target")
plt.show()


## 5.2 Report observations

Write your observations from the results of each method. Clearly justify your choice of the method.

Score 1 mark

<div class="alert alert-block alert-info">
<b>Observations:</b>


**1. Information Gain:**

Information Gain is applied on the train dataframe having 28 features.
The target attribute 'income_above_limit' is categoricaly type. 
We separated out the discrete and continous features and Calculated IG of each attribute against target 'income above limit'.
the IG plot is given above. We have set threshold of 0.02 and all features above this score is selected for below ML techniques.

**2. Gini Index:**

Gini is applied on the train dataframe having 28 features.
We evaluated this method on all the features against target variable in below steps.

</div>

<div class="alert alert-block alert-info">
<b>Justification:</b>
We have selected the Information Gain as the univariate filter in the ML techniques used below.
</div>

In [None]:
# Short listed features after Data wrangling techniques

feature_cols = ['occupation_code',
                'education',
                'working_week_per_year',
                'industry_code', 
                'industry_code_main', 
                'age',
                'tax_status', 
                'household_stat',
                'total_employed', 
                'household_summary',
                'stocks_status',
                'gains',
                'mig_year']


# 6. Implement Machine Learning Techniques

Use any 2 ML algorithms
1. Classification -- Decision Tree classifier

2. Clustering -- kmeans

3. Association Analysis

4. Anomaly detection

5. Textual data -- Naive Bayes classifier (not taught in this course)

A clear justification have to be given for why a certain algorithm was chosen to address your problem.

Score: 4 Marks (2 marks each for each algorithm)

## 6.1 ML technique 1 + Justification

In [None]:
# Copied the dataframe to build decision tree, because second ML algorithm should get affected
df = Dataframe.copy()

In [None]:
# Load libraries
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics # Import scikit-learn metrics module for accuracy calculation
from sklearn import tree

X = df[feature_cols] # Features
y = df.income_above_limit # Target variable

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [None]:
# Create Decision Tree classifer object
# clf = DecisionTreeClassifier(splitter='best', criterion='gini', max_depth=15, min_samples_split=2, min_samples_leaf=1000, max_features=None)
clf = DecisionTreeClassifier(splitter='best', criterion='entropy', max_depth=15, min_samples_split=2, min_samples_leaf=1000, max_features=None)

# Train Decision Tree Classifer
clf = clf.fit(X,y)

# Predict the response for 30% test dataset
y_pred = clf.predict(X_test)

# Predicting the target for Actual testdata
y_pred_actual_test_data = clf.predict(Test_Dataframe[feature_cols])

In [None]:
# Detailed metrics can be found in Section 7:

# Returns the mean accuracy on the given test data and labels.
print("Model Accuracy: ", clf.score(X, y)) # Whole dataset score
print("Model Accuracy: ", clf.score(X_test, y_test)) # Test data score

# Model Accuracy, how often is the classifier correct?
# print("Model Accuracy: ", metrics.accuracy_score(y_test, y_pred, normalize = True))
print("Model Accuracy: ", metrics.accuracy_score(y_test, y_pred)) # 30% Test data score

In [None]:
from sklearn.tree import export_graphviz
from six import StringIO # Alternative for sklearn.externals.six
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, max_depth=15,
                feature_names=feature_cols, class_names=['Below','Above'],
                label='all', filled=True, rounded=True,
                special_characters=True, proportion=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('income_dataset_decision_tree.png')
Image(graph.create_png())

<div class="alert alert-block alert-info">
<b>ML Technique 1: Justification for Decision Tree Classification:</b>

Target for our supervised dataset is Categorical attribute, to which Classification method is one of the best approach. Hence, we used Decision tree Classification method as our first option!

</div>

## 6.2 ML technique 2 + Justification

In [None]:
# Apriori: find frequent set of attributes pattern which implies income_above/below limit.
#pip install mlxtend
#Importing Libraries
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

f_list = list()
# pick Feature with < 15 Unique value to keep space complexity in limit.
for entry in feature_cols:
    if Dataframe[entry].nunique() < 30:
        print(entry)
        f_list.append(entry)
f_list.append('income_above_limit')

df1 = Dataframe[f_list].copy()
display(df1)

# one hot encoding
df2 = pds.get_dummies(data=df1, columns=f_list)
display(df2)


freq_items = apriori(df2, min_support=0.7, use_colnames=True, verbose=1)
#pds.set_option('display.max_colwidth', -1) 
display(freq_items)

# creating asssociation rules
rules = association_rules(freq_items, metric="confidence", min_threshold=0.7)
#pds.set_option('display.max_colwidth', -1) 
display(rules)


#visualization
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()

<div class="alert alert-block alert-info">
<b>ML Technique 2: Justification/Observation for Association Analysis:</b>

1. Algo executed with min_support = 0.7 and min_confidence=0.7.
2. Not a Single rule shows a pattern with Target = 'income_above_limit' which is business usecase requirement.
3. All rules are finding pattern with Target = 'Income_below_limit'.
4. Dataset which we are analysing is having very sparse entry (less than 5% ) with Target label = 'Income_above_limit' category, so Frequent pattern algo is not yielding meaningful set of rules for Target label = 'income_above_limit'.
5. Based on Above observation we are dropping FP Algo for further analysis.

</div>

## 7. Conclusion

Compare the performance of the ML techniques used.

Derive values for preformance study metrics like accuracy, precision, recall, F1 Score, AUC-ROC etc to compare the ML algos and plot them. A proper comparision based on different metrics should be done and not just accuracy alone, only then the comparision becomes authentic. You may use Confusion matrix, classification report, Word cloud etc as per the requirement of your application/problem.

Score 1 Mark

In [None]:
##Calculation for accuracy, precision, recall, F1 Score, AUC-ROC, confusion matrix, classification report and plotting them
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report)

#Calculating the Accuracy Decision tree classifier
Accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model is: ",Accuracy)

#Calculating the precision for Decision tree classifier
Precision = precision_score(y_test, y_pred)
print("Precision of the model is: ",Precision)

#Calculating the recall score for Decision tree classifier
Recall_score = recall_score(y_test, y_pred)
print("Recall score of the model is: ",Recall_score)    

#Calculating the F1 score for Decision tree classifier
F1_score = f1_score(y_test, y_pred)
print("F1 score of the model is: ",F1_score)

#Calculating the ROC_AUC for Decision tree classifier
y_score = clf.predict_proba(X_test)[:, 1]
ROC_AUC = roc_auc_score(y_test, y_score)
print("ROC_AUC of the model is: ",ROC_AUC)

#plot the metrics
plt.bar(['Accuracy', 'Precision', 'Recall', 'F1 Score', 'AUC-ROC'], [Accuracy, Precision, Recall_score, F1_score, ROC_AUC])
plt.xlabel('Performance Metrics')
plt.ylabel('Score')
plt.title('Performance Metrics for Decision Tree Classifier')
plt.show()

#Computing the confusion matrix for Decision tree classifier
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion matrix of the model is \n",conf_matrix)

#Computing the classification matrix for Decision tree classifier
Class_report = classification_report(y_test, y_pred)
print("Classification matrix of the model is \n",Class_report)



## 8. Solution

What is the solution that is proposed to solve the business problem discussed in Section 1. Also share your learnings while working through solving the problem in terms of challenges, observations, decisions made etc.

Score 2 Marks

<div class="alert alert-block alert-info">
<b>Solution:</b>

We can use this Trained model to classify our Bank customers and their financial capacity and accordingly design/prescribe credit card programs in future.
The attributes that are helping to classify the Customers are dervied through the process of data preparation and feature/ML techniques we evaluated and applied above

</div>

<div class="alert alert-block alert-info">
<b>Learnings:</b>

Python versions, API's and usage issues.
MAC and Windows compatibility issues.
Few ML topics like Anomaly Detection are still in progress, but we have to cover those topics in advance.

</div>

##NOTE
All Late Submissions will incur a penalty of -2 marks. Do ensure on time submission to avoid penalty.

Good Luck!!!