# Python: Introduction to Machine Learning

## Import Libraries 

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

#from sklearn.preprocessing import LabelEncoder  
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

pd.set_option('display.max_columns', None) # Display all columns when there are a lot of columns in dataframe
%matplotlib inline # Display Matplotlib graphs within the Notebook (and note as separate window pop-ups)

UsageError: unrecognized arguments: # Display Matplotlib graphs within the Notebook (and note as separate window pop-ups)


## Import Data 

In [None]:
# How many columns & rows? 


## Exploratory Data Analysis (EDA)

- EDA is an important step in the ML/Data Science pipeline 
- Gain a high-level understanding of the data and its characteristics (data types, rows, columns, missing values, etc.)  
- This step helps provide guidance on how to pre-process the data to prep it for model building 

In [None]:
# Display data about the data (nulls, data types, rows/columns, etc.)


In [None]:
# Check for missing Values


In [None]:
# Display statistical summary for the data 


In [None]:
# List all the columns


In [None]:
# List of Unique Values in all of the categorical columns 


In [None]:
# Checking for any repeated records with regards to Loan ID


#### Let's Summarize! 
- Loan ID is the primary key in the data - it uniquely identifies each record 
- There are 614 rows, 13 columns
- The .describe() function can be used to quickly gauge some statistics about the data 
    - In some cases it can also help identify some incorrect data (if this was a biometric dataset with heart-rate, an minimum heartrate of 0 would be a call for investigation!) 
- 7/13 columns have missing values 
- Credit History has the highest number of missing values! 

#### Key Remarks 
- Understanding the data you are working with is very important! 
- Always strive to work with Subjet Matter Experts (SMEs) to get insight into the data 
- In a real-world application, you may need to individually evaluate each column and its values to learn the context behind the data 

## Data Analysis / Data Visualization
- Investigate to find relationships and trends within the data 
- Certain features may be more prominent in determining whether the applicant's loan with be approved or not 
- Data Visualization can help reveal key information in the data 
    - Knowing which graphs to use is a key skills that comes with practice and experience! 
- A good starting point is compare different features against the label (Loan Status) to see if there are any easily distinguishable relationships

In [None]:
# Number of Approved & Not Approved (Y/N) records 


#### Gender vs Approval

In [None]:
# Let's understand how different

In [None]:
# Percent of Y/N for the genders in the dataset


In [None]:
# Lets plot the percentage

In [2]:
# Let's write this as a method to make it easy to check the loan status against all the parameters

In [3]:
# Test the function 

#### All Categorical Features vs Approval

In [4]:
#Select the data we want to test


#### Continuous Features 

In [5]:
# Check Continuous variables


In [6]:
# ApplicantIncome
# CoapplicantIncome
# LoanAmount
# Loan_Amount_Term


In [7]:
# Histogram


In [8]:
# Let's look at correlation next

## Model Development

### Data Preparation

#### Null Values
- There are many ways to deal with NULL values and it can have a significant impact on how your model performs
    - Deleting rows
    - Replacing with Mean, Median, Mode
    - Imputing values (KNN, ML algorithms, etc.) 

In [9]:
# Lets re-check columns with null values 


In [10]:
# Lets investigate the Loan Amount 


In [11]:
# Lets investigate the Loan Amount field 


In [12]:
# Replace Loan Amount NULL values with Mean


In [13]:
# Remove remaining records with Null values


In [14]:
#Confirm it worked

In [15]:
#Check for duplicate entries


In [16]:
#What's the shape of the new data?


#### Encoding Categorical Values
- ML models can only deal with numerical values 
- Categorical data has to be encoded as numbers for use in models 
- Common techniques: Ordinal Encoding & One-Hot Encoding
    - We will us the **get_dummies()** function in Pandas to do this, however when building ML for projects, using the **LabelEncoder & OneHotEncoder** modules in Sklearn are recommended 
    - Using get_dummies() functionally creates the same result, and is quicker to easily visualize the concept
- When dealing with categorical data in production, additional solutions/algorithms may be required to deal with unseen categorical values

In [17]:
# We need to replace string data (Y,N), with numbers


In [18]:
# One-Hot Encod the features using get_dummies() function in Pandas 


#### Feature Selection
- After analyzing the data, select the features you will use to help build the model 
- You do not always need to use every single feature. With lots of data, removing unnecessary features can save processing time, save costs, and even improve model performance
- Since the categorical features have been encoded, drop the respective non-encoded categorical columns 

In [19]:
# Obvious parameter to drop


### Model Development

In [20]:
# Separate data into target and features


In [21]:
# Create model instances 


#### Model Training, Testing, & Evaluation

In [22]:
#Model evaluation tools
from sklearn.metrics import classification_report, roc_auc_score, plot_roc_curve

#### So What? 

- From the results above we can see that Logistic Regression performs better 
- Going forward, this will be the model we select to make predictions on whether someone will be given or denied a loan

#### Remarks

- The average score is not always a true representation of how good a model is, especially for classification
- What if the model has to evaluate between apples & oranges, given there are 90 apples & 10 oranges ? 
    - If the model correctly classifies 90 apples, but only 5/10 organges are correctly classified, the model would still have a high accuracy even though it clearly cannot be trusted to properly classrify oranges 

## Considerations 
 - Test out different algorithms -> Support Vector Machine
 - Iterate over the feature selection process
 - Feature Engineering: Develop your own features from the available data 
