<a href="https://colab.research.google.com/github/TessM2/MLTutorial/blob/main/WashUmachinelearningtutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


**Machine Learning: Building a Model**

Today we're going to walk through the process of training a machine learning model to make some predictions.

We're going to train a model to predict whether people will (or "should," from the perspective of a bank or lender) receive a loan or not, based on some data we already have about people who did and did not receive loans.

But first, before we can "do" Machine Learning, we have to understand a bit about Python....


**The world's briefest introduction to Python**

First of all, the thing we're writing in right now: this is called a **notebook**; it's the place where we write code. This notebook is run by google colab.

Each panel in this notebook, like the one right below this note, is called a "cell"; we write code in cells. To add a new cell, press the +Code button. Go ahead and add a new cell below the blank cell below

We can write two types of things in cells. First we can write code, which runs when we press the play button next to a cell. Here's a sample of a very simply line of code, that prints out a word, in quotes. The word in quotes is called a string.

In [None]:
print("Hello")

Write a line of code to print a string of your name below

We can also write comments after hashtags, that look like this. They will not run as code

In [None]:
#Hello I'm writing something after a hashtag. What do you think this is for?

Besides strings, the words in quotes that we just saw, there are also some other types of important objects in Python. There are numbers, with come in the form of integers and floats (with a decimal place). And there are also variables. Variables are words that you set equal to, or make stand for something else. Just like how in math a variable X stands for something else, a variable we set equal to something else, in Python, comes to stand for it. We do it like this:

In [None]:
#create variable Tess that equals 2
Tess = 2
#now print Tess, to show it is just a placeholder for 2
print(Tess)
#now add one to Tess, to make her 3
Tess = Tess + 1
#now print Tess to show she's 3
print(Tess)

We can do a lot of things with the basic Python language. We can, for example, like I just showed you, add and do math with numbers. We can, as I just showed you, set variables equal to strings or floats/integers.

One thing we often like to do is create a "loop" to do something over and over again, which looks like this:

In [None]:
#This loop will add 1 to Tess 5 times
for i in range(5):
  Tess = Tess + 1

#let's print Tess, who should now be 5 bigger
print(Tess)

But the basic coding operations of python can't do everything. And often when we code we want to do really specific, complicated tasks. For that, we download, into python what are called packages - they're like little applications that are predesigned to do very different things. 

One very popular package for machine learning is called scikitlearn. Here's how we would "import" that package into Python, so we can use it:

In [None]:
import sklearn

**Machine Learning with Scikitlearn**

OK! Now we're ready to go. We know everything about Python (joke) and we've imported scikit learn, so it's time to do some machine learning. Don't worry, you're not meant to understand this code fully (unless you already know Python); but let's see if we can at least understand the steps of the process; and maybe even a bit of the code, too.

We're going to build a model, as we said, to predict people's loan status (Yes or No, for a loan)

In [None]:
#First, let's download some packages we'll need

#for tabular data manipulation
import pandas as pd
# for numerical compuations
import numpy as np
#for plotting and visualizing data
import matplotlib.pyplot as plt
#it's complicated; lets you use matplotlib nicely with pandas (a "wrapper" for matplotlib)
import seaborn as sns
  

**Step One: Getting the Data**
The first thing we always need to do is get the data

In [None]:
#Now, let's download the dataset we're going to use to train the model as a zipped archive
import urllib.request
urllib.request.urlretrieve("https://github.com/TessM2/loanpredictiondata/archive/refs/heads/main.zip", "dataset.zip")
import zipfile
with zipfile.ZipFile("dataset.zip", 'r') as zf:
    zf.extractall()
#Clean up after ourselves
import os
os.remove("dataset.zip")

In [None]:
#let's get our file and take a look at it
data = pd.read_csv('loanpredictiondata-main/LoanApprovalPrediction.csv')
data

**Step Two: Cleaning and Examining the Data**

After we get data, we have to "clean" it; that means put it in the forms that we need it in

We also probably want to explore or look around at our data a bit, to see what to expect

In [None]:
#let's get rid of a column that won't help us
data.drop(['Loan_ID'],axis=1,inplace=True)

In [None]:
#let's also get rid of rows that have missing values
data.dropna(inplace=True)

In [None]:
#let's make a list of our catgeorical variables
categorical_variables = ["Gender", "Married", "Education", "Self_Employed", "Property_Area", "Loan_Status"]

In [None]:
#let's visualize all of our columns in a barplot

plt.figure(figsize=(18,36))

for index,variable in enumerate(categorical_variables):
  plt.subplot(11,4,index+1)
  sns.countplot(data=data, x = variable)

We're focusing on the categorical variables for a reason. Let's talk about why. 

In [None]:
#change the categorical variable (or binary) columns into integer types (0,1)

# Import label encoder
from sklearn import preprocessing
    
# label_encoder object knows how 
# to understand word labels.
label_encoder = preprocessing.LabelEncoder()
for variable in categorical_variables:
  data[variable] = label_encoder.fit_transform(data[variable])

In [None]:
data

In [None]:
#let's keep exploring our data to make a heatmap. Can you figure out what this chart means, and why it might be helpful?

plt.figure(figsize=(12,6))

sns.heatmap(data.corr(), annot=True)

In [None]:
#what do we do about the fact that there's some correlation (.53) between two of our variables?
#tldr: we don't know. You might try it either way
#correlated variables can be problematic, but we're not going to worry about it now (and it's not much of a correlation here, anyway)

The time has come! It's time to train our model. First, we have to put the data into the types of groupings we need it in, to get ready

We divide data up into a training set, and a testing set; let's talk about why...

In [None]:
from sklearn.model_selection import train_test_split
  
X = data.drop(['Loan_Status'],axis=1)
Y = data['Loan_Status']
  
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.4,
                                                    random_state=1)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

Now that our data is all spplit up, time to train the model on our training sets. But wait! What kind of model should we use?

This is a big issue, and it's called model selection

A lot goes into model selection, but often we just test a lot of models and see which performs best (has the highest accuracy, or, gets most of the labels right)

Let's try the training with a few models, and measure accuracy...

But first, let's talk about what these models are. We're going to use three that are (sort of?) intuitive. K-Nearest neighbor model, random forest model, and decision tree. Let's talk about what they are, and then see which model performs best...

In [None]:
#First we instantiate the models. This is basically just getting them ready to go/taking them off the shelf


from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
  
from sklearn import metrics
  
knn = KNeighborsClassifier(n_neighbors=3)
rfc = RandomForestClassifier(n_estimators = 7,
                             criterion = 'entropy',
                             random_state =7)
dt = DecisionTreeClassifier()
  
#in the real world you'd be tuning what are called parameters/hyperparameters. These are settings of the models you can control
#For kneighbors the number of neighbors, e.g.
#things like number of estimators in random forest (whole lot of decision trees, estimator number of trees)
#for decision tree we can deicde the "maximum depth"; max_depth=)

In [None]:
#training them and getting training score and testing score
#the training score is the "in sample" accuracy score of training score; this means after we train them on the training set, we're testing their accuracy on the same set they were trained on
#which is why decision tree, as we'll see, does so well, it will build a tree that works perfectly (or almost perfectly)
#The testing score is he accuracy score on the new testing set

#Which model works best???

for clf in (rfc, knn, dt):
    clf.fit(X_train, Y_train)
    Y_pred = clf.predict(X_train)
    print("Training accuracy score of ",
          clf.__class__.__name__,
          "=",100*metrics.accuracy_score(Y_train, 
                                         Y_pred))
    Y_pred = clf.predict(X_test)
    print("Testing accuracy score of ",
          clf.__class__.__name__,"=",
          100*metrics.accuracy_score(Y_test,
                                    Y_pred))

But what on earth are these models doing? Let's take a look at the decision tree model to see a sample of one process

In [None]:
#import module to plot decision tree
#dt is now the trained decision tree from the past cell (even though it was called clf in the loop)
from sklearn.tree import plot_tree
plt.figure(figsize=(60,40))
plot_tree(dt, feature_names=X_train.columns, class_names = ["yes", "no"], fontsize=12);


