<a href="https://colab.research.google.com/github/roitraining/techtrek-python/blob/main/Module04-Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning with Python

### The following is a list of some of the most popular packages used in ML.
* Sometimes we might build a machine with different combinations, but for our examples we'll just install them all.
* Note these are already pre-installed, so don't run the following. It's just there for reference.
  * If you do this on your own in a Colab Notebook or one of your own, you'd have to make sure these packages are installed.

In [None]:
! pip install pandas scikit-learn tensorflow tensorboard matplotlib seaborn theano bokeh keras nltk joblib pyspark torch


## Cluster Analysis
* Cluster Analysis is an unsupervised model to help identify natural patterns that may exist in data.
* It is often a preliminary step used in Exploratory Data Analysis (EDA) to assist ML Practitioners in understanding their data.

### Let's create a random data set with two features that is clustered around three center points.
* Normally we'd read a real data set from a file, but for learning purposes this random set will be clearer.
* The <font color='blue' face="Courier New" size="+1">x</font> data represents two features, such as Age and Income, Temperature and Humidity.
* The <font color='blue' face="Courier New" size="+1">y</font> data indicates which group or cluster it belongs to.

In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Creating a sample dataset with 4 clusters
x, y = make_blobs(n_samples=400, n_features=2, centers=3)

display(x[:10]) # features
display(y[:5]) # cluster member


### The raw numbers are hard to understand, so let's plot it first to visualize 
* Doing so we can clearly see the data tends to have natural groupings.
* With real data and with many more dimensions, this is much harder to see but this is meant to help us understand the concept. From here, we can do our best to imagine it in multiple dimensions.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.plot(x[:,0],x[:,1],'o')
plt.show()


### With real data we would not have the <font color='blue' face="Courier New" size="+1">y</font> data that indicates which cluster each point belongs to, so we would run a cluster model to help figure that out.
* The algorithm does this by comparing each point to every other point and calculating the distances between them to determine which are closest to one another. Sort of like a birds of a feather flock together concept.
* To do this, we tell it we want to use a <font color='blue'>cluster algorithm</font> in this case <font color='blue'>KMeans</font>.
  * It's one of several different algorithms that could do this kind of analysis using different math.
* The <font color='blue' face="Courier New" size="+1">fit</font> method tells it to start doing the necessary calculations. This is often known as the <font color='blue'>training phase</font>.
* When all the math is done, it returns which cluster it thinks each point belongs to and what is the center point of each cluster.
* We do need to have an idea of how many clusters we think there should be. In this case, it is three. Changing this value will change the outcome and it just takes experimenting to get the right value.

In [None]:
from sklearn import cluster
CLUSTERS = 3
k_means = cluster.KMeans(n_clusters=CLUSTERS, random_state = 12)
print('labels_' in dir(k_means))
k_means.fit(x)
print('labels_' in dir(k_means))
print(k_means.labels_)
print(k_means.cluster_centers_)

### Let's combine this new information about clusters to make the plot clearer.
* We'll use the cluster number the model gave us to change the point's color.
* We'll use the center points of each cluster the model calculated to mark an X.

In [None]:
%matplotlib inline
def plot_cluster(model, data, clusters):
    labels = model.labels_
    centroids = model.cluster_centers_

    for i in range(clusters):
        ds = data[np.where(labels==i)]
        # plot the data observations
        plt.plot(ds[:,0],ds[:,1],'o')
        # plot the centroids
        lines = plt.plot(centroids[i,0],centroids[i,1],'kx')
    plt.show()

plot_cluster(k_means, x, CLUSTERS)


### Guessing how many clusters is appropriate is tricky, but we can use an <font color='blue'>elbow chart</font> to help figure out how many clusters might be the most natural and combine that with our skills as an analyst and knowledge of the business use case.

In [None]:
def plot_elbow(data, cluster_cnt = 6):
   CLUSTERS = range(1, cluster_cnt + 1)
   kmeans = [cluster.KMeans(n_clusters=i) for i in CLUSTERS]

   score = [kmeans[i].fit(data).score(data) for i in range(len(kmeans))]
   #print(score)
   plt.plot(CLUSTERS ,score)
   plt.xlabel('Number of Clusters')
   plt.ylabel('Score')
   plt.title('Elbow Curve')
   plt.xticks(np.arange(1, cluster_cnt + 1, 1))
   plt.show()

plot_elbow(x)

* In this case, the elbow indicates that three is probably the right choice.
* There are other techniques that can do this in more detail.
* Below is an example of a <font color='blue'>silhouette chart</font>.
* You can expand this and explore it on your own later.

### You can expand this and explore it on your own later. ##
#### Just copy this code into the empty code block below and run it.
<br>

<details><summary>Click for <b>code</b></summary>
<p>

```python
%matplotlib inline

def silhouette_plot(data, count = 6):
   from sklearn.datasets import make_blobs
   from sklearn.cluster import KMeans
   from sklearn.metrics import silhouette_samples, silhouette_score

   import matplotlib.pyplot as plt
   import matplotlib.cm as cm
   import numpy as np

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
#X, y = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1, center_box=(-10.0, 10.0), shuffle=True, random_state=1)  # For reproducibility

   range_n_clusters = range(2, count + 1)

   for n_clusters in range_n_clusters:
       # Create a subplot with 1 row and 2 columns
       fig, (ax1, ax2) = plt.subplots(1, 2)
       fig.set_size_inches(18, 7)

       # The 1st subplot is the silhouette plot
       # The silhouette coefficient can range from -1, 1 but in this example all
       # lie within [-0.1, 1]
       ax1.set_xlim([-0.1, 1])
       # The (n_clusters+1)*10 is for inserting blank space between silhouette
       # plots of individual clusters, to demarcate them clearly.
       ax1.set_ylim([0, len(data) + (n_clusters + 1) * 10])

       # Initialize the clusterer with n_clusters value and a random generator
       # seed of 10 for reproducibility.
       clusterer = KMeans(n_clusters=n_clusters, random_state=10)
       cluster_labels = clusterer.fit_predict(data)

       # The silhouette_score gives the average value for all the samples.
       # This gives a perspective into the density and separation of the formed
       # clusters
       silhouette_avg = silhouette_score(data, cluster_labels)
       print("For n_clusters =", n_clusters, "The average silhouette_score is :", silhouette_avg)

       # Compute the silhouette scores for each sample
       sample_silhouette_values = silhouette_samples(data, cluster_labels)

       y_lower = 10
       for i in range(n_clusters):
           # Aggregate the silhouette scores for samples belonging to
           # cluster i, and sort them
           ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

           ith_cluster_silhouette_values.sort()

           size_cluster_i = ith_cluster_silhouette_values.shape[0]
           y_upper = y_lower + size_cluster_i

           color = cm.nipy_spectral(float(i) / n_clusters)
           ax1.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_silhouette_values, facecolor=color, edgecolor=color, alpha=0.7)

           # Label the silhouette plots with their cluster numbers at the middle
           ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

           # Compute the new y_lower for next plot
           y_lower = y_upper + 10  # 10 for the 0 samples

       ax1.set_title("The silhouette plot for the various clusters.")
       ax1.set_xlabel("The silhouette coefficient values")
       ax1.set_ylabel("Cluster label")

       # The vertical line for average silhouette score of all the values
       ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

       ax1.set_yticks([])  # Clear the yaxis labels / ticks
       ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

       # 2nd Plot showing the actual clusters formed
       colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
       ax2.scatter(data[:, 0], data[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')

       # Labeling the clusters
       centers = clusterer.cluster_centers_
       # Draw white circles at cluster centers
       ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k')

       for i, c in enumerate(centers):
           ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')

       ax2.set_title("The visualization of the clustered data.")
       ax2.set_xlabel("Feature space for the 1st feature")
       ax2.set_ylabel("Feature space for the 2nd feature")

       plt.suptitle(("Silhouette analysis for KMeans clustering on sample data with n_clusters = %d" % n_clusters), fontsize=14, fontweight='bold')

   plt.show()

silhouette_plot(x, 6)
```
</p>
</details>

## Regression Analysis
* Regression Analysis is a supervised model, meaning we must first train the model and then we can use it to make predictions.
* It is used whenever we want to predict a numerical value.
* Let's import a CSV with some housing sales data.
* We're also applying some formatting to make it easier to understand the numbers.

In [None]:
import pandas as pd
pd.options.display.float_format = '${:,.2f}'.format

USAhousing = pd.read_csv('USA_Housing.csv')
print(USAhousing.columns)
C = USAhousing.iloc[:20].style.format({'Avg. Area Income': '${:,.2f}'
                             , 'Avg. Area House Age': '{:,.1f}'
                             , 'Avg. Area Number of Rooms': '{:,.1f}'
                             , 'Avg. Area Number of Bedrooms': '{:,.1f}'
                             , 'Area Population': '{:,.0f}'
                             , 'Price': '${:,.2f}'})

display(C)

#display(USAhousing)




* Of all these columns we could come up with, the hypothesis that income, age, number of rooms and bedrooms, and population might affect the price. So we will keep those columns as features.
* The whole address is too unique to be a good feature.
  * We could parse it to extract state or town and that might influence the price.
  * But for simplicity, we will just leave it out.
  * There's usually some data we have that we don't include as a feature.
* So let's keep the columns we want as features, indicate that price is the target and then split this into two sets.
  * The training set is used to do the calculations for the regression.
  * The testing set is used to see how good of a job the training did by comparing the actual price to the predicted price.

In [None]:
from sklearn.model_selection import train_test_split

x = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']

trainX, testX, trainY, testY = train_test_split(x, y, test_size = 0.4, random_state = 101)


* Now we can train it with the feature and target training set.
* Then we can use the reserved training set to get values for what that model would predict the price should be if it had to guess.

In [None]:
from pandas import DataFrame
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(trainX, trainY)
predictions = lm.predict(testX)

df = DataFrame({'Actual': testY[:10], 'Predicted': predictions[:10]})


display(df)

* Just by eyeballing it, we can see it did an OK job overall; some are pretty close and some are way off.
* Maybe our model needs more or better data, either more features or more rows.
* Sometimes we might try using different algorithms or changing <font color='blue'>hyperparameters</font> that affect how an algorithm might work.
* Either way, we first need to quantify how good this model did with some metrics.
  * There are many metrics and each has its pros and cons.
  * Let's explore the basics here of <font color='blue'>MSE</font> and <font color='blue'>r-squared</font>.
* The coefficients indicate how much to multiply each feature value by, then add them up and you get the predicted price.
* In this case, the r-squared of 92% means it's not bad, it could maybe be better, but it's certainly better than a random guess.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: %.2f" % mean_squared_error(testY, predictions))
print('r-squared: %.2f' % r2_score(testY, predictions))
print(lm.coef_, lm.intercept_)

#### You can do a deeper dive into Regression on your own. There are many more algorithms, hyperparmeters, metrics, and packages that can be used and compared to one another on the same data set. Our job is to do as many as we can to find the best overall for our business use case.

## Let's explore classification analysis.
* This is used to predict an outcome instead of a numeric value.
* It could be a yes/no, true/false outcome which we typically call <font color='blue'>binomial</font> classification.
* But it could also be predicting one of three or more outcomes which we call <font color='blue'>multiclass</font> classification.
    * Is a potential borrower low risk, medium risk, high risk or toxic?
    * Does a patient have a risk of Type 1, Type 2, or no diabetes?
* Let's load a dataset and try to predict if there is a possible fraudelent transaction or not.

In [None]:
! wget "https://storage.googleapis.com/joey-public-bucket/python_deloitte/CreditCardFraud_1M.csv"


In [None]:
import pandas as pd
df = pd.read_csv('CreditCardFraud_1M.csv')
display(df)


* We will try to predict if it isFraud, so that will be our <font color='blue'>target</font> or <font color='blue'>label</font> when we are doing classification.
* Let's keep some columns and ignore some we think won't influence the outcome.
* The thing is, features all need to be numeric, so non-numeric values like type are often call <font color='blue'>categorical</font> and can be converted into a numeric representation.


In [None]:
columns = ['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFraud']
df = df[columns]
df.type = pd.Categorical(df.type).codes
print(df.shape, df.columns)
display(df.head())


* Similar to regression, we will split the set up into a training and testing set.
* Because of limitations as to how much we can load into memory on this machine, we will just take a small sample of 30% for training and 10% for testing.
* We also want to make sure that the number of fraud and non-fraud records in the two sets are roughly the same.

In [None]:
train_size = .3
test_size = .1
from sklearn.model_selection import train_test_split
from sklearn import preprocessing as pp
dfNB = df
trainNB_X, testNB_X, trainNB_Y, testNB_Y = train_test_split(dfNB[dfNB.columns[:-1]], dfNB.isFraud, \
                                        train_size = train_size, test_size = test_size, random_state = 1)

print('Train Set Percentages', trainNB_Y.value_counts()/trainNB_Y.count())
print('Test Set Percentages', testNB_Y.value_counts()/testNB_Y.count())
display(trainNB_X.head(10))

* There are so many algorithms that can be used to do classification, so let's just start with a simple one.

In [None]:
from sklearn.naive_bayes import GaussianNB
modelNB = GaussianNB()
modelNB.fit(trainNB_X, trainNB_Y)

* The following is a helper function to display the results in a prettier format than the default.
* It display what's known as a <font color='blue'>confusion matrix</font> which shows how many true and false negatives and positives we have.
* By looking at these numbers and comparing different models, we can determine which model does the best job for our use case.


In [None]:
def evaluate_predictions(test, pred, show_percent = True, show_details = False):
    from sklearn.metrics import confusion_matrix
    length = len(test)
    cm = confusion_matrix(test, pred)
    test_vc = pd.Series(test).value_counts()
    pred_vc = pd.Series(pred).value_counts()
    if show_details:
        print(f'Test length = {length}')
        print('\nTest Values')
        print(test_vc)
        print('\nPredicted Values')
        print(pred_vc)
        print('\n TP FN\n FP TN')
        print(cm)


    print(f'''
A |\t\tPredicted
c |\tTP/FP\t|\tFN/TN\t|\tAP/AN
t +---------------------------------------------
u |\t{cm[0,0]:>7}\t|\t{cm[0, 1]:>7}\t|\t{test_vc[0]:>7}
a |\t{cm[1,0]:>7}\t|\t{cm[1, 1]:>7}\t|\t{test_vc[1]:>7}
l |\t{pred_vc[0]:>7}\t|\t{pred_vc[1]:>7}\t|\t{length:>7}
''')

    if show_percent:
        import numpy as np
        print('\n PC FP\n FN PW')
        print(np.ndarray(shape = (2,2), buffer = np.array([round(100 *(cm[0][0] + cm[1][1])/length, 1), \
           round(100 * cm[0][1]/length, 1), round(100 * cm[1][0]/length, 1), round(100 * (cm[1][0] + cm[0][1])/length, 1)])))



predNB_Y = modelNB.predict(testNB_X)
evaluate_predictions(testNB_Y, predNB_Y, show_details = False)


* Once we've trained a model and decide it's useful, we usually save it to a standard format.

In [None]:
from joblib import dump, load
dump(modelNB, 'modelNB.joblib')


* Once we have a saved model, we can use it in production to make batch or individual predictions by loading that model into something like a web server and exposing it as a web service.
  * Remember <font color='blue'>Flask</font>?


In [None]:
modelNB2 = load('modelNB.joblib')
predNB_Y = modelNB2.predict(testNB_X)

evaluate_predictions(testNB_Y, predNB_Y)


* Some algorithms require a special preprocessing of categorical columns into something called <font color='blue'>One Hot Encoding</font> or <font color='blue'>Dummy encoding</font>.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing as pp

def dummy_code(data, columns, drop_first = True):
    for c in columns:
        dummies = pd.get_dummies(data[c], prefix = c, drop_first = drop_first)
        i = list(data.columns).index(c)
        data = pd.concat([data.iloc[:,:i], dummies, data.iloc[:,i+1:]], axis = 1)
    return data

dfLR = dummy_code(df, ['type'], drop_first = False)
trainLR_X, testLR_X, trainLR_Y, testLR_Y = train_test_split(dfLR.iloc[:,dfLR.columns != 'isFraud'], dfLR.isFraud, train_size = train_size, test_size = test_size, random_state = 1)

print(testLR_X.columns)
display(testLR_X.head())

* From here there are many more algorithms that can be used and some have many hyperparameters we can experiment with.
* Our job is to try as many combinations as we can to find the best model and productionize that.

In [None]:
from sklearn.linear_model import LogisticRegression
modelLR = LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=1000)
modelLR.fit(trainLR_X, trainLR_Y)
print(modelLR.coef_)

In [None]:
import numpy as np
predLR_Y = modelLR.predict(testLR_X)

score = modelLR.score(testLR_X, testLR_Y)
mse = np.mean((predLR_Y - testLR_Y)**2)
print(score, mse, '\n')

evaluate_predictions(testLR_Y, predLR_Y)


### Something to explore on your own.
* Sometimes you can influence the false positive/false negative values by changing threshold.

In [None]:
predLR_Y = modelLR.predict_proba(testLR_X)
print(predLR_Y[:10])
print('Score', modelLR.score(testLR_X, testLR_Y))

for threshold in range(10, 91, 10):
    predLR_Y1 = np.where(predLR_Y[:,0] >= threshold/100, 0, 1)
    mse = np.mean((predLR_Y1 - testLR_Y)**2)
    print ('\nTHRESHOLD', threshold, 'MSE', mse)

    evaluate_predictions(testLR_Y, predLR_Y1, show_percent = False)

