# Outline of Sample Models with Blockchain

In this project, some sample AI models and blockchain algorithms will be displayed to guide further research and modeling work. Given the two goals we had: find the source of financial and medical resources waste & build AI models to reduce financial and medcial waste in order to reduce the healthcare cost as well as the insurance premium cost, We have already figured out the source of financial and medical resources waste:

- **Insured persons' part**
  - ***Medical waste actions***.  Given the insurance company will pay part or even most of bills, insured people will unavoidably intend to look for unnecessarily more expensive treatment or even just unnecessary treatment to take advantages of insurance company, which will cause the waste of financial and medical resources.
  - ***Fraud actions***. A more serious situation is that some insured persons will fake medical records to defraud insurance companies. Because all insured people share the same insurance fund, no matter if they are honest or not, and the high usage of the fund will lead to high insurance premium cost, so the fraud actions have to be controled to stop this kind of financial waste.
  - ***Impersonation actions***. Not all people are insured, but all people will take medical cares to some extent. Thus, there are some people will take impersonation actions to lower their medical cost. This behavior will waste insurance funds that do not originally intend to be used on them, and increase the insurance premium the policyholders paid.

- **Doctors' part**
  - ***Substance abuse***. When cooperating with insurance company, patients' bills are paid by both patient and insurance company. Both of them are responsible for part of the bills, and the whole budget will be increased in a certain degree, so a few doctors will intend to do subtance abuse to increase their income. However, even if there's no insurance company's involvement, the substance abuse is inevitable sometimes. The substance abuse is one of the source of financial and medical resources waste which makes the insurance premium cost high.


- **Insurance companies' part**
  - ***Inaccurate premium pricing model***. Insurance companies have their own pricing models to decide the insurance premium cost. These models are designed to make sure there's profit for insurance company after paying all medical fees for their insured clients. In order to ensure that, there will be set as much margin space as possible between the premium collected from policyholders and the payment to doctors. More accurate pricing models will make the margin smaller, which won't collect financail resources wastely anymore, and makes the financial resources freely flow to whole society where needs it more.


And have already listed all data, algorithms, inouts and outputs shown below:

 - ***A. AI models to monitor medical waste actions***
    - **a. Data needed**: \
      Insured people's basic numeric information collected when signing insurance contract like age(0~150,Integer), income(Numeric, Unit:k), insured people size(Integer), medical test scores(Numeric), credit scores(Numeric), etc.
    - **b. AI algorithm candidates**:\
      Principal component analysis(PCA), KMeans, Support Vector Classifier (SVC), K-Nearest Neighbors(KNN), Naive Bayes, XGBoost, RandomForest
    - **c. Input**: \
      Insured people's basic numeric information collected when signing insurance contract (Numeric)
    - **d. Output**:\
      0~1, the probability of taking medical waste action
   
      
  - ***B. AI models to monitor fraud actions***
    - **a. Data needed**:\
      Doctors' or experts' tagged honest and fraud medical records' database (data series with tag 0 or 1)
    - **b. AI algorithm candidates**:\
      Bidirectional Encoder Representations from Transformers (BERT), Convolutional Neural Network(CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM)
    - **c. Input**:\
      Insured people's historical medical records and new medical records (Images, text, and numbers data series without tag)
    - **d. Output**:\
      0~1, the probability of defrauding\


  - ***C. AI models to monitor impersonation actions***
    - **a. Data needed**:\
      Insured people's bio information such as photo or fingerprint(Image), and database that includes large amount comparable bio photos marked 0 and 1 (Image data with tag 0 or 1).
    - **b. AI algorithm candidates**:\
      Principal component analysis(PCA), Convolutional Neural Network(CNN)
    - **c. Input**:\
      Real-time face recognition video taken when entering insurance information (Image)
    - **d. Output**:\
      0 or 1, stands for it's the exact insured person or not

  - ***D. AI models to monitor substance abuse***
    - **a. Data needed**: \
      Doctors' or experts' tagged abuse or not medical records' database (text data series with tag 0 or 1).
    - **b. AI algorithm candidates**:\
      Bidirectional Encoder Representations from Transformers (BERT), Convolutional Neural Network(CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM)
    - **c. Input**:\
      Doctors' medical records on one certain insured person (text data series without tag)
    - **d. Output**:\
      0~1, the probability of substance abusing


  - ***E. AI models to build more accurate premium pricing model***
    - **a. Data needed**:\
      History database of insured people's basic information(Yearly, Numeric), all scores above AI models generated(Numeric), usage of insurance fund(Yearly, Numeric), and economic data like inflation rate etc. (Numeric).
    - **b. AI algorithm candidates**:\
      Principal component analysis(PCA), K-Nearest Neighbors(KNN), Support Vector Regression (SVR), RidgeRegression, XGBoost, RandomForest
    - **c. Input**:\
      Insured people's basic information (Numeric), scores generated from above AI models (Numeric), and economic data (Numeric)
    - **d. Output**:\
      Expectaion of the cost of certain type insured people (Yearly, Numeric)
      
      
And we also stressed the importance of Blockchain's application in medical industry, which could give patients and other data subjects full ownership and use rights of data, greatly improving the privacy and security of data, and at the same time providing safe and unimpeded communication between data owners and data recipients, which is beneficial to all participants.


This file will show details of the sample AI models and blockchain algorithms to give a hint for all other participants.

## AI models to monitor medical waste actions

### Data Generated

Insured people's basic numeric information collected when signing insurance contract like age(0~150,Integer), income(Numeric, Unit:k), insured people size(Integer), medical test scores(Numeric), credit scores(Numeric), etc. Given all data are numeric, and the ranges of different data are different, use the pipeline function to scale all data to make algorithms working better.

In [None]:
# Standard Import

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# X is the data of all insured people's numeric information we collected
# Y is the target which stands for whether the insured people has had medical waste acions before
# Which is shown by 0 and 1

# Scale Data

pipe = Pipeline([('scaler', StandardScaler()),...])

After doing the Scaler pipeline, there are too many parameters in our models given many X variables, thus, we could use Principal component analysis(PCA) algorithms to make the variables more concise.

In [None]:
# Standard Import

from sklearn.decomposition import PCA

# Fit PCA

pipe = Pipeline([('scaler', StandardScaler()),('PCA',PCA(n_components=10, svd_solver='full'))])

# Where n_components could be any numbers we want to keep

By now, the data has been splited and processed well to be fitted in proper algorithms, with the suitable range and suitable dimensions.

### Algorithms 1 - KMeans

The target Y is not always easy to get, and the values the insurance company collected is not always accurate. Thus, we could use the KMeans algorithms first, to observe the insured people roughly, to check if the target Y is accurate enough. If the results shows most of main clusters' components have the same target value, the data we collected and the target we tagged are accurate and credible to do the following models.

In [None]:
# Standard Import

from sklearn.cluster import KMeans
import numpy as np

# Fit KMeans

kmeans = KMeans(n_clusters=2, random_state=0, n_init="auto").fit(X)
# There's only 0 and 1 in Y.
# Set 2 clusters to fit.

# Check if the data and target values are accurate

Y_bar = kmeans.labels_
Y_test = Y - Y_bar
p = (Y_test == 0).sum()/len(Y_test)
# Higher p means more accurate data.
# If the p is higher than 0.8, we could do the following modeling work.
# Otherwise, we should improve the accuracy of data and target.

### Alogorithms 2 - Support Vector Classifier (SVC)

After feature engineering, and accuracy check, we could do the machine learning, or to say AI work, to predict wheather the insured peoson will do medical waste actions or not.

Support Vector Classifier (SVC) is an advanced model to do the classification work. Samples are data points allocate in the P-dimension space, each axis represents one feature. The idea of the Support Vector Classifier is to find the "hyperplane" to separate samples into 2 classes in this P-dimension space. In a 2-D space, when we only have 2 features, SVC is to find the straight line that can separate samples.

Of course, different people will draw different straight lines. However, no matter how you draw the line between these 2 classes, you will always find a closest sample from each class to your straight line. That is the SUPPORT VECTOR.

In [None]:
# Standard Import

from sklearn.svm import SVC

pipe_1 = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                  ('svc', SVC(gamma='auto'))])
pipe_1.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
pipe_1_gridsearch=GridSearchCV(estimator = pipe_1,
                        param_grid = {'C': [1, 10], 'kernel': ('linear', 'rbf')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
pipe_1_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Which should produce higher test score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe_1_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference

### Algorithms 3 - K-Nearest Neighbors(KNN)

Besides SVC, KNN also falls in the supervised learning algorithms. In the classification problem, the K-nearest neighbor algorithm essentially said that for a given value of K algorithm will find the K nearest neighbor of unseen data point and then it will assign the class to unseen data point by having the class which has the highest number of data points out of all classes of K neighbors. For distance metrics, we will use the Euclidean metric. Finally, the input x gets assigned to the class with the largest probability.


In [None]:
# Standard KNeighborsClassifier Import

from sklearn.neighbors import KNeighborsClassifier
pipe_2 = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                  ('KNN', KNeighborsClassifier(n_neighbors=3))])
pipe_2.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using GridSearchCV to Find the Best Parameters

pipe_2_gridsearch=GridSearchCV(estimator = pipe_2,
                        param_grid = {'n_neighbors': [1, 10],
                        'algorithm': ('auto','ball_tree', 'kd_tree', 'brute')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
pipe_2_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Which should produce higher test score

#Using Cross Validation to test Classifier More Accurate

scores = cross_val_score(pipe_2_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference.

### Algorithms 4 - Naive Bayes

Naive Bayes is another classification technique that is based on Bayes’ Theorem with an assumption that all the features that predicts the target value are independent of each other. It calculates the probability of each class and then pick the one with the highest probability.


In [None]:
# Standard Naive Bayes Classifier Import

from sklearn.naive_bayes import GaussianNB
pipe_3 = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                   ('NB', GaussianNB())])
pipe_3.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using Cross Validation to test Classifier More Accurate

scores = cross_val_score(pipe_3, X, y, cv=5)
# Which should give a more comprehensive scores for reference.

### Algorithms 5 - XGBoost

XGBoost algorithms could perform extreme level improvements on multiple weak learners and integrate them into a strong learner which could run at high speed and low memory without overfitting and under-fitting problems. At the same time, the algorithm is stable and has high accuracy. 

In [None]:
# Standard XGBoost Import

from sklearn.ensemble import GradientBoostingClassifier
pipe_4 = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                   ('XGBoost', GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
     max_depth=1, random_state=0))])

pipe_4.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
pipe_4_gridsearch=GridSearchCV(estimator = pipe_4,
                        param_grid = {'n_estimators': [100,200,300,500,1000],
                        'max_depth': [1,3,4,5,10,15,20],
                        'learning_rate':[0.01,0.05,0.1,0.3,0.5,1.0,5.0]},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
pipe_4_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Which should produce higher test score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe_4_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference


### Algorithms 6 - RandomForest

RandomForest algorithm is a tree algorithms structure, which is a collection of different decision trees obtained from the sample. Yes or no judgment is made on each node by Gini Index. After continuous iteration, the leaf nodes are the final decision results. This algotihm can explain the process from sample data to results, which can be used to explain patients’ behavior. 

In [None]:
# Standard RandomForest Classifier Import

from sklearn.ensemble import RandomForestClassifier

pipe_5 = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                   ('RandomForest', RandomForestClassifier(max_depth=10, random_state=0))])

pipe_5.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# RandomForest algorithm has already did 'cross validation' inside, no more needed additional test
# RandomForest algorithm is time-consuming, if the test score is much lower than XGBoost, then stop fine tuning.

## AI models to monitor fraud actions



### Data Generated

As discussed above, in some more serious situation, some insured persons will fake medical records to defraud insurance companies to get more money from insurance companies which are not belong to them and not in the insurance company's budget. So we should have a data base which includes doctors' or experts' tagged honest and fraud medical records' database (data series with tag 0 or 1). The data from data base is images, text, or numbers data with tags, which could teach AI models to classify data without any tag.

But before we do any modeling things, we should firstly unify data.

#### Numeric Data Engineering

This processing is very similar as previous numeric data engineering, which only need to be scaled,or sometimes with PCA processing to reduce the data's dimensions.

In [None]:
# Standard Import

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# X is the data of all numeric data we created in data base
# Y is the target which doctors and experts tagged whether it's fraud information or not
# Which is shown by 0 and 1

# Scale Data

pipe = Pipeline([('scaler', StandardScaler()),...])

# Fit PCA

pipe = Pipeline([('scaler', StandardScaler()),('PCA',PCA(n_components=10, svd_solver='full'))])
# Where n_components could be any numbers we want to keep

#### Image Data Engineering

The image data is fundamentally numeric data, which is a vector data having three dimensions - height, width, and depth. So we can use image data directly if we like. But in order to make it easier to fit models, we should also scale it in same distribution and pca it to compress the iamge data, which is predictablly very big.

In [None]:
# Standard Import

from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# Scale data

# X: the array of features
# y: the array of labels
X, y = np.array([]), np.array([])
X = data/255 y = labels

# Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# X is the data of all image data we created in data base
# Y is the target which doctors and experts tagged whether it's fraud information or not
# Which is shown by 0 and 1

# Compress Image Data

pipe = Pipeline([('PCA', PCA(n_components=20)),...])
# Where n_components could be any numbers we want to keep
# Smaller the n_components, blurry the images.

#### Text Data Engineering

Text data is the most difficult one to handle, because text data is fundamentally not numeric data, and has to be transferred by many advanced algorithms to AI's 'readable' numeric data. Fortunately we have the BERT model and pretrained data base, which could help us transfer text data fast and accurate.

In [None]:
# Standard Import

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

# Install Tensorflow Packages

pip install -U "tensorflow-text==2.13.*"
pip install "tf-models-official==2.13.*"

# Choose Bert Model

bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8'

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

# Text Data Engineering

bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
X = bert_preprocess_model(text_preprocessed)
# So far, the text data has been tranformed to numeric data
# And ready to do following AI modeling work.

### Algorithms for Originally Numeric Data

This part is similar as AI models to monitor medical waste actions, and can use the KMeans, Support Vector Classifier (SVC), K-Nearest Neighbors(KNN), Naive Bayes, XGBoost and RandomForest algorithms directly. Thus, this sample won’t go into details agian here.

### Algorithms for Originally Image Data

Convolutional Neural Network(CNN) is the best algorithms to handle image data. In a CNN model, the input data is passed through a series of layers that are designed to extract increasingly abstract features and then do the classification. The basic building blocks of a CNN are convolutional layers, which use filters to extract features from the input data, and pooling layers, which down sample the output of the convolutional layers to reduce the dimensionality of the data. After passing through several convolutional and pooling layers, the output is then flattened and fed into a series of fully connected layers, which perform classification or regression on the extracted features.

In [None]:
# Standard Import

import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Creat Models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
# Convolute and Pooling Images
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
# Transfering iamges
model.add(Dropout(0.2))
# Prevent overfitting

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
# Add layers to construct the Neural Networks

# Fit Pipeline and Model

model_pip = Pipeline([('PCA',PCA(n_components=30)),('cnn', model)])
model_pip.fit(X_train, y_train).score(X_test, y_test)

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
model_pip_gridsearch=GridSearchCV(estimator = model_pip,
                        param_grid = {'optimizer': ('adam','SGD','RMSprop'),
                                      'loss': ('squared_hinge', 'CategoricalHinge'),
                                      'metrics':('accuracy','binary_accuracy','CategoricalAccuracy')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
model_pip_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Try Best to find parameters contribute to higher score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_pip_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference
# And prevent overfitting

### Algorithms for Originally Text Data

After preprocessing by Bidirectional Encoder Representations from Transformers (BERT) model, the Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) will be the best algorithms for the transformed text data to fit, which are good at temporal and sequential data.

In [None]:
# Fit BERT Model

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)
# Define model builder function

model_1 = build_classifier_model()
result = model_1(tf.constant(X))
# X is the original and raw text data
# Next two models use the BERT-transformed X data gained above

# Fit RNN & LSTM Model
# The LSTM model is one kind of agvanced RNN model, which add LSTM layers
# In order to avoid duplication, here's only sample of LSTM model

from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

def LSTM():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

model_2 = LSTM()
model_2.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

# Fit Pipeline and Model

model_2_pip = Pipeline([('PCA',PCA(n_components=30)),('lstm', model_2)])
model_2_pip.fit(X_train, y_train).score(X_test, y_test)

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
model_2_pip_gridsearch=GridSearchCV(estimator = model_2_pip,
                        param_grid = {'optimizer': ('adam','SGD','RMSprop'),
                                      'loss': ('squared_hinge', 'CategoricalHinge'),
                                      'metrics':('accuracy','binary_accuracy','CategoricalAccuracy')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
model_2_pip_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Try Best to find parameters contribute to higher score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_2_pip_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference
# And prevent overfitting

model_2_pip_gridsearch.fit(X_train,Y_train,batch_size=128,epochs=10,
          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])
# Using EarlyStopping to save time when there's little earnings continuing modeling


## AI models to monitor impersonation actions

In order to avoid impersonation actions, comparing insured people's bio information during medical care is necessary. No matter it's video or fingerprint, it's all image data, which could be trained by Convolutional Neural Network(CNN) to decide whether the collected new image is belong to certain insured person or not.

### Data Colletected

The data base we should use is the insured people's comparable bio information, like video of insured person's face or several different positions of fingerprint. And we do scale and dimension reduction before feed the data to algorithms to shorten the processing time.

In [None]:
# Standard Import

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# Scale data

# X: the array of features
# y: the array of labels
X, y = np.array([]), np.array([])
X = data/255 y = labels

# Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# X is the data of all image data we created in data base
# Y is the target which the image data belongs to

# Compress Image Data

pipe = Pipeline([('PCA', PCA(n_components=20)),...])
# Where n_components could be any numbers we want to keep
# Smaller the n_components, blurry the images.

### Convolutional Neural Network(CNN) Algorithm

Using CNN algorithms to learn the path to decide whether the image comes from one insured person or not.

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

# Creat Models

cnn_model = models.Sequential([
    layers.Conv2D(filters=25, kernel_size=(3, 3), activation='relu', input_shape=(28,28,1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    # Convolute and Pooling Images
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax'),
    # Transfering iamges
    layers.Dropout(0.2)
    # Prevent overfitting
])

cnn_model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
# compile model

# Fit Pipeline and Model

cnn_model_pip = Pipeline([('PCA',PCA(n_components=30)),('cnn', cnn_model)])
cnn_model_pip.fit(X_train, y_train).score(X_test, y_test)

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
cnn_model_pip_gridsearch=GridSearchCV(estimator = cnn_model_pip,
                        param_grid = {'optimizer': ('adam','SGD','RMSprop'),
                                      'loss': ('squared_hinge', 'CategoricalHinge'),
                                      'metrics':('accuracy','binary_accuracy','CategoricalAccuracy')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
cnn_model_pip_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Try Best to find parameters contribute to higher score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(cnn_model_pip_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference

# After all modeling work, the algorithm has learned how to recognize image data
# and compare it to existed insured people's bio info data base

## AI models to monitor substance abuse

This part is quite similar as previous AI models to monitor fraud actions, but only focus on text data part, which stressed on prescriptions the doctors wrote.

### Data Generated

Doctors' or experts' tagged abuse or not medical records' database (text data series with tag 0 or 1). Considering it's text data, BERT algorithm could be used to preprocess the data to transform it to numeric data.

In [None]:
# Given that the model map has been created above
# Index the model name to use BERT preprocessed model directly

bert_model_name = 'small_bert/bert_en_uncased_L-4_H-512_A-8'

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

# Text Data Engineering

bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
X = bert_preprocess_model(text_preprocessed)
# X is the prescription data base
# So far, the text data has been tranformed to numeric data
# And ready to do following AI modeling work.

### Long Short-Term Memory (LSTM) Algorithms

After preprocessing by Bidirectional Encoder Representations from Transformers (BERT) model, the Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) will be the best algorithms for the transformed text data to fit, which are good at temporal and sequential data.

In [None]:
# Following code is quite similar as AI models to monitor fraud actions's text data part
# The different aspects are the data base and exact parameters when fine tune

# Fit BERT Model

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)
# Define model builder function

model_bert = build_classifier_model()
result = model_bert(tf.constant(X))
# X is the original and raw text data
# Next model use the BERT-transformed X data gained above

# Fit RNN & LSTM Model
# The LSTM model is one kind of agvanced RNN model, which add LSTM layers
# In order to avoid duplication, here's only sample of LSTM model

from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

def LSTM():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(64)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.5)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

model_lstm = LSTM()
model_lstm.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

# Fit Pipeline and Model

model_lstm_pip = Pipeline([('PCA',PCA(n_components=30)),('lstm', model_lstm)])
model_lstm_pip.fit(X_train, y_train).score(X_test, y_test)

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
model_lstm_pip_gridsearch=GridSearchCV(estimator = model_lstm_pip,
                        param_grid = {'optimizer': ('adam','SGD','RMSprop'),
                                      'loss': ('squared_hinge', 'CategoricalHinge'),
                                      'metrics':('accuracy','binary_accuracy','CategoricalAccuracy')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
model_lstm_pip_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Try Best to find parameters contribute to higher score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model_lstm_pip_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference
# And prevent overfitting

model_lstm_pip_gridsearch.fit(X_train,Y_train,batch_size=128,epochs=10,
          callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])
# Using EarlyStopping to save time when there's little earnings continuing modeling


## AI models to build more accurate premium pricing model

More accurate the premium pricing model is, lower the risk the insurance company faces. In financial world, lower risk means lower cost, which leads to lower premium cost for insured people. AI models are the best choice to do this work.

### Data Generated

History database of insured people's basic information(Yearly, Numeric), all scores above AI models generated(Numeric), usage of insurance fund(Yearly, Numeric), and economic data like inflation rate etc. (Numeric). The results we get above, like probability of defrauding or wasting, and probability of abusing have high value to be considered when pricing insurance products. And because all data are numeric, what we should do is just scale and PCA to reduce dimensions.

In [None]:
# Standard Import

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# Split Data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# X is the data of all insured people's numeric information we collected
# Y is the yearly medical cost the insured person spent

# Scale Data

pipe = Pipeline([('scaler', StandardScaler()),...])

# Fit PCA to Reduce Dimensions

pipe = Pipeline([('scaler', StandardScaler()),('PCA',PCA(n_components=20, svd_solver='full'))])

# Where n_components could be any numbers we want to keep

### Grouping Insured People

Different insurance company has different insured people's pool, which has many layers and groups. There are different pricing models for different groups. However, the groups which are created based on simple background information is not accurate. Thus, the KMeans algorithm should be used to group insured people.

In [None]:
# Standard Import

from sklearn.cluster import KMeans

# Fit KMeans to Group People

kmeans = KMeans(n_clusters=5, random_state=0, n_init="auto").fit(X)
# X is the data base of insured people's information
# Set 5 clusters to fit in this sample model
# In different insurance company, the n_cluster could be changed to the exact numer
# of their settled groups.

### Algorithms 1 - RidgeRegression(RR) for Different Groups

Our goal is the predict every insured people's yearly medical cost through all factors we discussed above. More factors we considered, higher probability of multicollinearity. RidgeRegression is the best algorithm to handle multicollinearity, and it can reduce model complexity and prevents overfitting as well, with lower variance. Thus, we should try RidgeRegression to pricing cost.

In [None]:
# Standard Import

from sklearn.linear_model import Ridge

pipe_RR = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                  ('ridge', Ridge(alpha=1.0))])
pipe_RR.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
pipe_RR_gridsearch=GridSearchCV(estimator = pipe_RR,
                        param_grid = {'alpha': [0, 1, 5, 10],
                        'solver': ('auto', 'svd', 'cholesky', 'lsqr', 'saga', 'lbfgs')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
pipe_RR_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Which should produce higher test score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe_RR_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference

### Algorithms 2 - Support Vector Regression (SVR) for Different Groups

Similar as Support Vector Classifier (SVC), which draw a line to make all points' distance to the line smallest, SVR gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data and produce real numbers to be results.

In contrast to OLS, the objective function of SVR is to minimize the coefficients — more specifically, the l2-norm of the coefficient vector — not the squared error. The error term is instead handled in the constraints, where we set the absolute error less than or equal to a specified margin, called the maximum error, ϵ (epsilon).

In [None]:
# Standard SVR Classifier

from sklearn.svm import SVR

pipe_SVR = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                  ('svr', SVR(C=1.0, epsilon=0.2))])
pipe_SVR.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
pipe_SVR_gridsearch=GridSearchCV(estimator = pipe_SVR,
                        param_grid = {'C': [1,3,5,10],
                        'kernel': ('linear', 'poly', 'rbf', 'sigmoid', 'precomputed')},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
pipe_SVR_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Which should produce higher test score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe_SVR_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference

### Algorithms 3 - XGBoost for Different Groups

XGBoost algorithms could perform extreme level improvements on multiple weak learners and integrate them into a strong learner which could run at high speed and low memory without overfitting and under-fitting problems. At the same time, the algorithm is stable and has high accuracy. This part is quite similar to XGBoost mentioned above, but it's applied on regression task.

In [None]:
# Standard Import

from sklearn import ensemble
from sklearn.metrics import mean_squared_error

# Preset Parameters to build model

params = {
    "n_estimators": 500,
    "max_depth": 4,
    "learning_rate": 0.01,
    "loss": "squared_error",
}

# Generate XGBoost regressioon Model

pipe_XG = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                  ('XG', ensemble.GradientBoostingRegressor(**params))])
pipe_XG.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# Using GridSearchCV to Find the Best Parameters

from sklearn.model_selection import GridSearchCV
pipe_XG_gridsearch=GridSearchCV(estimator = pipe_XG,
                        param_grid = {'n_estimators': [100,200,300,500,1000],
                        'max_depth': [3,4,5,10,15,20],
                        'learning_rate':[0.01,0.05,0.1,0.3]},
                        scoring='accuracy',
                        cv=10,
                        n_jobs=-1)
pipe_XG_gridsearch.fit(X_train, y_train).score(X_test, y_test)
# Which should produce higher test score

#Using Cross Validation to test Classifier More Accurate

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe_XG_gridsearch, X, y, cv=5)
# Which should give a more comprehensive scores for reference

### Algorithms 4 - RandomForest for Different Groups

RandomForest algorithm is a tree algorithms structure, which is a collection of different decision trees obtained from the sample. Yes or no judgment is made on each node by Gini Index. After continuous iteration, the leaf nodes are the final decision results. This algotihm can explain the process from sample data to results, which can be used to explain patients’ behavior. This part is quite similar to RandomForest mentioned above, but it's applied on regression task.

In [None]:
# Standard Import
from sklearn.ensemble import RandomForestRegressor

pipe_RF = Pipeline([('scaler', StandardScaler()),
                  ('PCA',PCA(n_components=10, svd_solver='full')),
                   ('RandomForest', RandomForestClassifier(max_depth=10, random_state=0))])

pipe_RF.fit(X_train, y_train).score(X_test, y_test)
# Test the accuracy score to see if the model works well

# RandomForest algorithm has already did 'cross validation' inside, no more needed additional test
# RandomForest algorithm is time-consuming, if the test score is much lower than XGBoost, then stop fine tuning.

## Blockchain Interior Codes

Blockchain's interior structure includes three parts: input the data uploaded to the blockchain, read the Hash value of the previous blockchain, and generate a new Hash value of the current blockchain to pass to the next blockchain. After these three parts, the data could be permanently and tamper-resistantly saved.

In [None]:
# Python program to create Blockchain
# This code's reference:https://www.geeksforgeeks.org/create-simple-blockchain-using-python/

# For timestamp
import datetime

# Calculating the hash
# in order to add digital
# fingerprints to the blocks
import hashlib

# To store data
# in our blockchain
import json


class Blockchain:

    # This function is created
    # to create the very first
    # block and set its hash to "0"
    def __init__(self):
        self.chain = []
        self.create_block(proof=1, previous_hash='0')

    # This function is created
    # to add further blocks
    # into the chain
    def create_block(self, proof, previous_hash):
        block = {'index': len(self.chain) + 1,
                'timestamp': str(datetime.datetime.now()),
                'proof': proof,
                'previous_hash': previous_hash}
        self.chain.append(block)
        return block

    # This function is created
    # to display the previous block
    def print_previous_block(self):
        return self.chain[-1]

    # This is the function for proof of work
    # and used to successfully mine the block
    def proof_of_work(self, previous_proof):
        new_proof = 1
        check_proof = False

        while check_proof is False:
            hash_operation = hashlib.sha256(
                str(new_proof**2 - previous_proof**2).encode()).hexdigest()
            if hash_operation[:5] == '00000':
                check_proof = True
            else:
                new_proof += 1

        return new_proof

    def hash(self, block):
        encoded_block = json.dumps(block, sort_keys=True).encode()
        return hashlib.sha256(encoded_block).hexdigest()

    def chain_valid(self, chain):
        previous_block = chain[0]
        block_index = 1

    while block_index < len(chain):
            block = chain[block_index]
            if block['previous_hash'] != self.hash(previous_block):
                return False

            previous_proof = previous_block['proof']
            proof = block['proof']
            hash_operation = hashlib.sha256(
                str(proof**2 - previous_proof**2).encode()).hexdigest()

            if hash_operation[:5] != '00000':
                return False
            previous_block = block
            block_index += 1

        return True