### summary
Machine learning deployment is the process of integrating a trained machine learning model into a production environment, enabling it to generate predictions and insights from real-world data. This crucial step transforms theoretical models into practical applications, allowing organizations to leverage the predictive capabilities of machine learning to drive decision-making and enhance operational efficiency. As industries increasingly adopt machine learning technologies, understanding the deployment process has become essential for maximizing the value derived from these advanced algorithms.[1]




[2]




 The deployment process encompasses several key stages, including model selection and training, evaluation, infrastructure setup, integration, and ongoing monitoring and improvement. Each of these phases plays a critical role in ensuring that the deployed model functions effectively in its intended context. Notably, challenges such as scalability, monitoring, version control, and security must be navigated to maintain the accuracy and reliability of machine learning applications over time.[1]




[3]




[4]




 Organizations are also encouraged to adhere to best practices, including documentation, error resolution, and continuous training, to ensure sustainable deployment outcomes.[5]




 The topic of machine learning deployment has gained prominence not only due to its technical complexities but also because of the ethical considerations associated with its use. Issues such as algorithmic bias, data privacy, and the environmental impact of model training raise significant concerns that organizations must address to foster trust and accountability in their AI systems.[6]




[7]




 Recent developments in regulatory frameworks and ethical guidelines further highlight the importance of responsible deployment practices in the rapidly evolving landscape of machine learning technologies.[8]




[9]




 In summary, effective machine learning deployment is a multi-faceted process that requires careful planning, execution, and continuous evaluation. As organizations navigate the complexities of deploying machine learning models, they must balance technological innovation with ethical responsibility to maximize the benefits of their AI initiatives.[10]




[11]





Process of Machine Learning Deployment
The process of machine learning deployment involves several key steps that transform a trained model into a functional application in a real-world environment. This process is critical for organizations to leverage the predictive power of machine learning models effectively.

Steps in the Deployment Process
1. Model Selection and Training
The first step in deployment is selecting an appropriate machine learning model tailored to the specific problem at hand. This involves evaluating the characteristics of the data and the desired outcomes, which may pertain to classification, regression, or clustering tasks. After selecting a model, it is trained using preprocessed data, where the model learns patterns and relationships to make accurate predictions[1]




. During this phase, the data is typically split into a training set and a validation set to facilitate effective learning and performance evaluation[1]




[2]




.

2. Model Evaluation
Once the model is trained, it must be evaluated using a separate validation or test dataset. This step assesses the model's performance on unseen data to ensure it generalizes well and maintains accuracy in real-world applications. Various evaluation metrics, such as accuracy, precision, recall, and mean squared error, are employed to gauge the model's effectiveness[1]




. If the model does not meet the desired performance thresholds, it may require further refinement, which could include adjusting the model's architecture or hyperparameters[1]




.

3. Infrastructure and Environment Setup
Deployment necessitates the establishment of the appropriate infrastructure and environment to support the model's operation. This may involve configuring servers, utilizing cloud platforms, or setting up other computational resources to handle the model's processing requirements. It is vital that this infrastructure is scalable to accommodate growing demands and provide reliable performance[1]




[2]




.

4. Integration
The model must then be integrated into existing systems or applications, which is a critical aspect of the deployment process. This integration could involve embedding the model within a web application, interfacing it with an API, or incorporating it into current software systems. A well-planned integration strategy ensures seamless communication between the model and other components of the infrastructure, allowing for efficient operation[1]




.

5. Continuous Evaluation and Improvement
Deployment is not a one-time process; it requires ongoing evaluation and improvement. Organizations should regularly assess the deployed model's performance and impact, collecting feedback from end-users to identify areas for enhancement. This continuous improvement cycle ensures that the models remain effective and aligned with evolving business needs, adapting to new data inputs and changing environments[1]




. By following these steps, organizations can successfully deploy machine learning models that not only meet their operational requirements but also provide substantial business value through data-driven insights[2]




.

Definition of Deployment
In the context of machine learning, deployment refers to making a trained model accessible for use in production environments. This integration allows the model to receive input data, make predictions, and generate actionable insights in real time[1]




[2]




.

Tools and Frameworks for Deployment
The deployment of machine learning models is supported by a variety of tools and frameworks that streamline the process, enhance scalability, and ensure reliability. These tools cater to different deployment strategies, including containerization, orchestration, and API integration.

Seldon Core
Seldon Core is an open-source framework that accelerates the deployment of machine learning models while simplifying the process.[2]




 It supports models created with any open-source machine learning framework and is built on Kubernetes, allowing for the use of advanced Kubernetes features. This includes managing model graphs and scaling resources as necessary. Seldon also facilitates connections to continuous integration and deployment (CI/CD) solutions, and provides alerts for issues in production, making it suitable for both on-premises and cloud deployments.

AWS SageMaker
Amazon SageMaker is a comprehensive service that enables developers and data scientists to build, train, and deploy machine learning models efficiently.[2]




 It includes an integrated Jupyter notebook for data analysis, eliminating server management. SageMaker offers various modules that can be used independently or together, providing optimized machine learning methods for large datasets.

TensorFlow Serving
TensorFlow Serving is designed for high-performance serving of machine learning models, allowing trained models to be deployed as REST API endpoints.[2]




[3]




 This flexibility enables real-time predictions and is capable of handling various data types. Its architecture supports multiple users and efficiently manages high request volumes with load balancing. TensorFlow Serving is widely utilized by major companies, including Google, as a central solution for model serving.

Containerization and Deployment Orchestration
Containerization technologies like Docker play a crucial role in the deployment of machine learning models. They package models, dependencies, and configurations into portable containers, ensuring consistent deployment across different environments.[1]




[3]




 Kubernetes, as a container orchestration tool, automates the management of these containers, providing features for scaling, load balancing, and fault tolerance.

Deployment Strategies
Different deployment methods cater to varying needs:

Web API Deployment: Deploys models as web services accessed via APIs, facilitating real-time predictions.
Cloud-based Deployment: Models are hosted on cloud platforms, allowing for dynamic scaling and cost efficiency.
Container Deployment: Models are packaged in containers and orchestrated with Kubernetes, integrating seamlessly with existing infrastructures.
Offline Deployment: Models run on batches of data, suitable for applications that can handle periodic updates.[2]




Best Practices for Deployment
Successful deployment of machine learning models involves several best practices that ensure models are reliable, maintainable, and scalable in real-world applications.

Version Control
Version control is essential for managing machine learning models, code, and configurations. By implementing systems like Git, organizations can track changes, document enhancements, and manage different versions of models throughout their lifecycle.[1]




 This practice allows for easier collaboration among team members and simplifies the process of rolling back to previous versions if needed.

Documentation
Proper documentation is crucial during the deployment process. It should encompass all steps taken, decisions made, and the reasoning behind those decisions. This documentation should include details about the model pipeline, data preprocessing, infrastructure specifics, performance metrics, and unique considerations specific to the deployment.[1]




[3]




 Comprehensive documentation facilitates easier reproduction of the deployment and aids future team members or stakeholders in understanding the process.

Monitoring and Performance Evaluation
Monitoring deployed models is vital for tracking key performance metrics and assessing their effectiveness. This includes evaluating inputs, outputs, and intermediate data throughout the model pipeline. Organizations should define specific performance metrics based on the task at hand, such as accuracy for classification tasks or mean squared error for regression tasks.[1]




 Regular monitoring helps identify anomalies and areas for improvement, ensuring that models remain aligned with business objectives.

Continuous Integration and Delivery (CI/CD)
Implementing CI/CD pipelines automates the integration, testing, and deployment processes for machine learning models. Automation increases efficiency and reduces the time from development to deployment by ensuring models are regularly updated and deployed without manual intervention.[1]




 CI/CD practices support a more agile deployment process, enabling quicker responses to changes in requirements or data.

Containerization and Orchestration
Containerization is a popular method for deploying machine learning models, allowing developers to package models with their dependencies into portable containers.[3]




 Tools like Docker and Kubernetes facilitate consistent deployment across different environments and support efficient scaling and management of containers. This orchestration ensures high availability and reliability, especially during fluctuating workloads.

Ongoing Training and Adaptation
To maintain accuracy and relevance, machine learning models must be periodically retrained with new data. This ongoing training process may involve incorporating new labeled data or leveraging transfer learning techniques to adapt to changing data patterns.[1]




 Regular updates help ensure models continue to perform effectively as the environment evolves.

Error Resolution and Maintenance
Timely diagnosis and resolution of errors are critical for maintaining deployed models. Organizations should implement logging and error tracking mechanisms to quickly identify and address issues related to data, code, or infrastructure.[1]




 Establishing a robust process for error resolution ensures minimal downtime and maintains user trust in the deployed models. By following these best practices, organizations can enhance the effectiveness and sustainability of their machine learning deployments, ultimately leading to improved performance and user satisfaction.

Challenges in Machine Learning Deployment
Machine learning deployment presents several challenges that organizations must navigate to ensure the successful integration of their models into real-world applications. These challenges encompass various aspects, including scalability, monitoring, maintenance, and security.

Monitoring and Maintenance
After deployment, continuous monitoring and maintenance are essential to ensure ongoing model effectiveness. This involves tracking key performance metrics, monitoring input data quality, and setting up alerts to detect anomalies or performance degradation[1]




. Without robust monitoring mechanisms, organizations risk overlooking issues that could lead to decreased accuracy or reliability of their machine learning models. Regular maintenance procedures, including model updates and retraining on new data, are also vital to address any changes in data patterns or user requirements[1]




.

Scalability Issues
One of the primary challenges in machine learning deployment is scalability. As the volume of data and user demands increases, it is crucial for deployed models to handle growing workloads without compromising performance. Organizations must carefully consider their deployment infrastructure, including computational resources, storage capacity, and network bandwidth, to accommodate these demands effectively[1]




[5]




. Cloud-based solutions such as AWS, Google Cloud, and Azure offer scalable infrastructures that can automatically adjust resources based on real-time demand, thus enhancing performance during peak loads[5]




.

Version Control
Effective version control is another critical challenge in machine learning deployment. Organizations must manage different versions of models and associated code throughout the deployment process. This includes saving snapshots of models at various stages of development and maintaining records of modifications[1]




. Proper version control ensures reproducibility and facilitates performance evaluation, allowing teams to roll back changes if necessary and compare model versions against specific features or datasets[1]




.

Security and Privacy Concerns
Security and privacy are paramount when deploying machine learning models, especially when handling sensitive data. Organizations must implement measures to protect against unauthorized access and data breaches, which could have severe implications for both privacy and compliance with regulations like GDPR, CCPA, and HIPAA[5]




. Techniques such as data anonymization and encryption play a crucial role in safeguarding sensitive information during the deployment process[5]




.

Additional Considerations
Deploying machine learning models is inherently complex, requiring careful planning and execution to address these challenges. Organizations must consider various factors, such as infrastructure scalability, monitoring systems, maintenance routines, version control, and security protocols, to ensure that their models remain effective and reliable over time[2]




[6]




. By tackling these challenges head-on, organizations can leverage the full potential of their machine learning models in real-world applications.

Best Practices for Deployment
Successful deployment of machine learning models involves several best practices that ensure models are reliable, maintainable, and scalable in real-world applications.

Version Control
Version control is essential for managing machine learning models, code, and configurations. By implementing systems like Git, organizations can track changes, document enhancements, and manage different versions of models throughout their lifecycle.[1]




 This practice allows for easier collaboration among team members and simplifies the process of rolling back to previous versions if needed.

Documentation
Proper documentation is crucial during the deployment process. It should encompass all steps taken, decisions made, and the reasoning behind those decisions. This documentation should include details about the model pipeline, data preprocessing, infrastructure specifics, performance metrics, and unique considerations specific to the deployment.[1]




[3]




 Comprehensive documentation facilitates easier reproduction of the deployment and aids future team members or stakeholders in understanding the process.

Monitoring and Performance Evaluation
Monitoring deployed models is vital for tracking key performance metrics and assessing their effectiveness. This includes evaluating inputs, outputs, and intermediate data throughout the model pipeline. Organizations should define specific performance metrics based on the task at hand, such as accuracy for classification tasks or mean squared error for regression tasks.[1]




 Regular monitoring helps identify anomalies and areas for improvement, ensuring that models remain aligned with business objectives.

Continuous Integration and Delivery (CI/CD)
Implementing CI/CD pipelines automates the integration, testing, and deployment processes for machine learning models. Automation increases efficiency and reduces the time from development to deployment by ensuring models are regularly updated and deployed without manual intervention.[1]




 CI/CD practices support a more agile deployment process, enabling quicker responses to changes in requirements or data.

Containerization and Orchestration
Containerization is a popular method for deploying machine learning models, allowing developers to package models with their dependencies into portable containers.[3]




 Tools like Docker and Kubernetes facilitate consistent deployment across different environments and support efficient scaling and management of containers. This orchestration ensures high availability and reliability, especially during fluctuating workloads.

Ongoing Training and Adaptation
To maintain accuracy and relevance, machine learning models must be periodically retrained with new data. This ongoing training process may involve incorporating new labeled data or leveraging transfer learning techniques to adapt to changing data patterns.[1]




 Regular updates help ensure models continue to perform effectively as the environment evolves.

Error Resolution and Maintenance
Timely diagnosis and resolution of errors are critical for maintaining deployed models. Organizations should implement logging and error tracking mechanisms to quickly identify and address issues related to data, code, or infrastructure.[1]




 Establishing a robust process for error resolution ensures minimal downtime and maintains user trust in the deployed models. By following these best practices, organizations can enhance the effectiveness and sustainability of their machine learning deployments, ultimately leading to improved performance and user satisfaction.

Ethical Dilemmas in Deployment
As the deployment of machine learning (ML) technologies continues to expand, it brings with it a host of ethical dilemmas that organizations must navigate. Ethical AI deployment is not just a technical challenge but a profound social imperative, necessitating a careful balance between innovation and responsibility[7]




[8]




.

Key Ethical Considerations
Algorithmic Bias and Data Integrity
One of the primary concerns in ML deployment is algorithmic bias, which occurs when AI systems are trained on unrepresentative or incomplete datasets. Historical biases embedded in training data can perpetuate inequalities, leading to biased outcomes in critical areas such as hiring, criminal justice, and healthcare[8]




. Notably, the experiences of researchers like Dr. Joy Buolamwini illustrate how insufficient diversity in datasets can skew AI performance, particularly affecting marginalized groups[8]




. Organizations must prioritize data integrity and ensure that the datasets used in training are representative and free from historical biases.

Environmental Impact
The environmental implications of ML model training are increasingly coming under scrutiny. Training large language models, for instance, consumes significant energy resources, equivalent to the carbon footprint of multiple long-haul flights[8]




. As the demand for AI continues to rise, organizations are urged to consider the environmental impact of their technologies and strive for sustainable practices in their deployment strategies.

Transparency and Accountability
Transparency is critical in building trust in AI systems. Stakeholders benefit from clear communication about how AI models make decisions, which is especially crucial in sectors like finance and healthcare where decisions can have far-reaching consequences[9]




. Providing detailed reports on AI performance and maintaining open channels for dialogue can help foster a culture of accountability. This transparency also extends to adhering to regulatory standards, such as GDPR, ensuring that data privacy and ethical considerations are integral to AI operations[9]




.

Ethical Decision-Making in Autonomous Systems
The deployment of autonomous systems, such as self-driving cars, raises unique ethical challenges. These systems must be programmed to make split-second decisions in scenarios where harm is unavoidable, forcing developers to confront difficult moral questions regarding the prioritization of lives[9]




. The development of ethical frameworks for such decision-making processes is vital to ensure that the technology aligns with societal values and ethical principles.

Privacy and Data Protection
With the increasing use of personal data in training ML models, privacy concerns are paramount. Organizations must implement robust data protection measures, such as data anonymization and secure storage protocols, to safeguard individual privacy[9]




. Adhering to privacy regulations not only protects users but also enhances trust in AI technologies.

Case Studies
The exploration of machine learning deployment is significantly enhanced by case studies that highlight both the successes and challenges encountered in real-world applications. These case studies serve as valuable resources for understanding how machine learning models are implemented and the ethical considerations that arise during deployment.

Princeton Dialogues on AI and Ethics
One notable collection of case studies is the Princeton Dialogues on AI and Ethics, which includes a series of fictional case studies designed to prompt reflection and discussion on the ethical dilemmas at the intersection of AI and societal impacts. These studies, developed through an interdisciplinary workshop series at Princeton University, are underpinned by five guiding principles: empirical foundations, broad accessibility, interactiveness, multiple viewpoints, and depth over brevity[10]




[11]




.

Example Case Studies
Automated Healthcare App: This case study examines issues of legitimacy, paternalism, transparency, censorship, and inequality[10]




. It reflects on how AI can influence healthcare delivery and the ethical implications of automating patient interactions.
Dynamic Sound Identification: Focused on rights, representational harms, neutrality, and downstream responsibility, this study illustrates the complexities involved in sound recognition technologies and their societal ramifications[10]




.
Optimizing Schools: This case addresses privacy, autonomy, consequentialism, and rhetoric, exploring the ethical dimensions of using AI in educational settings to optimize learning experiences and administrative processes[10]




.
Law Enforcement Chatbots: Highlighting the ethical concerns of automation, research ethics, and sovereignty, this case study delves into the use of AI chatbots within law enforcement agencies and the implications for justice and civil rights[10]




.
Hiring By Machine: This case emphasizes the ethical challenges in employing AI for hiring processes, including issues of bias and fairness, and the responsibility of organizations to ensure equitable outcomes[10]




.
Real-World Applications and Ethical Considerations
The importance of ethical considerations in machine learning deployment is underscored by practical examples, such as the unintentional biases that can arise from flawed data collection methods. For instance, a facial recognition model trained predominantly on data from a specific demographic may perform poorly on individuals from different backgrounds, highlighting the critical need for diverse and representative datasets in the training phase[12]




.

Framework for Responsible Deployment
To ensure responsible deployment, organizations are encouraged to implement frameworks that include error analysis, bias mitigation strategies, assigned responsibility for monitoring models, and transparent reporting practices[12]




[3]




. Establishing a feedback loop for continuous improvement based on user behavior and model performance is also essential, as it allows organizations to adapt and enhance their models post-deployment[3]




. By learning from these case studies and applying best practices in ethical oversight, organizations can navigate the complexities of machine learning deployment while fostering trust and accountability within the AI ecosystem.

Frameworks and Guidelines for Ethical Deployment
Importance of Ethical Considerations in Machine Learning
The deployment of machine learning (ML) models necessitates careful attention to ethical considerations, as unethical practices can lead to biased outcomes, which in turn can damage an organization's reputation and disrupt projects[13]




. Ethical frameworks play a crucial role in guiding the responsible development and implementation of AI technologies. This is particularly vital given the potential for AI to reflect and amplify existing societal biases found in training data[14]




[15]




.

Established Guidelines and Frameworks
Various organizations and research communities are actively working on establishing ethical guidelines for AI and ML deployment. For example, the Organisation for Economic Cooperation and Development (OECD) has issued guidelines that promote human-centric values, inclusivity, and sustainability in AI[16]




. Moreover, initiatives by organizations like OpenAI and the IEEE are focused on promoting responsible AI development by encouraging researchers to consider the social and ethical implications of their work[16]




[17]




.

Key Principles of Responsible AI
Responsible AI deployment is anchored in several key principles that organizations must adhere to in order to ensure ethical compliance:

Fairness and Bias Mitigation: AI systems should be designed to avoid discrimination against any group. This requires identifying biases in training data and implementing techniques to mitigate them, thus ensuring fairness in decisions made by AI systems[18]




[19]




.
Transparency: Organizations must maintain transparency in how AI systems operate, making it easier for users to understand decision-making processes[20]




.
Accountability: It is essential to establish clear lines of accountability for AI outcomes. When a machine learning model results in harm or error, the organization behind it must take responsibility and take corrective action[14]




[16]




.
Diverse Perspectives: Including diverse perspectives in the design and testing phases can help minimize harmful biases and improve the ethical integrity of the AI system[19]




[20]




.
Regulatory and Legislative Initiatives
As the use of AI and ML technologies expands, governments worldwide are starting to introduce legislation aimed at regulating these systems. The U.S. has initiated the Algorithmic Accountability Act, which focuses on addressing biases and ensuring ethical development in AI[16]




. These legal frameworks are essential for ensuring that organizations are held to ethical standards during the deployment of machine learning solutions.

Moving Forward with Responsible AI
To effectively address potential biases and ethical challenges in AI deployment, organizations should establish internal review boards, conduct regular bias assessments, and provide training on AI ethics for their employees[17]




[18]




. These practices will help ensure that AI systems not only enhance operational efficiency but also align with societal values and ethical standards, fostering a more responsible approach to machine learning deployment.

Future Trends in Machine Learning Deployment
As machine learning deployment continues to evolve, several trends are shaping its future, enhancing how organizations utilize and implement machine learning models in real-world applications. These trends focus on improving scalability, efficiency, and responsiveness to emerging technologies and market demands.

Serverless Architectures
One significant trend is the increasing adoption of serverless computing frameworks, such as AWS Lambda, Google Cloud Functions, and Azure Functions. These platforms enable businesses to run code in response to events without the need for managing servers, allowing for greater scalability and cost-efficiency in deployment processes. By leveraging serverless architectures, organizations can facilitate real-time predictions and streamline their deployment workflows, thus optimizing resource utilization and operational expenses[5]




[3]




.

Edge Computing
Edge computing is gaining traction as a deployment strategy that involves executing machine learning models closer to the data source, such as on IoT devices or edge servers. This approach significantly reduces latency and bandwidth usage, enabling real-time predictions and decision-making, especially in environments with limited connectivity. Industries such as healthcare, manufacturing, and autonomous systems are particularly benefitting from edge deployment, which allows for immediate responses critical to operations[5]




[21]




.

MLOps Integration
The integration of MLOps (Machine Learning Operations) practices into deployment processes is becoming essential for organizations seeking to leverage machine learning effectively. MLOps encompasses the automation of workflows related to data processing, model training, deployment, and monitoring. By implementing MLOps, companies can reduce manual intervention, enhance model performance, and ensure consistent deployments, thereby maximizing the return on their investment in machine learning technologies[21]




.

Explainable AI
As machine learning systems become more integrated into decision-making processes, the need for explainability and transparency in models is increasingly critical. Organizations are adopting tools and practices that promote model interpretability, ensuring that machine learning outcomes are understandable and unbiased. This trend not only fosters stakeholder trust but also helps organizations comply with regulatory requirements, thereby promoting ethical standards in AI deployment[21]




.

Enhanced Automation
Automation is set to play a pivotal role in future deployment strategies. Automating the deployment process reduces human error, accelerates deployment timelines, and ensures consistency across different environments. By utilizing infrastructure-as-code tools and continuous integration/continuous deployment (CI/CD) pipelines, businesses can create reliable and scalable deployment environments, facilitating easier updates and modifications to their models[1]




[6]




. These emerging trends indicate that machine learning deployment is transitioning towards more agile, efficient, and responsible practices, allowing organizations to harness the full potential of their data-driven insights and innovations.

### Model Deployment in 2024


### Hugginface face Space


https://www.youtube.com/watch?app=desktop&v=ZBXNyOPv6mM

https://www.youtube.com/watch?v=I9eK7idNCGw

### Streamlit Community Cloud

### deployment with mlflow,github actions

### Content
- Loan_id A unique loan number assigned to each loan customers

- Loan_status Whether a loan is paid off, in collection, new customer yet to payoff, or paid off after the collection efforts

- Principal Basic principal loan amount at the origination

- terms Can be weekly (7 days), biweekly, and monthly payoff schedule

- Effective_date When the loan got originated and took effects

- Due_date Since it’s one-time payoff schedule, each loan has one single due date

- Paidoff_time The actual time a customer pays off the loan

- Pastdue_days How many days a loan has been past due

- Age, education, gender A customer’s basic demographic information

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder,LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer

dir_loc = "C:/Users/nboateng/OneDrive - Nice Systems Ltd/Documents/Deployment_Dashboards/modeldeployment2024/"
#data location on kaggle https://www.kaggle.com/datasets/zhijinzhai/loandata
#df = pd.read_csv("C:/Users/nboateng/OneDrive - Nice Systems Ltd/Documents/Deployment_Dashboards/Model_Deployment_2024/Loan payments data.csv", parse_dates = ['effective_date', 'paid_off_time','due_date'])
df = pd.read_csv( dir_loc + "Loan payments data.csv")

df.head()

Unnamed: 0,Loan_ID,loan_status,Principal,terms,effective_date,due_date,paid_off_time,past_due_days,age,education,Gender
0,xqd20166231,PAIDOFF,1000,30,9/8/2016,10/7/2016,9/14/2016 19:31,,45,High School or Below,male
1,xqd20168902,PAIDOFF,1000,30,9/8/2016,10/7/2016,10/7/2016 9:00,,50,Bechalor,female
2,xqd20160003,PAIDOFF,1000,30,9/8/2016,10/7/2016,9/25/2016 16:58,,33,Bechalor,female
3,xqd20160004,PAIDOFF,1000,15,9/8/2016,9/22/2016,9/22/2016 20:00,,27,college,male
4,xqd20160005,PAIDOFF,1000,30,9/9/2016,10/8/2016,9/23/2016 21:36,,28,college,female


In [5]:
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer




def month_date(df):
    #d =  df.apply(lambda x:x.str.slice(5, 7)).astype(float)
    d =  df.apply(lambda x: pd.to_datetime(x).dt.month)
    return d


def quarter_date(df):
    d = df.apply(lambda x: pd.to_datetime(x).dt.quarter)
    return d


def cast_float(df):
    #d = df.apply(lambda x: pd.DataFrame(x).astype(float))
    return df.astype(float)

def to_object(x):
  return pd.DataFrame(x).astype(object)

def to_numeric(x):
  return pd.DataFrame(x).astype(float)

fun_tr = FunctionTransformer(to_object)

num_tr = FunctionTransformer(to_numeric)

get_month_date = FunctionTransformer(month_date)    
 
get_quarter_date = FunctionTransformer(quarter_date)

cast_col_float = FunctionTransformer(to_numeric)




ct = make_column_transformer(
    
     (get_month_date,['due_date','paid_off_time']),
    (get_quarter_date,['effective_date'])
)


cast_num = make_column_transformer(
    
     (cast_float,['Principal','terms','past_due_days'	,'age'])
   
)

import datetime as dt
from datetime import datetime


ct.fit_transform(df)

#cast_num.fit_transform(df)

#y = fun_tr.fit_transform(pd.DataFrame({'a':[1,2,3]}))
#y = num_tr.fit_transform(pd.DataFrame({'a':[1,2,3]}))
y = num_tr.fit_transform(df[['Principal','terms']])
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Principal  500 non-null    float64
 1   terms      500 non-null    float64
dtypes: float64(2)
memory usage: 7.9 KB


from sklearn.compose import make_column_selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder,StandardScaler,OneHotEncoder
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from xgboost import XGBClassifier,XGBRegressor


# Features and target variable
X = df.drop(columns=['Loan_ID',	'loan_status'])
y = df['loan_status']

# Columns to be scaled
numeric_features = ['Principal','terms','past_due_days'	,'age']

# Column to be binned and one-hot encoded
categorical_features = ['education',	'Gender']

# Create transformers
numeric_transformer = Pipeline(steps=[
    #('imputer', SimpleImputer(strategy='median')),
    ('imputer', SimpleImputer(fill_value= -9999, strategy='constant'))
    #('scaler', MinMaxScaler())
])
categorical_transformer = Pipeline(steps=[
    ('bin', KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
    #('encoder', OneHotEncoder(handle_unknown='ignore'))
   ('encoder', OrdinalEncoder(handle_unknown= 'use_encoded_value', unknown_value=-1)),
])

# Combine all transformers into a preprocessor using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create and evaluate the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', LogisticRegression())])
pipeline

In [9]:
from sklearn.compose import make_column_selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder,StandardScaler,OneHotEncoder
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
from xgboost import XGBClassifier,XGBRegressor
from sklearn import *
from sklearn.metrics import roc_curve, roc_auc_score,balanced_accuracy_score,f1_score,confusion_matrix

# Columns to be scaled
numeric_features = ['Principal','terms','past_due_days'	,'age']

# Column to be binned and one-hot encoded
categorical_features = ['education',	'Gender']

date_cols =  ['due_date','paid_off_time','effective_date']



target_column  =  ['loan_status']

feat_list  =  numeric_features  +  categorical_features +   date_cols  



categorical_transformer = Pipeline(steps=[
    #('encoding',LabelEncoder()),
    ('ordinal', OrdinalEncoder(handle_unknown= 'use_encoded_value', unknown_value=-1)),
     #('label_encoding',MyLabelEncoder()),
    #('imputer', SimpleImputer(fill_value= -9999, strategy='constant'))
     ('imputer', SimpleImputer( strategy='median')),
    #('encoding',OrdinalEncoder(categories='auto'))
   

])


numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(fill_value=-9999, strategy='constant', add_indicator=True)),
    #('scaler', StandardScaler()) 
     ('cast_float',cast_col_float)
    ])


date_transformer = Pipeline(steps=[
    ('date', get_quarter_date)
    ])




xgb_classifier = xgb.XGBClassifier(
             seed=1,
              n_jobs = -1,
              max_depth = 5,
              learning_rate =  0.1,
              min_child_weight= 2, 
              #min_samples_split= 0.9,
              n_estimators= 50,
              #eta = 0.1, 
              verbose = 1, 
              gamma=0.05,
              #nrounds = 100
              objective='multi:softmax', 
              num_class=3,
              eval_metric =  metrics.auc,              #metrics.r2_score,     mean_absolute_error  #"aucpr",    # "aucpr",  #aucpr, auc
              subsample = 0.7,
              colsample_bytree =0.8,
              max_delta_step=1,
              verbosity=1,
              tree_method='hist')

# Combine all transformers into a preprocessor using ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric_transform', numeric_transformer, numeric_features),
        ('date_eng_feat',date_transformer,date_cols),
        ('categorical_encoding', categorical_transformer, categorical_features),
        #('label_encoder', label_transformer, target_column)
       
        ])


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      #('dropfeature',UniqueDropper()),
                      #('anova', SelectPercentile(chi2)),
                      # ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
                      ('classifier',xgb_classifier )])


# Define the label encoder
label_encoder = LabelEncoder()


from sklearn.model_selection import train_test_split

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(df[feat_list], df[target_column], test_size=0.3, random_state=42)


pipeline.fit(X=X_train, y= label_encoder.fit_transform(y_train.values.ravel()))


Parameters: { "verbose" } are not used.



In [11]:
# print train and test set shape
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

# generate predictions
y_pred = pipeline.predict(X_test)
#y_pred_prob = pipeline.predict_proba(X_test)[:, 1]
y_pred_prob = pipeline.predict_proba(X_test)
y_pred_label = pipeline.predict(X_test)
y_test_label = label_encoder.fit_transform(y_test.values.ravel())


auc = roc_auc_score(y_test_label,y_pred_prob,multi_class='ovr')
print('auc :{}'.format(auc))

balanced_accuracy = balanced_accuracy_score(y_test_label,y_pred_label)
print('balanced_accuracy  :{}'.format(balanced_accuracy ))
#f1 = f1_score(y_test_label,y_pred_label,'micro')
f1 = f1_score(y_test_label, y_pred_label, average='macro')
print('f1  :{}'.format(f1 ))


import joblib

joblib.dump(pipeline, dir_loc + 'pipeline.pkl')
y_pred_label


X_train: (350, 9)
X_test: (150, 9)
y_train: (350, 1)
y_test: (150, 1)
auc :1.0
balanced_accuracy  :1.0
f1  :1.0


array([0, 2, 0, 2, 2, 0, 0, 2, 2, 1, 2, 2, 1, 2, 0, 0, 1, 2, 0, 1, 1, 1,
       2, 0, 2, 1, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 1, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 0, 1, 2, 2,
       1, 2, 2, 0, 0, 0, 2, 2, 0, 0, 2, 2, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1,
       0, 2, 0, 1, 0, 0, 1, 0, 2, 2, 1, 2, 2, 1, 2, 0, 0, 2, 0, 2, 0, 2,
       2, 1, 0, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0, 2, 2,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1])

In [None]:
target_map = {'COLLECTION':0,
  'PAIDOFF':2,
  'COLLECTION_PAIDOFF':1}

# Reverse the encoding
original_labels = label_encoder.inverse_transform(y_test_label)
print(original_labels)

target_map['COLLECTION']

['COLLECTION' 'PAIDOFF' 'COLLECTION' 'PAIDOFF' 'PAIDOFF' 'COLLECTION'
 'COLLECTION' 'PAIDOFF' 'PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF'
 'COLLECTION_PAIDOFF' 'PAIDOFF' 'COLLECTION' 'COLLECTION'
 'COLLECTION_PAIDOFF' 'PAIDOFF' 'COLLECTION' 'COLLECTION_PAIDOFF'
 'COLLECTION_PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'COLLECTION'
 'PAIDOFF' 'COLLECTION_PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF'
 'COLLECTION' 'COLLECTION_PAIDOFF' 'COLLECTION' 'COLLECTION_PAIDOFF'
 'PAIDOFF' 'PAIDOFF' 'COLLECTION' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF'
 'PAIDOFF' 'COLLECTION' 'PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF'
 'PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF'
 'PAIDOFF' 'PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'PAIDOFF'
 'PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF' 'COLLECTION'
 'COLLECTION_PAIDOFF' 'PAIDOFF' 'PAIDOFF' 'COLLECTION_PAIDOFF' 'PAIDOFF'
 'PAIDOFF' 'COLLECTION' 'COLLECTION' 'COLLECTION' 'PAIDOFF' 'PAIDOFF'
 'COLLECTIO

KeyError: 0

In [13]:
y_test_label

array([0, 2, 0, 2, 2, 0, 0, 2, 2, 1, 2, 2, 1, 2, 0, 0, 1, 2, 0, 1, 1, 1,
       2, 0, 2, 1, 1, 2, 2, 0, 1, 0, 1, 2, 2, 0, 1, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 0, 1, 2, 2,
       1, 2, 2, 0, 0, 0, 2, 2, 0, 0, 2, 2, 1, 1, 2, 2, 1, 2, 0, 2, 2, 1,
       0, 2, 0, 1, 0, 0, 1, 0, 2, 2, 1, 2, 2, 1, 2, 0, 0, 2, 0, 2, 0, 2,
       2, 1, 0, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0, 2, 2,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1])

In [14]:
#pip install --upgrade modelbit --user

In [46]:

# install modelbit
#!pip install modelbit

# run on top of your notebook
import modelbit
mb = modelbit.login()

### Creating an Inference function
The first step is to create a Python function for inference which uses the predict or predict_proba method from sklearn.

In [42]:
import pandas as pd
import numpy as np

# first define function
def predict_loan_default(Principal=0, terms=0,past_due_days=0,
                        age=0,education = 'High School' ,Gender = 'male',
                        due_date= '9/25/2016',paid_off_time='9/25/2016' ,effective_date = '9/25/2016' ):
#def predict_loan_default(Principal:float, terms:float,past_due_days:float,age:float,education:str ,Gender:str,due_date:str ,paid_off_time:str ,effective_date:str)-> float:
   
  """
  Predict the probability of loan default using a pre-trained machine learning pipeline.
 
  Args:
     
      Principal (int): loan principal
      terms (int):
      past_due_days (int):	
      age (int):
      education : str	
      Gender :str
      due_date :str
      paid_off_time :str
      effective_date :str

  Returns:
      float: Probability of loan default.
  """
  data = pd.DataFrame([[Principal, terms,past_due_days,
                        age,education ,Gender,
                        due_date,paid_off_time ,effective_date ]],
                                             columns = [ 'Principal', 'terms','past_due_days','age','education' ,'Gender','due_date' ,'paid_off_time' ,'effective_date'])
  
  for col in [ 'Principal', 'terms','past_due_days','age']:
    data[col] = data[col].astype(float)
  pred_prob = pipeline.predict_proba(data) 
  

  #pred_label = pipeline.predict(pd.DataFrame([[Principal, terms,past_due_days,age,education ,Gender,due_date ,paid_off_time ,effective_date ]],
  #                                           columns = [ 'Principal', 'terms','past_due_days','education' ,'Gender','due_date' ,'paid_off_time' ,'effective_date'])) 
  
  #return data
  return  dict(zip(['COLLECTION','PAIDOFF','COLLECTION_PAIDOFF'],pred_prob[0]))                            

In [44]:
predict_loan_default(800, 15,74,29,'High School or Below','male','9/25/2016','9/12/2016' ,'9/11/2016' )
#print(len([800, 15,74,29,'High School or Below','male','9/25/2016','9/12/2016' ,'9/11/2016' ]))
#print(len([ 'Principal', 'terms','past_due_days','age','education' ,'Gender','due_date' ,'paid_off_time' ,'effective_date']))
#print(len([Principal, terms,past_due_days,age,education ,Gender,due_date ,paid_off_time ,effective_date ]))

{'COLLECTION': 0.052470516,
 'PAIDOFF': 0.91186875,
 'COLLECTION_PAIDOFF': 0.03566072}

In [33]:
d= pd.DataFrame([[800, 15,74,29,'High School or Below','male','9/25/2016','9/12/2016' ,'9/11/2016' ]],
                                             columns = [ 'Principal', 'terms','past_due_days','age','education' ,'Gender','due_date' ,'paid_off_time' ,'effective_date'])

d['past_due_days']  = d['past_due_days'].astype(float)

print(pipeline.predict_proba(d).argmax() )
print(pipeline.predict_proba(d)[0] )

#X_test.info()
#pipeline.predict_proba()

#pipeline.predict_proba(X_test)
#print(d)
#X_test.head()
d.info()

#dict(zip(['COLLECTION','PAIDOFF','COLLECTION_PAIDOFF'],pipeline.predict_proba(d)[0]))

#predict_loan_default([800, 15,74,29,'High School or Below','male','9/25/2016','9/12/2016' ,'9/11/2016' ])

1
[0.05247052 0.91186875 0.03566072]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Principal       1 non-null      int64  
 1   terms           1 non-null      int64  
 2   past_due_days   1 non-null      float64
 3   age             1 non-null      int64  
 4   education       1 non-null      object 
 5   Gender          1 non-null      object 
 6   due_date        1 non-null      object 
 7   paid_off_time   1 non-null      object 
 8   effective_date  1 non-null      object 
dtypes: float64(1), int64(3), object(5)
memory usage: 204.0+ bytes


In [17]:
#!pip install fast_ml --upgrade
#predict_loan_default(29, 800, 15,74,'High School or Below','male','9/25/2016','' ,'9/11/2016')
#X_test.head()
drug_map = {0: "DrugY", 3: "drugC", 4: "drugX", 1: "drugA", 2: "drugB"}
drug_map[1]

'drugA'

### Deploy Machine Learning Pipeline on the cloud using Docker Container

### Build Web Application
This tutorial is not focused on building a Flask application. It is only discussed here for completeness. Now that our machine learning pipeline is ready we need a web application that can connect to our trained pipeline to generate predictions on new data points in real-time. We have created the web application using Flask framework in Python. There are two parts of this application:

- Front-end (designed using HTML)
- Back-end (developed using Flask)