# Machine Learning Lifecycle I

The lifecycle is the process of developing, deploying, and maintaining a machine learning model for a specific application. This shall serve as our fundamental guideline as to how we are to approach in creating the solutions required for our use case.

Below are the first three comprehensive steps laid out in the lifecycle in machine learning.

## Problem Identification and Understanding

The first step in the lifecycle and is crucial as it sets the direction for the project, and involves defining the problem and create the objectives necessary for the resolution of the problem defined.

**Define the Problem**: The first task in this stage is to clearly define the problem. This involves understanding what the we want to achieve and how a machine learning model can assist in achieving that goal. For this instance, this module aims to guide us together towards the creation of a model that will recognize the hand signs shown in the camera based from the American Sign Language.

**Determine the Machine Learning Task**: Once the problem is defined, it's necessary to define the machine learning task based on that problem. This could involve deciding whether the problem is a classification problem, a regression problem, a clustering problem, etc. For our case, the task at hand would be a classification problem.

**Define the Optimization Objective**: The next step would be to determine the key business performance metrics that the machine learning model should aim to improve. These metrics should align with the overall objectives. For our use case, if the goal is to ensure an accurate recognition of the hand sign shown in the picture, the optimization objective might be to reduce the error rate of the model.

**Review Data Requirements**: Reviewing data requirements involves determining what data is needed to solve the problem and whether the necessary data is available. This might also involve considering the cost of data acquisition and whether external data sources might improve model performance. For our case, we will need to collect images of hand signs based in the American Sign Language.

## Data Collection and Exploration

This step in the lifecycle involves the gathering of the necessary data for training the model and understanding its characteristics. Personally, this stage in the lifecycle is the most expensive and time-consuming depending on the data needed. For our case, the American Sign Language datasets are common and free versions can be acquired from the Internet for use of everybody so we will not encounter much problem on this stage.

**Data Colection**: During the data collection, data relevant to the problem at hand is gathered. The data may come from various sources, such as databases, APIs, web scraping, third-party providers, or even generated through experiments or surveys. The collected data should contain the features necessary to solve the problem and the target variable for supervised learning tasks.

**Data Exploration**: Once the data is collected, it needs to be explored to understand its characteristics. This is often done using descriptive statistics and data visualization techniques. This stage helps identify patterns, relationships, or anomalies in the data. During this stage, it's also important to understand the context of the data: where it comes from, how it was collected, what each feature represents, etc. This can provide valuable insights that can guide the next stages of the lifecycle. 

**Importance of Data Exploration**: Data exploration can help identify potential issues with the data, such as missing or inconsistent values, outliers, or imbalanced classes. It can also help in understanding the distribution of the data, the relationship between variables, or the presence of subgroups in the data. The insights gained from the data exploration stage can inform the subsequent Data Processing and Feature Engineering stage, where the raw data is cleaned and transformed into a form suitable for machine learning.

## Data Processing/Feature Engineering

Data processing or feature engineering is an essential step in the machine learning lifecycle as it involves transforming the raw data into a format that is suitable for training machine learning models and extracting meaningful features from the data.

A special note would be for feature engineering as it involves creating new features or transforming existing features to better represent the underlying patterns in the data. It requires domain knowledge and a deep understanding of the problem at hand. The goal is to extract relevant information from the raw data that can improve the model's performance.

### Data Processing

**Data Cleaning**: This step involves handling missing values, outliers, and inconsistencies in the data. Missing values can be imputed using techniques such as mean imputation, median imputation, or using advanced imputation methods like K-nearest neighbors or regression imputation. Outliers can be detected and treated by methods like Z-score, IQR, or using domain knowledge to determine if they are valid or erroneous. Inconsistent data can be resolved by standardizing units, resolving conflicting values, or correcting errors.

**Data Splitting**: The collected data is typically split into training, validation, and test sets. The training set is used to train the model, the validation set is used to tune hyperparameters and evaluate different model variations, and the test set is used to assess the final model's performance. The splitting ratio depends on the size of the dataset and the specific use case.

**Data Normalization and Scaling**: It is common to normalize or scale the data to ensure that different features have similar scales. This helps prevent certain features from dominating the learning process due to their larger magnitude. Techniques like min-max scaling, z-score standardization, or logarithmic scaling can be used depending on the distribution and characteristics of the data.

### Feature Engineering

**Feature Extraction**: This involves extracting relevant information from the raw data. For example, extracting the day of the week from a timestamp, extracting keywords from text data, or extracting color histograms from images. Feature extraction can be done using methods like Principal Component Analysis (PCA), Fourier Transform, or using domain-specific algorithms.

**Feature Transformation**: This involves transforming the data to meet certain assumptions or improve its distribution. Common transformations include logarithmic transformations, square root transformations, or Box-Cox transformations. These transformations can help normalize the data, reduce skewness, or make it more suitable for linear models.

**Feature Selection**: Feature selection aims to identify the most relevant features that contribute the most to the model's predictive power. This helps reduce dimensionality and improve model interpretability. Techniques like correlation analysis, forward/backward selection, or regularization methods like L1/L2 regularization can be used for feature selection.

If no pre-built models exist for the use as solution for our problem, you can choose the option to build your own customized model from scratch. Upon collection of the required dataset for the training of the model, it is important that before you start, you have the necessary hardware that can support computationally intensive processes required for training. This often means that generic laptops and computers will not be able to handle the load necessary to train your model. 

To solve this, you can choose the option of acquiring the GPU power of cloud services such as **Amazon Web Services** and **Google Colabs** 

For the development of your own customized model using Python using the Jupyter Notebook IDE. Necessary libraries that we can use for the development process are the following:

**Pandas and NumPy libraries** for the access and modification of solid state data structures, n-dimensional matrices, and perform exploratory data analysis, and allows you to read CSV, JSON, and TSV data files.

**Matplotllib and Seaborn libraries** – for the data visualization phase requiring the plotting of charts and graphs.

**Scikit-learn, TensorFlow, MXNext, PyTorch,** and **Keras** framework libraries for the actual training of the model.