# Main Report Notebook - Aidan Stack - 2021

This notebook details the entire CRISP-DM process for this project. This report is a technical summary that breaks down the techniques and results for every step in this project. The code for each section can be found in the relevant notebooks, all of which reside in the 'notebooks' folder in this repository. 

##  1. Business Context

 Terms like ‘AI’, ‘Machine Learning’ and ‘Neural Networks’, usually inundate the mind with images of IBM’s [Watson](https://www.ibm.com/watson) crushing some of human kind’s most knowledgeable representatives in Jeopardy, or Google’s [AlphaGo](https://deepmind.com/research/case-studies/alphago-the-story-so-far) beating the world’s top ranked player in Go, a game widely respected as one of the most complex strategy games of all time. What probably doesn’t spring to mind is the deployment of these technologies in an industry like steel manufacturing. However, in much the same way that computers have become ubiquitous in all businesses over the last 70 years, organizations of all kinds are realizing the power of utilizing machine learning solutions. Sometimes called [‘Industry 4.0’](https://www.n-ix.com/computer-vision-manufacturing/), manufacturers are adopting AI, deep learning, and computer vision to improve product quality, reduce costs, and increase efficiency. AI has made its way into every step of the manufacturing process, from supply chain to inventory management, however the focus of this project is computer vision for defect detection.

The dataset for this project comes from a [kaggle competition](https://www.kaggle.com/c/severstal-steel-defect-detection) put out by Russian steel production company [Severstal](https://www.severstal.com/eng/about/), who are looking to utilize machine learning to ‘improve automation, increase efficiency, and maintain quality’ throughout their production process. They are part of the global movement of manufacturing companies towards increased use of AI, and understand the value unlocked by applied machine learning. The specific AI application they are looking to improve with this kaggle competition is the detection of defects in steel using images of sheet steel from their production process. Detection of defects using computer vision could be integrated into the manufacturing pipeline and help reduce costs, and material waste. Steel is the worlds [second largest commodity item](https://secure.fia.org/files/css/magazineArticles/article-1410.pdf) behind crude oil, with [1.8 billion](https://www.weforum.org/agenda/2021/06/global-steel-production/#:~:text=Consumption%20and%20Production-,Steel%20is%20the%20foundation%20of%20our%20buildings%2C%20vehicles%2C%20and%20industries,scaling%20down%20their%20domestic%20production.) metric tons of crude steel produced in 2020 alone. The titanic size of this industry means that any improvements to efficiency that AI could afford would be well worth the investment. Additionally, [how appropriate AI is for various real world applications](https://www.ml.cmu.edu/research/joint_phd_dissertations/dissertation_dearteaga.pdf) is often judged in the context of low volume, high stakes vs high volume, low stakes. For example looking at an X-ray and determining whether a patient has pnuemonia is a classic low volume, high stakes image classification task, here the accountability and expert eye of human labor often makes the most sense. AI is often a better fit for high volume, low stakes contexts where an error is much more tolerable, but the volume is too high for human labor to make sense. The business context of this task supplied by severstal is a perfect example of this high volume, low stakes paradigm. Computer vision is a highly appropriate choice, and could be leveraged to improve the production of a commodity item with huge annual volume, and a relatively high tolerance for errors. 

In terms of the technology used for this task, neural networks are an obvious choice for any computer vision task, and they will be the tool used for this project, as even simple artificial neural networks can perform well in image classification tasks due to their ability to cope with unstructured data such as images. The specific business use cases for the different network architectures are laid out below, under Data Understanding (EDA). Neural networks do however require significant computational resources, so this entire project was done on the cloud, leveraging the power and scalability of AWS.


## 2. Data Preparation and Environment Set Up 



The number of images in the dataset and their relatively high resolution meant that this project was going to be computationally intensive, so all steps were performed on the cloud, using AWS. The images and CSVs were downloaded locally from kaggle, then uploaded to an Amazon S3 bucket. Using Amazon Sagemaker, an ml.m5.2xlarge notebook instance was created, with a volume size of 100GB. Multiple separate notebooks were created inside this instance, all with AWS' built in conda_tensorflow2_p36 environment. 

In order for the neural networks to be able to interpret the training images, the images must first undergo several transformations. The first is to convert the .jpg files into raw arrays. The dataset was fairly clean; all of the images have the same format of 256 by 1600, and have only 1 color channel, meaning they are grayscale. To reduce compute time, the images were reformatted to 256 by 256 upon import. The image arrays were then scaled from a range of 0 to 255, to a range of 0 to 1. The arrays were also flattened to be 1-Dimensional for use in the artificial neural networks, in the convolutional networks notebook, the arrays are again reshaped to a format of 256 x 256 x 1. 

## 3. Data Understanding (EDA)

The original dataset includes 18,074 total images, 12,568 in the training set, and 5,506 in the test set. All the images are grayscale and come in a resolution of 1600 pixels wide, by 256 pixels tall. Included with the images is a csv, which lists all the images that contain defects, where each row is indexed by the name of .jpg file, and say what class of defect is in the image, as well as data on which pixels in the image make up that defect. Images that are listed more than once in this csv contain more than one class of defect, and images from the training set not listed in this csv at all are images where no defect is present. Because the original dataset comes from a kaggle competition, ground truth labels are not available for the test set, so going forward the 12,568 images in the training set will be treated as the entire dataset. 

Out of the 12,568 total training images, 5902 (47%) do not exhibit defects, while 6666 (53%) exhibit at least one class of defect. While there are images with multiple classes of defect present, 97% of the images have either no, or only one class of defect. 

There are four classes of defect, labelled in the CSV provided simply as 1 through 4, not to be confused with the number of classes of defect present in each image. 

### Binary Classification

With a 53% - 47% split between images with defects and without defects respectively, one way to frame this problem is as a binary classification problem. In this application, as steel flows through the production process, pieces with defects are identified and removed from the production line and dealt with accordingly. 

### Multiclass Classification

Since the defect types are not mutually exclusive, this task is technically a multilabel classification problem, not a multiclass one. However because 97% of the images contain either no defects or one class of defect, the task at hand will require significant training time, and there is a significant class imbalance among images with defects, this task will be treated as a multiclass classification problem.

Multiclass classification could be deployed in production similarly to binary classification, where defects of certain classes are redirected from the main production line into seperate production streams. 

## 4. Binary Classification with an Artifical Neural Network

Once the image processing steps had been completed, and the resulting image arrays were flattened into one dimension, the images were ready for input into the artificial neural networks. The image arrays are not all that is required however, this is a supervised learning project so the networks would require ground truth labels to train on. For the purposes of binary classification, the images were sorted into two classes, '0', meaning no defect present, or '1', meaning one or more classes of defect present. These labels were generated using 'train.csv', a CSV file provided with the dataset. Once the labels were organized into a dictionary correlating each .jpg filename to a class, the labels were one hot encoded, the format required by Keras. 

The next step was to perform two train test splits. The first split would seperate the 12,568 images into training and holdout sets, of 90% and 10% respectively. The holdout set would be ignored until the end of the model iteration and training process, for final performance evaluation. The training set was then split again into train and validation sets, so that each epoch the networks could self validate. 

The first network had the simplest architecture possible, with subsequent networks increasing in complexity. This was to find out what was the minimum level of network complexity that could still learn the task. The accuracy curves of the first network did trend upwards, but it was clear that the model was struggling to capture the complexity of the problem. 

The second network had additional hidden layers, and each layer had more nodes, and it was trained for more epochs. This second network had smoother accuracy curves, and the validation accuracy thrashed significantly less. 

In order to reduce the thrashing of the validation accuracy curve, the third network iteration was identical to the second, except regularization was added to each hidden layer. Surprisingly, this worsened the overall accuracy, without improving the thrashing of the validation accuracy. 

The second network iteration was chosen for final evaluation as it performed the best out of the three. This network architecture achieved 77.3% on the holdout data. This is a great result for a simple neural network tackling a problem of this difficulty. The following notebook implements convolutional networks in both binary, and multiclass classification contexts. 


## Binary Classification with a Convolutional Neural Network

The image arrays created in the previous notebook were imported to save compute time. These arrays, each one representing one training image, were in the flattened format required by artificial networks. In order to be understood by the convolutional networks, the image arrays were reformatted using Numpy's .reshape( ) so that they were in the 256x256x1 format. The binary class labels from the earlier notebook did not require any changes, so they were simply imported. Then two train test splits were performed, using the same techniques, and with the same motivations as in the previous notebook. Random seeds and test set sizes were kept identical to the splits in the A.N.N. notebook, to ensure that the results of the artificial networks would be comparable, and could serve as a baseline performance to judge the convolutional networks against. 

The first network was instantiated with five convolutional layers each with a corresponding max pooling layer, a flattening layer, and two 64 node dense hidden layers, culminating in a 2 node output layer. This final output layer was set with an activation of 'softmax', to serve as a an indicator of which  of the two classes the input was more likely to belong to. The network was trained for 10 epochs and peaked in epoch 9 with a validation accuracy of 85%. 

The second network was created with the same number of convolutional and max pooling layers as the first, but had double the nodes in the two dense hidden layers. It was also trained for 115 total epochs in the hope that more training time, and additional complexity would improve performance. The network peaked in epoch 12 with an 88% accuracy. 

Since the second network performed better than the first, the second was used in the final evaluation, achieving 85% accuracy, a huge improvement over the 77% accuracy of the artificial neural networks. This jump in performance is of course due to the convolutional and pooling layers stacked before the deep layers, the attribute that make C.N.N.s specialized for image classification tasks. While this result is not surprising, this process has shown that between A.N.N.S and C.N.N.s, the convolutional networks are the way to go. 

## Multiclass Classification with a Convolutional Network

After proving their aptitude for image processing tasks, convolutional networks were the obvious choice for multiclass classification. The preprocessed image arrays from the previous binary task were used for this task as well. The labels however had to be remade, as the binary labels were useless in this context. 'Train.csv' was imported, and the necessary Pandas methods were used to transform the dataframe into one hot encoded arrays, each corresponding to a training image. Since there was a new set of labels, the train test splits were performed again, using the same methodology as the last two times. 

The first network was given seven convolutional layers, each with a corresponding max pooling layer. Then a flattening layer was added before the two dense hidden layers, each with 256 nodes and regularization added, this was done under the pretense that the larger dense layers would help the network learn the cokmplex task, while the regularization would prevent overfitting. Finally a dense layer with 5 nodes, and an activation of 'softmax' was added. With no defect, and defect classes 1-4 represented by each of the nodes in this layer. Loss was switched from binary_crossentropy to categorical_crossentropy in accordance with the switch from a binary task to a multiclass one. This network was trained for 15 epochs and managed an accuracy of 79.5%, with learning curves that showed no obvious signs of overfitting or thrashing. 

The second network iteration was largely the same, however it was given an additional dense layer of 256 nodes, and stronger regularization on each of the three dense layers. This network was also trained for 15 epochs and achieved an accuracy of 82.5%. Since this second network performed better, it was used in the final evaluation. 

The second network archirtecture achieved an accuracy of 78.5% in the multiclass context. This score, higher than the artificial network's performance on the easier binary task, clearly shows the power of convolutional networks in image classification tasks. 

## Conclusions

### Binary Classification
When compared to a modeless baseline of 53%, and the 77% accuracy of the A.N.N., the 85% accuracy achieved by the convolutional network proves that it is the superior approach out of the three. The A.N.N.s simply dont hold up to the specialization of the C.N.N.s, and if binary classification is the deployment mode specified by the stakeholders, C.N.N.s are certainly a better choice of network architecture.

### Multiclass Classification
The deployment of a network capable of multiclass classification would largely be up to the manufacturing engineers, as domain knowledge about the various types of defects would determine how each piece was treated. It is easy to imagine however the various new options that multiclass classification would offer, over the simple defect or no defect output of binary classification. With an accuracy of roughly 79%, the network could even be trained for recall of certain classes of defect, if false negatives of that class is more intolerable. Severstal, the company behind this kaggle competition, have a clear interest in using machine learning to improve efficiency and lower costs, and this analysis proves that convolutional neural networks are a viable, and effective approach to achieving those goals.


## Possible Next Steps

### 1. Early Stopping
While overall the networks benefitted from additional training epochs, some of them trended towards overfitment. The implementation of Keras' early stopping functionality could further improve network performance, and reduce the computational resources required. 

### 2. Additional Computing Power
This task is a difficult one, even for the specialized convolutional networks. Performance could likely be improved by moving to a more expensive and more powerful AWS notebook instance, allowing more complex models to be trained in the same time. 

### 3. Additional Network Complexity
In accordance with the reccommendation above, further network iterations could possibly accrue additional performance benefits with more complex architectures. This change could be especially effective if combined with the early stopping protocols mentioned above. 

### 4. Instance Segmentation
The csv provided with the original dataset by Severstal detailed not only which class of defect was present in each image, but also which pixels made up that defect. This opens up the opportunity to approach this task in the context of instance segmentation, where a network is trained to identify not only the presence and class of defect, but also which pixels make up that defect. This would give the manufacturing engineers even more freedom in deployment, as instead of classifying entire pieces of steel, regions of each piece could be classified and dealt with accordingly. For example if a large piece of sheet steel is going to be processed into multiple finished products, the sections the network identifies as defect free could be used, while other sections of the same piece of steel are discarded. 

### 5. Test Deployment
Finally, these networks could be tested under a mock deployment, where dozens of images are continuously fed as inputs, to test functionality under real world conditions. 