# AWS Machine Learning Engineer Nano Degree Capstone Project 

## Plants Disease Detection Using Deep Learning

### Project  Domain

Plant diseases are one of the major factors responsible for substantial losses in yield of
plants, leading to huge economic losses. According to a study by the Associated Chambers
of Commerce and Industry of India, annual crop losses due to diseases and pest’s amount to
Rs.50,000 crore in India alone, which is significant in a country where the
farmers are responsible for feeding a population of close to 1.3 billion people. The value of
plant science is therefore huge.</br>
Accurate identification and diagnosis of plant diseases are very important in the era of
climate change and globalization for food security. Accurate and early identification of plant
diseases could help in the prevention of spread of invasive pests/pathogens. In addition, for an
efficient and economical management of plant diseases accurate, sensitive and specific
diagnosis is necessary.</br>
The growth of GPU’s ( Graphical Processing Units ) has aided academics and business
in the advancement of Deep Learning methods, allowing them to explore deeper and more
sophisticated Neural Networks. Using concepts of Image Classification and Transfer
Learning we could train a Deep Learning model to categorize Plant leaf’s images to predict
whether the plant is healthy or has any diseases. This could help in the early detection of any
diseases in plants and could help take preventive measures to prevent huge crop losses

### Data Prepation

####  Installing Libraries
* We will be using **split-folders** to split our dataset into train, val and test sets.
* **tqdm** will help give us a visual status of the progress while copying folders.

In [1]:
!pip install split-folders tqdm

Collecting split-folders
  Downloading split_folders-0.4.3-py3-none-any.whl (7.4 kB)
Installing collected packages: split-folders
Successfully installed split-folders-0.4.3


#### Using GIT Large File Storage(LFS)
* For the pupose of this project the dataset is that we will be usig has been uploaded in a zipped file format to GIT LFS( https://docs.github.com/en/repositories/working-with-files/managing-large-files)
* So we need to install git lfs first and then use it to pull the complete data of the zip folder, as only links to the actual file present in git lfs is stored in the git repository.

In [2]:
# Commands to enabled git lfs and use it to pull the zip folders
#Reference: https://stackoverflow.com/questions/70513398/i-can-not-install-git-lfs-on-aws-sagemaker-notebook-instance
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash

!sudo yum install git-lfs -y

!git lfs install

Detected operating system as amzn/2018.
Checking for curl...
Detected curl...
Downloading repository file: https://packagecloud.io/install/repositories/github/git-lfs/config_file.repo?os=amzn&dist=2018&source=script
done.
Installing pygpgme to verify GPG signatures...
Loaded plugins: dkms-build-requires, priorities, update-motd, upgrade-helper,
              : versionlock
amzn-main                                                | 2.1 kB     00:00     
amzn-updates                                             | 3.8 kB     00:00     
github_git-lfs-source/signature                          |  833 B     00:00     
Retrieving key from https://packagecloud.io/github/git-lfs/gpgkey
Importing GPG key 0xDC282033:
 Userid     : "https://packagecloud.io/github/git-lfs (https://packagecloud.io/docs#gpg_signing) <support@packagecloud.io>"
 Fingerprint: 6d39 8dbd 30dd 7894 1e2c 4797 fe2a 5f8b dc28 2033
 From       : https://packagecloud.io/github/git-lfs/gpgkey
github_git-lfs-source/signature       

Now inorder to fetch the complete zip file from git lfs , we need to do `git lfs pull`

In [3]:
!git lfs pull

Downloading LFS objects: 100% (1/1), 170 MB | 47 MB/s                           

In [4]:
# lets write a small utility functions to unzip our folder's contents.
import zipfile

# Function below unzips the archive to the local directory. 
def unzip_data(input_data_path):
    with zipfile.ZipFile(input_data_path, 'r') as input_data_zip:
        input_data_zip.extractall('.')

In [8]:
zipped_filename = "CapstoneRawDatasetAndProposal.zip"
unzip_data(zipped_filename)

In [11]:
input_folder_path = "./CapstoneProposal/dataset/Plant_leave_diseases_dataset_with_augmentation"

##### Quick Overview of the plant diesease dataset.
The total plant disease dataset of that will be used for this project consists of **9644 images** . All the images
vary in dimensions, they are not standardized, and they are all coloured images. So the model
will be trained on the above 9 plant image classes, for our use-case.

#### Splitting Dataset into Train, Validation and Test Sets

The Dataset consists of **9 classes** and the dataset is more or less **balanced**. Thus we can split the dataset into train , validation and test sets in the ratio/proportion of 
**80:10:10**. Meaning 80% training dataset, 10% validation dataset and 10% test dataset.

In [13]:
import splitfolders  # or import split_folders

# Split with a ratio.
splitfolders.ratio(input_folder_path, output="plant_disease_dataset", seed=1357, ratio=(.8, .1, .1), group_prefix=None) # default valuesb

Copying files: 9644 files [00:01, 6660.96 files/s]
