![dphi banner](https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/dphi_banner.png)

# **Getting Started Code For [Data Sprint #62](https://dphi.tech/challenges/datathon/) on DPhi**

## Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.

We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd, tensorlow --> tf).

Note: You can import all the libraries that you think will be required or can import it as you go along.

In [40]:
import pandas as pd                                     # Data analysis and manipultion tool
import numpy as np                                      # Fundamental package for linear algebra and multidimensional arrays
import tensorflow as tf                                 # Deep Learning Tool
from tensorflow import keras
import os                                               # OS module in Python provides a way of using operating system dependent functionality
import cv2                                              # Library for image processing
from sklearn.model_selection import train_test_split    # For splitting the data into train and validation set

## Loading and preparing training data
The train and test images are given in two different folders - 'train' and 'test'. The labels of train images are given in a csv file 'Train.csv' with respective image id (i.e. image file name).

#### Getting the labels of the images

In [41]:
labels = pd.read_csv("./dataset/Training_set.csv")   # loading the labels
labels.head()           # will display the first five rows in labels dataframe

Unnamed: 0,filename,label
0,Image_1.jpg,sunrise
1,Image_2.jpg,shine
2,Image_3.jpg,cloudy
3,Image_4.jpg,shine
4,Image_5.jpg,sunrise


In [42]:
labels.tail()            # will display the last five rows in labels dataframe

Unnamed: 0,filename,label
1043,Image_1044.jpg,foggy
1044,Image_1045.jpg,sunrise
1045,Image_1046.jpg,cloudy
1046,Image_1047.jpg,rainy
1047,Image_1048.jpg,sunrise


#### Getting images file path

In [43]:
file_paths = [[fname, './dataset/train/' + fname] for fname in labels['filename']]

#### Confirming if no. of labels is equal to no. of images

In [44]:
# Confirm if number of images is same as number of labels given
if len(labels) == len(file_paths):
    print('Number of labels i.e. ', len(labels), 'matches the number of filenames i.e. ', len(file_paths))
else:
    print('Number of labels does not match the number of filenames')

Number of labels i.e.  1048 matches the number of filenames i.e.  1048


#### Converting the file_paths to dataframe

In [45]:
images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
images.head()

Unnamed: 0,filename,filepaths
0,Image_1.jpg,./dataset/train/Image_1.jpg
1,Image_2.jpg,./dataset/train/Image_2.jpg
2,Image_3.jpg,./dataset/train/Image_3.jpg
3,Image_4.jpg,./dataset/train/Image_4.jpg
4,Image_5.jpg,./dataset/train/Image_5.jpg


#### Combining the labels with the images

In [46]:
train_data = pd.merge(images, labels, how = 'inner', on = 'filename')
train_data.head()       

Unnamed: 0,filename,filepaths,label
0,Image_1.jpg,./dataset/train/Image_1.jpg,sunrise
1,Image_2.jpg,./dataset/train/Image_2.jpg,shine
2,Image_3.jpg,./dataset/train/Image_3.jpg,cloudy
3,Image_4.jpg,./dataset/train/Image_4.jpg,shine
4,Image_5.jpg,./dataset/train/Image_5.jpg,sunrise


In [47]:
train_data['label'].unique()

array(['sunrise', 'shine', 'cloudy', 'foggy', 'rainy'], dtype=object)

In [48]:
for label in train_data['label'].unique():
    print(len(train_data[train_data.label == label]))

245
174
210
210
209


The 'train_data' dataframe contains all the image id, their locations and their respective labels. Now the training data is ready.

## Data Pre-processing
It is necessary to bring all the images in the same shape and size, also convert them to their pixel values because all machine learning or deep learning models accepts only the numerical data. Also we need to convert all the labels from categorical to numerical values.

In [59]:
import os
import shutil

for label in train_data['label'].unique():
    os.mkdir('./dataset/train/{}'.format(label))

for i in range (0, len(train_data)):
    shutil.move('./dataset/train/{}'.format(train_data.iloc[i].filename), './dataset/train/{}/{}'.format(train_data.iloc[i].label, train_data.iloc[i].filename))

In [62]:
import splitfolders
splitfolders.ratio('./dataset/train', output="./dataset/data", seed=1337, ratio=(.8, 0.1,0.1)) 

## Building Model
Now we are finally ready, and we can train the model.

There are many machine learning or deep learning models like Random Forest, Decision Tree, Multi-Layer Perceptron (MLP), Convolution Neural Network (CNN), etc. to say you some.


Then we would feed the model both with the data (X_train) and the answers for that data (y_train)

In [84]:
shape = (350, 350, 3)
classes = 5

In [79]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.xception import Xception
from tensorflow.keras.applications.xception import preprocess_input

train_gen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

train_ds = train_gen.flow_from_directory(
    './dataset/train',
    target_size=(350, 350),
    batch_size=32
)

val_gen = ImageDataGenerator(preprocessing_function=preprocess_input)

val_ds = val_gen.flow_from_directory(
    './dataset/data/val',
    target_size=(350, 350),
    batch_size=32
)

Found 1048 images belonging to 5 classes.
Found 103 images belonging to 5 classes.


In [81]:
base_model = Xception(weights='imagenet', include_top=False, input_shape=shape)

base_model.trainable = False

inputs = keras.Input(shape=shape)

base = base_model(inputs, training=False)

vectors = keras.layers.GlobalAveragePooling2D()(base)

outputs = keras.layers.Dense(classes)(vectors)

xception_model = keras.Model(inputs, outputs)
learning_rate = 0.01
optimizer = keras.optimizers.Adam(learning_rate=learning_rate)

loss = keras.losses.BinaryCrossentropy(from_logits=True)

xception_model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
xception_history = xception_model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10


2022-06-19 18:25:23.294293: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-06-19 18:25:45.314696: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [82]:
test_gen = ImageDataGenerator(preprocessing_function=preprocess_input)

test_ds = test_gen.flow_from_directory(
    './dataset/data/test',
    target_size=(350, 350),
    batch_size=32
)

xception_score = xception_model.evaluate(test_ds)[1]

Found 107 images belonging to 5 classes.


In [89]:
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input

base_model = VGG16(
    include_top=False,
    input_shape=shape
)

base_model.trainable = False

inputs = keras.Input(shape=shape)

base = base_model(inputs, training=False)

vectors = keras.layers.GlobalAveragePooling2D()(base)

outputs = keras.layers.Dense(classes)(vectors)

vgg_model = keras.Model(inputs, outputs)
learning_rate = 0.01
optimizer = keras.optimizers.Adam(learning_rate=learning_rate)

loss = keras.losses.BinaryCrossentropy(from_logits=True)

vgg_model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
vgg_history = vgg_model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10


2022-06-19 18:42:02.534280: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-06-19 18:42:24.349177: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [90]:
test_gen = ImageDataGenerator(preprocessing_function=preprocess_input)

test_ds = test_gen.flow_from_directory(
    './dataset/data/test',
    target_size=(350, 350),
    batch_size=32
)

vgg_score = vgg_model.evaluate(test_ds)[1]

Found 107 images belonging to 5 classes.


In [91]:
from keras.applications.inception_v3 import InceptionV3
from keras.applications.inception_v3 import preprocess_input

base_model = InceptionV3(
    include_top=False,
    input_shape=shape
)

base_model.trainable = False

inputs = keras.Input(shape=shape)

base = base_model(inputs, training=False)

vectors = keras.layers.GlobalAveragePooling2D()(base)

outputs = keras.layers.Dense(classes)(vectors)

inception_model = keras.Model(inputs, outputs)
learning_rate = 0.01
optimizer = keras.optimizers.Adam(learning_rate=learning_rate)

loss = keras.losses.BinaryCrossentropy(from_logits=True)

inception_model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
inception_history = inception_model.fit(train_ds, epochs=5, validation_data=val_ds)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
Epoch 1/5


2022-06-19 18:49:22.268150: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-06-19 18:49:42.893342: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [92]:
test_gen = ImageDataGenerator(preprocessing_function=preprocess_input)

test_ds = test_gen.flow_from_directory(
    './dataset/data/test',
    target_size=(350, 350),
    batch_size=32
)

inception_score = inception_model.evaluate(test_ds)[1]

Found 107 images belonging to 5 classes.


## Predict The Output For Testing Dataset 😅
We have trained our model, evaluated it and now finally we will predict the output/target for the testing data (i.e. Test.csv).

#### Load Test Set
Load the test data on which final submission is to be made.

In [103]:
# Loading the order of the image's name that has been provided
test_image_order = pd.read_csv("./dataset/Testing_set.csv")
test_image_order.head()

Unnamed: 0,filename
0,Image_1.jpg
1,Image_2.jpg
2,Image_3.jpg
3,Image_4.jpg
4,Image_5.jpg


#### Getting images file path

In [104]:
file_paths = [[fname, './dataset/test/' + fname] for fname in test_image_order['filename']]

#### Confirm if number of images in test folder is same as number of image names in 'Testing_set_face_mask.csv'

In [105]:
# Confirm if number of images is same as number of labels given
if len(test_image_order) == len(file_paths):
    print('Number of image names i.e. ', len(test_image_order), 'matches the number of file paths i.e. ', len(file_paths))
else:
    print('Number of image names does not match the number of filepaths')

Number of image names i.e.  450 matches the number of file paths i.e.  450


#### Converting the file_paths to dataframe

In [106]:
test_images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
test_images.head()

Unnamed: 0,filename,filepaths
0,Image_1.jpg,./dataset/test/Image_1.jpg
1,Image_2.jpg,./dataset/test/Image_2.jpg
2,Image_3.jpg,./dataset/test/Image_3.jpg
3,Image_4.jpg,./dataset/test/Image_4.jpg
4,Image_5.jpg,./dataset/test/Image_5.jpg


## Data Pre-processing on test_data


### Make Prediction on Test Dataset
Time to make a submission!!!

In [130]:
!rm ./dataset/test/.DS_Store

In [131]:
from tensorflow.keras.preprocessing import image

def classify(img_path):
    img = image.load_img(img_path, target_size=shape)
    img_array = image.img_to_array(img)

    img_batch = np.expand_dims(img_array, axis=0)

    img_preprocessed = preprocess_input(img_batch)

    model = inception_model
    prediction = model.predict(img_preprocessed)

    return prediction

In [137]:
pred =  []

for filename in test_images['filename']:
    print(filename)
    pred.append(classify('./dataset/test/' + filename))

Image_1.jpg
Image_2.jpg
Image_3.jpg
Image_4.jpg
Image_5.jpg
Image_6.jpg
Image_7.jpg
Image_8.jpg
Image_9.jpg
Image_10.jpg
Image_11.jpg
Image_12.jpg
Image_13.jpg
Image_14.jpg
Image_15.jpg
Image_16.jpg
Image_17.jpg
Image_18.jpg
Image_19.jpg
Image_20.jpg
Image_21.jpg
Image_22.jpg
Image_23.jpg
Image_24.jpg
Image_25.jpg
Image_26.jpg
Image_27.jpg
Image_28.jpg
Image_29.jpg
Image_30.jpg
Image_31.jpg
Image_32.jpg
Image_33.jpg
Image_34.jpg
Image_35.jpg
Image_36.jpg
Image_37.jpg
Image_38.jpg
Image_39.jpg
Image_40.jpg
Image_41.jpg
Image_42.jpg
Image_43.jpg
Image_44.jpg
Image_45.jpg
Image_46.jpg
Image_47.jpg
Image_48.jpg
Image_49.jpg
Image_50.jpg
Image_51.jpg
Image_52.jpg
Image_53.jpg
Image_54.jpg
Image_55.jpg
Image_56.jpg
Image_57.jpg
Image_58.jpg
Image_59.jpg
Image_60.jpg
Image_61.jpg
Image_62.jpg
Image_63.jpg
Image_64.jpg
Image_65.jpg
Image_66.jpg
Image_67.jpg
Image_68.jpg
Image_69.jpg
Image_70.jpg
Image_71.jpg
Image_72.jpg
Image_73.jpg
Image_74.jpg
Image_75.jpg
Image_76.jpg
Image_77.jpg
Image_78

The above values are probability values. We need to convert it into respective classes. We can use np.argmax for the same.

In [133]:
prediction = []
for value in pred:
  prediction.append(np.argmax(value))

In [134]:
predictions = le.inverse_transform(prediction)

## **How to save prediciton results locally via jupyter notebook?**
If you are working on Jupyter notebook, execute below block of codes. A file named 'submission.csv' will be created in your current working directory.

In [135]:
res = pd.DataFrame({'filename': test_images['filename'], 'label': predictions})  # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv", index = False)      # the csv file will be saved locally on the same location where this notebook is located.

# **Well Done! 👍**
You are all set to make a submission. Let's head to the **[challenge page](https://dphi.tech/challenges/data-sprint-41/142/submit)** to make the submission.