# Data Visualization Notebook

## Objectives

*   Answer business requirement 1: 
    * As a customer I am interested to understand the patterns from my customer base, so I can better manage churn levels.


## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build Streamlit App


## Additional Comments | Insights | Conclusions




---

# Install Packages

In [None]:
! pip install tensorflow==2.6.0

In [None]:
# ! pip install pandas-profiling==2.11.0
# ! pip install plotly==4.14.0
# ! pip install feature-engine==1.0.2

# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session
  * Typically the output will be /device:GPU:0


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

# Load Data

Quick Data Exploration

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.image import imread

In [None]:
# my_data_dir = '/content/WalkthroughProject01/inputs/datasets/cell_images/cell_images'
# my_data_dir = '/content/WalkthroughProject01/inputs/little_train_set/cell_images'
my_data_dir = f"/content/{os.environ['RepoName']}/inputs/malaria_dataset/cell_images"

labels_train = os.listdir(my_data_dir+ '/train')
labels_val = os.listdir(my_data_dir+ '/validation')
labels_test = os.listdir(my_data_dir+ '/test')
labels = list(set(labels_train + labels_test))

print(
    f"Labels on train set: {labels_train}\n"
    f"Labels on validation set: {labels_val}\n"
    f"Labels on test set: {labels_test}\n"
    f"Project Labels: {labels}"
    )

In [None]:
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'
train_path

---

## Check labels frequencies

In [None]:
df_freq = pd.DataFrame([]) 
for folder in ['train', 'validation', 'test']:
  for label in labels:
    df_freq = df_freq.append(
        pd.Series(data={'Set': folder,
                        'Label': label,
                        'Frequency':int(len(os.listdir(my_data_dir+'/'+ folder + '/' + label)))}
                  ),
                  ignore_index=True
        )
    
    print(f"* {folder} - {label}: {len(os.listdir(my_data_dir+'/'+ folder + '/' + label))} images")

sns.set_style("whitegrid")
plt.figure(figsize=(8,5))
sns.barplot(data=df_freq, x='Set', y='Frequency', hue='Label')
plt.show()

---

## Image montage

In [None]:
import itertools
import random

# logic
# if label exists in the folder
  # check if your montage space is greater tha nsubset size
  # create list of axes indices based on nrows and ncols
  # create a Figure and display images
    # in this loop, load and plot given image


def image_montage(dir_path, label_to_display, nrows, ncols, figsize=(15,10)):
  sns.set_style("white")

  labels = os.listdir(dir_path)

  # subset the class you are interested to display
  if label_to_display in labels:

    # checks if your montage space is greater than subset size
    images_list = os.listdir(dir_path+'/'+ label_to_display)
    if nrows * ncols < len(images_list):
      img_idx = random.sample(images_list, nrows * ncols)
    else:
      print(
          f"Decrease nrows or ncols to create your montage. \n"
          f"There are {len(images_list)} in your subset. "
          f"You requested a montage with {nrows * ncols} spaces")
      return
    

    # create list of axes indices based on nrows and ncols
    list_rows= range(0,nrows)
    list_cols= range(0,ncols)
    plot_idx = list(itertools.product(list_rows,list_cols))


    # create a Figure and display images
    fig, axes = plt.subplots(nrows=nrows,ncols=ncols, figsize=figsize)
    for x in range(0,nrows*ncols):
      img = imread(dir_path + '/' + label_to_display + '/' + img_idx[x])
      img_shape = img.shape
      axes[plot_idx[x][0], plot_idx[x][1]].imshow(img)
      axes[plot_idx[x][0], plot_idx[x][1]].set_title(f"Width {img_shape[1]}px x Height {img_shape[0]}px")
      axes[plot_idx[x][0], plot_idx[x][1]].set_xticks([])
      axes[plot_idx[x][0], plot_idx[x][1]].set_yticks([])
    plt.tight_layout()
    plt.show()


  else:
    print("The label you selected doesn't exist.")
    print(f"The existing options are: {labels}")

In [None]:
for label in labels:
  print(label)
  image_montage(dir_path= train_path,
                label_to_display= label,
                nrows=6, ncols=3, figsize=(10,15))
  print("\n")

---

quick viz on 1 image from either train/test, from one of the labels

In [None]:
labels

In [None]:
pointer = 44
para_img = imread(train_path + '/'+ labels[0]+ '/'+ os.listdir(train_path+'/'+labels[0])[pointer])
print(para_img.shape)
sns.set_style("white")
plt.imshow(para_img)
plt.show()

In [None]:
para_img.max()

---

## Avg and Std images

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/eda//{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

image sizes on train set

In [None]:
dim1, dim2 = [], []
for label in labels:
  for image_filename in os.listdir(train_path + '/'+ label):
    img = imread(train_path + '/'+ label + '/'+ image_filename)
    d1, d2, colors = img.shape
    dim1.append(d1) # image height
    dim2.append(d2) # image width

sns.set_style("whitegrid")
fig, axes = plt.subplots()
sns.scatterplot(x=dim2, y=dim1, alpha=0.2)
axes.set_xlabel("Width (pixels)")
axes.set_ylabel("Height (pixels)")
dim1_mean = int(np.array(dim1).mean())
dim2_mean = int(np.array(dim2).mean())
axes.axvline(x=dim1_mean,color='r', linestyle='--')
axes.axhline(y=dim2_mean,color='r', linestyle='--')
plt.show()
print(f"Width average: {dim2_mean} \nHeight average: {dim1_mean}")

Shape for resized image

In [None]:
image_shape = (dim1_mean, dim2_mean, 3)
image_shape

avg/std image from each label, avg/std diff bewteen labels

In [None]:
import cv2
sns.set_style("white")

def resize_images(my_data_dir, new_size=(50,50)):
  
  X, y = np.array([], dtype='int'), np.array([], dtype='object')
  labels = os.listdir(my_data_dir)

  for label in labels:
    aux = 0
    # print(f"{label}\n")
    for image_filename in os.listdir(my_data_dir + '/' + label):
      if aux < 20:
        img = imread(my_data_dir + '/' + label + '/' + image_filename)
        img_resized = cv2.resize(img,(new_size[0], new_size[1]))
        X = np.append(X, img_resized).reshape(-1, new_size[0], new_size[1], img_resized.shape[2])
        y = np.append(y, label)
        # print(img_resized.shape)
        aux = aux + 1
  #   print(X.shape)

  return X, y


X, y = resize_images(my_data_dir=train_path, new_size=(dim2_mean,dim1_mean))
print(X.shape, y.shape)

In [None]:
for x in range(0,400,25):
  print(x, y[x])
  plt.imshow(X[x])
  plt.show()

image_average_and_variability

In [None]:
def image_average_and_variability(X, y, figsize=(12,5)):
  sns.set_style("white")

  for class_to_display in np.unique(y):
    y = y.reshape(-1,1,1)
    boolean_mask = np.any(y==class_to_display,axis=1).reshape(-1)
    df = X[boolean_mask]
    avg_img = np.mean(df, axis = 0)
    std_img = np.std(df, axis = 0)
    print(f"==== Class {class_to_display} ====")
    print(avg_img.shape)
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=figsize)
    axes[0].set_title(f"Average Image for class {class_to_display}")
    axes[0].imshow(avg_img)
    axes[1].set_title(f"Standard Deviation for class {class_to_display}")
    axes[1].imshow(std_img)
    # plt.savefig(f'{file_path}/avg_std_img_{class_to_display}.png', bbox_inches='tight', dpi=150)
    plt.show()

    print("\n")
  

# for standard deviation, the lighter area indicates higher variability in that class
image_average_and_variability(X=X, y=y, figsize=(12,5))

contrast_between_2_classes

In [None]:
def subset_image_class(X,y,class_to_display):
  y = y.reshape(-1,1,1)
  boolean_mask = np.any(y==class_to_display,axis=1).reshape(-1)
  df = X[boolean_mask]
  return df

def contrast_between_2_classes(X, y, class_1, class_2, figsize=(12,5)):
  sns.set_style("white")

  # what if images have different sizes?

  if (class_1 not in np.unique(y)) or (class_2 not in np.unique(y)):
    print(f"Either class {class_1} or class {class_2}, are not in {np.unique(y)} ")
    return

  # calculate mean from class1
  images_class_1 = subset_image_class(X, y, class_1)
  class1_avg = np.mean(images_class_1, axis = 0)

  # calculate mean from class2
  images_class_2 = subset_image_class(X, y, class_2)
  class2_avg = np.mean(images_class_2, axis = 0)

  # calculate difference and plot all
  contrast_mean = class1_avg - class2_avg
  fig, axes = plt.subplots(nrows=1, ncols=3,figsize=figsize)
  axes[0].imshow(contrast_mean)
  axes[0].set_title(f'Difference Between Avg: {class_1} & {class_2}')
  axes[1].imshow(class1_avg)
  axes[1].set_title(f'Average Class {class_1}')
  axes[2].imshow(class2_avg)
  axes[2].set_title(f'Average Class {class_2}')
  # plt.savefig(f"{file_path}/avg_diff.png", bbox_inches='tight', dpi=150)
  plt.show()


contrast_between_2_classes(X=X, y=y, class_1='Parasitized', class_2='Uninfected', figsize=(12,10))

---

In [None]:
https://towardsdatascience.com/tagged/resize-images?p=3e0f29b992be
https://towardsdatascience.com/what-is-the-best-input-pipeline-to-train-image-classification-models-with-tf-keras-eb3fe26d3cc5

---

## Modelling

model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D
model = Sequential()

model.add(Conv2D(filters=32, kernel_size=(3,3),input_shape=image_shape, activation='relu',))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(filters=64, kernel_size=(3,3),input_shape=image_shape, activation='relu',))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(filters=64, kernel_size=(3,3),input_shape=image_shape, activation='relu',))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(128))
model.add(Activation('relu'))

# Dropouts help reduce overfitting by randomly turning neurons off during training.
# Here we say randomly turn off 50% of neurons.
model.add(Dropout(0.5))

# Last layer, remember its binary so we use sigmoid
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', patience=2)

---

train set

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [None]:
train_datagen = ImageDataGenerator(rotation_range=20,
                                   width_shift_range=0.10, 
                                   height_shift_range=0.10,
                                   shear_range=0.1,
                                   zoom_range=0.1,
                                   horizontal_flip=True,
                                   fill_mode='nearest',     
                              )


In [None]:
batch_size = 16

train_set = train_datagen.flow_from_directory(train_path,
                                              target_size=image_shape[:2],
                                              color_mode='rgb',
                                              batch_size=batch_size,
                                              class_mode='binary',
                                              shuffle=True
                                              )

train_set.class_indices

validation set

In [None]:
validation_set = ImageDataGenerator().flow_from_directory(val_path,
                                                          target_size=image_shape[:2],
                                                          color_mode='rgb',
                                                          batch_size=batch_size,
                                                          class_mode='binary',
                                                          shuffle=False
                                                          )

validation_set.class_indices

test set

In [None]:
test_set = ImageDataGenerator().flow_from_directory(test_path,
                                                    target_size=image_shape[:2],
                                                    color_mode='rgb',
                                                    batch_size=batch_size,
                                                    class_mode='binary',
                                                    shuffle=False
                                                    )

test_set.class_indices

model

In [None]:
model.fit(train_set,
          epochs=20,
          steps_per_epoch = len(train_set.classes) // batch_size,
          validation_data=validation_set,
          callbacks=[early_stop])



from google.colab import output
output.eval_js('new Audio("https://upload.wikimedia.org/wikipedia/commons/0/05/Beep-09.ogg").play()')

# model.save('/content/WalkthroughProject01/outputs/model/model_wt01_full_train_set.h5')

# CommitMsg = "update"
# !git add .
# !git commit -m {CommitMsg}
# !git push origin main

# started at 14:14

model training history

In [None]:
losses = pd.DataFrame(model.history.history)

sns.set_style("whitegrid")
losses[['loss','val_loss']].plot(style='.-')
plt.title("Loss")
plt.show()

print("\n")
losses[['accuracy','val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.show()

---

evaluation

In [None]:
model.metrics_names

In [None]:
model.evaluate(train_set)

In [None]:
model.evaluate(validation_set)

In [None]:
model.evaluate(test_set)

In [None]:
# from tensorflow.keras.models import load_model
# model1 = load_model('outputs/model/malaria_detector.h5')

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
def model_evaluation(model, image_generator):
  pred_probabilities = model.predict(image_generator)
  predictions = pred_probabilities > 0.5

  pd.Series(pred_probabilities.flatten()).hist()
  plt.show()
  print("\n")

  print(classification_report(image_generator.classes,predictions))
  print("\n")

  Map = image_generator.class_indices
  print(pd.DataFrame(confusion_matrix(predictions,image_generator.classes),
        columns=[ ["Actual " + sub for sub in Map] ], 
        index = [ ["Prediction " + sub for sub in Map ]]))



In [None]:
train_set_shuffle_false = ImageDataGenerator().flow_from_directory(train_path,
                                                          target_size=image_shape[:2],
                                                          color_mode='rgb',
                                                          batch_size=batch_size,
                                                          class_mode='binary',
                                                          shuffle=False
                                                          )

In [None]:
import datetime
for image_set in [train_set_shuffle_false, validation_set, test_set]:
  
  print(f'{image_set.directory.split("/")[-1]} set')
  a = datetime.datetime.now()

  model_evaluation(model=model, image_generator=image_set)
  print(f'\nTime for evaluation: {datetime.datetime.now() - a}')
  print("\n\n")

---

In [None]:
# pred_probabilities = model.predict(validation_set)
# pred_probabilities

In [None]:
# pd.Series(pred_probabilities.flatten()).hist()

In [None]:
# predictions = pred_probabilities > 0.5
# predictions

In [None]:
# from sklearn.metrics import classification_report,confusion_matrix

In [None]:
# print(classification_report(test_image_gen.classes,predictions))

In [None]:
# Map = train_image_gen.class_indices
# print(pd.DataFrame(confusion_matrix(predictions,test_image_gen.classes),
#         columns=[ ["Actual " + sub for sub in Map] ], 
#         index = [ ["Prediction " + sub for sub in Map ]]))


In [None]:
# confusion_matrix(predictions,test_image_gen.classes)

---

predict on new data

In [None]:
# from tensorflow.keras.models import load_model
# model = load_model('outputs/model/malaria_detector.h5')

In [None]:
# model.summary()
labels

In [None]:
pointer = 66
label = labels[1]
para_img = imread(test_path + '/'+ label + '/'+ os.listdir(test_path+'/'+ label)[pointer])
print(para_img.shape)
sns.set_style("white")
plt.imshow(para_img)
plt.show()

---

In [None]:
from tensorflow.keras.preprocessing import image

pointer = 66
label = labels[1]
my_image = image.load_img(test_path + '/'+ label + '/'+ os.listdir(test_path+'/'+ label)[pointer],
                          target_size=image_shape,
                          color_mode='rgb')
print(my_image.size, my_image.mode)
my_image

In [None]:
sns.set_style("white")

my_image = image.img_to_array(my_image)
print(my_image.shape,my_image.max())
plt.imshow(my_image)
plt.show()

# my_image = np.expand_dims(my_image, axis=0)


# import cv2
# img_resized = cv2.resize(para_img,(image_shape[1], image_shape[0]))   # change for dynamic
# print(img_resized.shape)
# sns.set_style("white")
# plt.imshow(img_resized)

In [None]:
pred_proba = model.predict(my_image)[0,0]

target_map = {v: k for k, v in train_set.class_indices.items()}
pred_class =  target_map[pred_proba > 0.5]  

if pred_class == target_map[0]: pred_proba = 1 - pred_proba

print(pred_proba)
print(pred_class)


In [None]:
# prob_from_predicted_class = round(pred_proba[0,0], 2)
# prob_from_other_class = round(1 - pred_proba[0,0], 2)
# print(prob_from_predicted_class,prob_from_other_class)

In [None]:
prob_per_class= pd.DataFrame(data=[0,0],index=train_set.class_indices.keys(), columns=['Probability'])

prob_per_class.loc[pred_class] = pred_proba

for x in prob_per_class.index.to_list():
  if x not in pred_class: prob_per_class.loc[x] = 1 - pred_proba

prob_per_class = prob_per_class.round(3)
print(prob_per_class)
import plotly.express as px
fig = px.bar(prob_per_class, x = prob_per_class.index, y = prob_per_class['Probability'],range_y=[0,1],
             labels=dict(x="Diagnosis"), width=400, height=500)
fig.show()

save model

In [None]:
model.save('/content/WalkthroughProject01/outputs/model/model_full_train_set.h5')  # creates a HDF5 file 'my_model.h5'

In [None]:
# save modeel history - 2 plots
# save classification report? confusion matrix?or predictions on train,val,test sets?
# train_set.class_indices
# image_shape 
# save avg image, std image,
# labels frequency on train, vali, test sets

# **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "update"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main

---