# Chest X-Ray Images (Pneumonia)
**Authors:** Carlos McCrum, Jared Mitchell, Andrew Bernklau
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
from sklearn.linear_model import LogisticRegression
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import cv2
from random import shuffle 
from tqdm import tqdm 
from PIL import Image
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow import keras
from keras.layers import Dense, BatchNormalization
from keras.models import Sequential
from keras.losses import MeanSquaredError
from scipy import ndimage
import tensorflow_probability as tfp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from keras import regularizers
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

KeyboardInterrupt: 

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

### Preprocessing for Logitstic and Random Forest

In [None]:
train_path = 'data/chest_xray/train/'
test_path = 'data/chest_xray/test/'
val_path = 'data/chest_xray/val'



In [None]:
# get all the data in the directory chest_xray/test (624 images), and reshape them
test_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        test_path,
        target_size = (256, 256),
        batch_size = 32) 

# get all the data in the directory split/validation (16 images), and reshape them
val_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        val_path,
        target_size = (256, 256),
        batch_size = 32)

# get all the data in the directory split/train (5216 images), and reshape them
train_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        train_path,
        target_size = (256, 256),
        batch_size=32)

In [None]:
# create the data sets
train_images, train_labels = next(train_generator)
test_images, test_labels = next(test_generator)
val_images, val_labels = next(val_generator)
print(train_images.shape)
print(test_images.shape)
print(val_images.shape)

In [None]:
X_train = train_images.reshape(train_images.shape[0], -1)
X_test = test_images.reshape(test_images.shape[0], -1)
X_val = val_images.reshape(val_images.shape[0], -1)

print(X_train.shape)
print(X_test.shape)
print(X_val.shape)

In [None]:
y_train = np.reshape(train_labels[:,0], (32,1))
y_test = np.reshape(test_labels[:,0], (32,1))
y_val = np.reshape(val_labels[:,0], (16,1))
print(y_train.shape)
print(y_test.shape)
print(y_val.shape)

### Preprocessing for Neural Networks

In [None]:
# batch = 32
img_height = 256
img_width = 256


train_ds = tf.keras.utils.image_dataset_from_directory(
  train_path,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=5217)

val_ds = tf.keras.utils.image_dataset_from_directory(
  test_path,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=624)

In [None]:
for image_batch, labels_batch in train_ds:
    print(image_batch.shape)
    print(labels_batch.shape)
    break

In [None]:
class_names = train_ds.class_names
print(class_names)

In [None]:
mask = []
for i in train_ds:
    if(i.class_names[1] == "PNEUMONIA"):
        mask.append("Pneumonia")
    else:
        mask.append("Normal")
sns.countplot(mask)
print(pd.DataFrame(mask).value_counts(normalize=True));

In [None]:
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
      for i in range(9):
        ax = plt.subplot(3, 3, i + 1)
        plt.imshow(images[i].numpy().astype("uint8"))
        plt.title(class_names[labels[i]])
        plt.axis("off")

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
log = LogisticRegression(penalty='l2')
log.fit(X_train, y_train)
print(f'Test Score: {log.score(X_test,y_test)}')

In [None]:
model = RandomForestClassifier(criterion= 'entropy', max_depth= 15, min_samples_split= 5, 
                               n_estimators= 700, random_state=777,max_features='log2')
model.fit(X_train,y_train)
print(cross_val_score(model, X_test, y_test, cv=5))
print(model.score(X_train, y_train))
print(model.score(X_test,y_test))

In [None]:
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', kernel_regularizer=regularizers.l1_l2(.005),
                        input_shape=(256, 256,  3)))
model.add(layers.MaxPooling2D((2, 2)))

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer="sgd",
              metrics=['acc'])

history = model.fit(train_ds,
                    steps_per_epoch=30,
                    epochs=20,
                    batch_size=16,
                    validation_data=(val_ds))

print("Model training finished.")
_, rmse = model.evaluate(train_ds, verbose=0)
print(f"Train RMSE: {round(rmse, 3)}")

print("Evaluating model performance...")
_, rmse = model.evaluate(val_ds, verbose=0)
print(f"Test RMSE: {round(rmse, 3)}")


Epoch 1/20


In [None]:

model = tf.keras.Sequential([
  tf.keras.layers.Rescaling(1./255),
  tf.keras.layers.Conv2D(32, (3, 3), activation='relu',
                        input_shape=(256, 256, 3)), 
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(64, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(2)
])

model.compile(
  optimizer="adam",
  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
  metrics=['accuracy'])
print("Start training the model...")
model.fit(train_images, y_train, epochs=5, validation_data=val_ds)
print("Model training finished.")
_, rmse = model.evaluate(train_images, y_train, verbose=0)
print(f"Train RMSE: {round(rmse, 3)}")

print("Evaluating model performance...")
_, rmse = model.evaluate(test_images, y_test, verbose=0)
print(f"Test RMSE: {round(rmse, 3)}")

## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***