# Lung Cancer Detection using Convolutional Neural Network and Open Ai

**Importing Libraries:**

numpy (as np): A library for numerical operations and array manipulation.
pandas (as pd): A library for data manipulation and analysis.
matplotlib.pyplot (as plt): Used for data visualization, including creating plots and charts.
PIL (Python Imaging Library): A library for working with image files.
glob: Used for file path manipulation and searching for files in directories.
sklearn: The scikit-learn library for machine learning, which provides various tools for classification, regression, clustering, and more.
cv2 (OpenCV): An open-source computer vision library used for image processing.
gc (garbage collector): Python's built-in module for controlling the automatic garbage collection.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from glob import glob

from sklearn.model_selection import train_test_split
from sklearn import metrics

import cv2
import gc
import os

**Importing TensorFlow and Keras:**

tensorflow (as tf): A popular open-source machine learning framework developed by Google.
keras: An integrated deep learning framework that comes with TensorFlow and is used for building and training neural networks.

In [None]:
import tensorflow as tf
from tensorflow import keras
from keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator

import warnings
warnings.filterwarnings('ignore')

**Data Path and Directory Structure:**

The code begins by defining the path to a dataset stored in a ZIP file. It extracts the dataset using the ZipFile module, and the dataset is organized into a directory structure.

In [2]:
from zipfile import ZipFile

# Extract the dataset from the ZIP file
data_path = r"C:\Users\chowd\Downloads\archive.zip"

with ZipFile(data_path, 'r') as zip:
    zip.extractall()
    print('The dataset has been extracted.')

The dataset has been extracted.


In [3]:
# Define paths and classes
path = 'lung_colon_image_set/lung_image_sets'
classes = os.listdir(path)

In [19]:
IMG_SIZE = 256
SPLIT = 0.2
EPOCHS = 10  # Increase the number of epochs
BATCH_SIZE = 64

**Data Preprocessing and Augmentation:**

It sets parameters for image preprocessing and augmentation using ImageDataGenerator. This includes rescaling pixel values to the range [0, 1], as well as defining data augmentation techniques like rotation, shifting, and flipping to increase the diversity of the training data.

In [20]:
# Data preprocessing and augmentation
datagen = ImageDataGenerator(
    rescale=1.0/255.0,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

**Loading and Preprocessing Images:**

The code loads and preprocesses images from the dataset. It iterates through the subdirectories in the dataset directory and reads images using OpenCV (cv2). The images are resized to a specified size (defined by IMG_SIZE) and added to the X list. The corresponding labels (class indices) are added to the Y list.

**One-Hot Encoding:**

The labels (Y) are one-hot encoded using pd.get_dummies. This converts categorical labels into binary vectors.

**Train-Test Split:**

The dataset is split into training and validation sets using train_test_split from scikit-learn.

In [21]:
X = []
Y = []

for i, cat in enumerate(classes):
    images = glob(f'{path}/{cat}/*.jpeg')

    for image in images:
        img = cv2.imread(image)
        X.append(cv2.resize(img, (IMG_SIZE, IMG_SIZE)))
        Y.append(i)


X = np.asarray(X)
one_hot_encoded_Y = pd.get_dummies(Y).values

X_train, X_val, Y_train, Y_val = train_test_split(X, one_hot_encoded_Y,
                                                  test_size=SPLIT,
                                                  random_state=2022)
print(X_train.shape, X_val.shape)

(12000, 256, 256, 3) (3000, 256, 256, 3)


**Building a Convolutional Neural Network (CNN):**

A CNN model is defined using Keras. The model consists of convolutional layers, max-pooling layers, fully connected layers, and batch normalization layers. It is compiled with an optimizer, loss function, and evaluation metric.

In [22]:
# Build a more complex CNN model
model = keras.models.Sequential([
    layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 3), padding='same'),
    layers.MaxPooling2D(2, 2),
    layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu', padding='same'),
    layers.MaxPooling2D(2, 2),
    layers.Conv2D(filters=128, kernel_size=(3, 3), activation='relu', padding='same'),
    layers.MaxPooling2D(2, 2),
    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(len(classes), activation='softmax')  # Output neurons equal to the number of classes
])

In [23]:
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

**Model Training and Callbacks:**

The model is trained using the training data. Training is monitored using callbacks like early stopping (EarlyStopping) and learning rate reduction (ReduceLROnPlateau) to improve training efficiency.

In [24]:
# Data augmentation and learning rate scheduler callbacks
es = keras.callbacks.EarlyStopping(patience=3, monitor='val_accuracy', restore_best_weights=True)
lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=2, factor=0.5, verbose=1)

In [None]:
# Train the model
history = model.fit(datagen.flow(X_train, Y_train, batch_size=BATCH_SIZE),
                    validation_data=(X_val, Y_val),
                    epochs=EPOCHS,
                    verbose=1,
                    callbacks=[es, lr])

In [None]:
# Visualize training history
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot()
history_df.loc[:, ['accuracy', 'val_accuracy']].plot()
plt.show()

In [None]:
# Evaluate the model on the validation set
Y_pred = model.predict(X_val)
Y_val = np.argmax(Y_val, axis=1)
Y_pred = np.argmax(Y_pred, axis=1)

In [None]:
print(metrics.confusion_matrix(Y_val, Y_pred))
print(metrics.classification_report(Y_val, Y_pred, target_names=classes))

In [18]:
# GPT-3 Integration
import openai

# Replace 'YOUR_API_KEY' with your actual API key
api_key = 'YOUR_API_KEY'
openai.api_key = api_key

# Example of using GPT-3 to generate image descriptions
def generate_image_descriptions(images, max_tokens=50):
    descriptions = []
    for image in images:
        prompt = f"Describe the image: '{image}'"
        response = openai.Completion.create(
            engine="text-davinci-002",
            prompt=prompt,
            max_tokens=max_tokens
        )
        generated_description = response.choices[0].text
        descriptions.append(generated_description)
    return descriptions

# Example usage
image_files = ['lungscc92.jpg', 'lungscc91.jpg', 'lungscc93.jpg']
descriptions = generate_image_descriptions(image_files)

for i, description in enumerate(descriptions):
    print(f"Description for {image_files[i]}: {description}")

Description for lungscc92.jpg: 



The image shows two lungs, one on the left and one on the right. They are surrounded by blood vessels and nerves. The left lung is smaller than the right lung.
Description for lungscc91.jpg: 

The image is a closeup of a human lungs with cancerous growths.
Description for lungscc93.jpg: 

This image is of a pair of lungs, viewed from the front. The left lung is mostly obscured by the heart, which is located in the center of the chest. The right lung is visible and appears to be healthy. There are numerous
