#CANCER PREDICTION

#Objective

The objective for cancer prediction is to develop a reliable and accurate system for identifying the presence, risk, and progression of cancer in individuals.

#Data Source

Data sources for cancer prediction typically come from a variety of clinical, genomic, and environmental databases. These data sources include:

Medical Records: Electronic Health Records (EHRs) containing patient history, diagnoses, treatments, and outcomes.

Example: MIMIC-III (Medical Information Mart for Intensive Care).
Genomic Data: Sequencing data from patients, including whole genome, exome, or targeted gene panels.

Example: The Cancer Genome Atlas (TCGA).
Imaging Data: Medical imaging such as X-rays, MRIs, CT scans, and histopathology slides.

Example: The Cancer Imaging Archive (TCIA).
Clinical Trials Data: Information from clinical studies on cancer treatments, patient responses, and outcomes.

Example: ClinicalTrials.gov.
Biobank Data: Biological samples and associated data from biobanks.

Example: UK Biobank.
Public Health Databases: Epidemiological data on cancer incidence, prevalence, and risk factors.

Example: Surveillance, Epidemiology, and End Results (SEER) Program.
Lifestyle and Environmental Data: Information on patients' lifestyle choices, environmental exposures, and occupational hazards.

Example: National Health and Nutrition Examination Survey (NHANES).
Patient-Reported Outcomes: Data from patient surveys on quality of life, symptoms, and treatment side effects.

Example: PRO-CTCAE (Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events).
By integrating and analyzing data from these diverse sources, researchers and clinicians can develop more accurate and personalized cancer prediction models.








#IMPORT LIBRARY

In [None]:
import pandas as pd

#IMPORT DATA

In [None]:
cancer = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/Cancer.csv')

In [None]:
cancer.head()

In [None]:
cancer.info()

#Describe Data

In [None]:
cancer.describe()

#Define target (y) and features (X)

In [None]:
cancer.columns

In [None]:
y = cancer['diagnosis']

In [None]:
X = cancer.drop(['id','diagnosis','Unnamed: 32'],axis=1)


#Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7, random_state=2529)


In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

#MODELING

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=5000)


#Model Evaluation

In [None]:
model.fit(X_train,y_train)

In [None]:
model.intercept_

In [None]:
model.coef_

#Prediction

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

# Model accuracy

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
print(classification_report(y_test,y_pred))

#EXPLANATION

The cancer prediction model in the provided code uses a Logistic Regression algorithm to classify cancer diagnoses based on various features. Here’s a brief explanation of each part of the code related to the model:

1. Import Libraries and Data
python
Copy code
import pandas as pd

# Load the dataset
cancer = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/Cancer.csv')
pandas is used to handle the data.
The dataset is loaded from a URL into a DataFrame named cancer.
2. Examine the Data
python
Copy code
cancer.head()
cancer.info()
cancer.describe()
cancer.head(): Shows the first few rows of the dataset to understand its structure.
cancer.info(): Provides information about the DataFrame, such as column names and data types.
cancer.describe(): Gives statistical summaries of numerical features.
3. Define Target and Features
python
Copy code
y = cancer['diagnosis']
X = cancer.drop(['id','diagnosis','Unnamed: 32'], axis=1)
y: Target variable representing cancer diagnosis (e.g., malignant or benign).
X: Features used for prediction, excluding columns like id, diagnosis, and Unnamed: 32.
4. Train-Test Split
python
Copy code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=2529)
train_test_split: Splits the data into training and testing sets.
X_train, X_test: Features for training and testing.
y_train, y_test: Target variable for training and testing.
train_size=0.7: Uses 70% of the data for training.
random_state=2529: Ensures reproducibility of the split.
5. Modeling
python
Copy code
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=5000)
LogisticRegression: A classification algorithm that predicts binary outcomes.
max_iter=5000: Sets the maximum number of iterations for model convergence.
6. Model Training
python
Copy code
model.fit(X_train, y_train)
model.fit(): Trains the logistic regression model using the training data.
7. Model Evaluation
python
Copy code
model.intercept_
model.coef_
model.intercept_: Retrieves the intercept of the model.
model.coef_: Retrieves the coefficients of the features.
8. Prediction
python
Copy code
y_pred = model.predict(X_test)
model.predict(): Makes predictions on the test set based on the trained model.
9. Model Accuracy
python
Copy code
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
print(classification_report(y_test, y_pred))
confusion_matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
accuracy_score: Computes the overall accuracy of the model.
classification_report: Provides detailed metrics, including precision, recall, and F1-score for each class.
Summary
The cancer prediction model uses Logistic Regression to classify whether a tumor is malignant or benign. The steps include:

Loading and examining the dataset.
Preparing the data by defining features and target variables.
Splitting the data into training and testing sets.
Training the model with the training data.
Evaluating the model’s performance using metrics like confusion matrix, accuracy score, and classification report.
This process helps in predicting cancer outcomes based on various features, aiming to assist in early detection and diagnosis.