#**INTERNSHIP** **TASK-3**

#**End-to-end** **data** **science** **project**

## Task-Develop a full data science project, from data collection and preprocessing to model deployment using flask or fastapi.

### In this project, I have used **Breast Cancer Wisconsin (Diagnostic) dataset** from sklearn.datasets to develop a complete end-to-end data science pipeline, including data collection, preprocessing, model building, and deployment using Flask/FastAPI.”

##**Problem Overview: Breast Cancer Prediction**

###**Problem statement**

Breast cancer is one of the most common cancers affecting women worldwide. Early detection is crucial because it can significantly increase the chances of successful treatment.

The goal of this project is:

“To develop a machine learning model that can predict whether a tumor is malignant (cancerous) or benign (non-cancerous) based on certain medical features of breast tissue.”

This prediction can assist doctors in diagnosing and planning treatment more effectively.

##**Data description**

The dataset contains:
**Features**: **30** numerical measurements from digitized images of breast tumors (e.g
 radius, texture, perimeter, area, smoothness).

**Target**: Binary class:

0 = Malignant (cancerous)

1 = Benign (non-cancerous)
There are 569 samples in total, with 212 malignant and 357 benign tumors.

###**Objectives**

* **Data Preprocessing:** Clean the data, handle missing values, and scale features.

* **Model Training**: Train a classifier (Logistic Regression, Random Forest, etc.) to predict tumor type.

* **Evaluation:** Measure model performance using accuracy, precision, recall, and F1-score.

* **Deployment:** Make the model accessible via an API so it can predict new cases in real-time.

##**Load dataset**

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [None]:
#display first 5 rows
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [None]:
#no of rows and columns
df.shape

(569, 31)

In [None]:
#information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [None]:
#summary statistics for numerical columns
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


##**Data preprocessing**

###**Checking missing values**

In [None]:
df.isnull().sum()

Unnamed: 0,0
mean radius,0
mean texture,0
mean perimeter,0
mean area,0
mean smoothness,0
mean compactness,0
mean concavity,0
mean concave points,0
mean symmetry,0
mean fractal dimension,0


### There is no missing values in this dataset.

###**Checking duplicates**

In [None]:
df.duplicated().sum()

np.int64(0)

###There is no duplicates in this dataset

### Since all the columns in this dataset are numerical.so there is no need of encoding.

##**Feature selection**



Here I apply SelectKBest method for feature selection.

SelectKBest is a feature selection method in Scikit-learn that selects the top k features based on a chosen statistical test. It helps reduce dimensionality by keeping only the most relevant features.



In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

X = df.drop('target', axis=1)#split data into features and target
y = df['target']

# Select top 10 features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Top 10 features:", selected_features)

# Use only selected features
X = df[selected_features]


Top 10 features: Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
       'mean concave points', 'worst radius', 'worst perimeter', 'worst area',
       'worst concavity', 'worst concave points'],
      dtype='object')


##**Splitting  data**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


train_test_split separates data into training and testing sets to evaluate the model.

##**Scaling data**

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

StandardScaler standardizes features so that they have a mean of 0 and variance of 1. This improves model performance, especially for algorithms sensitive to feature scale

##**Train the model**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)


Here I chose **Logistic Regression** for model building because it is well-suited for binary classification, which matches our target of predicting malignant or benign tumors.

##**Evaluate the model**

In [None]:
# Predictions
y_pred = model.predict(X_test_scaled)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.95      0.98      0.97        43
           1       0.99      0.97      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



The Logistic Regression model achieved an accuracy of approximately 97% on the test set, indicating it correctly predicted most cases. The precision and recall for malignant tumors were 0.95 and 0.98, respectively, meaning the model correctly identifies almost all malignant cases while keeping false positives low. For benign tumors, precision and recall were 0.99 and 0.97, showing the model is also highly accurate in identifying non-cancerous cases. Overall, the F1-scores for both classes were around 0.97–0.98, reflecting a good balance between precision and recall. These results demonstrate that the model is reliable and effective for predicting breast cancer.

##**Save model & scaler**

In [None]:
import joblib

joblib.dump(model, 'breast_cancer_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(selected_features, 'selected_features.pkl')  # Save feature names


['selected_features.pkl']

saved the model, scaler, and selected features for deployment.

##**Deployment using FastAPI**

##**Create FastAPI app**

In [None]:
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import pandas as pd
import numpy as np

# Load saved artifacts
model = joblib.load("breast_cancer_model.pkl")
scaler = joblib.load("scaler.pkl")
selected_features = joblib.load("selected_features.pkl")  # top 10 features

app = FastAPI(title="Breast Cancer Prediction API")

# Define input schema
class CancerData(BaseModel):
    mean_radius: float
    mean_texture: float
    mean_perimeter: float
    mean_area: float
    mean_smoothness: float
    mean_compactness: float
    mean_concavity: float
    mean_concave_points: float
    worst_perimeter: float
    worst_radius: float
    # Ensure these match the top 10 features

@app.get("/")
def home():
    return {"message": "Welcome to Breast Cancer Prediction API"}

@app.post("/predict")
def predict(data: CancerData):
    # Convert input to DataFrame
    input_dict = data.dict()
    input_df = pd.DataFrame([input_dict], columns=selected_features)

    # Scale input
    input_scaled = scaler.transform(input_df)

    # Make prediction
    prediction = model.predict(input_scaled)[0]
    result = "Benign" if prediction == 1 else "Malignant"

    return {"prediction": result}


This code is to turn our trained machine learning model into an API, so anyone can send input data and get predictions over the web.

Instead of running the model manually in Python, now it can receive requests and return predictions in JSON format.

This allows easy integration with web apps, dashboards, or other software.

###**Create the deployment script**

In [None]:
!pip install pyngrok --quiet


pyngrok is a Python wrapper for ngrok, which allows you to expose a local server (like FastAPI running on port 8000) to the internet.

In [None]:
from pyngrok import ngrok


ngrok.set_auth_token ('35vA19JyP7cwURDKU936CSGKh5i_7T7akgYV7bdSu7CR2urTj')# this is my authtoken

# Start tunnel
public_url = ngrok.connect(8000)
print("Your public URL:", public_url)


Your public URL: NgrokTunnel: "https://posthumously-nonethnic-susie.ngrok-free.dev" -> "http://localhost:8000"


This code is for creating a public URL for your FastAPI app in Colab using ngrok.Loads the ngrok Python library so we can create a tunnel to our local server.Authenticates our ngrok account.Required because free ngrok accounts need a verified authtoken to start tunnels.Created a public URL that points to our local FastAPI server running on port 8000.This URL can now be accessed from any browser, Postman, or Python requests.Shows the working URL you can use to test your API.

Important: ngrok URLs are temporary for free accounts, so the URL changes each time you restart the tunnel.

###**Run FastAPI**

In [None]:
import uvicorn
uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)


INFO:     Will watch for changes in these directories: ['/content']
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     Started reloader process [353] using StatReload
INFO:     Stopping reloader process [353]


This code is required to start our FastAPI server so ngrok can create a public URL.

Without running Uvicorn, your API will NOT work.

###**Using python requests library**

In [None]:
import requests

url = "https://posthumously-nonethnic-susie.ngrok-free.dev/predict"  # or root `/`
response = requests.get(url)
print(response.status_code, response.text)


404 <!DOCTYPE html>
<html class="h-full" lang="en-US" dir="ltr">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="preload" href="https://assets.ngrok.com/fonts/euclid-square/EuclidSquare-Regular-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://assets.ngrok.com/fonts/euclid-square/EuclidSquare-RegularItalic-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://assets.ngrok.com/fonts/euclid-square/EuclidSquare-Medium-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://assets.ngrok.com/fonts/euclid-square/EuclidSquare-MediumItalic-WebS.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="preload" href="https://assets.ngrok.com/fonts/ibm-plex-mono/IBMPlexMono-Text.woff" as="font" type="font/woff" crossorigin="anonymous" />
    <link rel="prelo

This code is used to check whether our FastAPI application is running publicly through ngrok.