<h1 style="color:navy; text-align:center;">Uncovering Cancer Trends in England: A Decade of NHS Insights</h1>


## 📖 Introduction

Cancer remains one of the most significant health challenges in England, affecting thousands of individuals each year. The NHS collects comprehensive data on cancer diagnoses, which provides valuable insights into patterns and trends across the population. This project analyzes NHS cancer statistics from 2013 to 2022, with a focus on factors such as cancer type, stage at diagnosis, gender, and the distribution of cases among different population groups. By examining these patterns, the project aims to understand how cancer incidence varies across the population and how it has evolved over the past decade. While the primary goal is to gain a deeper understanding of cancer trends, the insights generated could also help inform NHS strategies related to early detection, prevention, and improving patient outcomes. Through this analysis, the project seeks to uncover important trends, highlight areas of concern, and provide a data-driven perspective on how cancer impacts various segments of the population.

## 🎯 Aim 

To analyze NHS cancer statistics from 2013 to 2022 in order to understand the distribution and trends of cancer incidence across different population groups, considering factors such as cancer type, stage at diagnosis, and gender, and to generate insights that could support a better understanding of how cancer affects the population.

## 🏥 Objective

1. To explore and describe trends in cancer incidence over the past decade.

2. To examine the distribution of different cancer types among various population groups.

3. To analyze differences in cancer incidence between genders.

4. To identify population groups that may be disproportionately affected by cancer.

5. To provide visualizations and data-driven insights that help understand the overall cancer landscape in England.

## 💡Data Snapshot: Overview of the Data

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv(r"C:\Users\Lenovo\OneDrive\Desktop\Project\NHS_CancerData.csv")

  df = pd.read_csv(r"C:\Users\Lenovo\OneDrive\Desktop\Project\NHS_CancerData.csv")


In [4]:
df.head()

Unnamed: 0,diagnosisyear,geography_type,geography_code,geography_name,ndrs_main_group,ndrs_detailed_group,stage_at_diagnosis,imd_quintile,hormone_receptor,hormone_receptor_status,gender,age_at_diagnosis,count,Population,type_of_rate,rate,lower_confidence_interval,upper_confidence_interval,flag
0,2013,Country,E92000001,England,Anus,All Anus,All stages,All quintiles,,,Persons,All ages,1057,53918686,Age-standardised,2.1,2.0,2.3,
1,2013,Country,E92000001,England,Anus,All Anus,All stages,All quintiles,,,Females,All ages,668,27410034,Age-standardised,2.6,2.4,2.8,
2,2013,Country,E92000001,England,Anus,All Anus,All stages,All quintiles,,,Males,All ages,389,26508652,Age-standardised,1.7,1.5,1.9,
3,2013,Country,E92000001,England,Anus,All Anus,All stages,1 - most deprived,,,Persons,All ages,232,10887630,Age-standardised,2.8,2.5,3.2,
4,2013,Country,E92000001,England,Anus,All Anus,All stages,2,,,Persons,All ages,219,10952016,Age-standardised,2.5,2.2,2.8,


*✅ The first few rows show cancer diagnosis records, including details on year, geography, cancer type, stage, demographics, and statistical indicators such as counts and rates.*

In [5]:
df.shape

(35030, 19)

✅ *The dataset comprises 35,030 rows and 19 columns.*

In [7]:
df.columns

Index(['diagnosisyear', 'geography_type', 'geography_code', 'geography_name',
       'ndrs_main_group', 'ndrs_detailed_group', 'stage_at_diagnosis',
       'imd_quintile', 'hormone_receptor', 'hormone_receptor_status', 'gender',
       'age_at_diagnosis', 'count', 'Population', 'type_of_rate', 'rate',
       'lower_confidence_interval', 'upper_confidence_interval', 'flag'],
      dtype='object')

✅ *The dataset includes variables capturing diagnosis year, geography, cancer type, stage, demographics, incidence counts, population, and rate estimates.*

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35030 entries, 0 to 35029
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   diagnosisyear              35030 non-null  int64  
 1   geography_type             35030 non-null  object 
 2   geography_code             35030 non-null  object 
 3   geography_name             35030 non-null  object 
 4   ndrs_main_group            35030 non-null  object 
 5   ndrs_detailed_group        35030 non-null  object 
 6   stage_at_diagnosis         35030 non-null  object 
 7   imd_quintile               35030 non-null  object 
 8   hormone_receptor           690 non-null    object 
 9   hormone_receptor_status    690 non-null    object 
 10  gender                     35030 non-null  object 
 11  age_at_diagnosis           35030 non-null  object 
 12  count                      35030 non-null  int64  
 13  Population                 35030 non-null  int

✅ *The dataset contains 35,030 rows and 19 columns. Out of these, 13 are categorical, 3 are integers (diagnosisyear, count, Population), and 3 are decimals (rate, lower_confidence_interval, upper_confidence_interval).*

*Some fields are quite sparse:*

*1)The hormone receptor fields are missing for almost the entire dataset (around 98% missing).*

*2)The flag column is missing in nearly 80% of rows.*

*The age_at_diagnosis column is stored as text, suggesting it may represent age bands rather than exact ages.*


In [9]:
df.describe()

Unnamed: 0,diagnosisyear,count,Population,rate,lower_confidence_interval,upper_confidence_interval
count,35030.0,35030.0,35030.0,29792.0,29792.0,29792.0
mean,2017.5,1497.867713,21248060.0,12.719364,12.260382,13.210795
std,2.872322,6697.796599,17570730.0,43.901143,43.332517,44.474403
min,2013.0,0.0,5201813.0,0.0,0.0,0.0
25%,2015.0,32.0,5650682.0,0.5,0.4,0.7
50%,2017.5,166.0,11373090.0,1.9,1.7,2.2
75%,2020.0,740.0,28307800.0,7.7,7.1,8.3
max,2022.0,230968.0,57112540.0,673.6,666.7,680.5


✅*The dataset covers cancer diagnoses from 2013 to 2022.*

*Case counts range from 0 to 230,968, with most records showing relatively small numbers.*

*Population figures span from about 5 million to 57 million, with a middle value around 11 million.*

*The rate column shows how common cancer is in the population (cases per 100,000). Most rates are low, with a median of about 1.9, but a few records have much higher rates, reaching up to 673.6.*

*The confidence interval columns show the likely range for the rate. Most values fall between 0.4 and 8.3, though a few extreme records have much higher ranges, above 600.*