# <span style="color: rgb(204, 219, 219); font-weight: bold;">Project Summary</span>

![My Image](Images\istockphoto-485716154-612x612.jpg)

## <span style="color: rgb(119, 209, 209);font-weight: bold;">The Northampton Council Project:</span>


There has been a significant increase in sickness and hospital admissions in Northampton. In response to this concerning trend, the town council has decided to conduct diagnostic research to identify the underlying causes and develop effective mitigation strategies.  

### <span style="color: rgb(126, 199, 189); font-weight: bold;">Objectives:</span>


- What are the most common medical conditions in the town?  
- How can these medical conditions be mitigated across the city?  
  - Which gender experiences the most common medical conditions?  
  - Which age groups within this gender are most affected?  
  - Are certain blood types more prone to specific medical conditions?  
  - What factors contribute to **inconclusive** test results, and how can they be minimized in future diagnoses?  
  - Which medical conditions are classified as urgent and emergencies, allowing the council to address the issues at their root?  
- Which medical conditions require the longest hospital stays?  
- How are different hospitals specialized in treating specific medical conditions, and how can this information be used to implement effective mitigation strategies?  
- What are the test results associated with these medical conditions to better understand their nature?  



---

### <span style="color: rgb(119, 209, 209); font-weight: bold;">Digging Into the Data</span>

<img src="Images/premium_photo-1711238064361-7843b5b9bd9c.avif" alt="My Image" width="500" height="400">


#### <span style="color: rgb(82, 206, 206); font-weight: bold;">Data Preprocessing</span>


In [140]:
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from ipywidgets import interact
from ipyleaflet import Map

In [141]:
df = pd.read_csv(r'Data\healthcare_dataset.csv')

In [142]:
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [143]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.4

From the above:

- The dataset has no null values
- The data entry is 55500 
- The data comprises of 14 column
- Seeing there are no mull cells or columns and the datatype of the columns are classified
- 

In [144]:
df.columns

Index(['Name', 'Age', 'Gender', 'Blood Type', 'Medical Condition',
       'Date of Admission', 'Doctor', 'Hospital', 'Insurance Provider',
       'Billing Amount', 'Room Number', 'Admission Type', 'Discharge Date',
       'Medication', 'Test Results'],
      dtype='object')

Drop Columns...

In [145]:
df.columns

Index(['Name', 'Age', 'Gender', 'Blood Type', 'Medical Condition',
       'Date of Admission', 'Doctor', 'Hospital', 'Insurance Provider',
       'Billing Amount', 'Room Number', 'Admission Type', 'Discharge Date',
       'Medication', 'Test Results'],
      dtype='object')

In [146]:
col_to_drop = ['Name', 'Doctor','Room Number']

df = df.drop(col_to_drop, axis=1)

Check for Unique Values

In [147]:
def unique_values (df, columns):
    unique_values = {}
    for col in columns:
        if col in df.columns:
            unique_values[col] = df[col].unique().tolist()
        else:
            unique_values[col] = "Column not found"
    return unique_values

cat_columns = ['Gender', 'Hospital', 'Blood Type','Medical Condition', 'Admission Type', 'Medication', 'Test Results', 'Insurance Provider']
cat_columns_1 = ['Gender', 'Hospital', 'Blood Type']
cat_columns_2 = ['Medical Condition', 'Admission Type', 'Medication', 'Test Results']

unique = unique_values(df, cat_columns_1)

print(f'The unique values for each columns are: {unique}')

The unique values for each columns are: {'Gender': ['Male', 'Female'], 'Hospital': ['Sons and Miller', 'Kim Inc', 'Cook PLC', 'Hernandez Rogers and Vang,', 'White-White', 'Nunez-Humphrey', 'Group Middleton', 'Powell Robinson and Valdez,', 'Sons Rich and', 'Padilla-Walker', 'Schaefer-Porter', 'Lyons-Blair', 'Powers Miller, and Flores', 'Rivera-Gutierrez', 'Morris-Arellano', 'Cline-Williams', 'Cervantes-Wells', 'Torres, and Harrison Jones', 'Houston PLC', 'Hammond Ltd', 'Jones LLC', 'Williams-Davis', 'Clark-Mayo', 'and Sons Smith', 'Wilson Group', 'Garner-Bowman', 'Brown, and Jones Weaver', 'Serrano-Dixon', 'Gardner-Miller', 'Guerrero-Boone', 'Hart Ltd', 'Cruz-Santiago', 'Group Duncan', 'Lopez-Phillips', 'Poole Inc', 'Sons and Cox', 'LLC Martin', 'Espinoza-Stone', 'and Garcia Morris Cunningham,', 'Walton-Meyer', 'PLC Young', 'Meadows Group', 'and Howell Brooks, Rogers', 'and Mcclure White Boone,', 'Gates Brown, and Stuart', 'Group Armstrong', 'Ltd Schwartz', 'Nelson-Phillips', 'Knight an

In [148]:
cat_columns_1

['Gender', 'Hospital', 'Blood Type']

In [149]:
cat_columns_2

['Medical Condition', 'Admission Type', 'Medication', 'Test Results']

In [150]:
from sklearn.preprocessing import LabelEncoder

def label_encode(df, columns):
    le_dict = {}  # Store mappings for reference
    for col in columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        
        # Create a dictionary that maps the encoded number to the label
        le_dict[col] = dict(zip(le.classes_, range(len(le.classes_))))  # Number -> Label mapping
    
    return df, le_dict

# Apply label encoding
df, label_mappings = label_encode(df, cat_columns)

# Print the mappings
for col, mapping in label_mappings.items():
    print(f"Label Mapping for {col}: {mapping}")


Label Mapping for Gender: {'Female': 0, 'Male': 1}
Label Mapping for Hospital: {'Abbott Inc': 0, 'Abbott Ltd': 1, 'Abbott Moore and Williams,': 2, 'Abbott and Thompson, Sullivan': 3, 'Abbott, Peters and Hoffman': 4, 'Abbott, Vazquez Bautista and': 5, 'Abbott-Castillo': 6, 'Abbott-Coleman': 7, 'Abbott-Ferrell': 8, 'Abbott-Hill': 9, 'Abbott-Jones': 10, 'Abbott-Martin': 11, 'Abbott-Rios': 12, 'Abbott-Wilson': 13, 'Acevedo Group': 14, 'Acevedo Holmes and Rangel,': 15, 'Acevedo LLC': 16, 'Acevedo Ltd': 17, 'Acevedo PLC': 18, 'Acevedo Phillips Steele, and': 19, 'Acevedo and Ellis, Snyder': 20, 'Acevedo and Hart, Hernandez': 21, 'Acevedo and Howard Burke,': 22, 'Acevedo and Larson Andrews,': 23, 'Acevedo and Lewis Barker,': 24, 'Acevedo, Jordan and Diaz': 25, 'Acevedo, Martin and Price': 26, 'Acevedo, Riddle Payne and': 27, 'Acevedo-Diaz': 28, 'Acevedo-Goodwin': 29, 'Acevedo-Henderson': 30, 'Acevedo-Lawson': 31, 'Acosta Group': 32, 'Acosta Inc': 33, 'Acosta LLC': 34, 'Acosta Ltd': 35, 'Acosta

In [151]:
df

Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Date of Admission,Hospital,Insurance Provider,Billing Amount,Admission Type,Discharge Date,Medication,Test Results
0,30,1,5,2,2024-01-31,29933,1,18856.281306,2,2024-02-02,3,2
1,62,1,0,5,2019-08-20,16012,3,33643.327287,1,2019-08-26,1,1
2,76,0,1,5,2022-09-22,5473,0,27955.096079,1,2022-10-07,0,2
3,28,0,6,3,2020-11-18,12317,3,37909.782410,0,2020-12-18,1,0
4,43,0,2,2,2022-09-19,33598,0,14238.317814,2,2022-10-09,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...
55495,42,0,6,1,2020-08-16,15553,1,2650.714952,0,2020-09-15,4,0
55496,61,0,3,5,2020-01-23,31722,2,31457.797307,0,2020-02-01,0,2
55497,38,0,4,4,2020-07-13,37408,4,27620.764717,2,2020-08-10,1,0
55498,43,1,7,0,2019-05-25,14329,3,32451.092358,0,2019-05-31,1,0


In [152]:
# @interact(hue=df['RiskLevel'])
# def plot(hue):
#     _=sns.countplot(df, hue=hue)



pairpot of the cat columns in order to assessthe cat balanced or not

In [153]:
# Create subplots
# fig, axes = plt.subplots(nrows=len(cat_columns), ncols=1, figsize=(8, 6))

# Iterate through selected columns and create subplots
# for i, col in enumerate(cat_columns):
#     axes[i].plot(df['Test Results'], df[col], marker='o', linestyle='-', label=col)
#     axes[i].set_title(col)
#     axes[i].set_xlabel('Test Results')
#     axes[i].set_ylabel(col)
#     axes[i].legend()

# # Adjust layout
# plt.tight_layout()
# plt.show()

We deasling with an imbalanced dataset.

- This could easily be a case where ofthe 1014 women _majority_ of them  had low-risk level 
    - 9This intepretes to the factb in the year 2024- when the dat was gotten within the Northampton town- knowing we have other hospitals (reseach shows 13 hospitaols in general- private and public)- this data represents 1/12 hospitals- which is fair represntation of the qwomen in Northampton- and may be fairly alright to generalize that most-women there have low-risklevels during maternity.
OR
- The represnetation of clients acould also be that the low-riosk individuals all came to the Invincible Hospital, and High-risk went to the other 11 hospitals. This may be location-based- as Nortyhampton is big- hospitals are scattered around and perhaps, where the Invincible Hosdpital is located is where you have wpme whose natual rioisk-level is low 9Probably, rural and serene or quiet areas- with less strees or traffic that could have affected their test results)
OR
- Time fadcrtor- if majority of the women gave birth during a period or their test resulst were collected mostly December (a month before their due delivery) and that's a period most wome get to relaxc (perhaps break from work or summer or close to a festive period or when they travel with their partners), this is another causal factor on why most women's risk-levels are Low.

----

Question 1


- What are the most common medical conditions in the town?  

In [154]:
df

Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Date of Admission,Hospital,Insurance Provider,Billing Amount,Admission Type,Discharge Date,Medication,Test Results
0,30,1,5,2,2024-01-31,29933,1,18856.281306,2,2024-02-02,3,2
1,62,1,0,5,2019-08-20,16012,3,33643.327287,1,2019-08-26,1,1
2,76,0,1,5,2022-09-22,5473,0,27955.096079,1,2022-10-07,0,2
3,28,0,6,3,2020-11-18,12317,3,37909.782410,0,2020-12-18,1,0
4,43,0,2,2,2022-09-19,33598,0,14238.317814,2,2022-10-09,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...
55495,42,0,6,1,2020-08-16,15553,1,2650.714952,0,2020-09-15,4,0
55496,61,0,3,5,2020-01-23,31722,2,31457.797307,0,2020-02-01,0,2
55497,38,0,4,4,2020-07-13,37408,4,27620.764717,2,2020-08-10,1,0
55498,43,1,7,0,2019-05-25,14329,3,32451.092358,0,2019-05-31,1,0


_Image_

Question 2

Questiion 3


- How can these medical conditions be mitigated across the city?  
  - Which gender experiences the most common medical conditions?  
  - Which age groups within this gender are most affected?  
  - Are certain blood types more prone to specific medical conditions?  
  - What factors contribute to **inconclusive** test results, and how can they be minimized in future diagnoses?  
  - Which medical conditions are classified as urgent and emergencies, allowing the council to address the issues at their root?  
- Which medical conditions require the longest hospital stays?  
- How are different hospitals specialized in treating specific medical conditions, and how can this information be used to implement effective mitigation strategies?  
- What are the test results associated with these medical conditions to better understand their nature? 