<div style='background-color:#FF6B35; padding: 15px; border-radius: 10px; margin-bottom: 20px;'>
<h1 style='color:#FFFFFF; text-align:center; font-family: Arial, sans-serif; margin: 0;'>📊 Stroke Prediction Analysis</h1>
<h2 style='color:#FFE66D; text-align:center; font-family: Arial, sans-serif; margin: 5px 0 0 0;'>Data Collection & Initial Exploration</h2>
</div>

<div style='background-color:#FFF8F0; padding: 15px; border-radius: 8px; border-left: 4px solid #FF6B35;'>
<h3 style='color:#FF6B35; margin-top: 0;'>🎯 Project Objectives</h3>
<ul style='color:#333; line-height: 1.6;'>
<li><strong>Data Acquisition:</strong> Import and examine the stroke prediction dataset from Kaggle</li>
<li><strong>Initial Assessment:</strong> Understand data structure, quality, and clinical relevance</li>
<li><strong>Healthcare Context:</strong> Establish the medical significance of each feature</li>
<li><strong>Data Quality:</strong> Identify missing values, outliers, and data integrity issues</li>
<li><strong>Feature Understanding:</strong> Map data features to clinical risk factors</li>
</ul>
</div>

# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle stroke prediction dataset and save as raw data
* Perform initial data inspection and quality assessment
* Document data source and structure

## Inputs

* Stroke prediction dataset from Kaggle: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
* Dataset contains 5,111 patient records with 12 features

## Outputs

* Raw stroke dataset saved to inputs/datasets/
* Initial data quality report
* Dataset structure documentation

## Additional Comments

* This dataset will be used for healthcare analytics to predict stroke risk
* Data contains patient demographics, health conditions, and lifestyle factors
* Target variable is binary (stroke occurrence: 0 or 1)

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\midas\\Documents\\2505-WMCA-Data-Git101\\Stroke-prediction\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\midas\\Documents\\2505-WMCA-Data-Git101\\Stroke-prediction'

# Load Data

Load the stroke prediction dataset and perform initial inspection

In [14]:
# Check environment and install packages
import sys
import subprocess

print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Try to install packages using different methods
try:
    import pandas as pd
    print("✓ pandas already installed")
except ImportError:
    print("Installing pandas...")
    subprocess.check_call([sys.executable, "-m", "ensurepip", "--default-pip"])
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pandas"])
    import pandas as pd
    print("✓ pandas installed successfully")

try:
    import numpy as np
    print("✓ numpy already installed")
except ImportError:
    print("Installing numpy...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "numpy"])
    import numpy as np
    print("✓ numpy installed successfully")

print("All packages ready!")

Python executable: c:\Users\midas\Documents\2505-WMCA-Data-Git101\Stroke-prediction\.venv\Scripts\python.exe
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
Installing pandas...
✓ pandas installed successfully
✓ numpy already installed
All packages ready!
✓ pandas installed successfully
✓ numpy already installed
All packages ready!


In [22]:
# Load and explore the stroke dataset
import pandas as pd
import numpy as np

# Load the stroke dataset with correct path
df = pd.read_csv("jupyter_notebooks/inputs/datasets/Stroke-data.csv")

print(f"✓ Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nDataset info:")
print(f"- {df.shape[0]} rows")
print(f"- {df.shape[1]} columns")
print(f"- Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

print("\nFirst 5 rows:")
df.head()

✓ Dataset loaded successfully!
Dataset shape: (5110, 12)
Columns: ['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke']

Dataset info:
- 5110 rows
- 12 columns
- Memory usage: 1657.7 KB

First 5 rows:


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [21]:
# Test that core data science packages work (without matplotlib for now)
import pandas as pd
import numpy as np
import os

# Clear matplotlib backend environment variable if it exists
if 'MPLBACKEND' in os.environ:
    del os.environ['MPLBACKEND']

print("✅ Core Data Science Environment Ready!")
print("\n📊 Available packages:")
print("- pandas: Data manipulation and analysis")
print("- numpy: Numerical computing") 

print(f"\n📈 Dataset Summary:")
print(f"- Shape: {df.shape}")
print(f"- Stroke cases: {df['stroke'].sum()} ({df['stroke'].mean()*100:.1f}%)")
print(f"- Missing values: {df.isnull().sum().sum()}")

print("\n📋 Dataset columns:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print(f"\n📊 Basic statistics:")
print(f"- Average age: {df['age'].mean():.1f} years")
print(f"- Age range: {df['age'].min():.0f} - {df['age'].max():.0f} years")
print(f"- Average glucose: {df['avg_glucose_level'].mean():.1f}")

print("\n🎯 Jupyter notebook environment is ready for Code Institute assessment!")
print("📝 Note: Core pandas and numpy functionality working perfectly.")
print("   Visualization packages can be imported as needed in subsequent cells.")

✅ Core Data Science Environment Ready!

📊 Available packages:
- pandas: Data manipulation and analysis
- numpy: Numerical computing

📈 Dataset Summary:
- Shape: (5110, 12)
- Stroke cases: 249 (4.9%)
- Missing values: 201

📋 Dataset columns:
 1. id
 2. gender
 3. age
 4. hypertension
 5. heart_disease
 6. ever_married
 7. work_type
 8. Residence_type
 9. avg_glucose_level
10. bmi
11. smoking_status
12. stroke

📊 Basic statistics:
- Average age: 43.2 years
- Age range: 0 - 82 years
- Average glucose: 106.1

🎯 Jupyter notebook environment is ready for Code Institute assessment!
📝 Note: Core pandas and numpy functionality working perfectly.
   Visualization packages can be imported as needed in subsequent cells.


## Data Quality Assessment

In [23]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

Missing values per column:
id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

Total missing values: 201


In [24]:
# Data types and basic info
print("Dataset Info:")
df.info()

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [25]:
# Descriptive statistics for numerical columns
print("Descriptive Statistics:")
df.describe()

Descriptive Statistics:


Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


## Target Variable Analysis

In [26]:
# Check target variable distribution
print("Stroke distribution:")
print(df['stroke'].value_counts())
print(f"\nStroke rate: {df['stroke'].mean():.3f}")
print(f"Class imbalance ratio: {df['stroke'].value_counts()[0] / df['stroke'].value_counts()[1]:.1f}:1")

Stroke distribution:
stroke
0    4861
1     249
Name: count, dtype: int64

Stroke rate: 0.049
Class imbalance ratio: 19.5:1


## Feature Overview

In [27]:
# Check unique values for categorical features
categorical_features = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

for feature in categorical_features:
    print(f"\n{feature}:")
    print(df[feature].value_counts())


gender:
gender
Female    2994
Male      2115
Other        1
Name: count, dtype: int64

ever_married:
ever_married
Yes    3353
No     1757
Name: count, dtype: int64

work_type:
work_type
Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: count, dtype: int64

Residence_type:
Residence_type
Urban    2596
Rural    2514
Name: count, dtype: int64

smoking_status:
smoking_status
never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: count, dtype: int64


## Data Validation

In [28]:
# Check for data anomalies
print("Data validation checks:")
print(f"Age range: {df['age'].min()} - {df['age'].max()}")
print(f"Glucose level range: {df['avg_glucose_level'].min():.1f} - {df['avg_glucose_level'].max():.1f}")
print(f"BMI range: {df['bmi'].min():.1f} - {df['bmi'].max():.1f}")
print(f"Hypertension values: {df['hypertension'].unique()}")
print(f"Heart disease values: {df['heart_disease'].unique()}")
print(f"Stroke values: {df['stroke'].unique()}")

Data validation checks:
Age range: 0.08 - 82.0
Glucose level range: 55.1 - 271.7
BMI range: 10.3 - 97.6
Hypertension values: [0 1]
Heart disease values: [1 0]
Stroke values: [1 0]


---

# Push files to Repo

Save the collected data for use in subsequent analysis notebooks

In [32]:
# Create outputs directory structure and save the raw dataset
import os

# Create the outputs/datasets directory if it doesn't exist
os.makedirs("jupyter_notebooks/outputs/datasets", exist_ok=True)

# Save the raw dataset for future use in the outputs folder
df.to_csv("jupyter_notebooks/outputs/datasets/stroke_raw.csv", index=False)

print("✅ Raw dataset saved successfully!")
print(f"📍 Location: jupyter_notebooks/outputs/datasets/stroke_raw.csv")
print(f"📊 Dataset contains {df.shape[0]} patients with {df.shape[1]} features")
print(f"🎯 Stroke cases: {df['stroke'].sum()} ({df['stroke'].mean()*100:.1f}%)")
print(f"💾 File size: {os.path.getsize('jupyter_notebooks/outputs/datasets/stroke_raw.csv') / 1024:.1f} KB")

# Verify the file was created
if os.path.exists("jupyter_notebooks/outputs/datasets/stroke_raw.csv"):
    print("\n🎉 Raw data collection completed successfully!")

✅ Raw dataset saved successfully!
📍 Location: jupyter_notebooks/outputs/datasets/stroke_raw.csv
📊 Dataset contains 5110 patients with 12 features
🎯 Stroke cases: 249 (4.9%)
💾 File size: 324.8 KB

🎉 Raw data collection completed successfully!


## ✅ Data Collection Summary

**Raw Dataset Successfully Saved:**
- **Location:** `jupyter_notebooks/outputs/datasets/stroke_raw.csv`
- **Records:** 5,110 patient observations
- **Features:** 12 variables (demographics, health conditions, lifestyle factors)
- **Target Variable:** Stroke occurrence (binary: 0/1)
- **File Size:** ~325 KB

**Key Dataset Characteristics:**
- **Stroke Rate:** 4.9% (249 positive cases)
- **Missing Values:** 201 total (mainly in BMI column)
- **Age Range:** 0-82 years
- **Class Imbalance:** High (95.1% no stroke, 4.9% stroke)

**Next Steps:**
1. **Data Inspection:** Detailed exploratory data analysis
2. **Data Cleaning:** Handle missing values and outliers
3. **Feature Engineering:** Create new predictive features
4. **Data Preprocessing:** Prepare for machine learning models
5. **Model Training:** Build stroke prediction models

The raw dataset is now ready for subsequent analysis notebooks in your Code Institute assessment sequence.