### Uploading Files in Google Colab

To work with local files in Google Colab, we use the `files.upload()` function from the `google.colab` library. This function allows us to upload files from our local system directly into the Colab environment for further processing.


In [1]:
from google.colab import files
uploaded = files.upload()

Saving healthcare-dataset-stroke-data.csv to healthcare-dataset-stroke-data.csv


### Importing Necessary Libraries

Before we begin exploring and analyzing the dataset, we need to import the necessary Python libraries. In this project, we will use the following libraries:

1. **Pandas (`pd`)**:
   - Pandas is a powerful library for data manipulation and analysis.
   - It allows us to work with data structures like DataFrames, making it easy to clean, explore, and preprocess the dataset.

2. **NumPy (`np`)**:
   - NumPy is a fundamental package for scientific computing in Python.
   - It provides support for arrays and matrices, along with a collection of mathematical functions to operate on these data structures.
   
By importing these libraries, we ensure that we have the necessary tools to handle the data efficiently throughout the notebook.


In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np

### Loading the Dataset

We use the `pandas` library to load the dataset into a DataFrame. The `pd.read_csv()` function reads the CSV file and stores it in the variable `df`, allowing us to easily manipulate and analyze the data.

In [4]:
# Load the dataset (update the path as needed)
df = pd.read_csv('healthcare-dataset-stroke-data.csv')

### 1. Basic Data Exploration

To gain an initial understanding of the dataset, we display the first few rows using the `head()` function. This allows us to quickly inspect the data and see its structure, including the features and their values.

In [5]:
# 1. Basic Data Exploration
# Display first few rows
print("First few rows of the dataset:")
print(df.head())

First few rows of the dataset:
      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  


## Dataset Overview

In this section, we will take a closer look at our dataset to understand its structure. The shape of the dataset is an important aspect as it provides insight into the number of records and features present.
### Shape of the Dataset

The dataset consists of a total of 5110 rows and 12 columns. This indicates that we have 5110 instances and 12 features to work with for our analysis and modeling tasks.

In [6]:
# Shape of the dataset
print(f"\nShape of the dataset: {df.shape}")


Shape of the dataset: (5110, 12)


In [7]:
# Data information
print("\nInformation about the dataset:")
df.info()


Information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


### Key Insights

The dataset contains the following information:
- **Total Entries**: 5110
- **Columns**: age, gender, hypertension, heart disease, BMI, and more
- **Data Types**: float64(3), int64(4), object(5)

## Summary Statistics for Numerical Columns

The following section provides a summary of the key statistics for the numerical features in the dataset. These statistics help in understanding the distribution and spread of the data, which includes metrics such as:

- **Count**: Number of non-null entries for each feature.
- **Mean**: The average value.
- **Standard Deviation (std)**: Measure of how spread out the values are from the mean.
- **Minimum (min)**: The smallest value.
- **25th, 50th (Median), and 75th Percentiles**: These indicate the distribution of the data, helping to understand the spread.
- **Maximum (max)**: The largest value.

In [8]:
# Summary statistics for numerical columns
print("\nSummary statistics for numerical columns:")
print(df.describe())


Summary statistics for numerical columns:
                 id          age  hypertension  heart_disease  \
count   5110.000000  5110.000000   5110.000000    5110.000000   
mean   36517.829354    43.226614      0.097456       0.054012   
std    21161.721625    22.612647      0.296607       0.226063   
min       67.000000     0.080000      0.000000       0.000000   
25%    17741.250000    25.000000      0.000000       0.000000   
50%    36932.000000    45.000000      0.000000       0.000000   
75%    54682.000000    61.000000      0.000000       0.000000   
max    72940.000000    82.000000      1.000000       1.000000   

       avg_glucose_level          bmi       stroke  
count        5110.000000  4909.000000  5110.000000  
mean          106.147677    28.893237     0.048728  
std            45.283560     7.854067     0.215320  
min            55.120000    10.300000     0.000000  
25%            77.245000    23.500000     0.000000  
50%            91.885000    28.100000     0.000000  


## Summary Statistics for Categorical Columns

This section provides a summary of the categorical features in the dataset. The statistics for categorical columns give us an understanding of the distribution of categories within each feature. The following key metrics are included:

- **Count**: Number of non-null entries for each feature.
- **Unique**: The number of unique categories or values in each column.
- **Top**: The most frequent category (mode) in each column.
- **Frequency (freq)**: The number of occurrences of the top category.

In [9]:
# Summary statistics for categorical columns
print("\nSummary statistics for categorical columns:")
print(df.describe(include=object))


Summary statistics for categorical columns:
        gender ever_married work_type Residence_type smoking_status
count     5110         5110      5110           5110           5110
unique       3            2         5              2              4
top     Female          Yes   Private          Urban   never smoked
freq      2994         3353      2925           2596           1892


## Unique Values per Column

In this section, we analyze the number of unique values present in each column. Understanding the uniqueness of the data in each feature helps in identifying categorical variables, potential key identifiers, or redundant features with low variability.

### Key Insights:

- **Unique Value Count**: This metric shows the number of distinct values in each column.
  - For categorical columns, it provides insight into the diversity of categories.
  - For numerical columns, it helps determine whether the feature might be discrete or continuous.
  
- **Features with Low or High Uniqueness**: Columns with very few unique values might indicate categorical data or binary features, whereas those with a high count could be numerical or ID-like columns.

This information helps in selecting appropriate data preprocessing techniques, such as encoding for categorical variables or handling ID columns.


In [10]:
# 2. Finding unique values, null values, and percentage of null values

# Unique values per column
print("\nUnique values in each column:")
print(df.nunique())


Unique values in each column:
id                   5110
gender                  3
age                   104
hypertension            2
heart_disease           2
ever_married            2
work_type               5
Residence_type          2
avg_glucose_level    3979
bmi                   418
smoking_status          4
stroke                  2
dtype: int64


## Null Values per Column

In this section, we identify the number of missing or null values in each column. Missing data is important to address as it can impact the quality and performance of our models.

### Key Insights:

- **Null Value Count**: This shows how many entries are missing in each column.
- **Data Imputation**: Columns with missing values may require imputation or other techniques to handle the missing data before analysis.


In [11]:
# Null values per column
print("\nNull values in each column:")
print(df.isnull().sum())


Null values in each column:
id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64


## Percentage of Null Values per Column

This section shows the percentage of missing or null values in each column. Understanding the proportion of missing data helps in deciding the best strategy for handling it.

### Key Insights:

- **High Null Percentage**: Columns with a high percentage of missing values may need special treatment, such as removal or advanced imputation methods.
- **Low Null Percentage**: Columns with a low percentage of missing values might require simpler imputation techniques (e.g., mean, median, or mode).

Handling missing data appropriately ensures better model performance and more accurate insights.


In [12]:
# Percentage of null values per column
print("\nPercentage of null values in each column:")
null_percentage = (df.isnull().sum() / len(df)) * 100
print(null_percentage)


Percentage of null values in each column:
id                   0.000000
gender               0.000000
age                  0.000000
hypertension         0.000000
heart_disease        0.000000
ever_married         0.000000
work_type            0.000000
Residence_type       0.000000
avg_glucose_level    0.000000
bmi                  3.933464
smoking_status       0.000000
stroke               0.000000
dtype: float64


## Handling Missing Values in Numerical Columns

For numerical columns like `BMI`, we can handle missing values by replacing them with the column's mean. This approach ensures that the missing data is filled in with a value representative of the dataset without introducing bias.

### Example:
We replace the missing values in the `BMI` column with the mean value using the following code:

In [13]:
# 3. Handling missing values
# Filling missing values with mean (for numerical columns like BMI)
df['bmi'].fillna(df['bmi'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bmi'].fillna(df['bmi'].mean(), inplace=True)


In [14]:
# Document observations
print("\nObservations:")
print(f"1. The dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
print(f"2. The 'bmi' column has {df.isnull().sum()['bmi']} missing values, accounting for {null_percentage['bmi']:.2f}% of the data.")
print("3. We chose to fill missing values in the 'bmi' column using the mean.")


Observations:
1. The dataset contains 5110 rows and 12 columns.
2. The 'bmi' column has 0 missing values, accounting for 3.93% of the data.
3. We chose to fill missing values in the 'bmi' column using the mean.


In [15]:
# Save the cleaned dataset if needed
df.to_csv('cleaned_heart_stroke_data.csv', index=False)
print("\nCleaned dataset saved as 'cleaned_heart_stroke_data.csv'.")


Cleaned dataset saved as 'cleaned_heart_stroke_data.csv'.
