![image](https://analyticsindiamag.com/wp-content/uploads/2020/04/Screenshot-2020-04-15-at-10.08.12-AM.png)

## Business Problem Understanding

- According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

- This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.

- Each row in the data provides relavant information about the patient.

## Attribute Information
- 1) id: unique identifier
- 2) gender: "Male", "Female" or "Other"
- 3) age: age of the patient
- 4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- 5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
- 6) ever_married: "No" or "Yes"
- 7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
- 8) Residence_type: "Rural" or "Urban"
- 9) avg_glucose_level: average glucose level in blood
- 10) bmi: body mass index
- 11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
- 12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



## Data Collection/Data Import

**The dataset is available at this google drive link:**
**Use gdown to download this in the colab environment directly.**

!gdown https://drive.google.com/uc?id=1vs0cmeKYeht_d07C1HvHFUIxD-IfEACL


In [1]:
## write your code here


## Importing Necessary libraries

In [2]:
import os
import gdown
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Importing Data From CSV File Using Pandas

In [3]:
## write your code here
file_path = "stroke_data.csv"
df = pd.read_csv(file_path)

## Data Understanding:

### Print the first five rows of the pandas dataframe

In [4]:
## write your code here
df.head()

Unnamed: 0,id,gender,age,height_in_m,weight_in_kg,avg_glucose_level,bmi,smoking_status,hypertension,heart_disease,ever_married,work_type,Residence_type,stroke
0,9046,Male,67.0,1.8288,122.409046,169.35,36.6,formerly smoked,0,1,Yes,Private,Urban,1
1,51676,Female,61.0,1.64592,,169.35,,never smoked,0,0,Yes,Self-employed,Rural,1
2,31112,Male,80.0,1.79832,105.103532,105.92,32.5,never smoked,0,1,Yes,Private,Rural,1
3,60182,Female,49.0,1.92024,126.843865,169.35,34.4,smokes,0,0,Yes,Private,Urban,1
4,1665,female,79.0,1.85928,82.966131,169.35,24.0,never smoked,1,0,Yes,Self-employed,Rural,1


## Print the last five rows of the pandas dataframe

In [5]:
## write your code here
df.tail()

Unnamed: 0,id,gender,age,height_in_m,weight_in_kg,avg_glucose_level,bmi,smoking_status,hypertension,heart_disease,ever_married,work_type,Residence_type,stroke
5130,46373,Female,57.0,1.6764,72.506178,169.97,25.8,never smoked,0,0,Yes,Private,Rural,0
5131,40112,Female,37.0,1.92024,92.551774,118.41,25.1,never smoked,0,0,No,Private,Urban,0
5132,32240,Female,27.0,1.8288,139.131593,93.55,41.6,never smoked,0,0,No,Private,Urban,0
5133,69312,Female,48.0,1.58496,78.377464,99.29,31.2,never smoked,0,0,Yes,Self-employed,Urban,0
5134,25763,Female,23.0,1.64592,76.66619,98.66,28.3,Unknown,0,0,No,Private,Urban,0


## What is the shape of the dataset?

In [6]:
## write your code here
df.shape

(5135, 14)

## What are the name of the columns in the dataframe?

In [7]:
## write your code here
df.columns

Index(['id', 'gender', 'age', 'height_in_m', 'weight_in_kg',
       'avg_glucose_level', 'bmi', 'smoking_status', 'hypertension',
       'heart_disease', 'ever_married', 'work_type', 'Residence_type',
       'stroke'],
      dtype='object')

### What are the datatypes of each feature in the dataset?

In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5135 entries, 0 to 5134
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5135 non-null   int64  
 1   gender             5135 non-null   object 
 2   age                5135 non-null   float64
 3   height_in_m        5135 non-null   float64
 4   weight_in_kg       4934 non-null   float64
 5   avg_glucose_level  5135 non-null   float64
 6   bmi                4934 non-null   float64
 7   smoking_status     5135 non-null   object 
 8   hypertension       5135 non-null   int64  
 9   heart_disease      5135 non-null   int64  
 10  ever_married       4933 non-null   object 
 11  work_type          5135 non-null   object 
 12  Residence_type     5135 non-null   object 
 13  stroke             5135 non-null   int64  
dtypes: float64(5), int64(4), object(5)
memory usage: 561.8+ KB


In [8]:
## write your code here
df.dtypes

id                     int64
gender                object
age                  float64
height_in_m          float64
weight_in_kg         float64
avg_glucose_level    float64
bmi                  float64
smoking_status        object
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
stroke                 int64
dtype: object

## Descriptive Statistics

Descriptive statistics involve a set of summary measures that provide a snapshot of the dataset's characteristics. These measures help us understand the distribution, central tendency, and variability within the data.

- Mean: The average value of the data.
- Median: The middle value when the data is sorted.
- Mode: The most frequently occurring value.
- Range: The difference between the maximum and minimum values.
- Standard Deviation: A more interpretable measure of data spread.
These statistics provide a preliminary understanding of the dataset, which is valuable for subsequent analysis and decision-making.



### How to see the descriptive statistics of a dataset?

In [9]:
## write your code here?
df.describe()

Unnamed: 0,id,age,height_in_m,weight_in_kg,avg_glucose_level,bmi,hypertension,heart_disease,stroke
count,5135.0,5135.0,5135.0,4934.0,5135.0,4934.0,5135.0,5135.0,5135.0
mean,36510.30594,43.23739,1.751992,89.172256,101.38318,28.899959,0.097176,0.053944,0.048491
std,21153.824243,22.601553,0.132236,27.910274,34.605155,7.847094,0.296226,0.225928,0.214822
min,67.0,0.08,1.524,32.516064,55.12,10.3,0.0,0.0,0.0
25%,17757.0,25.0,1.64592,69.416594,77.285,23.5,0.0,0.0,0.0
50%,36896.0,45.0,1.73736,85.357641,91.85,28.1,0.0,0.0,0.0
75%,54631.5,61.0,1.85928,104.671347,113.985,33.1,0.0,0.0,0.0
max,72940.0,82.0,1.9812,348.548423,271.74,97.6,1.0,1.0,1.0


## How to select gender column from the pandas dataframe?

In [10]:
## write your code here


### How to select multiple columns : age, gender and bmi?

In [11]:
## write your code here


## How to select the 7th row of the pandas dataframe?


In [12]:
## write your code here


## How to select the 4th column from the pandas dataframe?


In [13]:
## write your code here


## How to select 20th to 30th row and 3rd to 7th column in pandas dataframe?

In [14]:
## write your code here


## How to select 3rd and 100th row & 4th and 10th column in a pandas dataframe?

In [15]:
## write your code here


## Select only those rows with gender 'Male'

In [16]:
## write your code here


## Select all those rows which have avg_glucose_level greater than 100 and columns gender, age, bmi and avg_glucose_level

In [17]:
## write your code here


## Select all those Females who are greater than 50 years old?

In [18]:
## write your code here


## Data Wrangling

- Data Inspection
  - Checking Duplicate Enties
  - Checking Missing Values
  - Checking standard format
  - Checking data entry typos and errors
- Data Cleaning
  - Removing Duplicates
  - Handling Missing Values
  - Standardizing Formats
  - Correcting Errors
- Data Transformation
  - Feature Engineering
  - Normalization/Scaling
  - One-Hot Encoding
- Data Integration
- Data Reduction
- Data Formatting
- Data Enrichment
- Data Validation
- Documentation
- Exploratory Data Analysis (EDA)


### Checking Duplicate Entries
- Check if duplicate entries are present or not.
- If present find how many of duplicate entries are there?

In [19]:
## write your code here


In [20]:
## write your code here


## Remove Duplicate Entries

- Remove all those rows which has duplicate entries

In [21]:
## write your code here


In [22]:
## write your code here


## Checking Missing Values
- Find missing values (NAN) values in the datasets
- Find columns which has missing values with their frequency

In [23]:
## write your code here


## Visualize missing values using heatmaps

In [24]:
## write your code here


## Handling Missing Values

- Handle missing values for ever_married column, avg_glucose_level and weight_in_kg column

In [25]:
## write your code here


In [26]:
## write your code here


In [27]:
## write your code here



In [28]:
## write your code here



## Checking missing values for weight_in_kg and bmi columns

In [29]:
## write your code here


In [30]:
## write your code here


## Check relationship between bmi and height_in_m whether it can be used to fill missing values in bmi (use scatterplot to visually inspect relationship)


In [31]:
## write your code here


In [32]:
## write your code here


In [33]:
## write your code here


In [34]:
## write your code here


## Exploratory Data Analysis

- Univariate Analysis: Studying one variable at a time
- Bivariate Analysis: Studying two variables at a time
- Multivariate Analysis: Studying multiple variables at a time
- We need to investigate each feature properly

In [35]:
## write your code here


## id feature

In [36]:
## write your code here


## gender

In [37]:
## write your code here (check dtypes first)


In [38]:
## write your code here


In [39]:
## write your code here for calculating frequency count of gender column


In [40]:
## write your code here


## Create Piechart Or Bargraph For Univariate Analysis Of Categorical Feature

In [41]:
## write your code here


## smoking status

In [42]:
## write your code here


In [43]:
## write your code here


In [44]:
## write your code here


In [45]:
## write your code here


## Plot figure (Barchart)

In [46]:
## write your code here (use seaborn)


## hypertension

In [47]:
## write your code here


In [48]:
## write your code here (show graph)


## stroke feature

In [49]:
## write your code here


In [50]:
## write your code here(piechart)


# Bivariate Analysis
## Is there a chance that patients with hypertension has more likely to get a stroke or not? (cross_tab function)

In [51]:
## write your code here


## Hypothesis Testing (Chisquare test for Independence)


chi2, p, dof, expected = chi2_contingency(stroke_hypertension_df)

In [52]:
from scipy.stats import chi2_contingency

In [53]:
# Perform Chi-square test



## Group Barplot

In [54]:
## write your code here


In [55]:
# Plot using Seaborn



## heart disease

In [56]:
## write your code here


In [57]:
## write your code here


## Hypothesis Testing (Chisquare test for Independence)


In [58]:
# Perform Chi-square test



## Group Bar plot

In [59]:
## write your code here with long format table


In [60]:
## write your code here


## Numerical Features

In [61]:
# select numerical features
## write your code here


## age column

In [62]:
## write your code here (for histogram)


In [63]:
## write your code here (for kde plot)


In [64]:
## write your code here (for outlier analysis using boxplot)


## Bmi column

In [65]:
## write your code here (for histogram)


In [66]:
## write your code here (for kde plot)

In [67]:
## write your code here (for boxplot outlier analysis)


### Hypothesis Test For Normality


# Perform Kolmogorov-Smirnov test
statistic, pvalue = kstest(final_df['bmi'], 'norm')

# Print the result
print("Kolmogorov-Smirnov Test Statistic:", statistic)

print("p-value:", pvalue)

# Interpret the results
alpha = 0.05  # Significance level

if pvalue > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
    
else:
    print("Sample does not look Gaussian (reject H0)")


In [68]:
from scipy.stats import kstest, shapiro

## Scatterplots

In [69]:
## write your code here


## Correlation Plots and Heatmaps

In [70]:
## write your code here
