# Assignment Details
In this assignment, you will conduct a guided exploration over the Healthcare stroke dataset. You will learn and use some of the most common exploration/aggregation/descriptive operations. This should also help you learn most of the key functionalities in R.

You will also learn how to use visualization libraries to identify patterns in data that will help in your further data analysis. You will also explore most popular chart types and how to use different libraries and styles to make your visualizations more attractive.


## Reading Dataset


#### Note:-
For execution of this file , the dataset 'healthcare_stroke_dataset.csv' must be in the same folder as this ipynb file

Below code attaches the csv dataset file to R dataframe named as "block"


In [14]:
# loading the dataset
block <- read.csv('healthcare_stroke_dataset.csv', stringsAsFactors = F)
head(block,5)


Unnamed: 0_level_0,id,date,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
Unnamed: 0_level_1,<int>,<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>
1,9046,12/30/2020,Male,67,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,51676,8/18/2020,Female,61,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
3,31112,3/5/2020,Male,80,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
4,60182,7/8/2020,Female,49,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
5,1665,6/5/2020,Female,79,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1



# Preproccesing the Data




In [16]:
# looking at missing values, # orwos and columns & overall summary
row.has.na <- apply(block,1,function(x){any(is.na(x))})
cat("Actual empty values in the given dataset\n",sum(row.has.na))
cat("\nNumber of rows and columns in the dataset are -")
dim(block)
cat("\nsummary of the given dataset")
summary(block)

Actual empty values in the given dataset
 201
Number of rows and columns in the dataset are -


summary of the given dataset

       id            date              gender               age       
 Min.   :   67   Length:5110        Length:5110        Min.   : 0.08  
 1st Qu.:17741   Class :character   Class :character   1st Qu.:25.00  
 Median :36932   Mode  :character   Mode  :character   Median :45.00  
 Mean   :36518                                         Mean   :43.23  
 3rd Qu.:54682                                         3rd Qu.:61.00  
 Max.   :72940                                         Max.   :82.00  
                                                                      
  hypertension     heart_disease     ever_married        work_type        
 Min.   :0.00000   Min.   :0.00000   Length:5110        Length:5110       
 1st Qu.:0.00000   1st Qu.:0.00000   Class :character   Class :character  
 Median :0.00000   Median :0.00000   Mode  :character   Mode  :character  
 Mean   :0.09746   Mean   :0.05401                                        
 3rd Qu.:0.00000   3rd Qu.:0.00000                       

In [3]:
# replacing missing values
block <- block[!row.has.na,]
row.has.na <- apply(block,1,function(x){any(is.na(x))})
cat("\nTotal Empty values in the dataset after modification\n",sum(row.has.na))



Total Empty values in the dataset after modification
 0

In [4]:
# factorizing the heart_disease & hypertension variables
block$heart_disease <- factor(block$heart_disease, levels = c(0,1), labels =c('No','Yes'))
block$hypertension <- factor(block$hypertension,levels = c(0,1), labels = c('No','Yes'))
summary(block)

       id            date              gender               age       
 Min.   :   77   Length:4909        Length:4909        Min.   : 0.08  
 1st Qu.:18605   Class :character   Class :character   1st Qu.:25.00  
 Median :37608   Mode  :character   Mode  :character   Median :44.00  
 Mean   :37064                                         Mean   :42.87  
 3rd Qu.:55220                                         3rd Qu.:60.00  
 Max.   :72940                                         Max.   :82.00  
 hypertension heart_disease ever_married        work_type        
 No :4458     No :4666      Length:4909        Length:4909       
 Yes: 451     Yes: 243      Class :character   Class :character  
                            Mode  :character   Mode  :character  
                                                                 
                                                                 
                                                                 
 Residence_type     avg_glucose_level    


# Task 1: Statistical Exploratory Data Analysis
Let us start with getting know the dataset. Your first task will be to get some basic information by using Pandas features.


In [5]:
#For each task below, look for a R function to do the task.
#Replace None in each task with your code.


#Task 1-a: Find the number of rows and columns in the block data frame.

details<-dim(block)  # Syntax to display details of data frame to give count of row and columns
cat("\n-->Task 1-a: Number of rows and columns  of block data frame are: \n",details)



#Task 1-b: Find and print the descriptive detail of (count, min,max,avg,mean,top,freq etc) for given dataset 
details<-summary(block)     #Syntax to display details of data frame as a summary
cat("\n-->Task 1-b: Descriptive details of the dataset are\n")
details


#Task 1-c: Print ALL the unique values of work_type,smoking_status and Residence_type columns.
cat("\n-->Task 1-c: Print ALL the unique values of work_type,smoking_status and Residence_type columns:\n")
unique(block$work_type)
unique(block$smoking_status)
unique(block$Residence_type)


-->Task 1-a: Number of rows and columns  of block data frame are: 
 4909 13
-->Task 1-b: Descriptive details of the dataset are


       id            date              gender               age       
 Min.   :   77   Length:4909        Length:4909        Min.   : 0.08  
 1st Qu.:18605   Class :character   Class :character   1st Qu.:25.00  
 Median :37608   Mode  :character   Mode  :character   Median :44.00  
 Mean   :37064                                         Mean   :42.87  
 3rd Qu.:55220                                         3rd Qu.:60.00  
 Max.   :72940                                         Max.   :82.00  
 hypertension heart_disease ever_married        work_type        
 No :4458     No :4666      Length:4909        Length:4909       
 Yes: 451     Yes: 243      Class :character   Class :character  
                            Mode  :character   Mode  :character  
                                                                 
                                                                 
                                                                 
 Residence_type     avg_glucose_level    


-->Task 1-c: Print ALL the unique values of work_type,smoking_status and Residence_type columns:




# Task 2: Aggregation & Filtering & Rank
In this task, we will perform some very high level aggregation and filtering operations. 
Then, we will apply ranking on the results for some tasks. 
Pandas has a convenient and powerful syntax for aggregation, filtering, and ranking. 
DO NOT write a for loop.



In [12]:


#Task 2-a: Ascertain the highest avg_glucose_level achieved by different ages.
cat("\n-->Task 2-a:Highest avg_glucose_level  for every age is listed below\n")

aggregate(block$avg_glucose_level, by = list("different ages"=block$age,"id Num"=block$id),max)




#Task 2-b:Rank all the movies based on rating in a year where minimum number of movies are released and maximum number of movies released.
cat("\n>-Task 2-b:Listed below are id ordered according to their bmi with the LEAST age \n")
uniqv <- unique(block$age)
min_age<-uniqv[which.min(tabulate(match(block$age, uniqv)))]
cat("\n LEAST age is",min_age)
cat("\n")
newdata <- subset(block, age == min_age)
rankeddata <-newdata[order(newdata$bmi),]
rankeddata

# cat("\n>-Task 2-b:Listed below are id ordered according to their bmi in the with the least age \n")
cat("\n>-Task 2-b:Listed below are id ordered according to their bmi with the HIGHEST age \n")
uniqv <- unique(block$age)
max_age<-uniqv[which.max(tabulate(match(block$age, uniqv)))]
cat("\n HIGHEST age is",max_age)
cat("\n")
newdata <- subset(block, age == max_age)
rankeddata <-newdata[order(newdata$bmi),]
rankeddata




-->Task 2-a:Highest avg_glucose_level  for every age is listed below


different ages,id Num,x
<dbl>,<int>,<dbl>
13,77,85.81
55,84,89.17
42,91,98.53
31,99,108.89
24,129,97.55
33,156,86.97
20,163,94.67
20,187,84.07
43,205,88.23
81,210,91.54



>-Task 2-b:Listed below are id ordered according to their bmi with the LEAST age 

 LEAST age is 0.4


Unnamed: 0_level_0,id,date,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
Unnamed: 0_level_1,<int>,<chr>,<chr>,<dbl>,<fct>,<fct>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>
1601,40544,12/16/2020,Male,0.4,No,No,No,children,Urban,109.56,14.3,-,0
4582,15728,3/8/2020,Female,0.4,No,No,No,children,Rural,85.65,17.4,-,0



>-Task 2-b:Listed below are id ordered according to their bmi with the HIGHEST age 

 HIGHEST age is 78


Unnamed: 0_level_0,id,date,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
Unnamed: 0_level_1,<int>,<chr>,<chr>,<dbl>,<fct>,<fct>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<int>
1814,42465,10/31/2020,Female,78,Yes,No,Yes,Private,Rural,58.66,16.4,never smoked,0
2011,26247,3/13/2020,Female,78,No,No,Yes,Private,Rural,95.37,17.3,-,0
1449,49341,6/22/2020,Female,78,No,No,Yes,Private,Rural,154.75,17.6,never smoked,0
249,43424,8/20/2020,Female,78,No,No,Yes,Private,Rural,78.81,19.6,-,1
132,16817,10/31/2020,Female,78,Yes,No,No,Private,Urban,130.54,20.1,never smoked,1
2604,32445,10/24/2020,Female,78,No,No,Yes,Self-employed,Urban,79.55,21.1,formerly smoked,0
4449,69010,5/1/2020,Male,78,No,No,Yes,Private,Rural,83.20,21.2,formerly smoked,0
2488,44325,12/24/2020,Male,78,No,No,Yes,Self-employed,Rural,126.39,21.3,smokes,0
1972,48775,2/27/2020,Female,78,Yes,No,Yes,Self-employed,Rural,201.07,21.8,-,0
2180,66677,10/20/2020,Male,78,No,No,Yes,Private,Rural,80.09,21.8,never smoked,0
