## Aim:
#### Introduction to Pandas Library

### What is Pandas?
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. In particular, it offers data structures and operations for manipulating numerical tables and time series.  
Pandas makes it simple to do many of the time consuming, repetitive tasks associated with working with data, including:
>
- Data cleansing
- Data fill
- Data normalization
- Merges and joins
- Data visualization
- Statistical analysis
- Data inspection
- Loading and saving data  
And much more


### Dataframe
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
![image.png](attachment:a50c7eae-586d-4aa6-8f9b-82c7d588d713.png)

### Importing the library:

In [1]:
import pandas as pd

In [2]:
file = r"C:\Users\super\Software\Mega\Academics\5th sem labs\DS Lab\stroke.csv"

#### Reading the csv:

In [3]:
df = pd.read_csv(file)
#creating extra dataframe for easier use
df3 = pd.read_csv(file)

In [4]:
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


#### `df.head()`
Return the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
>Default value of n is 5

In [5]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [6]:
df.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


#### `df.tail()`
Same as `head` function but instead prints n rows from the bottom

In [7]:
df.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.2,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


#### We can also print individual columns of a dataframe (These individual columns are called a `Series`)

In [8]:
#method 1 to print
print(df.id)
print(type(df.id))

0        9046
1       51676
2       31112
3       60182
4        1665
        ...  
5105    18234
5106    44873
5107    19723
5108    37544
5109    44679
Name: id, Length: 5110, dtype: int64
<class 'pandas.core.series.Series'>


In [9]:
#method 2
df['id']

0        9046
1       51676
2       31112
3       60182
4        1665
        ...  
5105    18234
5106    44873
5107    19723
5108    37544
5109    44679
Name: id, Length: 5110, dtype: int64

#### We can also print individual columns as a dataframe instead of series:

In [10]:
print(df[['id']])
print(type(df[['id']]))

         id
0      9046
1     51676
2     31112
3     60182
4      1665
...     ...
5105  18234
5106  44873
5107  19723
5108  37544
5109  44679

[5110 rows x 1 columns]
<class 'pandas.core.frame.DataFrame'>


#### `df.shape`
Returns a tuple representing the dimensionality of the DataFrame.

In [11]:
df.shape

(5110, 12)

#### To get the datatypes of each column in the dataframe:

In [12]:
df.dtypes

id                     int64
gender                object
age                  float64
hypertension           int64
heart_disease          int64
ever_married          object
work_type             object
Residence_type        object
avg_glucose_level    float64
bmi                  float64
smoking_status        object
stroke                 int64
dtype: object

#### `dataframe.describe()`
Generate descriptive statistics.  
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.  
Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. 

In [13]:
df.describe().T
# `.T` transposes the resulting matrix

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,5110.0,36517.829354,21161.721625,67.0,17741.25,36932.0,54682.0,72940.0
age,5110.0,43.226614,22.612647,0.08,25.0,45.0,61.0,82.0
hypertension,5110.0,0.097456,0.296607,0.0,0.0,0.0,0.0,1.0
heart_disease,5110.0,0.054012,0.226063,0.0,0.0,0.0,0.0,1.0
avg_glucose_level,5110.0,106.147677,45.28356,55.12,77.245,91.885,114.09,271.74
bmi,4909.0,28.893237,7.854067,10.3,23.5,28.1,33.1,97.6
stroke,5110.0,0.048728,0.21532,0.0,0.0,0.0,0.0,1.0


In [14]:
df[df.age>60].describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,1304.0,1304.0,1304.0,1304.0,1304.0,1217.0,1304.0
mean,36323.394172,71.506135,0.216258,0.161043,122.045222,29.873377,0.135736
std,21178.602549,6.721973,0.41185,0.367712,58.084227,6.020397,0.34264
min,132.0,61.0,0.0,0.0,55.23,11.3,0.0
25%,17303.75,65.0,0.0,0.0,79.1925,26.0,0.0
50%,36603.5,71.0,0.0,0.0,97.07,29.1,0.0
75%,54504.25,78.0,0.0,0.0,174.1975,33.4,0.0
max,72823.0,82.0,1.0,1.0,271.74,60.2,1.0


#### `isnull()`
DataFrame.isnull is an alias for DataFrame.isna.  
Detect missing values.  
- Returns a boolean same-sized object indicating if the values are NA


In [15]:
df.isnull()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,False,False,False,False,False,False,False,False,False,True,False,False
5106,False,False,False,False,False,False,False,False,False,False,False,False
5107,False,False,False,False,False,False,False,False,False,False,False,False
5108,False,False,False,False,False,False,False,False,False,False,False,False


#### To get the number of null values in each column:

In [16]:
df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

### Dealing with null values
Some ways are:

#### 1. Dropping the null values:

In [17]:
df.dropna(axis=0, inplace=True)

Resetting the index after some rows have been dropped:

In [18]:
df.reset_index(drop=True,inplace=True)

In [19]:
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4904,14180,Female,13.0,0,0,No,children,Rural,103.08,18.6,Unknown,0
4905,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
4906,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
4907,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


#### 2. Replacing the null values
with say Mean

In [20]:
df3.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [21]:
df3.bmi.fillna(df3.bmi.mean(), inplace=True)

In [22]:
df3.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

### `loc` and `iloc`
> The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc().

> The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc(). iloc() does not accept the boolean data unlike loc(). 

In [23]:
df.loc[3,'age']

79.0

Getting a range of values using loc

In [24]:
df.loc[0:6, ['age', 'gender','heart_disease']]

Unnamed: 0,age,gender,heart_disease
0,67.0,Male,1
1,80.0,Male,1
2,49.0,Female,0
3,79.0,Female,0
4,81.0,Male,0
5,74.0,Male,1
6,69.0,Female,0


In [25]:
df.iloc[4, 5]

'Yes'

#### Getting tuples based on some conditions:

In [26]:
df.loc[df.age>60]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
5,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4890,22190,Female,64.0,1,0,Yes,Self-employed,Urban,76.89,30.2,Unknown,0
4894,56799,Male,76.0,0,0,Yes,Govt_job,Urban,82.35,38.9,never smoked,0
4898,64520,Male,68.0,0,0,Yes,Self-employed,Urban,91.68,40.8,Unknown,0
4900,68398,Male,82.0,1,0,Yes,Self-employed,Rural,71.97,28.3,never smoked,0


In [27]:
#problem: find all those entries with age greater than 60 and bmi less than 25
df.loc[(df['age']>60) & (df['bmi']<25)]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
3,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
6,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
7,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1
16,70630,Female,71.0,0,0,Yes,Govt_job,Rural,193.94,22.4,smokes,1
21,70822,Male,80.0,0,0,Yes,Self-employed,Rural,104.12,23.5,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4812,48109,Female,79.0,0,1,Yes,Private,Rural,88.51,24.5,never smoked,0
4822,19826,Female,81.0,0,0,Yes,Self-employed,Rural,86.05,20.1,formerly smoked,0
4861,64420,Female,61.0,0,0,Yes,Govt_job,Rural,120.23,22.7,Unknown,0
4866,66684,Male,70.0,0,0,Yes,Self-employed,Rural,193.88,24.3,Unknown,0


Some operations on each column:

In [28]:
df['age'].max()

82.0

In [29]:
df['age'].min()

0.08

In [30]:
df['age'].mode()

0    57.0
1    78.0
dtype: float64

In [31]:
df['age'].mean()

42.865373803218574

#### Renaming the columns:

In [32]:
df.rename(columns = {'Residence_type':'residence_type', 'work_type' : 'type_of_work'}, inplace = True)

In [33]:
df.type_of_work.mode()

0    Private
dtype: object

In [34]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


#### Dropping a column:

In [35]:
df.drop("id", axis=1, inplace=True)

In [36]:
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1


#### Sorting the tuples based on an attribute:

In [37]:
df.sort_values('age').head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke
3140,Male,0.08,0,0,No,children,Rural,70.33,16.9,Unknown,0
1531,Female,0.08,0,0,No,children,Urban,139.67,14.1,Unknown,0
3793,Male,0.16,0,0,No,children,Rural,69.79,13.0,Unknown,0
3456,Male,0.16,0,0,No,children,Urban,114.71,17.4,Unknown,0
3846,Male,0.16,0,0,No,children,Urban,109.52,13.9,Unknown,0


Decreasing order:

In [38]:
df.sort_values('age', ascending=False).head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke
2201,Male,82.0,0,0,Yes,Private,Urban,89.83,24.7,smokes,0
623,Male,82.0,0,0,Yes,Self-employed,Rural,56.75,21.0,never smoked,0
4627,Female,82.0,0,0,Yes,Self-employed,Urban,113.45,30.3,never smoked,0
2225,Female,82.0,0,0,Yes,Private,Urban,80.0,33.6,never smoked,0
3442,Male,82.0,0,0,No,Self-employed,Urban,101.57,24.3,smokes,0


### `groupby`
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

In [39]:
smokers = df.groupby("smoking_status")

In [40]:
#number of values in each group:
smokers.size()

smoking_status
Unknown            1483
formerly smoked     837
never smoked       1852
smokes              737
dtype: int64

### Binning:
Binning also known as bucketing or discretization is a common data pre-processing technique used to group intervals of continuous data into “bins” or “buckets”.

In [41]:
bins = [0, 18.5, 25, 30, 100]  
labels = ['UnderWeight', 'Healthy', 'OverWeight', 'Obese']
bmi_category = pd.cut(x = df['bmi'], bins=bins, labels=labels)

In [42]:
bmi_category.value_counts()

Obese          1893
OverWeight     1409
Healthy        1258
UnderWeight     349
Name: bmi, dtype: int64

In [43]:
df2 = df

In [44]:
df2['bmi_category'] = pd.cut(x = df['bmi'], bins=bins, labels=labels)

In [45]:
df2.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_category
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,Obese
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,Obese
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,Obese
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,Healthy
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1,OverWeight


We can also perform mathematical operations on individual columns:

In [46]:
df2['age/10'] = df2['age']/10

In [47]:
df2.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_category,age/10
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,Obese,6.7
1,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,Obese,8.0
2,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1,Obese,4.9
3,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1,Healthy,7.9
4,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1,OverWeight,8.1


In [48]:
df3.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [49]:
df3.bmi.fillna(df3.bmi.mean(), inplace=True)

In [50]:
df3.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

#### `dataframe.map()`
Map values of Series according to an input mapping or function.  
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.  

In [51]:
df['ever_married'] = df['ever_married'].map({'Yes' : 1, 'No' :0})

In [52]:
df

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_category,age/10
0,Male,67.0,0,1,1,Private,Urban,228.69,36.6,formerly smoked,1,Obese,6.7
1,Male,80.0,0,1,1,Private,Rural,105.92,32.5,never smoked,1,Obese,8.0
2,Female,49.0,0,0,1,Private,Urban,171.23,34.4,smokes,1,Obese,4.9
3,Female,79.0,1,0,1,Self-employed,Rural,174.12,24.0,never smoked,1,Healthy,7.9
4,Male,81.0,0,0,1,Private,Urban,186.21,29.0,formerly smoked,1,OverWeight,8.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4904,Female,13.0,0,0,0,children,Rural,103.08,18.6,Unknown,0,Healthy,1.3
4905,Female,81.0,0,0,1,Self-employed,Urban,125.20,40.0,never smoked,0,Obese,8.1
4906,Female,35.0,0,0,1,Self-employed,Rural,82.99,30.6,never smoked,0,Obese,3.5
4907,Male,51.0,0,0,1,Private,Rural,166.29,25.6,formerly smoked,0,OverWeight,5.1


### [Masking Operator](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html)
`dataFrame.mask(cond, other=nan, *, inplace=False, axis=None, level=None, errors='raise', try_cast=_NoDefault.no_default)`  
- Replace values where the condition is True.

Parameters
> **cond**: bool Series/DataFrame, array-like, or callable  
Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).
    
>**other**: scalar Series/DataFrame, or callable  
Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it).


![image.png](attachment:434aa99e-b85f-4d8a-a1d0-c3ab49aec622.png)

#### Making index start from 1 instead of 0 
>makes use of numpy's `arange` function

In [53]:
#earlier
df2.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_category,age/10
0,Male,67.0,0,1,1,Private,Urban,228.69,36.6,formerly smoked,1,Obese,6.7
1,Male,80.0,0,1,1,Private,Rural,105.92,32.5,never smoked,1,Obese,8.0
2,Female,49.0,0,0,1,Private,Urban,171.23,34.4,smokes,1,Obese,4.9
3,Female,79.0,1,0,1,Self-employed,Rural,174.12,24.0,never smoked,1,Healthy,7.9
4,Male,81.0,0,0,1,Private,Urban,186.21,29.0,formerly smoked,1,OverWeight,8.1


In [54]:
import numpy as np
df2.index = np.arange(1,df2.shape[0]+1,1)

In [55]:
#now
df2

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,type_of_work,residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_category,age/10
1,Male,67.0,0,1,1,Private,Urban,228.69,36.6,formerly smoked,1,Obese,6.7
2,Male,80.0,0,1,1,Private,Rural,105.92,32.5,never smoked,1,Obese,8.0
3,Female,49.0,0,0,1,Private,Urban,171.23,34.4,smokes,1,Obese,4.9
4,Female,79.0,1,0,1,Self-employed,Rural,174.12,24.0,never smoked,1,Healthy,7.9
5,Male,81.0,0,0,1,Private,Urban,186.21,29.0,formerly smoked,1,OverWeight,8.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4905,Female,13.0,0,0,0,children,Rural,103.08,18.6,Unknown,0,Healthy,1.3
4906,Female,81.0,0,0,1,Self-employed,Urban,125.20,40.0,never smoked,0,Obese,8.1
4907,Female,35.0,0,0,1,Self-employed,Rural,82.99,30.6,never smoked,0,Obese,3.5
4908,Male,51.0,0,0,1,Private,Rural,166.29,25.6,formerly smoked,0,OverWeight,5.1


### Apply
`DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)`  
- Apply a function along an axis of the DataFrame.

In [56]:
output = df2[['age', 'bmi', 'avg_glucose_level']].apply(np.mean)

In [57]:
output

age                   42.865374
bmi                   28.893237
avg_glucose_level    105.305150
dtype: float64