# Data Wrangling (part 2)

The process of cleaning, organizing, and transforming "raw" data with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

In this notebook we will look at:

- [x] Importing data
- [x] Handling missing values
- [x] Handling incorrect data
- [x] Removing duplicated data
- [x] Detecting outliers
- [ ] Data Transformation
    - .apply() and lambda functions
    - .groupby() function

Consider the _transformation_dataset.xlsx_ dataset.

Certainly! Here's the information presented in a markdown table:

| Variable | Meaning                                       |
|----------|-----------------------------------------------|
| gender   | Gender of the individual                      |
| age      | Age of the individual                         |
| height   | Height of the individual (in cm)                      |
| weight   | Weight of the individual (in kg)                     |
| smoke    | Indicates if the individual smokes ( 1 = yes/ 0 = no)   |
| alcohol  | Indicates if the individual drinks alcohol ( 1 = yes/ 0 = no)        |
| active   | Indicates if the individual is active ( 1 = yes/ 0 = no)    |

This table provides a concise explanation of the meaning of each variable in the dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
original_df = pd.read_excel('transformation_dataset.xlsx')
df = original_df.copy()

## Data Transformation

** Create a new column called 'BMI' containing the BMI of each person. **

$$ \text {BMI} = \frac{\text { Weight (in kilograms) }}{\text { Height }^2 \text { (in meters) }} $$



In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
# calculate BMI directly:


### 1. .apply() function and Lambda function

In [None]:
# this function converts cm to m:

def cm_to_m_converter(x):
    result = x/100
    return result

In [None]:
df['height'].apply(cm_to_m_converter)

In [None]:
df['height_m'] = df['height'].apply(cm_to_m_converter)
df

** Use the .apply() function to calculate the BMI scores. **

In [None]:
def BMI_calculator(row):
    return row['weight'] / (row['height_m'] ** 2)

df['BMI'] = df.apply(BMI_calculator, axis=1)

# the axis argument specifies along which the function is applied:
# 0 : apply function to each column. (default)
# 1 : apply function to each row.

df

Consider the following function:
    
```   
def BMI_calculator_2args(w, h):
    BMI = w / h**2
    return BMI
```

For more complicated functions (eg. those requiring more than one argument), you can use a lambda function. 

#### Lambda Function

A lambda function is a small anonymous function. 

Remember this function:
    
```
def cm_to_m_converter(x):
    result = x/100
    return result 
```

Instead of using this, as we did before,:

```
df['height'].apply(cm_to_m_converter)
```

we can use the .apply() and lambda functions as follows:

In [None]:
df['height'].apply(lambda x: x/100)

In [None]:
df['height'].apply(lambda x: cm_to_m_converter(x))

**What if the function requires 2 or more arguments?**

** Create a new column called `BMI_class` that only indicates whether a male is 
- Underweight (BMI < 18.5), 
- Normal (18.5 <= BMI < 24.9), 
- Overweight (25 <= BMI < 29.9) or 
- Obese BMI (>= 29.9)
based on his BMI score. If the person is a female, output the words 'n/a'. **

In [None]:
def BMI_classifier(gender, BMI):
    
    if gender == 'Female':
        category = 'n/a'
    
    else:    
        if BMI < 18.5:
            category = 'Underweight'
        elif 18.5 <= BMI < 24.9:
            category = 'Normal'
        elif 25 <= BMI < 29.9:
            category = 'Overweight'
        else:
            category = 'Obese'
            
    return category

In [None]:
# use the .apply() and lambda functions


In [None]:
# create the new column: BMI_class 

# view updated dataframe


### 3. .groupby() function

The groupby method allows you to group rows of data together and call aggregate functions.

In [None]:
df

** What is are the average weight of the people in each BMI class? **

Use the .groupby() method to group rows together based off of a column name. 

This will create a DataFrameGroupBy object:

In [None]:
df.groupby('BMI_class')

You can save this object as a new variable:

In [None]:
BMI_groups = df.groupby('BMI_class')

And then call aggregate methods of the object:

In [None]:
BMI_groups.mean()

In [None]:
BMI_groups.mean()['weight']

Other operations:

In [None]:
BMI_groups.median()

In [None]:
BMI_groups.std()

In [None]:
BMI_groups.min()

In [None]:
BMI_groups.sum()

In [None]:
BMI_groups.size()

To view groups:

In [None]:
BMI_groups.groups

In [None]:
df.loc[[2, 8]]

**Group by with multiple columns:**

** Calculate the average weights of individuals based on their BMI categories and smoking habits. **

In [None]:
df.groupby(['BMI_class','smoke'])

In [None]:
df.groupby(['BMI_class','smoke']).mean()

You can also use other aggregation functions. For example:

In [None]:
df.groupby(['BMI_class','smoke']).size()

<h4><center>The End!</center></h4>