<img src="https://i.ibb.co/hcrKx44/Weekly-Challenge-Banner.png" >

# Weekly Challenge 3
## Description

Hello everyone and welcome to the third challenge! In this week's challenge, you will learn how to handle a common problem in data science: _missing data_. 


## The dataset
For this task, we produced a _corrupted_ version of the classic <a href="https://archive.ics.uci.edu/ml/datasets/iris">iris</a> dataset.

This dataset describes the characteristics of iris flowers through four features:
 * Sepal length
 * Sepal width
 * Petal length
 * Petal width

Obviously, these values have to be **strictly** positive.

However, you will observe that some values are missing/incoherent. We will assume here that the values are <a href="https://en.wikipedia.org/wiki/Missing_data#Missing_completely_at_random">missing completely at random</a>.

## The task
Your task is to clean the dataset by performing *median imputation*, i.e., replacing the missing values by the feature median. While there are numerous techniques to deal with missing data, median (or mean) imputation are among the most frequently used methods.

After replacing the missing values, submit the median **sepal length** of flowers whose **petal length** is greater than or equal to 5.5 cm.

In [1]:
import pandas as pd

# Load data
df = pd.read_csv('data/iris_corrupted.csv')

In [2]:
df.tail(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
140,9.7,,4.4,1.4
141,-1.0,2.6,4.4,1.2
142,6.8,2.8,4.8,1.4
143,4.8,3.0,1.4,0.3
144,7.7,3.0,6.1,2.3
145,7.7,3.8,,2.2
146,,3.1,,2.4
147,5.0,2.3,3.3,1.0
148,7.0,3.3,1.4,
149,-1.0,2.9,5.6,1.8


### Step-by-step solution

In [3]:
import numpy as np

# Replace all zero and negative entries by NaNs
# (Note: all missing values are already represented by NaNs)
df[df <= 0] = np.nan
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,7.7,3.8,,2.2
146,,3.1,,2.4
147,5.0,2.3,3.3,1.0
148,7.0,3.3,1.4,
149,,2.9,5.6,1.8


In [4]:
# Replace all NaNs by the median of the non-NaN column features
df = df.fillna(df.median())
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,7.7,3.8,4.3,2.2
146,5.9,3.1,4.3,2.4
147,5.0,2.3,3.3,1.0
148,7.0,3.3,1.4,1.3
149,5.9,2.9,5.6,1.8


In [5]:
# Find the median sepal length of flowers whose petal length is >= 5.5
df[df['petal length (cm)'] >= 5.5].median()

sepal length (cm)    6.9
sepal width (cm)     3.0
petal length (cm)    5.8
petal width (cm)     2.1
dtype: float64

In [6]:
df[df['petal length (cm)'] >= 5.5].median()['sepal length (cm)']

6.9

### Or as a one-liner
The code below only works from Python 3.8 up because it uses the new walrus assignment operator. In order to be able to run it, you might have to create a new environment with `conda create --name py38 python=3.8 pandas jupyter` and activate it using `conda activate py38`.

In [7]:
# Reload data
df = pd.read_csv('data/iris_corrupted.csv')

(df_imputed := (dfnan := df.mask(df <= 0)).fillna(dfnan.median()))[df_imputed['petal length (cm)'] >= 5.5].median()['sepal length (cm)']


6.9

## Congratulations to everyone that found the solution!