## Data cleaning
Data cleaning is the process of dropping any null values, inconsistent or dirty data. 
We will use the iris dataset for this example, notice that we can read a csv file directly from the web.

This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray

The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
For more information on our dataset follow [this link](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)

In [4]:
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


*Max and min values*

In [5]:
iris.max()

sepal_length          7.9
sepal_width           4.4
petal_length          6.9
petal_width           2.5
species         virginica
dtype: object

In [6]:
iris['sepal_length'].max()

7.9

In [7]:
iris['sepal_length'].min()

4.3

In [9]:
# Look for duplicate values
iris['sepal_length'].duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: sepal_length, Length: 150, dtype: bool

We can group our data by some column we specify. Lets group by species

In [12]:
iris_group = iris.groupby(by='species')

In [13]:
iris_group.mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


Or calculate the mean of a specific column

In [14]:
iris_group['petal_width'].mean()

species
setosa        0.246
versicolor    1.326
virginica     2.026
Name: petal_width, dtype: float64

Show petal width between 1.3 and 1.5

In [17]:
iris[(iris['petal_width'] >= 1.3) & (iris['petal_width'] <= 1.5)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
52,6.9,3.1,4.9,1.5,versicolor
53,5.5,2.3,4.0,1.3,versicolor
54,6.5,2.8,4.6,1.5,versicolor
55,5.7,2.8,4.5,1.3,versicolor
58,6.6,2.9,4.6,1.3,versicolor
59,5.2,2.7,3.9,1.4,versicolor
61,5.9,3.0,4.2,1.5,versicolor
63,6.1,2.9,4.7,1.4,versicolor
