## Exercise on wine dataset


Let's start off by following the general workflow that we use when moving data into a DataFrame: 

    * Importing Pandas
    * Reading data into the DataFrame
    * Getting a general sense of the data

So, in terms of what you should do for this part...

1. Select the first 10 rows of the `chlorides` column. 
2. Select the last 10 rows of the `chlorides` column. 
3. Grab indices 264-282 of the `chlorides` **and** `density` columns.  
4. Select all rows where the `chlorides` value is less than 0.10. 
5. Now select all the rows where the `chlorides` value is greater than the column's mean (try **not** to use a hard-coded value for the mean, but instead a method).
6. Select all those rows where the `pH` is greater than 3.0 and less than 3.5. Further filter the results from 6 to grab only those rows that have a `residual sugar` less than 2.0. 

If you'd like some extra practice, try answering each of the questions in more than one way (because remember, we can often select data in a couple of different ways). Selecting is the same as displaying on the screen in this context.

In [1]:
# import pandas
# read data in dataframe
import pandas as pd
df = pd.read_csv('../data/winequality-red.csv', delimiter=';')

In [2]:
#1. Select the first 10 rows of the `chlorides` column. 
#2. Select the last 10 rows of the `chlorides` column. 
df.chlorides.head(10)
df.chlorides.tail(10)


1589    0.073
1590    0.077
1591    0.089
1592    0.076
1593    0.068
1594    0.090
1595    0.062
1596    0.076
1597    0.075
1598    0.067
Name: chlorides, dtype: float64

In [3]:
#3. Grab indices 264-282 of the `chlorides` and `density` columns.
df.loc[264:282, ['chlorides', 'density']]

Unnamed: 0,chlorides,density
264,0.064,0.9999
265,0.071,0.9968
266,0.096,1.00025
267,0.078,0.9973
268,0.077,0.9987
269,0.104,0.9996
270,0.087,0.9965
271,0.104,0.9996
272,0.071,0.99935
273,0.076,0.99735


In [4]:
#4. Select all rows where the `chlorides` value is less than 0.10.
df[df.chlorides <.1]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [5]:
#5. Now select all the rows where the `chlorides` value is greater than the column's mean
#(try not to use a hard-coded value for the mean, but instead a method.)
df.query('chlorides > chlorides.mean()')

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
10,6.7,0.580,0.08,1.8,0.097,15.0,65.0,0.99590,3.28,0.54,9.2,5
12,5.6,0.615,0.00,1.6,0.089,16.0,59.0,0.99430,3.58,0.52,9.9,5
13,7.8,0.610,0.29,1.6,0.114,9.0,29.0,0.99740,3.26,1.56,9.1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1558,6.9,0.630,0.33,6.7,0.235,66.0,115.0,0.99787,3.22,0.56,9.5,5
1570,6.4,0.360,0.53,2.2,0.230,19.0,35.0,0.99340,3.37,0.93,12.4,6
1578,6.8,0.670,0.15,1.8,0.118,13.0,20.0,0.99540,3.42,0.67,11.3,6
1591,5.4,0.740,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6,6


In [6]:
#6. Select all those rows where the `pH` is greater than 3.0 and less than 3.5. 
df.query('pH > 3.0 and pH < 3.5')

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
6,7.9,0.60,0.06,1.6,0.069,15.0,59.0,0.99640,3.30,0.46,9.4,5
7,7.3,0.65,0.00,1.2,0.065,15.0,21.0,0.99460,3.39,0.47,10.0,7
...,...,...,...,...,...,...,...,...,...,...,...,...
1592,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1593,6.8,0.62,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6
1594,6.2,0.60,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6


In [7]:
#7. Further filter the results from 6 to grab only those rows that have a `residual sugar` less than 2.0.
#Tip: Use backticks (``) to mask column names with spaces in query string.
df.query('pH > 3.0 and pH < 3.5 and `residual sugar` < 2.0')

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
6,7.9,0.60,0.06,1.6,0.069,15.0,59.0,0.99640,3.30,0.46,9.4,5
7,7.3,0.65,0.00,1.2,0.065,15.0,21.0,0.99460,3.39,0.47,10.0,7
10,6.7,0.58,0.08,1.8,0.097,15.0,65.0,0.99590,3.28,0.54,9.2,5
13,7.8,0.61,0.29,1.6,0.114,9.0,29.0,0.99740,3.26,1.56,9.1,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1569,6.2,0.51,0.14,1.9,0.056,15.0,34.0,0.99396,3.48,0.57,11.5,6
1576,8.0,0.30,0.63,1.6,0.081,16.0,29.0,0.99588,3.30,0.78,10.8,6
1578,6.8,0.67,0.15,1.8,0.118,13.0,20.0,0.99540,3.42,0.67,11.3,6
1590,6.3,0.55,0.15,1.8,0.077,26.0,35.0,0.99314,3.32,0.82,11.6,6


## Exercise on iris dataset

![IRIS, https://github.com/simonava5/fishers-iris-data](../images/iris.png)

After the notebook with a lot of new input, let's start applying it totally by yourselves. 
For this purpose we will use one of the most standard real-life datasets: It's called Iris Dataset, and is all about the plant iris. Let's learn a little bit more about the dataset by taking a closer look at it. 

In [8]:
# import pandas

In [9]:
# load the data
iris = pd.read_csv('../../Data/iris.csv')

1. Let us first have a look at the head of the table, maybe also on the last 10 rows...

In [10]:
display(iris.head())
display(iris.tail(10))


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


2. How many irises are the data set?


In [11]:
display(iris.shape)

(150, 5)

8. Calculate the basic descriptive statistics for all columns of the iris data set using a single command.

In [12]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


12. Add the sum of the sepal width and length as a new column to your data frame.

In [13]:
iris.eval("sepal_sum = sepal_length + sepal_width", inplace=True)
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_sum
0,5.1,3.5,1.4,0.2,Iris-setosa,8.6
1,4.9,3.0,1.4,0.2,Iris-setosa,7.9
2,4.7,3.2,1.3,0.2,Iris-setosa,7.9
3,4.6,3.1,1.5,0.2,Iris-setosa,7.7
4,5.0,3.6,1.4,0.2,Iris-setosa,8.6
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,9.7
146,6.3,2.5,5.0,1.9,Iris-virginica,8.8
147,6.5,3.0,5.2,2.0,Iris-virginica,9.5
148,6.2,3.4,5.4,2.3,Iris-virginica,9.6


18. Create a new column with a rough estimate of petal area by multiplying petal length and width together.

In [14]:
iris.eval("petal_area = petal_length * petal_width", inplace=True)
display(iris)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_sum,petal_area
0,5.1,3.5,1.4,0.2,Iris-setosa,8.6,0.28
1,4.9,3.0,1.4,0.2,Iris-setosa,7.9,0.28
2,4.7,3.2,1.3,0.2,Iris-setosa,7.9,0.26
3,4.6,3.1,1.5,0.2,Iris-setosa,7.7,0.30
4,5.0,3.6,1.4,0.2,Iris-setosa,8.6,0.28
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,9.7,11.96
146,6.3,2.5,5.0,1.9,Iris-virginica,8.8,9.50
147,6.5,3.0,5.2,2.0,Iris-virginica,9.5,10.40
148,6.2,3.4,5.4,2.3,Iris-virginica,9.6,12.42


19. Create a new dataframe with petal areas greater than $1cm^2$.

In [15]:
iris_big = iris[iris.petal_area > 1]
iris_big

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_sum,petal_area
50,7.0,3.2,4.7,1.4,Iris-versicolor,10.2,6.58
51,6.4,3.2,4.5,1.5,Iris-versicolor,9.6,6.75
52,6.9,3.1,4.9,1.5,Iris-versicolor,10.0,7.35
53,5.5,2.3,4.0,1.3,Iris-versicolor,7.8,5.20
54,6.5,2.8,4.6,1.5,Iris-versicolor,9.3,6.90
...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica,9.7,11.96
146,6.3,2.5,5.0,1.9,Iris-virginica,8.8,9.50
147,6.5,3.0,5.2,2.0,Iris-virginica,9.5,10.40
148,6.2,3.4,5.4,2.3,Iris-virginica,9.6,12.42


20. Using the original unfiltered dataframe, create 3 new dataframes, each containing only irises of each a single species 'Iris-setosa', 'Iris-versicolor' or 'Iris-virginica'.

In [16]:
iris_setosa = iris[iris["species"]=="Iris-setosa"]
iris_setosa.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_sum,petal_area
0,5.1,3.5,1.4,0.2,Iris-setosa,8.6,0.28
1,4.9,3.0,1.4,0.2,Iris-setosa,7.9,0.28
2,4.7,3.2,1.3,0.2,Iris-setosa,7.9,0.26
3,4.6,3.1,1.5,0.2,Iris-setosa,7.7,0.3
4,5.0,3.6,1.4,0.2,Iris-setosa,8.6,0.28


In [17]:
iris_versicolor = iris[iris["species"]=="Iris-versicolor"]
iris_versicolor.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_sum,petal_area
50,7.0,3.2,4.7,1.4,Iris-versicolor,10.2,6.58
51,6.4,3.2,4.5,1.5,Iris-versicolor,9.6,6.75
52,6.9,3.1,4.9,1.5,Iris-versicolor,10.0,7.35
53,5.5,2.3,4.0,1.3,Iris-versicolor,7.8,5.2
54,6.5,2.8,4.6,1.5,Iris-versicolor,9.3,6.9


In [18]:
iris_virginica = iris[iris["species"]=="Iris-virginica"]
iris_virginica.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_sum,petal_area
100,6.3,3.3,6.0,2.5,Iris-virginica,9.6,15.0
101,5.8,2.7,5.1,1.9,Iris-virginica,8.5,9.69
102,7.1,3.0,5.9,2.1,Iris-virginica,10.1,12.39
103,6.3,2.9,5.6,1.8,Iris-virginica,9.2,10.08
104,6.5,3.0,5.8,2.2,Iris-virginica,9.5,12.76
