## Exercises

Let's try out these data wrangling operations with the Iris dataset.

You should have downloaded the iris_csv.csv dataset into your working directory from the previous exercise. 

In [1]:
# We should always start with the import, although it may have been run above
import pandas as pd

In [4]:
# read the data into a dataframe called irisdf
irisdf = pd.read_csv('iris_csv.csv')

**Q1. Missing Values**

Check if there are any missing values in the `irisdf` data set.

In [6]:
# Q1 Answer
missing_values = irisdf.isnull().sum()
print(missing_values)

sepallength    0
sepalwidth     0
petallength    0
petalwidth     0
class          0
dtype: int64


**Q2. Find Duplicates**

There are three duplicate rows, display the rows.

In [7]:
#Q2 Answer
duplicate_rows = irisdf[irisdf.duplicated()]
print(duplicate_rows)

     sepallength  sepalwidth  petallength  petalwidth           class
34           4.9         3.1          1.5         0.1     Iris-setosa
37           4.9         3.1          1.5         0.1     Iris-setosa
142          5.8         2.7          5.1         1.9  Iris-virginica


**Q3. Calculate Mean**

Find the mean `sepallength` and store it in a variable called `mean_sepallength`

In [9]:
#Q3 answer
mean_sepallength = irisdf['sepallength'].mean()
print("Mean Sepal Length", mean_sepallength)

Mean Sepal Length 5.843333333333334


**Q4. Set Value**

Set the `sepallength` of the row with index 34 to the mean value found.

In [13]:
# Q4 Answer
irisdf.at[34, 'sepallength'] = mean_sepallength
print(irisdf.at[34, 'sepallength'])

5.843333333333334


Check whether the number of duplicate rows has decreased by displaying the duplicated rows again. You can execute your answer to Q2 again) 

In [14]:
duplicate_rows_after_update = irisdf[irisdf.duplicated()]
print(duplicate_rows_after_update)


     sepallength  sepalwidth  petallength  petalwidth           class
37           4.9         3.1          1.5         0.1     Iris-setosa
142          5.8         2.7          5.1         1.9  Iris-virginica


**Q5. Drop Duplicates**

Drop the rows with duplicates

In [17]:
# Q5 answer
irisdf = irisdf.drop_duplicates()
print(irisdf)


     sepallength  sepalwidth  petallength  petalwidth           class
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[148 rows x 5 columns]


Now check if there are any more duplicate values by running your answer to Q2 again.

In [18]:
duplicate_rows_after_drop = irisdf[irisdf.duplicated()]
print(duplicate_rows_after_drop)


Empty DataFrame
Columns: [sepallength, sepalwidth, petallength, petalwidth, class]
Index: []


**Q6. Set as NA**

Set the `sepalwidth` of the row with index 34 to `pd.NA` (which is pandas for NULL value)

In [23]:
# Q6 Answer
irisdf.loc[34, 'sepalwidth'] = pd.NA
print(irisdf.loc[30:40])

    sepallength  sepalwidth  petallength  petalwidth        class
30     4.800000         3.1          1.6         0.2  Iris-setosa
31     5.400000         3.4          1.5         0.4  Iris-setosa
32     5.200000         4.1          1.5         0.1  Iris-setosa
33     5.500000         4.2          1.4         0.2  Iris-setosa
34     5.843333         NaN          1.5         0.1  Iris-setosa
35     5.000000         3.2          1.2         0.2  Iris-setosa
36     5.500000         3.5          1.3         0.2  Iris-setosa
38     4.400000         3.0          1.3         0.2  Iris-setosa
39     5.100000         3.4          1.5         0.2  Iris-setosa
40     5.000000         3.5          1.3         0.3  Iris-setosa


We should be able to view the contents of the row with index 34 using the `loc` attribute:

In [24]:
irisdf.loc[34]

sepallength       5.843333
sepalwidth             NaN
petallength            1.5
petalwidth             0.1
class          Iris-setosa
Name: 34, dtype: object

**Q7. Find Rows with Missing Values**

Find the rows with *any* NA values

In [25]:
# Q7 Answer
# Find the rows with any 'na' values, by row (axis = 1 indicates to check row by row)
row_with_na = irisdf[irisdf.isna().any(axis=1)]
print(row_with_na)


    sepallength  sepalwidth  petallength  petalwidth        class
34     5.843333         NaN          1.5         0.1  Iris-setosa


**Q8. Drop Rows with Missing Values**

Now drop the rows with missing values with the argument `inplace = True`

In [26]:
# Q8 answer
irisdf.dropna(inplace=True)
print(irisdf)

     sepallength  sepalwidth  petallength  petalwidth           class
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[147 rows x 5 columns]


Check whether there are any more rows with NA values by running your answer to Q7 again.