# Workshop 9 - Pandas

In todays workshop, we will work with the reduced dataset of [Algerian Forest Fires](https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++) (please download the file from BlackBoard) [[1]](#References). The dataset describes the weather conditions in Bejaia region of Algeria, and the presence or absence of Forest Fires. 

To fully analyse the dataset, try and complete all the following steps:
- [Loading the dataset](#Loading-the-dataset)
- [Indexing the DataFrame](#Indexing-the-DataFrame)
    - [Exercise 1a](#Exercise-1a)
    - [Exercise 1b](#Exercise-1b)
- [Cleaning the dataset](#Cleaning-the-dataset)
    - [Exercise 2](#Exercise-2)
    - [Exercise 3](#Exercise-3)
    - [Exercise 4](#Exercise-4)
    - [Exercise 5](#Exercise-5)
- [Ensuring data type consistency](#Ensuring-data-type-consistency)
    - [Exercise 6](#Exercise-6)
- [Dataset statistics](#Dataset-statistics)
    - [Exercise 7a](#Exercise-7a)
    - [Exercise 7b](#Exercise-7b)
    - [Exercise 7c](#Exercise-7c)
- [Save your changes](#Save-your-changes)
    - [Exercise 8](#Exercise-8)
- [(References)](#References)

In [None]:
import pandas as pd
import numpy as np

### Loading the dataset

The file `fires_bejaia.csv` contains measurements from 122 days and notes whether there was a fire or not. Let us load the dataset and examine the first 5 samples:

In [None]:
fires_df = pd.read_csv('fires_bejaia.csv', index_col = 'ID')
print(fires_df.shape)
print(fires_df.columns)
fires_df.head()

The columns in this dataset signify, in order:
- **day** - day of the month 
- **month** - month of the year
- **year** - calendar year
- **Temperature** - temperature in degrees Celsius
- **RH** - Relative Humidity between 0 and 100
- **Ws** - Wind speed in km/h
- **Rain** - total rain in mm
- **FFMC** - Fine Fuel Moisture Code (FFMC) index
- **DMC** - Duff Moisture Code (DMC) index
- **DC** - Drought Code (DC) index
- **ISI** - Inisial Spread Index (ISI)
- **BUI** - Buildup Index
- **FWI** - Fire Weather Index
- **Classes** - two classes, `fire` and `not fire`

### Indexing the DataFrame

#### Exercise 1a

Try and access the data with ID $6$. Remember, the `ID` column is unique and is used to index the [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

#### Exercise 1b

Now, try and access all the entries which happened on the second of the month. This can be done by checking when `day` equals 2.

### Cleaning the dataset

Let us start cleaning up the dataset and preparing it for further use.

#### Exercise 2

If you look at the list of column names above, you will notice they have some additional whitespace (sometimes on both ends). _Rename_ all the columns in the  [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) by changing [`Dataframe.columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html). You can use the python [`string.strip()`](https://docs.python.org/3/library/stdtypes.html#str.strip) function to remove the whitespaces.

### Missing values

#### Exercise 3

Are there any missing values in the dataset? Get a summary telling you how many null/undefined values there are for each feature (column) of the dataset.

### Removal of missing values

#### Exercise 4

First, use [`DataFrame.dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) to remove all the rows which contain 2 or more missing values. Update your `fires_df` [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to reflect the changes you've made.

How many samples were removed from the dataset?

### Imputing missing values

#### Exercise 5

All features between `Temperature` and `FWI` are numerical features. Use Use [`DataFrame.fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) to replace the missing values of these features with the [`mean()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) of those ten features.

On the other hand, all features between `day` and `year`, while numerical, seem to be expressed as a whole (integer) number. (A measurement is never taken on day 21.7 of month 3.2.) Replace any missing values of these features with the [`mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) of these features. (Remember, since the [`mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) function returns a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) or a [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), you to access the first row of the output of [`mode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) with `.loc[0]`).

Update your `fires_df` [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to reflect the changes you've made.

### Ensuring data type consistency

#### Exercise 6

Let us specify the types of some of our features/columns. Using [`DataFrame.astype()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html), make the following changes:
- change `'day'` first into `'int'` data type (to get rid of decimal places) and then into `'category'` data type
- change `'month'` first into `'int'` data type (to get rid of decimal places) and then into `'category'` data type
- change `'year'` first into `'int'` data type (to get rid of decimal places) and then into `'category'` data type
- change `'Temperature'` into `'int'` data type (all the original temperatures were recorded as integers)
- change `'Classes'` directly into the `'category'` data type

### Dataset statistics

#### Exercise 7a

Find the [`max`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html) value of wind speed (column `'Ws'`) across the whole dataset.

#### Exercise 7b

Find the [`mean`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) of the `'Temperature'` feature for the `'not fire'` class.

#### Exercise 7c

Using the [`DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method, find the [`mean`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html) of _all_ the features, per class (i.e. grouped by the `'Classes'` column).

As an additional task, can you display the mean for only the `'Temperature'` and `'Rain'` features.

### Save your changes

In the next workshop, we will use this cleaned up version of the dataset which we have prepared in the workshop today. Therefore, you should save your work into an updated dataset file.

### Exercise 8

Finally, save the modified [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) into a new file called `fires_cleaned.csv`, using  [`DataFrame.to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

## References

[1] _Abid, Faroudja, and Nouma Izeboudjen. "Predicting forest fire in algeria using data mining techniques: Case study of the decision tree algorithm." International Conference on Advanced Intelligent Systems for Sustainable Development. Springer, Cham, 2020._