## Coding Assignment 3: Forest Fires [SOLUTIONS]

### Introduction to Data Science
#### Last Updated: August 20, 2022

---  

### Skills Assessed

You will demonstrate these skills in the HW:
- pandas operations: subsetting
- pandas operations: creating new columns
- pandas operations: summarizing data
- pandas operations: aggregating data
- pandas operations: lambda functions
---

### Instructions

You will show off your pandas skills in this assignment.

It uses a dataset looking at the burned area of forest fires in the northeast region of Portugal, by using meteorological and other data.

NOTE: All questions are independent. For example, if question 2 asks you to filter a dataframe, the filter is only applied for question 2.

**Data source:** https://archive.ics.uci.edu/ml/machine-learning-databases/forest-fires/

### About the data columns

   1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
   2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
   3. month - month of the year: "jan" to "dec" 
   4. day - day of the week: "mon" to "sun"
   5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
   6. DMC - DMC index from the FWI system: 1.1 to 291.3 
   7. DC - DC index from the FWI system: 7.9 to 860.6 
   8. ISI - ISI index from the FWI system: 0.0 to 56.10
   9. temp - temperature in Celsius degrees: 2.2 to 33.30
   10. RH - relative humidity in %: 15 to 100
   11. wind - wind speed in km/h: 0.40 to 9.40 
   12. rain - outside rain in mm/m2 : 0.0 to 6.4 
   13. area - the burned area of the forest (in ha): 0.00 to 1090.84 
   (this output variable is very skewed towards 0.0, thus it may make
    sense to model with the logarithm transform). 

**TOTAL POINTS: 10**

---

You'll work with pandas dataframes so import the module

In [2]:
import pandas as pd

1) (1 PT) Read the fire data into a dataframe and print the first 4 rows of data (not including the header).

In [3]:
df = pd.read_csv('../datasets/forestfires.csv')
df.head(4)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0


2) (1 PT) Determine the unique month values. Next, sort them and print them.

In [4]:
months = df.month.unique()
months.sort()
print(months)

['apr' 'aug' 'dec' 'feb' 'jan' 'jul' 'jun' 'mar' 'may' 'nov' 'oct' 'sep']


3) (1 PT) Create a new column called *temp_f* and append it to the dataframe.   
It will be the *temp* column converted to Fahrenheit (look up the formula).

Print the first four rows of data.  
HINT: Use a lambda function.

In [5]:
df['temp_f'] = df.temp.apply(lambda x: x * 1.8 + 32)
df.head(4)

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area,temp_f
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0,46.76
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0,64.4
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0,58.28
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0,46.94


4) (1 PT) Select the following columns (together) from the dataframe: `day`, `month`,`area`.  
Print the rows from these columns where the area is greater than 10 and less than 10.5.

In [6]:
df[['day','month','area']][(df.area>10)&(df.area<10.5)]

Unnamed: 0,day,month,area
194,tue,aug,10.01
195,fri,aug,10.02
242,sun,aug,10.13
254,thu,aug,10.34
474,thu,jun,10.08


5) (1 PT) Compute the mean `area`, grouped (aggregated) by `month`.  

Print the result; it should only contain the columns: `area` and `month`

In [7]:
df[['area','month']].groupby(by='month').mean()

Unnamed: 0_level_0,area
month,Unnamed: 1_level_1
apr,8.891111
aug,12.489076
dec,13.33
feb,6.275
jan,0.0
jul,14.369687
jun,5.841176
mar,4.356667
may,19.24
nov,0.0


6) (1 PT) Print the datatype of each column in the dataframe. Hint: there is a function to do this.

In [8]:
df.dtypes

X           int64
Y           int64
month      object
day        object
FFMC      float64
DMC       float64
DC        float64
ISI       float64
temp      float64
RH          int64
wind      float64
rain      float64
area      float64
temp_f    float64
dtype: object

7) (3 PTS) For this part, you will break the columns into separate lists by data type. This is often useful for preprocessing predictors in machine learning.

- Save the categorical column names in a list named `cats`
- Save the continuous variable column names in a list named `con`. These variables have float (decimal) values.
- Save the discrete variable column names in a list called `disc`. These variables have integer values.

Hint: Use what you learned from task 5. You can also use the `value_counts()` function to explore the values of a column like this:

In [9]:
df.X.value_counts()

4    91
6    86
2    73
8    61
7    60
3    55
1    48
5    30
9    13
Name: X, dtype: int64

In [10]:
cats = ['month','day']
con = ['FFMC','DMC','DC','ISI','temp','wind','rain','area','temp_squared']
disc = ['X','Y','RH']

---

8) (1 PT) Using the list `disc`, subset those columns in the dataframe and print the first four rows of those columns. 

In [11]:
df[disc].head(4)

Unnamed: 0,X,Y,RH
0,7,5,51
1,7,4,33
2,7,4,33
3,8,6,97


---