# Workshop 11 - SciPy

In todays workshop, we will continue working with the reduced dataset of [Algerian Forest Fires](https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++) [[1]](#References).

To solve today's workshop, you will need `fires_cleaned.csv`, which is the final output of the _Pandas_ workshop. If you have successfully solved that workshop, or worked through the solution notebook, you will have the file in your working folder already. However, it has also been uploaded to blackboard for your convenience.

Today, we are going to be analysing the relationship between different features in this dataset:
- [Exercise 1](#Exercise-1)
- [Exercise 2](#Exercise-2)
- [Exercise 3a](#Exercise-3a)
- [Exercise 3b](#Exercise-3b)
- [Exercise 4](#Exercise-4)
- [Exercise 5](#Exercise-5)
- [References](#References)

In [1]:
import pandas as pd
from scipy import stats

Let us input the dataset and remind ourselves of the meaning of the various features in it:

In [2]:
fires_df = pd.read_csv('fires_cleaned.csv')

The dataset contains the following features.

| Feature| Type| Decription| 
|:--- | :--| :---|
| **day** | int | day of the month  |
| **month** | int | month of the year |
| **year** | int | calendar year |
| **Temperature** | int | temperature in degrees Celsius |
| **RH** | float | Relative Humidity between 0 and 100 |
| **Ws** | float | Wind speed in km/h |
| **Rain** | float | total rain in mm | 
| **FFMC** | float | Fine Fuel Moisture Code (FFMC) index |
| **DMC** | float | Duff Moisture Code (DMC) index |
| **DC** | float | Drought Code (DC) index |
| **ISI** | float | Inisial Spread Index (ISI) |
| **BUI** | float | Buildup Index |
| **FWI** | float | Fire Weather Index |
| **Classes** | categorical | two classes, `fire` and `not fire` |

For the purposes of this workshop, we will treat **day, month and year** as **categorical** features, while all other features (except "Classes") will be treated as **numerical** features.

## Exercise 1

Get the feel for the dataset again:
- How many samples does it have?
- How many features?
- Can you get the names of the columns?
- Can you count how many samples with `fire` and `not fire`?
- Can you print the first 5 samples?

In [3]:
print(fires_df.shape)
print(fires_df.columns)
print(fires_df.Classes.value_counts())
fires_df.head()

(117, 15)
Index(['ID', 'day', 'month', 'year', 'Temperature', 'RH', 'Ws', 'Rain', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes'],
      dtype='object')
Classes
not fire    59
fire        58
Name: count, dtype: int64


Unnamed: 0,ID,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
0,0,1,6,2012,29,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,1,2,6,2012,29,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,2,3,6,2012,26,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
3,3,4,6,2012,25,89.0,13.0,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
4,4,5,6,2012,27,77.0,16.0,0.0,74.926087,3.0,14.2,1.2,3.9,0.5,not fire


## Exercise 2

Let us first check the distribution of our numerical features (remember: all the features from `'Temperature'` to `'FWI'`).

Do any of them come from a normal distribution?

Note: you can check the distribution of each individual column with [`scipy.stats.shapiro()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html) or you could try and apply this SciPy function to the whole dataframe using [`pandas.DataFrame.apply()`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) to calculate it for all the columns at once.

In [4]:
fires_df.loc[:, 'Temperature':'FWI'].apply(stats.shapiro)
# fires_df.loc[:, 'Temperature':'FWI'].apply(stats.shapiro).loc[1] # --> gets only the p-values
# none of the features have a normal distribution since all the p-values < 0.05

Unnamed: 0,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI
0,0.963628,0.97332,0.960364,0.377147,0.8397259,0.8290474,0.8363994,0.915311,0.8264819,0.8321198
1,0.002934,0.019617,0.001603,4.107902e-20,6.267603e-10,2.524821e-10,4.70221e-10,2e-06,2.04071e-10,3.266987e-10


## Exercise 3a

Now that you have determined whether the features have a normal distribution or not, you can calculate the correlation between `'Temperature'` and `'Rain'` as well as `'Temperature'` and `'DMC'`.

Which is the correct correlation function to use between [`scipy.stats.pearsonr()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) and [`scipy.stats.spearmanr()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)?

What can you say about the correlation between temperature, rain and the DMC index?

In [5]:
print(stats.spearmanr(fires_df.Temperature, fires_df.Rain))
# negative correlation between temperature and rain

print(stats.spearmanr(fires_df.Temperature, fires_df.DMC))
# positive correlation between temperature and DMC

SignificanceResult(statistic=-0.42184278198055325, pvalue=2.1695379088826672e-06)
SignificanceResult(statistic=0.62831333066368, pvalue=3.36560171791536e-14)


## Exercise 3b

The [`pandas.DataFrame.corr()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) also has an argument `method` by which you can chose the correlation coefficients to calculate for the correlation table.

Calculate the correlation table with the correct choice of correlation coefficient for all numerical features (all the features from `'Temperature'` to `'FWI'`)

In [6]:
# using spearman correlation as none of the features are normally distributed
fires_df.loc[:, 'Temperature':'FWI'].corr(method='spearman')

Unnamed: 0,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI
Temperature,1.0,-0.668241,-0.138058,-0.421843,0.74107,0.628313,0.622496,0.720339,0.655657,0.700823
RH,-0.668241,1.0,0.214009,0.371166,-0.662768,-0.472881,-0.451026,-0.657525,-0.49462,-0.587679
Ws,-0.138058,0.214009,1.0,0.191259,-0.088878,0.071255,0.11185,-0.006769,0.071476,-0.012904
Rain,-0.421843,0.371166,0.191259,1.0,-0.772376,-0.573234,-0.551779,-0.750137,-0.587719,-0.727784
FFMC,0.74107,-0.662768,-0.088878,-0.772376,1.0,0.809545,0.769714,0.972936,0.82279,0.944665
DMC,0.628313,-0.472881,0.071255,-0.573234,0.809545,1.0,0.917563,0.824083,0.970926,0.898263
DC,0.622496,-0.451026,0.11185,-0.551779,0.769714,0.917563,1.0,0.79362,0.954105,0.870695
ISI,0.720339,-0.657525,-0.006769,-0.750137,0.972936,0.824083,0.79362,1.0,0.837409,0.961299
BUI,0.655657,-0.49462,0.071476,-0.587719,0.82279,0.970926,0.954105,0.837409,1.0,0.909922
FWI,0.700823,-0.587679,-0.012904,-0.727784,0.944665,0.898263,0.870695,0.961299,0.909922,1.0


## Exercise 4

Now, try to check whether the fires are **independent** of the **month** in the year. For this, you will need to calculate the _cross tabulation_ between the `'month'` and `'Classes'` columns using [`pandas.crosstab()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html).

To check for independance of categorical variables, you can use the chi-squared statistic on the cross tabulation, also called _contingency table_, with [`scipy.stats.chi2_contingency()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html).

Also check whether the fires are **independent of the day** in the month.

**Do the results confirm your intuition?**

In [7]:
crosstab = pd.crosstab(fires_df.month, fires_df.Classes)
print(crosstab)
_, p, _, _ = stats.chi2_contingency(crosstab)
display(p)
# since p < 0.05, we reject the hypothesis of independence. Presence of fire depends on the month of the year

Classes  fire  not fire
month                  
6          13        16
7          14        14
8          24         8
9           7        21


0.0015752584228439868

## Exercise 5

Finally, check whether the amount of `'Rain'` comes from the same distribution on the days with fire (`'fire'`) and on the days with no fire (`'not fire'`) -- is the amount of rain different on days with fire and days with no fire?

Similarly, check whether there is a difference in wind speed (`'Ws'`) between days with fire and days with no fire.

**Do the results confirm your intuition?**

In [8]:
# select the info about the amount of rain on the days with a fire
fire_rain = fires_df[fires_df.Classes=='fire'].Rain
# select the info about the amount of rain on the days without a fire
nofire_rain = fires_df[fires_df.Classes=='not fire'].Rain

# check once again if the amount of rain is normally distributed
# in the case of the days with a fire
print(stats.shapiro(fire_rain))
# and in case of the days without a fire
print(stats.shapiro(nofire_rain))
# neither is normally distributed, as p < 0.05

# (we use kruskal rather than anova, as the amount of rain is not normally distributed in either case)
# check if the medians are the same
print(stats.kruskal(fire_rain, nofire_rain))
# sice p < 0.05, we conclude that medians are not the same

# you can repeat the above steps for WS, or as an alternative do the following:
# check if the wind speed is normally distributed on days with fire/no fire
print(fires_df.groupby('Classes').Ws.apply(stats.shapiro)) # since p < 0.05 in both cases, the answer is no
# check if median wind speed is the same on the days with fire and not fire
print(stats.kruskal(*fires_df.groupby('Classes').Ws.apply(list).values)) # since p > 0.05, medians may be the same

ShapiroResult(statistic=0.2937174439430237, pvalue=3.3761619429230772e-15)
ShapiroResult(statistic=0.5192348957061768, pvalue=1.2990044153185498e-12)
KruskalResult(statistic=55.90732524914059, pvalue=7.596897330775139e-14)
Classes
fire        (0.9464011192321777, 0.012477301061153412)
not fire    (0.9540866613388062, 0.026161661371588707)
Name: Ws, dtype: object
KruskalResult(statistic=0.1413468681274714, pvalue=0.7069456976280015)


## References

[1] _Abid, Faroudja, and Nouma Izeboudjen. "Predicting forest fire in algeria using data mining techniques: Case study of the decision tree algorithm." International Conference on Advanced Intelligent Systems for Sustainable Development. Springer, Cham, 2020._