## 数据集介绍  

X - x轴坐标：1 - 9。  
Y - y轴坐标：1 - 9。  
Month - 月：一月到十二月  
day - 周几：周一到周日  
FFMC - FWI系统的Fine Fuel Moisture Code指数：18.7 - 96.20  
DMC - 来自FWI系统的Duff moisture code指数：1.1 - 291.3  
DC - 来自FWI系统的Drought code指数：7.9 - 860.6  
ISI - 来自FWI系统的 Initial spread index指数：0.0 - 56.10  
Temp- 温度 - 摄氏度：2.2 - 33.30  
RH - 相对湿度％：15.0 - 100  
wind - 风速km.h：0.40 - 9.40  
rain- 雨，mm / m2为单位：0.0 - 6.4  
area - 森林的烧毁区域：0.00 - 1090.84

## 数据可视化

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits 

In [2]:
forestfire = pd.read_csv("/home/forestfires.csv")
forestfire.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


In [3]:
forestfire.shape

(517, 13)

In [4]:
forestfire.describe()

Unnamed: 0,X,Y,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
count,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0,517.0
mean,4.669246,4.299807,90.644681,110.87234,547.940039,9.021663,18.889168,44.288201,4.017602,0.021663,12.847292
std,2.313778,1.2299,5.520111,64.046482,248.066192,4.559477,5.806625,16.317469,1.791653,0.295959,63.655818
min,1.0,2.0,18.7,1.1,7.9,0.0,2.2,15.0,0.4,0.0,0.0
25%,3.0,4.0,90.2,68.6,437.7,6.5,15.5,33.0,2.7,0.0,0.0
50%,4.0,4.0,91.6,108.3,664.2,8.4,19.3,42.0,4.0,0.0,0.52
75%,7.0,5.0,92.9,142.4,713.9,10.8,22.8,53.0,4.9,0.0,6.57
max,9.0,9.0,96.2,291.3,860.6,56.1,33.3,100.0,9.4,6.4,1090.84


###哪个月份更有可能发生火灾

In [5]:
forestfire['month'].value_counts().plot(kind='bar')
plt.title('month vs forest fire')
plt.xlabel('month')
plt.ylabel('Count')

Text(0, 0.5, 'Count')

八月， 九月， 和三月为Montesinho公园容易发生火灾的月份

###一个星期的哪天更有可能发生火灾

In [6]:
forestfire['day'].value_counts().plot(kind='bar')
plt.title('day vs forest fire')
plt.xlabel('days of a week')
plt.ylabel('Count')
sns.despine

<function seaborn.utils.despine(fig=None, ax=None, top=True, right=True, left=False, bottom=False, offset=None, trim=False)>

In [7]:
def simple_rela(i):
    forestfire[i].value_counts().plot(kind='bar')
    plt.title(i + " "+"vs forest fire")
    plt.xlabel(i)
    plt.ylabel('Count')
simple_rela("rain")

In [8]:
simple_rela("wind")

周天最容易发生火灾，周三发生的次数最少，但其它天之间的差距并不远显著

以下这个地图中出现的点为Montesinho公园发生过火灾的地点的坐标

In [9]:
plt.figure(figsize=(10,10))
sns.jointplot(x=forestfire.X, y=forestfire.Y, size=10)
plt.ylabel('Y', fontsize=12)
plt.xlabel('X', fontsize=12)
plt.show()
sns.despine



<Figure size 720x720 with 0 Axes>

<function seaborn.utils.despine(fig=None, ax=None, top=True, right=True, left=False, bottom=False, offset=None, trim=False)>

###绘制热度图来表现火灾面积与不同因素作用之间的关系

In [10]:
corrMatrix = forestfire.corr()
plt.figure(figsize=(16,5))
sns.heatmap(corrMatrix, annot=True)
plt.show()

区块颜色越浅则表示相关性越大， 由图可见， 火灾面积与以上这些因素的相关性并不大， 都小于0.1

In [18]:
import plotly.figure_factory as ff
figure = ff.create_scatterplotmatrix(
    forestfire[['FFMC',      
        'DMC','DC', 'ISI', 'temp', 'RH', 'wind', 'rain']],
    diag='histogram',
    index='rain')

可以看出dc 与 dmc， ffmc 与 temp 有比较明显的正相关性

### 不同月份的湿度，干燥度，降水， 温度， 风速， 雨等的分布情况（violin plots)

In [19]:
fire_columns = ['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']
for i in fire_columns:
    sns.catplot(x='month', y=i, kind='violin', data=forestfire)

## 使用gradient boost model

In [20]:
from sklearnex import patch_sklearn
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

labels = forestfire['area']

train1 = forestfire.drop(['day', 'month'],axis=1)



### 把数据集拆分成训练集和测试集

In [23]:

from sklearn.model_selection import train_test_split

x_train , x_test , y_train , y_test = train_test_split(train1 , labels , test_size = 0.10,random_state =2)

reg.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

reg.score(x_test,y_test)

1.0

In [24]:
from sklearn import ensemble
clf = ensemble.GradientBoostingRegressor(n_estimators = 400, max_depth = 5, min_samples_split = 2,
          learning_rate = 0.1, loss = 'ls')

clf.fit(x_train, y_train)
clf.score(x_test,y_test)

0.9281627583699141

### 对测试集的预测结果

In [17]:
a = clf.predict(x_test)
print(a)

[-1.39512270e-02  2.16516572e+00  5.99812548e-01  2.66028686e-03
  1.44513401e-02  1.82274866e-03 -3.91301204e-03 -2.32228788e-05
  8.68108406e-06  2.20170932e-03  4.96028358e+01 -3.25228255e-03
  2.02009782e+02  1.57013741e+00  3.67939298e+00  3.34431366e+00
 -7.73263662e-03 -1.18549636e-02  1.51114514e-03  3.84251279e+01
 -3.99814763e-02  2.81533050e+01  1.11908034e+01  1.96983177e+00
  4.29819460e+00 -5.80232231e-03  5.48672868e-01  1.88101692e-01
  2.77137991e+00 -1.62273687e-02  1.45512591e-03  1.71247459e+00
  8.17356825e-04  6.79586797e-01  1.53746876e+00  1.15698183e+00
  2.82074938e+01  1.47375886e+01  5.52154646e-03  3.25991351e+00
  1.57485817e-03  1.53197348e+00 -8.50345700e-03 -1.13152693e-02
  7.19323097e-04  2.92779157e-04 -3.24989610e-03  6.22717235e-03
  2.23683797e+00 -6.27467059e-03  3.68495943e+01  2.82074938e+01]
