## 基线测试

- 输入特征：
    - Dates中的小时字段
    - DayOfWeek
    - PdDistrict
    - Address中是否包含"Block"字段
    - 经度
    - 纬度

### 读取数据集

In [1]:
import pandas as pd
import numpy as np

origin_train_data = pd.read_csv('../datasets/train.csv')
origin_train_data.head()

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541


### 1. 数据预处理

In [2]:
# 删除标签列
train_data = origin_train_data.drop(['Category', 'Descript', 'Resolution'], axis=1)
train_data.head()

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,Wednesday,NORTHERN,OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,Wednesday,NORTHERN,VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,Wednesday,NORTHERN,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,Wednesday,PARK,100 Block of BRODERICK ST,-122.438738,37.771541


### 2. 对数几率回归

#### 2.1 数据预处理  
2.1.1 将Address列转化为是否含Block的0,1值，然后删除Address列，添加HasBlock列

In [4]:
find_block = np.char.find(np.char.lower(np.array(origin_train_data['Address'], dtype=str)), 'block')
addresses = np.select([find_block<0, find_block>0, find_block==0], [0, 1, 1])
train_data = train_data.drop(['Address'], axis=1)
train_data['HasBlock'] = addresses

train_data.head()

Unnamed: 0,Dates,DayOfWeek,PdDistrict,X,Y,HasBlock
0,2015-05-13 23:53:00,Wednesday,NORTHERN,-122.425892,37.774599,0
1,2015-05-13 23:53:00,Wednesday,NORTHERN,-122.425892,37.774599,0
2,2015-05-13 23:33:00,Wednesday,NORTHERN,-122.424363,37.800414,0
3,2015-05-13 23:30:00,Wednesday,NORTHERN,-122.426995,37.800873,1
4,2015-05-13 23:30:00,Wednesday,PARK,-122.438738,37.771541,1


2.1.2 Dates列只留下小时作为特征

In [5]:
hours = pd.DatetimeIndex(train_data['Dates']).hour
train_data['Hours'] = hours
train_data = train_data.drop(['Dates'], axis=1)
train_data.head()

Unnamed: 0,DayOfWeek,PdDistrict,X,Y,HasBlock,Hours
0,Wednesday,NORTHERN,-122.425892,37.774599,0,23
1,Wednesday,NORTHERN,-122.425892,37.774599,0,23
2,Wednesday,NORTHERN,-122.424363,37.800414,0,23
3,Wednesday,NORTHERN,-122.426995,37.800873,1,23
4,Wednesday,PARK,-122.438738,37.771541,1,23


2.1.3 对X、Y、Hours列进行归一化处理

In [6]:
from sklearn.preprocessing import MinMaxScaler
scaler =MinMaxScaler()
features = ['X', 'Y', 'Hours']
train_data[features] = scaler.fit_transform(train_data[features])
train_data.head()

  return self.partial_fit(X, y)


Unnamed: 0,DayOfWeek,PdDistrict,X,Y,HasBlock,Hours
0,Wednesday,NORTHERN,0.043578,0.001276,0,1.0
1,Wednesday,NORTHERN,0.043578,0.001276,0,1.0
2,Wednesday,NORTHERN,0.044337,0.00177,0,1.0
3,Wednesday,NORTHERN,0.04303,0.001778,1,1.0
4,Wednesday,PARK,0.037198,0.001217,1,1.0


2.1.4 对DayOfWeek、PdDistrict进行独热编码

In [7]:
train_data = pd.get_dummies(train_data)
train_data.head()

Unnamed: 0,X,Y,HasBlock,Hours,DayOfWeek_Friday,DayOfWeek_Monday,DayOfWeek_Saturday,DayOfWeek_Sunday,DayOfWeek_Thursday,DayOfWeek_Tuesday,...,PdDistrict_BAYVIEW,PdDistrict_CENTRAL,PdDistrict_INGLESIDE,PdDistrict_MISSION,PdDistrict_NORTHERN,PdDistrict_PARK,PdDistrict_RICHMOND,PdDistrict_SOUTHERN,PdDistrict_TARAVAL,PdDistrict_TENDERLOIN
0,0.043578,0.001276,0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0.043578,0.001276,0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0.044337,0.00177,0,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0.04303,0.001778,1,1.0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0.037198,0.001217,1,1.0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


2.1.5 将数据集分为训练集和测试集

In [8]:
from sklearn.model_selection import train_test_split
y_label = origin_train_data['Category']
X_train, X_test, y_train, y_test = train_test_split(train_data, y_label, test_size=0.2, random_state=42)

print('X_train has {} samples.'.format(X_train.shape[0]))
print('X_test has {} samples.'.format(X_test.shape[0]))

X_train has 702439 samples.
X_test has 175610 samples.


#### 2.2 模型的创建和训练
2.2.1 训练模型

In [9]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42, solver='lbfgs', multi_class='multinomial', max_iter=1000).fit(X_train, y_train)

2.2.2 计算准确率

In [26]:
y_pred = clf.predict(X_test)
print('准确率: ', accuracy_score(y_pred, y_test))

准确率:  0.22801093331814817


2.2.3 计算对数损失

In [27]:
y_pred_prob = clf.predict_proba(X_test)
print('多分类对数损失: ', log_loss(y_test, y_pred_prob))

多分类对数损失:  2.556361302239199
