## 代码简介
**1、客户了类型预测的代码一共有两部分，一个存储在'C:\Users\Z0044A8U\Desktop\experiment10.12 公司名称经营范围项目'内，名称为：** 
- Customer+Role+Cleansing-for+Ben.ipynb  

这也正是此notebook所展示的内容，其所包含的内容有：
- 对sample数据进行预处理  
- 模型训练

**2、另一个存储在'C:\Users\Z0044A8U\Desktop\GUI'内，名称为：**  
- scratch.py  

其所包含的内容有：
- 对待预测的数据进行相同的预处理  
- 调用已训练模型进行预测，生成结果导出为DataFrame/excel文件  
- 使用tkinter包设计生成GUI界面

## 目录
### *----Preprocessing*
#### [Import Data](#Import_Data)  
#### [Define functions](#Define_functions)
#### [Remove useless features/words](#Remove_useless_features/words)  
#### [Construct Tf-idf Model](#Construct_Tf-idf_Model)  
#### [Define evaluation metrics under cross-validation](#Define_evaluation_metrics_under_cross-validation)  
#### [Preprocessing for 公司名称](#Preprocessing_for_公司名称)   

### *----Model Training*
#### [1.EndCustomer](#EndCustomer)
#### [2.PaB](#PaB)
#### [3.OEM](#OEM)
#### [4.SI](#SI)
#### [5.DISTRIBUTOR](#DISTRIBUTOR)

In [1]:
import pandas as pd
from jieba import lcut
from jieba import analyse
import jieba
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MiniBatchKMeans
import numpy as np
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import timeit
from sklearn.metrics import accuracy_score
from sklearn.linear_model import Perceptron
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix,precision_recall_fscore_support
import re
import random
from nltk.util import ngrams
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC # "Support vector classifier"
from sklearn.linear_model import LogisticRegression
import pickle

## Import_Data

In [2]:
'''
    Import Data
'''
# df_full是所有客户（40000~）的数据
df_full = pd.read_excel(r'..\01 Raw Data\01_1st batch_QCC result(cleaned).xlsx')  # 40185
# 去除经营范围为空的，重置索引并且如Index列
df_full = df_full[pd.notnull(df_full['经营范围'])]  # 40177
df_full = df_full.reset_index(drop = True)
df_full['Index'] = df_full.index

'''
    在df_full（所有客户）中找到sample，并对应地添加标签
'''
# skiprows = 4 是因为前四行没有数据
# 分别提取五种客户类型的数据，按行合并
df_sample_oem = pd.read_excel(r'..\01 Raw Data\Customer Sample with right role\FY19 Company Detail & CDP Report_OEM sample.xlsx', skiprows = 4)
df_sample_oem['Label'] = 'OEM'  # 263

df_sample_PaB = pd.read_excel(r'..\01 Raw Data\Customer Sample with right role\FY19 Company Detail & CDP Report_PaB sample.xlsx', skiprows = 4)
df_sample_PaB['Label'] = 'PaB'  #179

df_sample_dist = pd.read_excel(r'..\01 Raw Data\Customer Sample with right role\FY19 Company Detail & CDP Report_Distributor Sample.xlsx', skiprows = 4) 
df_sample_dist['Label'] = 'Distributor'  # 204

df_sample_SI = pd.read_excel(r'..\01 Raw Data\Customer Sample with right role\FY19 Company Detail & CDP Report_SI Sample.xlsx', skiprows = 4) 
df_sample_SI['Label'] = 'SI'  # 487

df_sample_EC = pd.read_excel(r'..\01 Raw Data\Customer Sample with right role\FY19 Company Detail & CDP Report_End Customer sample.xlsx', skiprows = 4) 
df_sample_EC['Label'] = 'End-customer'  # 226

# 分类数据与总体数据关于CNOC inner merge
df_samples = df_full.merge(pd.concat([df_sample_oem, df_sample_PaB, df_sample_dist, df_sample_SI, df_sample_EC], axis = 0), on = 'CNOC', how = 'inner') 
# keep df_full's index to identified samples
df_samples


# 提取数据完毕，接下来编写处理数据的函数

Unnamed: 0,Query.ID,Query.Name,Result.Name,Result.BelongOrg,公司注册时间,统一社会信用代码,CNOC,Company Role in Trust IT,注册资本,企业类型,...,Sales E-Mail,Sales GID,Sales Name,Sales Office,Sales Territory,Sales Type_1,Sales Type_2,Set-Category-Initiative,Total-CY Siemens OR(till P10),Vbez
0,10033,北京永创达科技发展有限公司,北京永创达科技发展有限公司,北京市工商行政管理局海淀分局,2002-03-13,91110108736452296Q,'736452296,Panel Builder/Cabinet Builder,1800万元人民币,有限责任公司(自然人投资或控股),...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,,,37001001
1,10012,北京利顺和科技有限公司,北京利顺和科技有限公司,北京市工商行政管理局海淀分局,2012-07-19,91110108599623158J,'599623158,Panel Builder/Cabinet Builder,1000万元人民币,有限责任公司(自然人投资或控股),...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,,,37001001
2,10024,北京北开电气股份有限公司,北京北开电气股份有限公司,北京市工商行政管理局朝阳分局,1999-12-28,91110105700239397Q,'700239397,Panel Builder/Cabinet Builder,26998万元人民币,其他股份有限公司(非上市),...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,DF Country Strategy MC > GMC_CC new measures >...,,37001001
3,10002,北京燕化中天电控设备有限责任公司,北京燕化中天电控设备有限责任公司,北京市工商行政管理局燕山分局,1996-12-24,91110304102772503U,'102772503,Panel Builder/Cabinet Builder,8000万元人民币,有限责任公司(法人独资),...,,,RE Pab Public,RE_PaB,S_Region East,Pab,Public,,,37002000
4,10035,北京京润海自动化技术研究所,北京京润海自动化技术研究所,北京市工商行政管理局大兴分局,2002-04-10,91110115737687839Q,'737687839,Panel Builder/Cabinet Builder,900万元人民币,集体所有制（股份合作）,...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,DF Country Strategy MC > GMC_CC > New customer...,,37001001
5,10019,盛隆电气（北京）有限公司,盛隆电气（北京）有限公司,北京市工商行政管理局密云分局,2009-02-04,91110228684354992N,'684354992,Panel Builder/Cabinet Builder,10000万元人民币,有限责任公司(法人独资),...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,MC GMC > MC GMC BD INF > MCC (motor control ce...,,37001001
6,10016,欧伏电气股份有限公司,欧伏电气股份有限公司,廊坊市市场监督管理局,2007-10-31,91131000669055476N,'669055476,Panel Builder/Cabinet Builder,8746.4万元人民币,其他股份有限公司(非上市),...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,DF Sales Funnel FY19 > Growing/new Customer > ...,,37001001
7,10055,北京广盟电气有限公司,北京广盟电气有限公司,北京市工商行政管理局顺义分局,2006-03-23,91110113786879343X,'786879343,Panel Builder/Cabinet Builder,5000万元人民币,有限责任公司(自然人投资或控股),...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,,,37001001
8,10064,天津首钢电气设备有限公司,天津首钢电气设备有限公司,天津市滨海新区市场和质量监督管理局,1991-12-14,911201161043161140,'104316114,Panel Builder/Cabinet Builder,1000万元人民币,有限责任公司,...,xingzhong.guo@siemens.com,Z001RWNP,Guo Xing Zhong,RE_PaB,S_Region East,Pab,Individual,DF Country Strategy MC > GMC_CC new measures >...,,37001001
9,10093,上海蓝箭电控设备成套有限公司,上海蓝箭电控设备成套有限公司,宝山区市场监督管理局,1991-03-15,913101131331305382,'133130538,Panel Builder/Cabinet Builder,12000万元人民币,有限责任公司(自然人投资或控股),...,daodong.li@siemens.com,Z0029MMV,Li Dao Dong,RE_PaB,S_Region East,Pab,Individual,DF Country Strategy MC > GMC_CC > New customer...,,37002003


## Define_functions

In [3]:
# 生成【停词】list，用于去掉无用的词
def stopwordslist(filepath):  
    stopwords = [str(line).strip() for line in open(filepath, 'r', encoding = 'utf-8').readlines()]  
    return stopwords  

# 进一步去除无用的词，如英文或数字
def proprecess_chinese(sentence, letters = False, digital = False):
    """
    该函数可以根据具体要求具体添加，这里默认只保留汉字，可以通过传参
    是否保留英文字母和数字等，也可以在此基础上继续添加特定的功能
    :param sentence:
    :param letters: 默认为False，不保留
    :param digital: 默认为False，不保留
    :return: 处理好的字符串
    """
    # ^A-Z^a-z^ 字母
    # 0-9^ 数字
    # \u4e00-\u9fa5 汉字
    if letters & digital:
        pro_str = re.compile('[^A-Z^a-z^0-9^\u4e00-\u9fa5]')
    elif ~letters & digital:
        pro_str = re.compile('[^0-9^\u4e00-\u9fa5]')
    elif letters & ~digital:
        pro_str = re.compile('[^A-Z^a-z^\u4e00-\u9fa5]')
    else:
        pro_str = re.compile('[^\u4e00-\u9fa5]')
    return pro_str.sub('', sentence)

# 用于去掉括号和括号内的文字
def remove_bracket(substr: str):
    # index left & index right
    # 英文括号
    def find_next_1(index_l: int, index_r: int): 
        index_l = str.find(substr, '(', index_l)  # 从设定起始位置第一次出现的左括号
        index_r = str.find(substr, ')', index_r)  # 从设定起始位置第一次出现的右括号
        return index_l, index_r
    
    # 中文括号
    def find_next_2(index_l: int, index_r: int): 
        index_l = str.find(substr, '【', index_l)  # 从设定起始位置第一次出现的左括号
        index_r = str.find(substr, '】', index_r)  # 从设定起始位置第一次出现的右括号
        return index_l, index_r
    
    index_l = 0
    index_r = 0
    # 英文括号处理
    # 如果左括号和右括号位置均大于等于0，则调用函数找下一个，初始是0，也就是从初始位置一一找
    while index_l >= 0 & index_r >= 0:
        index_l, index_r = find_next_1(index_l, index_r)
        # 如果右括号在左括号后边，'xxx(xxx)xxx'，字符串选择左括号前&右括号后
        if index_r > index_l:
            substr = substr[:index_l] + substr[index_r + 1:]
        # 如果右括号在左括号前边，'xxx)xxx(xxx'，再调用一次函数寻找下一个
        elif index_r < index_l:
            index_l, index_r = find_next_1(index_l, index_r + 1)
            # 如果还这样，结束
            if index_r < index_l:
                break
                
    index_l = 0
    index_r = 0
    # 中文括号处理，同上调用函数2
    while index_l >= 0 & index_r >= 0:
        index_l, index_r = find_next_2(index_l, index_r)
        if index_r > index_l:
            substr = substr[:index_l] + substr[index_r + 1:]
        elif index_r < index_l:
            index_l, index_r = find_next_2(index_l, index_r + 1)
            if index_r < index_l:
                break
                
    return substr

# 编写函数完毕，接下来调用函数处理数据

## Remove_useless_features/words

In [4]:
# construct list for stopwords
# strip移除字符串开头结尾的空格
stopwords_list = stopwordslist(r'..\01 Raw Data\中文停用词表.txt')
stopwords_list = [stopword.strip() for stopword in stopwords_list]  # 1894

words_list  = []
# 通过循环，将经营范围一个一个提出处理
for i in range(len(df_full)):
    content = df_full.iloc[i, :]['经营范围']
    
    # 调用上述编写函数
    content = remove_bracket(content)  # 去掉括号及括号内的内容
    content = proprecess_chinese(content, letters = False)  # 去掉英文和数字
    
    # 切割后调用停词列表，循环选择not in的
    splitedStr = ''
    words = lcut(content, HMM = True)
    words_final = []
    for word in words:
        if word not in stopwords_list:
            words_final.append(word)
    words_list.append(words_final)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Z00490NT\AppData\Local\Temp\jieba.cache
Loading model cost 1.091 seconds.
Prefix dict has been built successfully.


## Construct_Tf-idf_Model

In [5]:
# 把词合并，中间是空格
document = [" ".join(word) for word in words_list]
# 词库准备完毕，接下来运用tdidf模型

# 步骤：
# 1.生成所有词all的tfidf
# 2.选择training出现的词生成tfidf train
# 3.找到all在tfidf的位置
# 4.找到training在tfidf train的位置
# 5.合并找到training在tfidf的位置
# 6.根据位置提取training的tfidf
# 7.生成模型
# 注：中间的tfidf train model是为了加快运行速度

# tfidf for all items （19000~词）
tfidf_model = TfidfVectorizer(token_pattern = '\\b\\w+\\b').fit(document)  # \\b\\w+\\b -> 保留一个字符的词
sparse_result = tfidf_model.transform(document)
# 一行是一个句子，列是所有的词
print("Origninal Matrix of all sentences", sparse_result.shape)  # (40177, 19742)

'''
    这里引申出一个细节问题——NLP模型的构建，不是应该只选取Training set中的词进行训练吗？
        答：虽然我们只用到了training set中的词，但是tf-idf模型是无监督模型，它统计了所有文本中出现的词频。
            虽然我们能用的只有这3236个词，但是我们要在所有公司中统计出这3236个词的tf和idf才最准确。
            一开始我们把所有的19742个词都放进模型训练，虽然可行，但是迁移到软件中导致软件运行极慢。
            改进后的策略如下：
            以下步骤提取了原来的维数中19742维里面training set出现的词，提取后只剩3236维。
'''

# tfidf for training items (3000~词)
document_train = pd.Series(document)[df_samples['Index'].values].tolist()
tfidf_model_train = TfidfVectorizer(token_pattern = '\\b\\w+\\b').fit(document_train)  # \\b\\w+\\b -> 保留一个字符的词

# 提取3000~词及它们在tfidf matrix（train）中的位置（i列）
words_train = pd.Series(tfidf_model_train.vocabulary_).sort_values()
words_train = pd.DataFrame(words_train).reset_index().drop([0], axis = 1)
print(words_train.shape)  # (3236, 1)

# 提取19000~词及它们在tfidf matrix（all）中的位置（i列）
words_all = pd.Series(tfidf_model.vocabulary_).sort_values()
words_all = pd.DataFrame(words_all).reset_index()
print(words_all.shape)  # (19742, 2)

# 合并上面两个table，以找出3000~词在tfidf matrix（all）中的位置（i列）
words_combined = words_train.merge(words_all, on = 'index', how = 'left').rename({'index':'keys'})
words_combined.set_index([0], inplace = True)

# 在tfidf matrix（all）中提取3000~词的tfidf
sparse_result = sparse_result[:, list(words_combined.index)]
print("After transformation", sparse_result.shape) 

# 生成X和y
X = sparse_result[df_samples['Index'], :].toarray()
# stype强制转换数据类型，label是五种客户类型
df_samples['Label'] = df_samples['Label'].astype('category')  
# .cat.codes: Return Series of codes as well as the index.
df_samples['Label_Code'] = df_samples['Label'].cat.codes
y = df_samples['Label_Code'].values

#save tf-idf model和词典
f = open(r'01 Script generated models\tfidf_model.pkl', 'wb')
# 序列化对象，将对象obj保存到文件file中去
pickle.dump((tfidf_model, words_combined), f)
f.close()

Origninal Matrix of all sentences (40177, 19742)
(3236, 1)
(19742, 2)
After transformation (40177, 3236)


## Define_evaluation_metrics_under_cross-validation

In [6]:
# tfidf模型处理完毕，接下来评估分类器
# Precision 精确度：被分类器挑选的正样本有多少个是真正的正样本
# recall 召回率：在全部真正的正样本中挑选了几个
# f
def eval_model(y_true, y_pred, labels):
    # 计算每个分类的Precision, Recall, f1, support
    p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
    
    # 计算总体的平均Precision, Recall, f1, support
    tot_p = np.average(p, weights = s)
    tot_r = np.average(r, weights = s)
    tot_f1 = np.average(f1, weights = s)
    tot_s = np.sum(s)
    
    res1 = pd.DataFrame({
        u'Label': labels,
        u'Precision': p,
        u'Recall': r,
        u'F1': f1,
        u'Support': s
    })
    res2 = pd.DataFrame({
        u'Label': ['总体'],
        u'Precision': [tot_p],
        u'Recall': [tot_r],
        u'F1': [tot_f1],
        u'Support': [tot_s]
    })
    res2.index = [999]
    res = pd.concat([res1, res2])
    return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]

categoty = ['Distributor', 'End-customer', 'OEM', 'PaB', 'SI']

## Preprocessing_for_公司名称

In [7]:
def remove_words(item, word_list):
    filtered_item = item
    for word in word_list:
        filtered_item = filtered_item.replace(word,'')
    return filtered_item

def extract_word(splited_item, word_list):
    l = []
    for i in splited_item:
        if i in word_list:
            l.append(i)
    return l 
'''
keywords in customer names
'''
geo_list = pd.read_excel(r'..\01 Raw Data\Geo Names.xlsx')['GEO'].values.tolist() #地理位置词库
industry_list = pd.read_excel(r'..\01 Raw Data\Industry Names.xlsx')['Industry'].values.tolist() #行业词库

#prepare sample data for keyword extraction
df_kw_samples = pd.concat([df_sample_oem, df_sample_PaB, df_sample_dist, df_sample_SI, df_sample_EC],axis = 0)
df_kw_samples = df_kw_samples[['Company Name (CN)','Label']]

#停用词
stopwords_2 = ['省', '市', '县','中国',\
                '（', '）', '(', ')', '-', '*',' ','·','.','?']
#标志词
suffixes = ['有限','股份','集团','分公司','公司','厂','局']


'''
    从公司名称中提取公司行业关键词：
        step1: 去地名、去公司末尾后分词，取最后一个词。 如xx科技有限公司，科技被提取出来
        step2: 如果正好是两个或三个字，就直接通过。否则，到step3
        step3: 如果四个及以上字，就提取最后两个字
        step4: 根据公司行业表 industry_list 中的行业，一一比对提取。（但最终没有采用，代码已放入bkp）
'''
inds_1 = []
companies_list = df_kw_samples['Company Name (CN)'].values
for i in range(len(companies_list)):
    company_name_1_splitted = [i for i in lcut(companies_list[i], HMM=True)]
    geo_extracted_1 = extract_word(company_name_1_splitted, geo_list)
    company_name_1 = remove_words(companies_list[i],geo_extracted_1)
    company_name_2 = remove_words(company_name_1,stopwords_2)
    min_idx = len(company_name_2)
    for suffix in suffixes:
        try:
            suffix_idx = company_name_2.index(suffix)
        except:
            suffix_idx = min_idx
        if  suffix_idx < min_idx:
            min_idx = company_name_2.index(suffix) 
    company_name_3 = company_name_2[:min_idx]
    company_name_3_splitted = [i for i in lcut(company_name_3, HMM=True)]
    inds_1.append(company_name_3_splitted[-1])
    
df_kw_samples['Industry2'] = inds_1

#Industry3为最终提取结果
df_kw_samples['Industry3'] = np.where(df_kw_samples['Industry2'].apply(len)>=4, df_kw_samples['Industry2'].str[-2:],df_kw_samples['Industry2'] )

In [34]:
df_kw_samples

Unnamed: 0,Company Name (CN),Label,Industry2,Industry3
0,济南二机床集团有限公司,OEM,机床,机床
1,中车青岛四方车辆研究所有限公司,OEM,研究所,研究所
2,青岛软控机电工程有限公司,OEM,机电工程,工程
3,山东新华医疗器械股份有限公司,OEM,医疗器械,器械
4,荏原机械（中国）有限公司烟台分公司,OEM,机械,机械
5,景津环保股份有限公司,OEM,环保,环保
6,青岛大牧人机械股份有限公司,OEM,机械,机械
7,烟台杰瑞石油装备技术有限公司,OEM,技术,技术
8,青岛高测科技股份有限公司,OEM,科技,科技
9,临沂成盛精机制造有限公司,OEM,制造,制造


# 训练模型——目录  

### [1.EndCustomer](#EndCustomer)
### [2.PaB](#PaB)
### [3.OEM](#OEM)
### [4.SI](#SI)
### [5.DISTRIBUTOR](#DISTRIBUTOR)

In [8]:
'''
    根据不同的客户类型提取公司名称关键词
    'consolidated_5_tables.xlsx'是根据gini纯度提取的五个类型下不同的关键词。提取的阈值作为参数传进函数。

'''
def get_names(sheetName: str, gini_in: float, gini_other: float):
    df=pd.read_excel(r'..\01 Raw Data\consolidated_5_tables.xlsx',sheetName.lower())
    df=df[df['If in Industry_List']==1]
    df_in=df[df['gini']<=gini_in]
    df_other=df[df['gini']<=gini_other]
    if sheetName.lower()=='ec':
        kw_in=df_in[df_in['EC_Count']>df_in['others']]['KW']
        kw_others=df_other[df_other['EC_Count']<df_other['others']]['KW']    
    elif sheetName.lower()=='pab':
        kw_in=df_in[df_in['PaB_Count']>df_in['others']]['KW']
        kw_others=df_other[df_other['PaB_Count']<df_other['others']]['KW']    
    elif sheetName.lower()=='oem':
        kw_in=df_in[df_in['OEM_Count']>df_in['others']]['KW']
        kw_others=df_other[df_other['OEM_Count']<df_other['others']]['KW']    
    elif sheetName.lower()=='si':
        kw_in=df_in[df_in['SI_Count']>df_in['others']]['KW']
        kw_others=df_other[df_other['SI_Count']<df_other['others']]['KW']    
    elif sheetName.lower()=='dist':
        kw_in=df_in[df_in['Dist_Count']>df_in['others']]['KW']
        kw_others=df_other[df_other['Dist_Count']<df_other['others']]['KW']    
    else:
        print('sheet_name not valid, change **get_names** function')
        
    inds=[]
    for j in kw_in:
        if j.endswith('业'): #e.g. 糖业，药业
            in_d=j[-2]
            if in_d not in ['工','产','实']:
                inds.append(in_d)
    inds_o=[]      
    for j in kw_others:
        if j.endswith('业'):
            in_d=j[-2]
            if in_d not in ['工','产','实']:
                inds_o.append(in_d)
    kw_in=list(kw_in)
    kw_in.extend(inds)
    kw_others=list(kw_others)
    kw_others.extend(inds_o)
    return kw_in,kw_others

## EndCustomer
### [Back to Catalogue](#训练模型——目录)

In [10]:
'''
    重复之前的步骤，对数据预处理，提取公司名称关键词，也对经营范围应用tf-idf模型
'''
df_kw_full = df_full[['Result.Name']]

inds_2 = []
companies_list_full = df_kw_full['Result.Name'].values
for i in range(len(companies_list_full)):
    company_name_1_splitted = [i for i in lcut(companies_list_full[i], HMM=True)]
    geo_extracted_1 = extract_word(company_name_1_splitted, geo_list)
    company_name_1 = remove_words(companies_list_full[i],geo_extracted_1)
    company_name_2 = remove_words(company_name_1,stopwords_2)
    min_idx = len(company_name_2)
    for suffix in suffixes:
        try:
            suffix_idx = company_name_2.index(suffix)
        except:
            suffix_idx = min_idx
        if  suffix_idx < min_idx:
            min_idx = company_name_2.index(suffix) 
    company_name_3 = company_name_2[:min_idx]
    company_name_3_splitted = [i for i in lcut(company_name_3, HMM=True)]
    try:
        inds_2.append(company_name_3_splitted[-1])
    except:
        inds_2.append('')
    
df_kw_full['Industry2'] = inds_2
df_kw_full['Industry3'] = np.where(df_kw_full['Industry2'].apply(len)>=4, df_kw_full['Industry2'].str[-2:],df_kw_full['Industry2'] )



#EC Names
name_EC, name_EC_excl=get_names('ec',0.2,0.1) #提取关键词

#判断每个公司名称中是否包含name_EC或name_EC_excl中的关键词： 1/0
EC_name_count=0
EC_name_excl_count=0
EC_name_count_list = []
EC_name_excl_count_list = []
for i in range(len(df_kw_full.Industry3)):
    EC_name_count = 0
    EC_name_excl_count = 0
    company_name = df_kw_full.Industry3.values[i]
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
            break
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
            break
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [11]:
df_kw_full

Unnamed: 0,Result.Name,Industry2,Industry3
0,北京晓通网络科技有限公司,网络科技,科技
1,天津市百利电气有限公司,电气,电气
2,北京永隆新立自控工程有限公司,工程,工程
3,北京奥之通科技发展有限公司,发展,发展
4,北京华兴恒通电器有限公司,电器,电器
5,北京北整新研自控成套设备有限公司,成套设备,设备
6,北京永创达科技发展有限公司,发展,发展
7,北京易艾斯德科技有限公司,科技,科技
8,新兴重工集团有限公司,重工,重工
9,北京利顺和科技有限公司,科技,科技


In [12]:
'''
    平衡模型，使正负训练集平衡。做两分类问题。
'''
df_samples_forEC = df_full.merge(pd.concat([df_sample_oem[:50],df_sample_PaB[:50],df_sample_dist[:50],df_sample_SI[:50],df_sample_EC], axis = 0), on = 'CNOC', how = 'inner') #keep df_full's index to identified samples
X_EC = sparse_result[df_samples_forEC['Index'],:].toarray()


df_samples_forEC['Label'] = df_samples_forEC['Label'].astype('category')
df_samples_forEC['Label_Code'] = df_samples_forEC['Label'].cat.codes

y_EC = np.where(df_samples_forEC.Label == 'End-customer', 1,0)
categoty_EC = list(set(y_EC))

In [13]:
'''
    将X_EC和y_EC（即经营范围所变换得来的矩阵）应用于第一个（Stacking前）的模型，得出交叉验证的结果。
'''
use_data = X_EC
df_result_from_model1 = pd.DataFrame(index = [i for i in range(len(use_data))], columns = ['Model1_Output'])
result_svm_EC = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(use_data, y_EC)
for train_index, test_index in skf.split(use_data, y_EC):
    X_train, X_test = use_data[train_index], use_data[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]

    clf = SVC(kernel='linear', C=5)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_svm_EC.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.mean(result_svm_EC,axis = 0)[:,1],
        u'F1': np.mean(result_svm_EC,axis = 0)[:,2],
        u'Support': np.mean(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.std(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.std(result_svm_EC,axis = 0)[:,1],
        u'F1': np.std(result_svm_EC,axis = 0)[:,2],
        u'Support': np.std(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

  Label  Precision    Recall        F1  Support
0     0   0.917203  0.823158  0.855448     19.2
1     1   0.863692  0.922381  0.886751     20.8
2    总体   0.889273  0.875000  0.871838     40.0
  Label  Precision    Recall        F1  Support
0     0   0.071932  0.163059  0.105757      0.4
1     1   0.102595  0.075707  0.066065      0.4
2    总体   0.063256  0.078262  0.082874      0.0


In [14]:
'''
    模型堆叠，Stacking
    增加公司名称的features
    将第一个模型的结果应用于第二个模型
'''
df_name_features = pd.DataFrame({'EC_Name':EC_name_count_list,'EC_Name_exclude':EC_name_excl_count_list})
df_industry_features = pd.get_dummies(df_full['Result.Industry.IndustryCode'])
df_input_model2 = pd.concat([df_result_from_model1
                             ,  df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                            ], axis = 1).values

In [50]:
df_input_model2

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 1],
       ...,
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]], dtype=int64)

In [15]:
'''
    第二个模型（Stacking后）的预测结果（最终）
'''
result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    clf = SVC(kernel='rbf', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

  Label  Precision    Recall        F1  Support
0     0   0.925947  0.937368  0.927103     19.2
1     1   0.948635  0.922857  0.931486     20.8
2    总体   0.937631  0.930000  0.929417     40.0


### model training - EC

In [16]:
'''
    用所有的sample数据训练，即不进行交叉验证，得到最终模型
'''
clf_EC = SVC(kernel='linear', C=5,verbose=True)
clf_EC.fit(X_EC, y_EC)
clf_EC_stack = SVC(kernel='linear', C=5,verbose=True, probability = True).fit(df_input_model2, y_EC)
X_full_EC = sparse_result[df_full[df_full['Company Role in Trust IT']=='End-Customer']['Index'],:].toarray()
df_EC = df_full[df_full['Company Role in Trust IT']=='End-Customer']
df_EC['Prediction'] = clf_EC.predict(X_full_EC)
df_EC_stack = pd.concat([df_EC[['Prediction']].reset_index(drop=True)
                             ,  df_name_features.iloc[df_full[df_full['Company Role in Trust IT']=='End-Customer']['Index'],:].reset_index(drop = True)
                            ], axis = 1).values
df_EC['Prediction_2'] = clf_EC_stack.predict(df_EC_stack)
df_EC['Prediction_2_Prob_other'] = clf_EC_stack.predict_proba(df_EC_stack)[:,0]
df_EC['Prediction_2_Prob1_EC'] = clf_EC_stack.predict_proba(df_EC_stack)[:,1]
df_EC_final = pd.concat([df_EC.reset_index(drop = True),pd.DataFrame(df_EC_stack, columns = ['Pred_Model1','KW_incl','KW_excl'])], axis = 1).copy(deep = True)
df_EC_final[['KW_incl','KW_excl','Prediction','Prediction_2', 'Prediction_2_Prob_other', 'Prediction_2_Prob1_EC']]
df_EC_final['Final_Pred'] = np.where(df_EC_final['KW_incl']+df_EC_final['KW_excl']==0, 1,df_EC_final['Prediction_2']) 
df_EC_final.Final_Pred.value_counts()

[LibSVM][LibSVM]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexi

1    9763
0    1507
Name: Final_Pred, dtype: int64

In [17]:
df_EC_final[['KW_incl','KW_excl','Prediction','Prediction_2', 'Prediction_2_Prob_other', 'Prediction_2_Prob1_EC']]

Unnamed: 0,KW_incl,KW_excl,Prediction,Prediction_2,Prediction_2_Prob_other,Prediction_2_Prob1_EC
0,0,1,0,0,0.988593,0.011407
1,0,0,1,1,0.157537,0.842463
2,1,0,1,1,0.008579,0.991421
3,1,0,0,1,0.157537,0.842463
4,0,0,0,0,0.801862,0.198138
5,0,1,0,0,0.988593,0.011407
6,0,0,0,0,0.801862,0.198138
7,0,0,0,0,0.801862,0.198138
8,0,0,0,0,0.801862,0.198138
9,0,1,0,0,0.988593,0.011407


In [18]:
'''
    保存模型
'''
f = open(r'01 Script generated models\ec_model.pkl', 'wb')
pickle.dump(((name_EC, name_EC_excl),(clf_EC,clf_EC_stack)), f)
f.close()

## PaB
### [Back to Catalogue](#训练模型——目录)

In [19]:
'''
    重复之前的步骤，对数据预处理，提取公司名称关键词，也对经营范围应用tf-idf模型
'''
df_kw_full = df_full[['Result.Name']]

inds_2 = []
companies_list_full = df_kw_full['Result.Name'].values
for i in range(len(companies_list_full)):
    company_name_1_splitted = [i for i in lcut(companies_list_full[i], HMM=True)]
    geo_extracted_1 = extract_word(company_name_1_splitted, geo_list)
    company_name_1 = remove_words(companies_list_full[i],geo_extracted_1)
    company_name_2 = remove_words(company_name_1,stopwords_2)
    min_idx = len(company_name_2)
    for suffix in suffixes:
        try:
            suffix_idx = company_name_2.index(suffix)
        except:
            suffix_idx = min_idx
        if  suffix_idx < min_idx:
            min_idx = company_name_2.index(suffix) 
    company_name_3 = company_name_2[:min_idx]
    company_name_3_splitted = [i for i in lcut(company_name_3, HMM=True)]
    try:
        inds_2.append(company_name_3_splitted[-1])
    except:
        inds_2.append('')
    
df_kw_full['Industry2'] = inds_2
df_kw_full['Industry3'] = np.where(df_kw_full['Industry2'].apply(len)>=4, df_kw_full['Industry2'].str[-2:],df_kw_full['Industry2'] )



#EC Names
name_EC, name_EC_excl=get_names('pab',0.2,0)

EC_name_count=0
EC_name_excl_count=0
EC_name_count_list = []
EC_name_excl_count_list = []

for i in range(len(df_kw_full.Industry3)):
    EC_name_count = 0
    EC_name_excl_count = 0
    company_name = df_kw_full.Industry3.values[i]
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
            break
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
            break
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)
    
#以下是Ben新加的经营范围的特征
'''
经营范围 range in all
'''
df_kw_full = df_full[['经营范围']]
range_EC=['低压','开关','配电']#如果加经营范围会提高准确率，选出的词如下：它们传入stacking后的模型
EC_range_count=0
EC_range_count_list=[]

for i in range(len(df_kw_full.经营范围)):
    EC_range_count = 0
    company_name = df_kw_full.经营范围.values[i]
    for kw in range_EC:
        if kw in company_name:
            EC_range_count = 1
            break
    EC_range_count_list.append(EC_range_count)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [20]:
'''
    平衡模型，使正负训练集平衡。做两分类问题。
'''
df_samples_forEC = df_full.merge(pd.concat([df_sample_oem[:50],df_sample_PaB,df_sample_dist[:50],df_sample_SI[:50],df_sample_EC[:50]], axis = 0), on = 'CNOC', how = 'inner') #keep df_full's index to identified samples
X_EC = sparse_result[df_samples_forEC['Index'],:].toarray()


df_samples_forEC['Label'] = df_samples_forEC['Label'].astype('category')
df_samples_forEC['Label_Code'] = df_samples_forEC['Label'].cat.codes

y_EC = np.where(df_samples_forEC.Label == 'PaB', 1,0)
categoty_EC = list(set(y_EC))

In [21]:
'''
    将X_EC和y_EC（即经营范围所变换得来的矩阵）应用于第一个（Stacking前）的模型，得出交叉验证的结果。
'''
use_data = X_EC
df_result_from_model1 = pd.DataFrame(index = [i for i in range(len(use_data))], columns = ['Model1_Output'])
result_svm_EC = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(use_data, y_EC)
for train_index, test_index in skf.split(use_data, y_EC):
    X_train, X_test = use_data[train_index], use_data[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]

    clf = SVC(kernel='linear', C=1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_svm_EC.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.mean(result_svm_EC,axis = 0)[:,1],
        u'F1': np.mean(result_svm_EC,axis = 0)[:,2],
        u'Support': np.mean(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.std(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.std(result_svm_EC,axis = 0)[:,1],
        u'F1': np.std(result_svm_EC,axis = 0)[:,2],
        u'Support': np.std(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

  Label  Precision    Recall        F1  Support
0     0   0.759773  0.835380  0.792043     18.7
1     1   0.794125  0.689706  0.729338     16.5
2    总体   0.776498  0.766508  0.762585     35.2
  Label  Precision    Recall        F1   Support
0     0   0.123779  0.139673  0.120817  0.458258
1     1   0.155836  0.173934  0.154437  0.500000
2    总体   0.130280  0.132682  0.134555  0.400000


In [22]:
'''
    模型堆叠，Stacking
    增加公司名称的features
    将第一个模型的结果应用于第二个模型
'''
df_name_features = pd.DataFrame({'EC_Name':EC_name_count_list,'EC_Name_exclude':EC_name_excl_count_list})

df_industry_features = pd.get_dummies(df_full['Result.Industry.IndustryCode'])
df_range_features=pd.DataFrame({'EC_range':EC_range_count_list}) ##Ben新加的公司经营范围的特征
df_input_model2 = pd.concat([df_result_from_model1
                             ,  df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                             ,df_range_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                            ], axis = 1).values

In [23]:
'''
    第二个模型（Stacking后）的预测结果（最终）
'''
result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
        clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])
#法二

IndentationError: unexpected indent (<ipython-input-23-c8860386e4cf>, line 10)

### model training - PaB

In [None]:
clf_PAB = SVC(kernel='linear', C=5,verbose=True)
clf_PAB.fit(X_EC, y_EC)
clf_PAB_stack = SVC(kernel='linear', C=5,verbose=True, probability = True).fit(df_input_model2, y_EC)
X_full_EC = sparse_result[df_full[df_full['Company Role in Trust IT']=='Panel Builder/Cabinet Builder']['Index'],:].toarray()
df_EC = df_full[df_full['Company Role in Trust IT']=='Panel Builder/Cabinet Builder']
df_EC['Prediction'] = clf_PAB.predict(X_full_EC)
df_EC_stack = pd.concat([df_EC[['Prediction']].reset_index(drop=True)
                             ,  df_name_features.iloc[df_full[df_full['Company Role in Trust IT']=='Panel Builder/Cabinet Builder']['Index'],:].reset_index(drop = True)
                             ,df_range_features.iloc[df_full[df_full['Company Role in Trust IT']=='Panel Builder/Cabinet Builder']['Index'],:].reset_index(drop = True)
                            ], axis = 1).values
df_EC['Prediction_2'] = clf_PAB_stack.predict(df_EC_stack)
df_EC['Prediction_2_Prob_other'] = clf_PAB_stack.predict_proba(df_EC_stack)[:,0]
df_EC['Prediction_2_Prob1_PaB'] = clf_PAB_stack.predict_proba(df_EC_stack)[:,1]
df_EC_final = pd.concat([df_EC.reset_index(drop = True),pd.DataFrame(df_EC_stack, columns = ['Pred_Model1','KW_incl','KW_excl','Range_KW'])], axis = 1).copy(deep = True)
df_EC_final[['KW_incl','KW_excl','Prediction','Prediction_2', 'Prediction_2_Prob_other', 'Prediction_2_Prob1_PaB']]
df_EC_final['Final_Pred'] = df_EC_final['Prediction_2']
df_EC_final.Final_Pred.value_counts()

In [None]:
'''
    保存模型
'''
f = open(r'01 Script generated models\pab_model.pkl', 'wb')
pickle.dump(((name_EC, name_EC_excl),(clf_PAB,clf_PAB_stack)), f)
f.close()

## OEM
### [Back to Catalogue](#训练模型——目录)

In [None]:
'''
    重复之前的步骤，对数据预处理，提取公司名称关键词，也对经营范围应用tf-idf模型
'''
df_kw_full = df_full[['Result.Name']]

inds_2 = []
companies_list_full = df_kw_full['Result.Name'].values
for i in range(len(companies_list_full)):
    company_name_1_splitted = [i for i in lcut(companies_list_full[i], HMM=True)]
    geo_extracted_1 = extract_word(company_name_1_splitted, geo_list)
    company_name_1 = remove_words(companies_list_full[i],geo_extracted_1)
    company_name_2 = remove_words(company_name_1,stopwords_2)
    min_idx = len(company_name_2)
    for suffix in suffixes:
        try:
            suffix_idx = company_name_2.index(suffix)
        except:
            suffix_idx = min_idx
        if  suffix_idx < min_idx:
            min_idx = company_name_2.index(suffix) 
    company_name_3 = company_name_2[:min_idx]
    company_name_3_splitted = [i for i in lcut(company_name_3, HMM=True)]
    try:
        inds_2.append(company_name_3_splitted[-1])
    except:
        inds_2.append('')
    
df_kw_full['Industry2'] = inds_2
df_kw_full['Industry3'] = np.where(df_kw_full['Industry2'].apply(len)>=4, df_kw_full['Industry2'].str[-2:],df_kw_full['Industry2'] )


#EC Names
name_EC, name_EC_excl=get_names('oem',0.2,0.1)

EC_name_count=0
EC_name_excl_count=0
EC_name_count_list = []
EC_name_excl_count_list = []

for i in range(len(df_kw_full.Industry3)):
    EC_name_count = 0
    EC_name_excl_count = 0
    company_name = df_kw_full.Industry3.values[i]
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
            break
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
            break
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)
    
#以下是Ben新加的经营范围的特征
'''
经营范围 range in all
'''
#('oem',0.2,0.1)
df_kw_full = df_full[['经营范围']]
range_EC=['机器人','铸造','智能']
EC_range_count=0
EC_range_count_list=[]

for i in range(len(df_kw_full.经营范围)):
    EC_range_count = 0
    company_name = df_kw_full.经营范围.values[i]
    for kw in range_EC:
        if kw in company_name:
            EC_range_count = 1
            break
    EC_range_count_list.append(EC_range_count)


In [None]:
'''
    平衡模型，使正负训练集平衡。做两分类问题。
'''
df_samples_forEC = df_full.merge(pd.concat([df_sample_oem,df_sample_PaB[:50],df_sample_dist[:50],df_sample_SI[:50],df_sample_EC[:50]], axis = 0), on = 'CNOC', how = 'inner') #keep df_full's index to identified samples
X_EC = sparse_result[df_samples_forEC['Index'],:].toarray()


df_samples_forEC['Label'] = df_samples_forEC['Label'].astype('category')
df_samples_forEC['Label_Code'] = df_samples_forEC['Label'].cat.codes

y_EC = np.where(df_samples_forEC.Label == 'OEM', 1,0)
categoty_EC = list(set(y_EC))

In [None]:
'''
    将X_EC和y_EC（即经营范围所变换得来的矩阵）应用于第一个（Stacking前）的模型，得出交叉验证的结果。
'''
use_data = X_EC
df_result_from_model1 = pd.DataFrame(index = [i for i in range(len(use_data))], columns = ['Model1_Output'])
result_svm_EC = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(use_data, y_EC)
for train_index, test_index in skf.split(use_data, y_EC):
    X_train, X_test = use_data[train_index], use_data[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]

    clf = SVC(kernel='linear', C=1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_svm_EC.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.mean(result_svm_EC,axis = 0)[:,1],
        u'F1': np.mean(result_svm_EC,axis = 0)[:,2],
        u'Support': np.mean(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.std(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.std(result_svm_EC,axis = 0)[:,1],
        u'F1': np.std(result_svm_EC,axis = 0)[:,2],
        u'Support': np.std(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

In [None]:
'''
    模型堆叠，Stacking
    增加公司名称的features
    将第一个模型的结果应用于第二个模型
'''
df_name_features = pd.DataFrame({'EC_Name':EC_name_count_list,'EC_Name_exclude':EC_name_excl_count_list})
df_range_features=pd.DataFrame({'EC_range':EC_range_count_list}) ##Ben新加的公司经营范围的特征
df_industry_features = pd.get_dummies(df_full['Result.Industry.IndustryCode'])
df_input_model2 = pd.concat([df_result_from_model1
                            # ,  df_cust_val_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                             ,  df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                           # ,  df_kw_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                            ,df_range_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)##Ben新加的公司经营范围的特征                            ], axis = 1).values

In [None]:
'''
    第二个模型（Stacking后）的预测结果（最终）
'''
result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    
    #clf = LogisticRegression(random_state=0, solver='lbfgs',class_weight ='balanced').fit(X_train, y_train)
    clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    #df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])
#法二

### model training - OEM

In [24]:
clf_OEM = SVC(kernel='linear', C=5,verbose=True)
clf_OEM.fit(X_EC, y_EC)
clf_OEM_stack = SVC(kernel='linear', C=5,verbose=True, probability = True).fit(df_input_model2, y_EC)
X_full_EC = sparse_result[df_full[df_full['Company Role in Trust IT']=='OEM']['Index'],:].toarray()
df_EC = df_full[df_full['Company Role in Trust IT']=='OEM']
df_EC['Prediction'] = clf_OEM.predict(X_full_EC)
df_EC_stack = pd.concat([df_EC[['Prediction']].reset_index(drop=True)
                             ,  df_name_features.iloc[df_full[df_full['Company Role in Trust IT']=='OEM']['Index'],:].reset_index(drop = True)
                             ,df_range_features.iloc[df_full[df_full['Company Role in Trust IT']=='OEM']['Index'],:].reset_index(drop = True)
                            ], axis = 1).values
df_EC['Prediction_2'] = clf_OEM_stack.predict(df_EC_stack)
df_EC['Prediction_2_Prob_other'] = clf_OEM_stack.predict_proba(df_EC_stack)[:,0]
df_EC['Prediction_2_Prob1_OEM'] = clf_OEM_stack.predict_proba(df_EC_stack)[:,1]
df_EC_final = pd.concat([df_EC.reset_index(drop = True),pd.DataFrame(df_EC_stack, columns = ['Pred_Model1','KW_incl','KW_excl','Range_KW'])], axis = 1).copy(deep = True)
df_EC_final[['KW_incl','KW_excl','Prediction','Prediction_2', 'Prediction_2_Prob_other', 'Prediction_2_Prob1_OEM']]
df_EC_final['Final_Pred'] = df_EC_final['Prediction_2']
df_EC_final.Final_Pred.value_counts()

[LibSVM][LibSVM]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.p

0    15585
1     1448
Name: Final_Pred, dtype: int64

In [25]:
'''
    保存模型
'''
f = open(r'01 Script generated models\oem_model.pkl', 'wb')
pickle.dump(((name_EC, name_EC_excl),(clf_OEM,clf_OEM_stack)), f)
f.close()

## SI
### [Back to Catalogue](#训练模型——目录)

In [26]:
'''
    重复之前的步骤，对数据预处理，提取公司名称关键词，也对经营范围应用tf-idf模型
'''
df_kw_full = df_full[['Result.Name']]

inds_2 = []
companies_list_full = df_kw_full['Result.Name'].values
for i in range(len(companies_list_full)):
    company_name_1_splitted = [i for i in lcut(companies_list_full[i], HMM=True)]
    geo_extracted_1 = extract_word(company_name_1_splitted, geo_list)
    company_name_1 = remove_words(companies_list_full[i],geo_extracted_1)
    company_name_2 = remove_words(company_name_1,stopwords_2)
    min_idx = len(company_name_2)
    for suffix in suffixes:
        try:
            suffix_idx = company_name_2.index(suffix)
        except:
            suffix_idx = min_idx
        if  suffix_idx < min_idx:
            min_idx = company_name_2.index(suffix) 
    company_name_3 = company_name_2[:min_idx]
    company_name_3_splitted = [i for i in lcut(company_name_3, HMM=True)]
    try:
        inds_2.append(company_name_3_splitted[-1])
    except:
        inds_2.append('')
    
df_kw_full['Industry2'] = inds_2
df_kw_full['Industry3'] = np.where(df_kw_full['Industry2'].apply(len)>=4, df_kw_full['Industry2'].str[-2:],df_kw_full['Industry2'] )



#EC Names
name_EC, name_EC_excl=get_names('si',0.1,0.1)

EC_name_count=0
EC_name_excl_count=0
EC_name_count_list = []
EC_name_excl_count_list = []

for i in range(len(df_kw_full.Industry3)):
    EC_name_count = 0
    EC_name_excl_count = 0
    company_name = df_kw_full.Industry3.values[i]
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
            break
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
            break
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)
    
#以下是Ben新加的经营范围的特征
'''
经营范围 range in all
'''
df_kw_full = df_full[['经营范围']]
range_EC=[]#['编程','开发','计算机','承包','系统集成','软件','服务','外包','系统']#去掉range后模型会更好一点
EC_range_count=0
EC_range_count_list=[]

for i in range(len(df_kw_full.经营范围)):
    EC_range_count = 0
    company_name = df_kw_full.经营范围.values[i]
    for kw in range_EC:
        if kw in company_name:
            EC_range_count = 1
            break
    EC_range_count_list.append(EC_range_count)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [27]:
'''
    平衡模型，使正负训练集平衡。做两分类问题。
'''
df_samples_forEC = df_full.merge(pd.concat([df_sample_oem[:100],df_sample_PaB[:100],df_sample_dist[:100],df_sample_SI,df_sample_EC[:100]], axis = 0), on = 'CNOC', how = 'inner') #keep df_full's index to identified samples
X_EC = sparse_result[df_samples_forEC['Index'],:].toarray()


df_samples_forEC['Label'] = df_samples_forEC['Label'].astype('category')
df_samples_forEC['Label_Code'] = df_samples_forEC['Label'].cat.codes

y_EC = np.where(df_samples_forEC.Label == 'SI', 1,0)
categoty_EC = list(set(y_EC))

In [28]:
'''
    将X_EC和y_EC（即经营范围所变换得来的矩阵）应用于第一个（Stacking前）的模型，得出交叉验证的结果。
'''
use_data = X_EC
df_result_from_model1 = pd.DataFrame(index = [i for i in range(len(use_data))], columns = ['Model1_Output'])
result_svm_EC = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(use_data, y_EC)
for train_index, test_index in skf.split(use_data, y_EC):
    X_train, X_test = use_data[train_index], use_data[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]

    clf = SVC(kernel='linear', C=1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_svm_EC.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.mean(result_svm_EC,axis = 0)[:,1],
        u'F1': np.mean(result_svm_EC,axis = 0)[:,2],
        u'Support': np.mean(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.std(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.std(result_svm_EC,axis = 0)[:,1],
        u'F1': np.std(result_svm_EC,axis = 0)[:,2],
        u'Support': np.std(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

  Label  Precision    Recall        F1  Support
0     0   0.748917  0.668563  0.686809     37.1
1     1   0.784899  0.823512  0.794419     48.1
2    总体   0.769319  0.755882  0.747527     85.2
  Label  Precision    Recall        F1  Support
0     0   0.033971  0.207769  0.122485      0.3
1     1   0.109855  0.070166  0.031985      0.3
2    总体   0.060795  0.058231  0.069118      0.4


In [29]:
'''
    模型堆叠，Stacking
    增加公司名称的features
    将第一个模型的结果应用于第二个模型
'''
df_name_features = pd.DataFrame({'EC_Name':EC_name_count_list,'EC_Name_exclude':EC_name_excl_count_list})
df_range_features=pd.DataFrame({'EC_range':EC_range_count_list}) ##Ben新加的公司经营范围的特征
df_industry_features = pd.get_dummies(df_full['Result.Industry.IndustryCode'])
df_input_model2 = pd.concat([df_result_from_model1
                             ,  df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                            ,df_range_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)##Ben新加的公司经营范围的特征
                            ], axis = 1).values

In [30]:
'''
    第二个模型（Stacking后）的预测结果（最终）
'''
result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    
    #clf = LogisticRegression(random_state=0, solver='lbfgs',class_weight ='balanced').fit(X_train, y_train)
    clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    #df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])
#法二

  Label  Precision    Recall        F1  Support
0     0   0.765277  0.700853  0.712708     37.1
1     1   0.804888  0.831803  0.809250     48.1
2    总体   0.787702  0.774679  0.767192     85.2


### model training - SI

In [31]:
clf_SI = SVC(kernel='linear', C=5,verbose=True)
clf_SI.fit(X_EC, y_EC)
clf_SI_stack = SVC(kernel='linear', C=5,verbose=True, probability = True).fit(df_input_model2, y_EC)
X_full_EC = sparse_result[df_full[df_full['Company Role in Trust IT']=='System Integrator/VA Partner']['Index'],:].toarray()
df_EC = df_full[df_full['Company Role in Trust IT']=='System Integrator/VA Partner']
df_EC['Prediction'] = clf_SI.predict(X_full_EC)
df_EC_stack = pd.concat([df_EC[['Prediction']].reset_index(drop=True)
                             ,  df_name_features.iloc[df_full[df_full['Company Role in Trust IT']=='System Integrator/VA Partner']['Index'],:].reset_index(drop = True)
                             ,df_range_features.iloc[df_full[df_full['Company Role in Trust IT']=='System Integrator/VA Partner']['Index'],:].reset_index(drop = True)
                            ], axis = 1).values
df_EC['Prediction_2'] = clf_SI_stack.predict(df_EC_stack)
df_EC['Prediction_2_Prob_other'] = clf_SI_stack.predict_proba(df_EC_stack)[:,0]
df_EC['Prediction_2_Prob1_SI'] = clf_SI_stack.predict_proba(df_EC_stack)[:,1]
df_EC_final = pd.concat([df_EC.reset_index(drop = True),pd.DataFrame(df_EC_stack, columns = ['Pred_Model1','KW_incl','KW_excl','Range_KW'])], axis = 1).copy(deep = True)
df_EC_final[['KW_incl','KW_excl','Prediction','Prediction_2', 'Prediction_2_Prob_other', 'Prediction_2_Prob1_SI']]
df_EC_final['Final_Pred'] = df_EC_final['Prediction_2']
df_EC_final.Final_Pred.value_counts()

[LibSVM][LibSVM]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.p

1    3322
0    2102
Name: Final_Pred, dtype: int64

In [32]:
'''
    保存模型
'''
f = open(r'01 Script generated models\si_model.pkl', 'wb')
pickle.dump(((name_EC, name_EC_excl),(clf_SI,clf_SI_stack)), f)
f.close()

## DISTRIBUTOR
### [Back to Catalogue](#训练模型——目录)

In [33]:
'''
    重复之前的步骤，对数据预处理，提取公司名称关键词，也对经营范围应用tf-idf模型
'''
df_kw_full = df_full[['Result.Name']]

inds_2 = []
companies_list_full = df_kw_full['Result.Name'].values
for i in range(len(companies_list_full)):
    company_name_1_splitted = [i for i in lcut(companies_list_full[i], HMM=True)]
    geo_extracted_1 = extract_word(company_name_1_splitted, geo_list)
    company_name_1 = remove_words(companies_list_full[i],geo_extracted_1)
    company_name_2 = remove_words(company_name_1,stopwords_2)
    min_idx = len(company_name_2)
    for suffix in suffixes:
        try:
            suffix_idx = company_name_2.index(suffix)
        except:
            suffix_idx = min_idx
        if  suffix_idx < min_idx:
            min_idx = company_name_2.index(suffix) 
    company_name_3 = company_name_2[:min_idx]
    company_name_3_splitted = [i for i in lcut(company_name_3, HMM=True)]
    try:
        inds_2.append(company_name_3_splitted[-1])
    except:
        inds_2.append('')
    
df_kw_full['Industry2'] = inds_2
df_kw_full['Industry3'] = np.where(df_kw_full['Industry2'].apply(len)>=4, df_kw_full['Industry2'].str[-2:],df_kw_full['Industry2'] )



#EC Names
name_EC, name_EC_excl=get_names('dist',0.13, 0.1)

EC_name_count=0
EC_name_excl_count=0
EC_name_count_list = []
EC_name_excl_count_list = []

for i in range(len(df_kw_full.Industry3)):
    EC_name_count = 0
    EC_name_excl_count = 0
    company_name = df_kw_full.Industry3.values[i]
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
            break
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
            break
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)
    
#以下是Ben新加的经营范围的特征
'''
经营范围 range in all
'''
df_kw_full = df_full[['经营范围']]
range_EC=[]
EC_range_count=0
EC_range_count_list=[]

for i in range(len(df_kw_full.经营范围)):
    EC_range_count = 0
    company_name = df_kw_full.经营范围.values[i]
    for kw in range_EC:
        if kw in company_name:
            EC_range_count = 1
            break
    EC_range_count_list.append(EC_range_count)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [34]:
'''
    平衡模型，使正负训练集平衡。做两分类问题。
'''
df_samples_forEC = df_full.merge(pd.concat([df_sample_oem[:50],df_sample_PaB[:50],df_sample_dist,df_sample_SI[:50],df_sample_EC[:50]], axis = 0), on = 'CNOC', how = 'inner') #keep df_full's index to identified samples
X_EC = sparse_result[df_samples_forEC['Index'],:].toarray()


df_samples_forEC['Label'] = df_samples_forEC['Label'].astype('category')
df_samples_forEC['Label_Code'] = df_samples_forEC['Label'].cat.codes

y_EC = np.where(df_samples_forEC.Label == 'Distributor', 1,0)
categoty_EC = list(set(y_EC))

In [35]:
'''
    将X_EC和y_EC（即经营范围所变换得来的矩阵）应用于第一个（Stacking前）的模型，得出交叉验证的结果。
'''
use_data = X_EC
df_result_from_model1 = pd.DataFrame(index = [i for i in range(len(use_data))], columns = ['Model1_Output'])
result_svm_EC = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(use_data, y_EC)
for train_index, test_index in skf.split(use_data, y_EC):
    X_train, X_test = use_data[train_index], use_data[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]

    clf = SVC(kernel='linear', C=1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_svm_EC.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.mean(result_svm_EC,axis = 0)[:,1],
        u'F1': np.mean(result_svm_EC,axis = 0)[:,2],
        u'Support': np.mean(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.std(result_svm_EC,axis = 0)[:,0],
        u'Recall': np.std(result_svm_EC,axis = 0)[:,1],
        u'F1': np.std(result_svm_EC,axis = 0)[:,2],
        u'Support': np.std(result_svm_EC,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

  Label  Precision    Recall        F1  Support
0     0   0.769918  0.693860  0.718224     18.4
1     1   0.769947  0.815952  0.783678     20.2
2    总体   0.769731  0.758570  0.752854     38.6
  Label  Precision    Recall        F1   Support
0     0   0.137145  0.223316  0.170759  0.489898
1     1   0.151901  0.108123  0.104075  0.400000
2    总体   0.132217  0.128293  0.133049  0.489898


In [36]:
'''
    模型堆叠，Stacking
    增加公司名称的features
    将第一个模型的结果应用于第二个模型
'''
df_name_features = pd.DataFrame({'EC_Name':EC_name_count_list,'EC_Name_exclude':EC_name_excl_count_list})
df_range_features=pd.DataFrame({'EC_range':EC_range_count_list}) ##Ben新加的公司经营范围的特征
df_industry_features = pd.get_dummies(df_full['Result.Industry.IndustryCode'])
df_input_model2 = pd.concat([df_result_from_model1
                             ,  df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                            ,df_range_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)##Ben新加的公司经营范围的特征
                            ], axis = 1).values

In [37]:
'''
    第二个模型（Stacking后）的预测结果（最终）
'''
result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])

  Label  Precision    Recall        F1  Support
0     0   0.775044  0.731871  0.741806     18.4
1     1   0.796817  0.815952  0.798101     20.2
2    总体   0.786208  0.776788  0.771685     38.6


### model training - Distributor

In [38]:
clf_DIST = SVC(kernel='linear', C=5,verbose=True)
clf_DIST.fit(X_EC, y_EC)
clf_DIST_stack = SVC(kernel='linear', C=5,verbose=True, probability = True).fit(df_input_model2, y_EC)
X_full_EC = sparse_result[df_full[df_full['Company Role in Trust IT']=='Wholesaler/Distributor(EWT)']['Index'],:].toarray()
df_EC = df_full[df_full['Company Role in Trust IT']=='Wholesaler/Distributor(EWT)']
df_EC['Prediction'] = clf_DIST.predict(X_full_EC)
df_EC_stack = pd.concat([df_EC[['Prediction']].reset_index(drop=True)
                             ,  df_name_features.iloc[df_full[df_full['Company Role in Trust IT']=='Wholesaler/Distributor(EWT)']['Index'],:].reset_index(drop = True)
                             ,df_range_features.iloc[df_full[df_full['Company Role in Trust IT']=='Wholesaler/Distributor(EWT)']['Index'],:].reset_index(drop = True)
                            ], axis = 1).values
df_EC['Prediction_2'] = clf_DIST_stack.predict(df_EC_stack)
df_EC['Prediction_2_Prob_other'] = clf_DIST_stack.predict_proba(df_EC_stack)[:,0]
df_EC['Prediction_2_Prob1_DIST'] = clf_DIST_stack.predict_proba(df_EC_stack)[:,1]
df_EC_final = pd.concat([df_EC.reset_index(drop = True),pd.DataFrame(df_EC_stack, columns = ['Pred_Model1','KW_incl','KW_excl','Range_KW'])], axis = 1).copy(deep = True)
df_EC_final[['KW_incl','KW_excl','Prediction','Prediction_2', 'Prediction_2_Prob_other', 'Prediction_2_Prob1_DIST']]
df_EC_final['Final_Pred'] = df_EC_final['Prediction_2']
df_EC_final.Final_Pred.value_counts()

[LibSVM][LibSVM]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.p

1    771
0    308
Name: Final_Pred, dtype: int64

In [39]:
'''
    保存模型
'''
f = open(r'01 Script generated models\dist_model.pkl', 'wb')
pickle.dump(((name_EC, name_EC_excl),(clf_DIST,clf_DIST_stack)), f)
f.close()

所有模型的训练至此结束。  
后续可见'C:\Users\Z0044A8U\Desktop\GUI\scratch.py'中是如何对新数据进行预测的。  
此外，如发生以下情况之一，需要在scratch.py中对应地修改：  
- 更新模型（TF-IDF、五类中每一类的小模型......）
- 更新数据预处理策略

In [57]:
'''
提取每一个类的关键词：
方法：多分类逻辑回归，提取系数作为重要性
输入：  train_corpus list:len=1304
        X:(1304,3236)
        y:len=1304
'''
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

tfidf_model = TfidfVectorizer(token_pattern='\\b\\w+\\b')
tfidf_model.fit(train_corpus)
sparse_result = tfidf_model.transform(train_corpus)
print(sparse_result.shape)

X = sparse_result.toarray()
y=list(y)


logit=LogisticRegression(multi_class="multinomial",solver="newton-cg")
logit.fit(X,y)
predicted = logit.predict(X)
print(classification_report(y, predicted))
#tfidf_model.get_feature_names()#获取每一个词的位置

NameError: name 'train_corpus' is not defined

### 附录：查看每一类的公司名称关键词

In [608]:
i,j=get_names('ec',0.2, 0.1)
print('EC(inc):',end=None)
for m in i:
    print(m,end=' ')
print()
print('EC(exc):',end=None)
for m in j:
    print(m,end=' ')
print()
i,j=get_names('pab',0.2, 0)
print('PaB(inc):',end=None)
for m in i:
    print(m,end=' ')
print()
print('PaB(exc):',end=None)
for m in j:
    print(m,end=' ')
print()    
i,j=get_names('oem',0.2,0.1)
print('OEM(inc):',end=None)
for m in i:
    print(m,end=' ')
print()
print('OEM(exc):',end=None)
for m in j:
    print(m,end=' ')
print()    
i,j=get_names('si',0.1, 0.1)
print('SI(inc):',end=None)
for m in i:
    print(m,end=' ')
print()
print('SI(exc):',end=None)
for m in j:
    print(m,end=' ')
print()    
i,j=get_names('dist',0.13, 0.1)
print('Dist(inc):',end=None)
for m in i:
    print(m,end=' ')
print()
print('Dist(exc):',end=None)
for m in j:
    print(m,end=' ')

EC(inc):
生物 轴承 昆药 糖业 化工 焊接 纤维 机场 磷化 动力 学校 白药 交通 热电 水电 产业 建设 钛业 圆锗业 铝业 热电站 发动机 锂业 铸造 硅业 太阳能 橡胶 郎酒 藏药 食品 半导体 铜业 车灯 锅炉 化学 水务 水泥 排水 零部件 药业 自来水 汽车 矿业 制药 纸业 发电 新能源 建材 产品 不锈钢 铅锌 天然气 学院 制水 处理 酒业 硝化棉 糖 钛 锗 铝 锂 硅 铜 药 矿 纸 酒 
EC(exc):
设备 工程 技术 电气 机械 机床 医药 信息 配套 软件 服务 龙科技 自控 高科技 研究院 设计院 科贸 工贸 仪器 电控 供应链 管理 通科技 物资 控系统 商贸 电机 环境 智能 仪表 空调 销售 控制 机器人 物流 机器 机电 鼓风机 集成 自动化 器械 研究所 华科技 电力 成套 电子 液压 泵业 重工 开关 风机 金属 贸易 器材 车辆 泵 
PaB(inc):
电控 供应链 管理 电力 开关 
PaB(exc):
鼓风机 华科技 机器 机器人 集成 控制 销售 空调 物流 器材 器械 机床 车灯 硅业 铸造 锂业 热电站 铝业 大学 圆锗业 钛业 热电 金属 太阳能 风机 自控 龙科技 服务 软件 高科技 配套 医药 仪器 电机 信息 研究院 设计院 车辆 泵业 液压 仪表 智能 环境 通科技 科贸 重工 橡胶 郎酒 化学 纸业 水泥 水务 制品 酒业 处理 材料 制药 排水 机械 零部件 新能源 投资 制水 发电 能源 天然气 药业 锅炉 开发 工业 汽车 自来水 硝化棉 学院 铅锌 不锈钢 环保 化工 糖业 昆药 轴承 生物 机场 建设 发动机 半导体 食品 藏药 焊接 控系统 纤维 水电 产品 建材 钢铁 矿业 产业 铜业 交通 白药 学校 动力 磷化 物资 硅 锂 铝 锗 钛 泵 纸 酒 药 糖 矿 铜 
OEM(inc):
液压 机器人 空调 华科技 物流 鼓风机 器材 器械 金属 机床 风机 车辆 泵业 机械 泵 
OEM(exc):
电气 电器 科贸 通科技 环境 智能 车灯 热电 钛业 圆锗业 大学 设计院 铝业 热电站 锂业 铸造 硅业 太阳能 橡胶 郎酒 藏药 食品 半导体 研究院 高科技 电控 供应链 管理 物资 控系统 商贸 电机 仪器 工贸 医药 信息 配套 软件 服务 龙科技 自控 电

### BKP--不要的代码

'''
    Additional Features - keywords
'''


    
'''
    Additional Features - Company Name
'''
name_SI=['工程', '控制', '系统', '研究', '设计']
name_EC=['天然气', '制水', '再生能源', '液化', '药业', '太阳能', '锅炉', '电厂', '发电厂', '航空', '自来水'
         , '汽车集团', '商用车', '零部件', '部件', '动力机械', '排水', '水泥', '化学', '纸业', '生物制药', '特种'
         , '酒业', '酒', '污水处理', '污水', '焊接', '硝化棉', '学院', '职业', '铅锌', '不锈钢', '气体', '矿业', '产业'
         , '铜业', '锌铟', '学校', '磷', '建设', '建设工程', '纤维', '糖业', '铁路', '燃气', '显示', '集成电路', '机场'
         , '发动机', '地铁', '半导体', '橡胶', '锂', '金属材料', '硅业', '热电', '电站', '安装', '技师', '铝业', '锗'
         , '钛业', '经济', '灯', '汽车', '发电', '水务', '新能源', '建材', '药', '投资', '动力', '生物科技', '钢铁', '食品']
        # ,'生物', '钢铁集团', '光电', '轨道交通', '交通', '轨道', '水电', '制药', '材料', '能源', '玻璃', '制品'
               # ,'产品', '资源', '科学', '通用', '轴承', '集成', '轮胎', '金属', '精工', '管理'
               # , '开发', '处理', '实业'
               # , '国际', '铸造']

name_EC_excl = ['自动', '自动化', '机电', '控制', '机电设备', '工程技术', '电子', '自控', '电气设备', '研究', '信息'
                , '电子科技', '成套', '控制技术', '控制系统', '开关', '系统工程', '机器', '重工', '研究所', '电工'
                , '工控', '信息技术', '机电工程', '电力设备', '仪表', '研究院', '机器人', '塑料机械', '塑料', '销售'
                , '贸易', '机械工程', '自动控制', '电控', '电器设备', '电气工程', '通科技', '设计', '机床', '数控'
                , '控制工程', '包装', '供水', '工贸', '机械制造', '城市', '船舶', '开关厂', '伟业', '仪器', '科技开发'
                , '高科', '高科技', '软件', '风机', '锻压', '物流', '空调', '数控机械', '制冷', "电器", '商贸', '科贸'
                , '传动', '技术开发', '网络', '成套设备', '节能', '油田', '系统集成', '医药', '分析仪器', '车辆', '医疗'
                , '医疗器械', '起重', '重型机械', '纺织机械', '纺织', '测控', '液压', '流体', '热工', '设备厂', '环境工程'
                , '供应链', '通讯', '设计院', '智慧', '数据', '建筑', '建筑材料', '中控', '高新', '生物医药', '联合', '软控'
                #, '石油'
                , '精机', '鼓风机', '橡塑', '塑机', '人工', '印染机械', '印刷', '清洗', '木业', '风电', '造纸', '器材'
                , '生化', '电子设备', '精密机械', '精密', '人造板', '机械厂', '聚氨酯', '风能', '粉体', '微波', '机车'
                , '机车车辆', '食品机械', '净化', '电热', '木工机械', '创新', '空调设备', '玻璃机械', '矿山机械', '防爆'
                , '生物工程', '泵业', '环球', '磁电', '热能', '发电设备', '安全', '电器厂', '高压', '真空', '感应'
                , '电子电器', '数码科技', '数码', '电源', '五金', '众业', '物资', '测试', '试验', '试验设备', '交通设备'
                , '能源技术', '核工业', '仪器仪表', '在线', '楼宇', '运输', '服务', '洁净', '科普', '配套', '高新技术'
                , '电气传动', '医药集团', '高新区', '仪器厂', '智控'
                ,'装备', '系统', '机械', '技术', '智能', '电气', '设备', '工程'
                
               ]
name_PAB=['电气成套', '开关']

SI_name_count_list = []
EC_name_count_list = []
EC_name_excl_count_list = []
PaB_name_count_list = []
for i in range(len(df_full)):
    SI_name_count = 0
    EC_name_count = 0
    EC_name_excl_count = 0
    PaB_name_count = 0
    company_name = df_full['Result.Name'].values[i]
    for kw in name_SI:
        if kw in company_name:
            SI_name_count = 1
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
    for kw in name_PAB:
        if kw in company_name:
            PaB_name_count = 1
    SI_name_count_list.append(SI_name_count)
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)
    PaB_name_count_list.append(PaB_name_count)
    
    
    
'''
    提取行业

'''
flag_list2 = [] 
for i in range(len(df_kw_full)):
    industry3 = df_kw_full['Industry3'].values[i]
    flag = 0
    for j in industry_list:
        if j in industry3:
            flag = 1
    flag_list2.append(flag)

df_kw_full['Flag'] = flag_list2



'''
    训练模型EC时的尝试的模型

'''

result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    
    #clf = LogisticRegression(random_state=0, solver='lbfgs',class_weight ='balanced').fit(X_train, y_train)
    clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    #df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])
#Amber法

result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    
    #clf = LogisticRegression(random_state=0, solver='lbfgs',class_weight ='balanced').fit(X_train, y_train)
    clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    #df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])
#法一

In [90]:
#EC Names
name_EC, name_EC_excl=get_names('ec',0.2,0.1)

EC_name_count=0
EC_name_excl_count=0
EC_name_count_list = []
EC_name_excl_count_list = []

for i in range(len(df_full)):
    EC_name_count = 0
    EC_name_excl_count = 0
    company_name = df_full['Result.Name'].values[i]
    for kw in name_EC:
        if kw in company_name:
            EC_name_count = 1
            break
    for kw in name_EC_excl:
        if kw in company_name:
            EC_name_excl_count = 1
            break
    EC_name_count_list.append(EC_name_count)
    EC_name_excl_count_list.append(EC_name_excl_count)

In [94]:
#stacking
#add addition keywords
#df_kw_features = pd.DataFrame({'KW_Count_SI': [x / len(kw_biz_scope_SI_2) for x in SI_kw_count_list],'KW_Count_OEM': [x / len(kw_biz_scope_OEM_2) for x in OEM_kw_count_list],'KW_Count_PaB': [x / len(kw_biz_scope_PAB_2) for x in PaB_kw_count_list]})
df_name_features = pd.DataFrame({'EC_Name':EC_name_count_list,'EC_Name_exclude':EC_name_excl_count_list})
df_industry_features = pd.get_dummies(df_full['Result.Industry.IndustryCode'])
#df_cust_val_features = df_full[['Q4']]
#df_cust_val_features = df_full[['Direct Val']]
#from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()
#df_cust_val_features  = pd.DataFrame(scaler.fit_transform(df_cust_val_features))

#df_input_model2 = pd.concat([df_result_from_model1,  df_kw_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True), df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)], axis = 1).values
df_input_model2 = pd.concat([df_result_from_model1
                            # ,  df_cust_val_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                             ,  df_name_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                           # ,  df_kw_features.iloc[df_samples_forEC['Index'],:].reset_index(drop = True)
                            ], axis = 1).values

In [24]:
result_lr_w2v = []
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(df_input_model2, y_EC)
for train_index, test_index in skf.split(df_input_model2, y_EC):
    X_train, X_test = df_input_model2[train_index], df_input_model2[test_index]
    y_train, y_test = y_EC[train_index], y_EC[test_index]
    
    #clf = LogisticRegression(random_state=0, solver='lbfgs',class_weight ='balanced').fit(X_train, y_train)
    clf = SVC(kernel='linear', C=5).fit(X_train, y_train)
    y_pred =clf.predict(X_test)
    #df_result_from_model1.loc[test_index,'Model1_Output'] = y_pred
    result_lr_w2v.append(eval_model(y_test, y_pred, categoty_EC).iloc[:,1:].values)

print(pd.DataFrame({
        u'Label': categoty_EC+['总体'],
        u'Precision': np.mean(result_lr_w2v,axis = 0)[:,0],
        u'Recall': np.mean(result_lr_w2v,axis = 0)[:,1],
        u'F1': np.mean(result_lr_w2v,axis = 0)[:,2],
        u'Support': np.mean(result_lr_w2v,axis = 0)[:,3]
    })[['Label','Precision','Recall','F1','Support']])
#法一

  Label  Precision    Recall        F1  Support
0     0   0.775431  0.837719  0.800681     18.7
1     1   0.797204  0.707353  0.740680     16.5
2    总体   0.785834  0.776513  0.772603     35.2
