# 数据预处理-黑色星期五购物数据


**数据集描述**

美国的黑色星期五相当于是中国的双11，人们在这一天的购买量是非常之多的。
原始数据有50多万条购物记录。本案例为了提高程序运行速度，截取了前5千条数据。

**字段解释**

'User_ID': 字符串类型；每名顾客独特的编号，从编号可知一共有6039名顾客。

'Product_ID': 字符串类型；每个产品独特的编号

'Gender': 字符串类型；顾客性别；有2个值，分别为F：女性 M：男性

'Age': 字符串类型；顾客所在年龄段；有7个值，分别为0-17岁、18-25岁、26-35岁、36-45岁、46-50岁、51-55岁和55岁以上

'Occupation': 字符串类型；顾客的职业；有21个值，为数字0-20分别代表不同的职业

'City_Category': 字符串类型；顾客所在城市；有3个值，分别为字母A、B、C，代表三个不同的城市

'Stay_In_Current_City_Years': 字符串类型；顾客在所在城市居住时长；有4个值，分别为1、2、3、4+ 单位年（注：在数据预处理时会将其转换为数值类
型）

'Marital_Status': 字符串类型；顾客婚姻状况；有2个值，分别为0、1；0：代表未婚，1：代表已婚

'Product_Category_1': 字符串类型；产品类型；有20个值，为数字1-20，分别代表着20个不同的产品，如食品、图书、咖啡等

'Purchase': 数值类型；顾客购物所花金额

In [10]:
#导入相关包
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 

In [2]:
# 导入数据
data=pd.read_csv("./data/BlackFriday.csv")
print("原始数据信息：")
data.info() 


原始数据信息：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     50000 non-null  int64  
 1   Product_ID                  50000 non-null  object 
 2   Gender                      50000 non-null  object 
 3   Age                         50000 non-null  object 
 4   Occupation                  50000 non-null  int64  
 5   City_Category               50000 non-null  object 
 6   Stay_In_Current_City_Years  50000 non-null  object 
 7   Marital_Status              49997 non-null  float64
 8   Product_Category_1          50000 non-null  int64  
 9   Product_Category_2          34279 non-null  float64
 10  Product_Category_3          15183 non-null  float64
 11  Purchase                    50000 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 4.6+ MB


In [3]:
# 处理缺失数据

# 观察可知 Marital_Status 存在缺失
#更改其中类别字段为字符串类型
# toObjFields=["Product_Category_1","Product_Category_2","Product_Category_3","Occupation","Marital_Status"]
# data[toObjFields]=data[toObjFields].astype("object")
# data.describe(include="all")

#婚姻状态只有3个缺失值，将其改为未婚状态
data['Marital_Status'].fillna(0, inplace = True)
print("\n填补婚姻状态的三个空值行")
data.info() 

#Product_Category_2 和Product_Category_3缺失信息过多，User_ID和Product_ID 对训练无用将其删去
data.drop(["Product_Category_2","Product_Category_3","User_ID","Product_ID"],axis=1,inplace=True)
print("\n删除无用列")
data.info() 


填补婚姻状态的三个空值行
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   User_ID                     50000 non-null  int64  
 1   Product_ID                  50000 non-null  object 
 2   Gender                      50000 non-null  object 
 3   Age                         50000 non-null  object 
 4   Occupation                  50000 non-null  int64  
 5   City_Category               50000 non-null  object 
 6   Stay_In_Current_City_Years  50000 non-null  object 
 7   Marital_Status              50000 non-null  float64
 8   Product_Category_1          50000 non-null  int64  
 9   Product_Category_2          34279 non-null  float64
 10  Product_Category_3          15183 non-null  float64
 11  Purchase                    50000 non-null  int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 4.6+ MB

删除无用列
<class 'pa

In [4]:
# 特征工程
data.describe(include="all")

# Occupation、Marital_Status和Product_Category_1 应该为类别数据
toObjFields=["Occupation","Marital_Status","Product_Category_1"]
data[toObjFields]=data[toObjFields].astype("object")
print("修改了Occupation、Marital_Status和Product_Category_1为obj类型")
data.info() 

# Stay_In_Current_City_Years 应该为数值类型数据
data["Stay_In_Current_City_Years"].replace("4+",4,inplace=True)
data["Stay_In_Current_City_Years"]=data["Stay_In_Current_City_Years"].astype("int")
print("\n修改了Stay_In_Current_City_Years为数值型")
data["Stay_In_Current_City_Years"].value_counts()

# 构造哑变量 
### 对于一些类别，比如打分评级 A，B，C是有顺序区别的 A会比B高，但是对于性别和其他的分类，各个类别之间是平等的，需要将这种类转化成哑变量
### 这里构建过多的哑变量可能会导致后续模型训练时维度过高
data=pd.get_dummies(data,drop_first=True)
# 这里采用独热码，但是对于独热码来说，一般都存在一个线性的推导关系，比如一个人的性别非女即男，也就是不需要两个列，只需要一个即可
print("\n用独热码构造哑变量")
data.head()


修改了Occupation、Marital_Status和Product_Category_1为obj类型
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Gender                      50000 non-null  object
 1   Age                         50000 non-null  object
 2   Occupation                  50000 non-null  object
 3   City_Category               50000 non-null  object
 4   Stay_In_Current_City_Years  50000 non-null  object
 5   Marital_Status              50000 non-null  object
 6   Product_Category_1          50000 non-null  object
 7   Purchase                    50000 non-null  int64 
dtypes: int64(1), object(7)
memory usage: 3.1+ MB

修改了Stay_In_Current_City_Years为数值型

用独热码构造哑变量


Unnamed: 0,Stay_In_Current_City_Years,Purchase,Gender_M,Age_18-25,Age_26-35,Age_36-45,Age_46-50,Age_51-55,Age_55+,Occupation_1,...,Product_Category_1_9,Product_Category_1_10,Product_Category_1_11,Product_Category_1_12,Product_Category_1_13,Product_Category_1_14,Product_Category_1_15,Product_Category_1_16,Product_Category_1_17,Product_Category_1_18
0,2,8370,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,15200,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,1422,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,2,1057,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,4,7969,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# 分离自变量与因变量
Y=data["Purchase"]
Y.head()
X=data.drop(["Purchase"],axis=1)
X.head()

Unnamed: 0,Stay_In_Current_City_Years,Gender_M,Age_18-25,Age_26-35,Age_36-45,Age_46-50,Age_51-55,Age_55+,Occupation_1,Occupation_2,...,Product_Category_1_9,Product_Category_1_10,Product_Category_1_11,Product_Category_1_12,Product_Category_1_13,Product_Category_1_14,Product_Category_1_15,Product_Category_1_16,Product_Category_1_17,Product_Category_1_18
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,4,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# 分离训练集和测试集
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2)

In [21]:
# 特征缩放
scX=StandardScaler()
scY=StandardScaler()
x_train=scX.fit_transform(x_train)
x_test=scX.transform(x_test)
y_train=np.ravel(scY.fit_transform(y_train.values.reshape(-1,1)))

### 尝试用多元线性模型去预测
 结果狠拉 😪

In [33]:

#实例化模型开始训练
reg=LinearRegression()
reg.fit(x_train,y_train)
print(reg.coef_)
# 对测试集生成预测结果
y_pre=reg.predict(x_test)
# 进行r2评分
r2_score(y_test,y_pre)


[-4.25334717e-04 -6.56993789e-04 -3.17991869e-02 -2.72052322e-02
 -1.25319139e-02 -9.87180581e-03  1.13892483e-02  2.47435855e-03
 -1.29004791e-03  4.22358615e-03  1.30542998e-02  1.18047123e-02
  5.34704871e-03  1.10727628e-02  6.66216006e-03 -2.71764910e-04
  3.93187061e-03 -7.69506878e-03  9.29352113e-03  1.32143517e-02
  3.96607466e-03 -1.16254200e-04  1.11903486e-02 -7.10536069e-03
  9.38404779e-03  7.36723202e-04 -1.29751081e-02 -1.31437948e-02
  1.28449813e-02  5.07500935e-02 -3.49873595e-03 -9.91720586e-02
 -1.33645343e-01 -3.27625922e-01 -6.64147180e-01  9.29649342e-02
  4.41985496e-02 -5.03150020e-01  1.21557323e-02  1.13569755e-01
 -3.77780552e-01 -2.08441036e-01 -2.50683431e-01 -7.63104803e-03
  2.16106908e-02  2.55255465e-02 -2.71391429e-02 -1.53844285e-01]


-3.5508490048066914