# 科研创新实践期中随堂测验（Pandas数据分析）

## Customer Personality Analysis (市场营销与用户数据)
### Context （背景介绍）
Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

### Data Attributes （数据字段介绍）

#### People

* ID: Customer's unique identifier，用户ID
* Year_Birth: Customer's birth year，出生年月
* Education: Customer's education level，教育程度
* Marital_Status: Customer's marital status，婚姻状态
* Income: Customer's yearly household income，收入
* Kidhome: Number of children in customer's household，家庭小孩数量
* Teenhome: Number of teenagers in customer's household，家庭青少年数量
* Dt_Customer: Date of customer's enrollment with the company，用户注册时间
* Recency: Number of days since customer's last purchase，至上次购物时间
* Complain: 1 if the customer complained in the last 2 years, 0 otherwise，过去两年是否投诉过
#### Products

* MntWines: Amount spent on wine in last 2 years，酒类花费
* MntFruits: Amount spent on fruits in last 2 years，水果花费
* MntMeatProducts: Amount spent on meat in last 2 years，肉类花费
* MntFishProducts: Amount spent on fish in last 2 years，鱼类花费
* MntSweetProducts: Amount spent on sweets in last 2 years，甜品花费
* MntGoldProds: Amount spent on gold in last 2 years，金属花费
#### Promotion

* NumDealsPurchases: Number of purchases made with a discount，使用折扣次数
* AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise，首次促销时使用折扣
* AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise，第2次促销时使用折扣
* AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise，第3次促销时使用折扣，第2次促销时使用折扣
* AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise，第4次促销时使用折扣
* AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise，第5次促销时使用折扣
* Response: 1 if customer accepted the offer in the last campaign, 0 otherwise，最后促销时使用折扣
#### Place

* NumWebPurchases: Number of purchases made through the company’s website，网站购物次数
* NumCatalogPurchases: Number of purchases made using a catalogue，第三方购买次数
* NumStorePurchases: Number of purchases made directly in stores，商店购物次数
* NumWebVisitsMonth: Number of visits to company’s website in the last month，上月网站访问次数

### 问题1：导入Pandas包，读取数据文件(../data/marketing_campaign.csv)，打印数据行列数大小

<div style='color: red;'>注意：文件分隔符为\t，pandas读取数据时需指定参数sep='\t'</div>

In [1]:
import pandas as pd

In [5]:
path = '../data/marketing_campaign.csv'
data = pd.read_csv(path,sep='\t')

In [7]:
data.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


In [8]:
print(data.shape)

(2240, 29)


### 问题2：增加一列表示用户年龄，列名为age，统计用户平均年龄，统计年龄段为$age \leq 30$，$30 < age \leq 40$和$age \geq 40$的用户人数

In [15]:
data['age'] = 2022-data['Year_Birth']
data['age'].mean()

53.19419642857143

In [11]:
len(data[data['age']<=30])

28

In [12]:
len(data[data['age']>=40])

1939

In [14]:
len(data)-len(data[data['age']<30])-len(data[data['age']>40])

331

### 问题3：用户学历分布

In [16]:
data['Education'].value_counts()

Graduation    1127
PhD            486
Master         370
2n Cycle       203
Basic           54
Name: Education, dtype: int64

### 问题4：统计不同小孩（Kidhome）和青少年（Teenhome）数量情形下的家庭水果（MntFruits）、肉类（MntMeatProducts）、鱼类（MntFishProducts）消费总额；整理成dataframe表格，并按照消费总额大小排序。

In [46]:
a1 = data.groupby(['Kidhome','Teenhome']).apply(lambda x: x['MntFruits'].sum()+x['MntMeatProducts'].sum()+x['MntFishProducts'].sum())

In [47]:
a2 = a1.to_frame('SUM').reset_index()
a2.head()

Unnamed: 0,Kidhome,Teenhome,SUM
0,0,0,319966
1,0,1,126368
2,0,2,5636
3,1,0,36766
4,1,1,22880


In [48]:
a2.sort_values('SUM',ascending=False)

Unnamed: 0,Kidhome,Teenhome,SUM
0,0,0,319966
1,0,1,126368
3,1,0,36766
4,1,1,22880
2,0,2,5636
5,1,2,3006
6,2,0,1192
7,2,1,1128


### 问题5：不同教育水平（Education）用户使用折扣（NumDealsPurchases）的平均次数？

In [36]:
s1 = data.groupby(['Education']).apply(lambda x: x.NumDealsPurchases.mean())

In [38]:
s1.sort_values()

Education
Basic         1.796296
2n Cycle      2.246305
Graduation    2.309672
PhD           2.374486
Master        2.427027
dtype: float64

### 问题6：所有用户在网站（NumWebPurchases）、第三方（NumCatalogPurchases）和商店（NumStorePurchases）购买的次数占比？
<div style='color: red;'>注意：此问题考察表格迭代，把上述数据整理为Python列表，再统计</div>

In [44]:
result6 = {'Web':0, 'Catalog':0, 'Store':0}
for _, r in data.iterrows():
    result6['Web'] += r.NumWebPurchases
    result6['Catalog'] += r.NumCatalogPurchases    
    result6['Store'] += r.NumStorePurchases
    
total = sum(result6.values())
[(k, result6[k]/total) for k in result6]

[('Web', 0.3258198910372823),
 ('Catalog', 0.21233486450877756),
 ('Store', 0.4618452444539401)]