### Описание данных в файле transactions.csv

* customer_id - идентификатор клиента
* tr_datetime - день и время совершения транзакции (дни нумеруются с начала данных)
* mcc_code - mcc-код транзакции
* tr_type - тип транзакции
* amount - сумма транзакции в условных единицах; со знаком "+" — начисление средств клиенту (приходная транзакция), "-" — списание средств (расходная транзакция)
* term_id - идентификатор терминала

### Описание задания

Цель задания выполнить последовательно все упражнения. Будет оцениваться правильность кода, и конечный результат, т.е. после прогона всех ячеек должен получится преобразованный датасет в файле features.csv.

Обратите внимание, что задания можно выполнить разными способами, конретное решение не навязывается, однако код должен быть по возможности хорошо читаемым и лаконичным.

**Хочу отметить**: я постараюсь сделать в двух вариантах в pandas и в spark.

**13.04** pandas done.

**13.04** spark wait.

### 1. Создать sql context

https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#starting-point-sqlcontext

In [1]:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

In [2]:
# #Spark Config
conf = SparkConf().setAppName("sample_app")
sc = SparkContext(conf=conf)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/15 12:08:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
sqlContext = SQLContext(sc)

df = sqlContext.read.csv('transactions.csv',
                         header=True)

# Displays the content of the DataFrame to stdout
df.show()



+-----------+------------+--------+-------+----------+-------+
|customer_id| tr_datetime|mcc_code|tr_type|    amount|term_id|
+-----------+------------+--------+-------+----------+-------+
|   79780256| 37 13:36:14|    4814|   1030|  -3144.28|   NULL|
|   79780256| 39 10:16:49|    4814|   1030|  -5614.79|   NULL|
|   79780256| 44 09:41:33|    6011|   2010|-112295.79|   NULL|
|   79780256| 44 09:42:44|    6011|   2010| -67377.47|   NULL|
|   79780256| 51 08:53:56|    4814|   1030|  -1122.96|   NULL|
|   79780256| 51 08:55:09|    4814|   1030|  -2245.92|   NULL|
|   79780256| 58 11:18:31|    6011|   2010| -67377.47|   NULL|
|   79780256| 59 12:29:60|    6011|   2010| -22459.16|   NULL|
|   79780256| 62 15:44:60|    4814|   1030|  -3368.87|   NULL|
|   79780256| 62 15:46:24|    4814|   1030|  -2245.92|   NULL|
|   79780256| 65 06:20:50|    6011|   2010| -44918.32|   NULL|
|   79780256| 71 11:18:04|    6011|   2010| -89836.63|   NULL|
|   79780256| 78 10:38:15|    6011|   2010| -78607.05| 

### 2. Создать DataFrame из файла transactions.csv

Хотя выше мы уже создали DF создадим его еще двумя способами: 
* через spark
* через pandas

In [4]:
import pandas as pd
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession

у spark решил сделать еще одни способом 

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Создадим DF:
example = spark.read.csv('transactions.csv', header=True)
example.show()

+-----------+------------+--------+-------+----------+-------+
|customer_id| tr_datetime|mcc_code|tr_type|    amount|term_id|
+-----------+------------+--------+-------+----------+-------+
|   79780256| 37 13:36:14|    4814|   1030|  -3144.28|   NULL|
|   79780256| 39 10:16:49|    4814|   1030|  -5614.79|   NULL|
|   79780256| 44 09:41:33|    6011|   2010|-112295.79|   NULL|
|   79780256| 44 09:42:44|    6011|   2010| -67377.47|   NULL|
|   79780256| 51 08:53:56|    4814|   1030|  -1122.96|   NULL|
|   79780256| 51 08:55:09|    4814|   1030|  -2245.92|   NULL|
|   79780256| 58 11:18:31|    6011|   2010| -67377.47|   NULL|
|   79780256| 59 12:29:60|    6011|   2010| -22459.16|   NULL|
|   79780256| 62 15:44:60|    4814|   1030|  -3368.87|   NULL|
|   79780256| 62 15:46:24|    4814|   1030|  -2245.92|   NULL|
|   79780256| 65 06:20:50|    6011|   2010| -44918.32|   NULL|
|   79780256| 71 11:18:04|    6011|   2010| -89836.63|   NULL|
|   79780256| 78 10:38:15|    6011|   2010| -78607.05| 

pandas ниже

In [6]:
data = pd.read_csv('transactions.csv')

In [7]:
data.head()

Unnamed: 0,customer_id,tr_datetime,mcc_code,tr_type,amount,term_id
0,79780256,37 13:36:14,4814,1030,-3144.28,
1,79780256,39 10:16:49,4814,1030,-5614.79,
2,79780256,44 09:41:33,6011,2010,-112295.79,
3,79780256,44 09:42:44,6011,2010,-67377.47,
4,79780256,51 08:53:56,4814,1030,-1122.96,


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028188 entries, 0 to 1028187
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   customer_id  1028188 non-null  int64  
 1   tr_datetime  1028188 non-null  object 
 2   mcc_code     1028188 non-null  int64  
 3   tr_type      1028188 non-null  int64  
 4   amount       1028188 non-null  float64
 5   term_id      615129 non-null   object 
dtypes: float64(1), int64(3), object(2)
memory usage: 47.1+ MB


### 3. Напечатать схему

In [9]:
df.printSchema()

root
 |-- customer_id: string (nullable = true)
 |-- tr_datetime: string (nullable = true)
 |-- mcc_code: string (nullable = true)
 |-- tr_type: string (nullable = true)
 |-- amount: string (nullable = true)
 |-- term_id: string (nullable = true)



In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028188 entries, 0 to 1028187
Data columns (total 6 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   customer_id  1028188 non-null  int64  
 1   tr_datetime  1028188 non-null  object 
 2   mcc_code     1028188 non-null  int64  
 3   tr_type      1028188 non-null  int64  
 4   amount       1028188 non-null  float64
 5   term_id      615129 non-null   object 
dtypes: float64(1), int64(3), object(2)
memory usage: 47.1+ MB


у spark изначально не подгружаются типы, везде string. У pandas типы данных указаны, стоит изменить :
* tr_datetime to datetime
* customer_id - это не совсем числовое значение , это категориальное , так как мы не можем делать операции мат. с ним.
* mcc_code, tr_type, term_id - тоже object.
* c amount нормально

для изменения в spark типов данных https://www.geeksforgeeks.org/update-pyspark-dataframe-metadata/

### 4. Отобразить первые 20 строк DataFrame-а

In [11]:
data.head(20)

Unnamed: 0,customer_id,tr_datetime,mcc_code,tr_type,amount,term_id
0,79780256,37 13:36:14,4814,1030,-3144.28,
1,79780256,39 10:16:49,4814,1030,-5614.79,
2,79780256,44 09:41:33,6011,2010,-112295.79,
3,79780256,44 09:42:44,6011,2010,-67377.47,
4,79780256,51 08:53:56,4814,1030,-1122.96,
5,79780256,51 08:55:09,4814,1030,-2245.92,
6,79780256,58 11:18:31,6011,2010,-67377.47,
7,79780256,59 12:29:60,6011,2010,-22459.16,
8,79780256,62 15:44:60,4814,1030,-3368.87,
9,79780256,62 15:46:24,4814,1030,-2245.92,


In [12]:
df.show(20)

+-----------+------------+--------+-------+----------+-------+
|customer_id| tr_datetime|mcc_code|tr_type|    amount|term_id|
+-----------+------------+--------+-------+----------+-------+
|   79780256| 37 13:36:14|    4814|   1030|  -3144.28|   NULL|
|   79780256| 39 10:16:49|    4814|   1030|  -5614.79|   NULL|
|   79780256| 44 09:41:33|    6011|   2010|-112295.79|   NULL|
|   79780256| 44 09:42:44|    6011|   2010| -67377.47|   NULL|
|   79780256| 51 08:53:56|    4814|   1030|  -1122.96|   NULL|
|   79780256| 51 08:55:09|    4814|   1030|  -2245.92|   NULL|
|   79780256| 58 11:18:31|    6011|   2010| -67377.47|   NULL|
|   79780256| 59 12:29:60|    6011|   2010| -22459.16|   NULL|
|   79780256| 62 15:44:60|    4814|   1030|  -3368.87|   NULL|
|   79780256| 62 15:46:24|    4814|   1030|  -2245.92|   NULL|
|   79780256| 65 06:20:50|    6011|   2010| -44918.32|   NULL|
|   79780256| 71 11:18:04|    6011|   2010| -89836.63|   NULL|
|   79780256| 78 10:38:15|    6011|   2010| -78607.05| 

### 5. Посчитать количество уникальных customer_id

In [13]:
data['customer_id'].nunique()

2000

In [14]:
df.count()

1028188

In [15]:
df.select('customer_id').distinct().count()

                                                                                

2000

### 6. Посчитать количество уникальных term_id

In [16]:
data['term_id'].nunique()

110871

### 7. Посчитать среднее количество транзакций на одного customer_id

так как кол-во трнзакций целочисленное число, то нужно округлить

In [17]:
round(data.groupby(['customer_id']).agg({'tr_datetime':'count'}).mean().iloc[0])

514

In [18]:
total_count = int(len(data)/data['customer_id'].nunique())

In [19]:
print(f'Средняя кол-во транзакций на 1 пользователя: {total_count}')

Средняя кол-во транзакций на 1 пользователя: 514


### 8. Посчитать среднюю сумму транзакций на одного customer_id

* исходя из условия, что '-' это списание средств, а '+' это начиление клиенту.

Если имелось ввиду всех транзакций (суммы их), то нам нужно взять модуль от суммы прихода и списания (сложить их) и поделить на кол-во id.

In [20]:
mean_transaction = abs(data['amount'].sum())/data['customer_id'].nunique()
print(f'Средняя сумма в уе на 1 пользователя: {mean_transaction}')

Средняя сумма в уе на 1 пользователя: 10725317.261949996


### 9. Удалить столбец term_id

In [21]:
data = data.drop('term_id', axis=1)

In [22]:
data

Unnamed: 0,customer_id,tr_datetime,mcc_code,tr_type,amount
0,79780256,37 13:36:14,4814,1030,-3144.28
1,79780256,39 10:16:49,4814,1030,-5614.79
2,79780256,44 09:41:33,6011,2010,-112295.79
3,79780256,44 09:42:44,6011,2010,-67377.47
4,79780256,51 08:53:56,4814,1030,-1122.96
...,...,...,...,...,...
1028183,46437397,440 13:01:25,4814,1030,-8983.66
1028184,46437397,442 08:14:54,6011,7010,89836.63
1028185,46437397,446 19:26:27,6010,7071,103424.42
1028186,46437397,449 14:07:39,5411,1010,-26456.89


### 10. Добавить столбец direction, который указывает "направление" транзакции, если в поле amount отрицательное значение то туда записать D, если положительное - C

In [23]:
data['direction'] = data['amount'].apply(lambda x: 'C' if x>0 else 'D') 

### 11. Столбец amount преобразовать в абсолютное значение

In [24]:
data['amount'] = abs(data['amount'])

### 12. Посчитать среднюю сумму транзакций на одного customer_id отдельно по каждому направлению

In [25]:
data_direction = data.groupby(['direction'])\
.agg({'amount':'sum', 
      'customer_id':'nunique'})

data_direction['mean'] = data_direction['amount']/data_direction['customer_id']

In [26]:
data_direction

Unnamed: 0_level_0,amount,customer_id,mean
direction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,31664710000.0,1847,17143860.0
D,53115340000.0,1998,26584260.0


In [27]:
data['customer_id'].nunique()

2000

In [28]:
data_direction2 =pd.pivot_table(data, 
               index= 'direction', 
               values =['amount','customer_id'], 
               aggfunc={'amount':'sum', 
                        'customer_id':'nunique'})


data_direction2['mean'] = data_direction2['amount']/data_direction2['customer_id']

In [29]:
c_dir = data_direction['mean'].loc['C']
d_dir = data_direction['mean'].loc['D']

In [30]:
print(f"""Средняя сумма в уе на 1 пользователя в {data_direction.index[0]}: {c_dir}
Средняя сумма в уе на 1 пользователя в {data_direction.index[1]}: {d_dir}
"""
)

Средняя сумма в уе на 1 пользователя в C: 17143860.163670816
Средняя сумма в уе на 1 пользователя в D: 26584256.37947948



### 13, 14 задания удалил, были криво сформулированы

ok.

### 15. Сделать pivot, в котором строки это customer_id, столбцы mcc-коды, в ячейках суммы по amount

In [31]:
data_mcc = data.pivot_table(index='customer_id', 
                 columns='mcc_code',
                 values='amount', 
                 aggfunc='sum')\
.fillna(0)

data_mcc

mcc_code,742,1711,1731,1799,2741,3000,3351,3501,4111,4112,...,8299,8398,8641,8699,8999,9211,9222,9311,9399,9402
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6815,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
27914,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
53395,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
104032,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
218079,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,0.0,7636.11,124452.93,...,0.0,0.0,0.0,0.0,3468.59,0.0,0.0,11229.58,96811.76,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99717689,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,6737.75,0.0,0.0,0.00,8017.92,0.0
99770379,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,6737.76,0.0,0.0,0.00,0.00,0.0
99787257,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
99915912,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0


In [32]:
data_mcc.reset_index(inplace=True)

In [33]:
data_mcc

mcc_code,customer_id,742,1711,1731,1799,2741,3000,3351,3501,4111,...,8299,8398,8641,8699,8999,9211,9222,9311,9399,9402
0,6815,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
1,27914,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
2,53395,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
3,104032,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
4,218079,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,0.0,7636.11,...,0.0,0.0,0.0,0.0,3468.59,0.0,0.0,11229.58,96811.76,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,6737.75,0.0,0.0,0.00,8017.92,0.0
1996,99770379,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,6737.76,0.0,0.0,0.00,0.00,0.0
1997,99787257,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0
1998,99915912,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.00,0.0


### 16. Сделать pivot, в котором строки это customer_id, столбцы mcc-коды, в ячейках средние и стандартные отклонения по amount
т.е. на каждый mcc_code должно быть до 2-х столбцов со средним и стандартным отклонением

In [34]:
data_mcc_2 = data.pivot_table(index='customer_id', 
                 columns='mcc_code',
                 values=['amount'], 
                 aggfunc=['mean','std'])\
.fillna(0)

data_mcc_2

Unnamed: 0_level_0,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,...,std,std,std,std,std,std,std,std,std,std
Unnamed: 0_level_1,amount,amount,amount,amount,amount,amount,amount,amount,amount,amount,...,amount,amount,amount,amount,amount,amount,amount,amount,amount,amount
mcc_code,742,1711,1731,1799,2741,3000,3351,3501,4111,4112,...,8299,8398,8641,8699,8999,9211,9222,9311,9399,9402
customer_id,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
6815,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
27914,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
53395,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
104032,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
218079,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,0.0,7636.11,31113.2325,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,27155.064184,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99717689,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
99770379,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,648.341258,0.0,0.0,0.0,0.000000,0.0
99787257,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
99915912,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,0.0000,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0


In [35]:
# Найдем значения, которые не были в стандартном отклонении
set(data_mcc_2['mean'].columns) - set(data_mcc_2['std'].columns)

{('amount', 2741),
 ('amount', 5085),
 ('amount', 6513),
 ('amount', 7829),
 ('amount', 8220),
 ('amount', 8244)}

In [36]:
# '742_mcc_avg'
data_mcc_2.columns = [f"{i[2]}_mcc_{i[0]}" for i in data_mcc_2.columns]
data_mcc_2.reset_index(inplace=True)

In [37]:
data_mcc_2

Unnamed: 0,customer_id,742_mcc_mean,1711_mcc_mean,1731_mcc_mean,1799_mcc_mean,2741_mcc_mean,3000_mcc_mean,3351_mcc_mean,3501_mcc_mean,4111_mcc_mean,...,8299_mcc_std,8398_mcc_std,8641_mcc_std,8699_mcc_std,8999_mcc_std,9211_mcc_std,9222_mcc_std,9311_mcc_std,9399_mcc_std,9402_mcc_std
0,6815,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
1,27914,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
2,53395,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
3,104032,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
4,218079,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,0.0,7636.11,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,27155.064184,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
1996,99770379,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,648.341258,0.0,0.0,0.0,0.000000,0.0
1997,99787257,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
1998,99915912,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0


Мне захотелось осортировать столбцы, но в то же время важно было оставить 1 столбцом customer_id, поэтому:

In [38]:
mcc = pd.concat([data_mcc_2['customer_id'], 
           data_mcc_2.iloc[:,1:].reindex(sorted(data_mcc_2.iloc[:,1:].columns), axis=1)], 
          axis=1)
mcc

Unnamed: 0,customer_id,1711_mcc_mean,1711_mcc_std,1731_mcc_mean,1731_mcc_std,1799_mcc_mean,1799_mcc_std,2741_mcc_mean,3000_mcc_mean,3000_mcc_std,...,9211_mcc_mean,9211_mcc_std,9222_mcc_mean,9222_mcc_std,9311_mcc_mean,9311_mcc_std,9399_mcc_mean,9399_mcc_std,9402_mcc_mean,9402_mcc_std
0,6815,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0
1,27914,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0
2,53395,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0
3,104032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0
4,218079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,...,0.0,0.0,0.0,0.0,11229.58,0.0,19362.352,27155.064184,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,8017.920,0.000000,0.0,0.0
1996,99770379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0
1997,99787257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0
1998,99915912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.0,0.0,0.0,0.0,0.00,0.0,0.000,0.000000,0.0,0.0


оказаось 361 столбцец, хотя ожидалосб 367. Посмотрим почему.

In [39]:
data_mcc_2

Unnamed: 0,customer_id,742_mcc_mean,1711_mcc_mean,1731_mcc_mean,1799_mcc_mean,2741_mcc_mean,3000_mcc_mean,3351_mcc_mean,3501_mcc_mean,4111_mcc_mean,...,8299_mcc_std,8398_mcc_std,8641_mcc_std,8699_mcc_std,8999_mcc_std,9211_mcc_std,9222_mcc_std,9311_mcc_std,9399_mcc_std,9402_mcc_std
0,6815,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
1,27914,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
2,53395,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
3,104032,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
4,218079,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,0.0,7636.11,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,27155.064184,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
1996,99770379,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,648.341258,0.0,0.0,0.0,0.000000,0.0
1997,99787257,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0
1998,99915912,0.0,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0


In [40]:
# в 6 значениях mcc_code стандартное отлконение не выдает, 
# это по ходу из-за особенностей высчиления через pivot_table в pandas,а также, что у каждого из значений по 1 объекту
# по сути должен быть 0, более вероятноиз-за формулы. в знаменателе n-1 чаще всего, хотя здеcь считается нормально.
data[data['mcc_code'] == 6513].iloc[0]['amount'].std()

0.0

In [41]:
data[data['mcc_code'] == 6513]

Unnamed: 0,customer_id,tr_datetime,mcc_code,tr_type,amount,direction
539068,581281,401 11:21:11,6513,1110,15272.23,D
833110,87094958,310 22:57:59,6513,1100,5390.2,D


In [42]:
import numpy as np

In [43]:
data[data['mcc_code'] == 6513].pivot_table(index='customer_id', 
                 columns='mcc_code',
                 values=['amount'], 
                 aggfunc=['std'])\
.fillna(0)

customer_id


### 17. Сделать pivot, в котором строки это customer_id, столбцы типы транзакций, в ячейках средние и стандартные отклонения по amount, значения должны быть разделены по направлениям
т.е. на каждый tr_type должно быть до 4-х столбцов со средним и стандартным отклонением по каждому направлению

**Подсказка:** Можно сделать расчеты отдельно для каждого направления платежей, потом присоединить к заранее подготовленному списку уникальных customer_id. Так будет проще, наглядней и меньше вероятность сделать ошибку.

In [44]:
data_type = data.pivot_table(index='customer_id', 
                 columns=['tr_type','direction'],
                 values=['amount'], 
                 aggfunc=['mean','std'])\
.fillna(0)

data_type

Unnamed: 0_level_0,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,...,std,std,std,std,std,std,std,std,std,std
Unnamed: 0_level_1,amount,amount,amount,amount,amount,amount,amount,amount,amount,amount,...,amount,amount,amount,amount,amount,amount,amount,amount,amount,amount
tr_type,1000,1010,1010,1030,1030,1100,1100,1110,1110,1200,...,7035,7040,7041,7070,7070,7071,7071,7074,7075,8100
direction,D,C,D,C,D,C,D,C,D,C,...,C,C,C,C,D,C,D,C,C,C
customer_id,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4
6815,0.0,0.00,7627.253810,0.0,2770.465889,0.0,0.000000,0.0,9703.427778,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
27914,0.0,0.00,0.000000,0.0,2896.459828,0.0,0.000000,0.0,6731.460000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
53395,0.0,0.00,0.000000,0.0,0.000000,0.0,8240.685000,0.0,0.000000,0.0,...,0.0,0.0,0.0,117240.245672,0.0,0.0,0.0,0.0,0.0,0.0
104032,0.0,0.00,0.000000,0.0,5512.117714,0.0,11229.580000,0.0,0.000000,0.0,...,0.0,0.0,0.0,648.335485,0.0,0.0,0.0,0.0,0.0,0.0
218079,0.0,0.00,18258.671368,0.0,11628.229167,0.0,17207.319048,0.0,19760.517391,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99717689,0.0,0.00,59588.510500,0.0,7126.126341,0.0,6266.103333,0.0,31217.138571,0.0,...,0.0,0.0,0.0,368737.376866,0.0,0.0,0.0,0.0,0.0,0.0
99770379,0.0,0.00,1497.280000,0.0,5201.069474,0.0,2205.889750,0.0,4490.930000,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
99787257,0.0,0.00,0.000000,0.0,2146.097778,0.0,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,3176.210524,0.0,0.0,0.0,0.0,0.0,0.0
99915912,0.0,24592.78,5559.074604,0.0,1726.428435,0.0,0.000000,0.0,11008.993442,0.0,...,0.0,0.0,0.0,12007.680882,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# '1000_d_type_avg'
data_type.columns = [f"{i[2]}_{i[-1]}_{'type'}_{i[0]}" for i in data_type.columns]
data_type.reset_index(inplace=True)
data_type

Unnamed: 0,customer_id,1000_D_type_mean,1010_C_type_mean,1010_D_type_mean,1030_C_type_mean,1030_D_type_mean,1100_C_type_mean,1100_D_type_mean,1110_C_type_mean,1110_D_type_mean,...,7035_C_type_std,7040_C_type_std,7041_C_type_std,7070_C_type_std,7070_D_type_std,7071_C_type_std,7071_D_type_std,7074_C_type_std,7075_C_type_std,8100_C_type_std
0,6815,0.0,0.00,7627.253810,0.0,2770.465889,0.0,0.000000,0.0,9703.427778,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,27914,0.0,0.00,0.000000,0.0,2896.459828,0.0,0.000000,0.0,6731.460000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,53395,0.0,0.00,0.000000,0.0,0.000000,0.0,8240.685000,0.0,0.000000,...,0.0,0.0,0.0,117240.245672,0.0,0.0,0.0,0.0,0.0,0.0
3,104032,0.0,0.00,0.000000,0.0,5512.117714,0.0,11229.580000,0.0,0.000000,...,0.0,0.0,0.0,648.335485,0.0,0.0,0.0,0.0,0.0,0.0
4,218079,0.0,0.00,18258.671368,0.0,11628.229167,0.0,17207.319048,0.0,19760.517391,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.00,59588.510500,0.0,7126.126341,0.0,6266.103333,0.0,31217.138571,...,0.0,0.0,0.0,368737.376866,0.0,0.0,0.0,0.0,0.0,0.0
1996,99770379,0.0,0.00,1497.280000,0.0,5201.069474,0.0,2205.889750,0.0,4490.930000,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1997,99787257,0.0,0.00,0.000000,0.0,2146.097778,0.0,0.000000,0.0,0.000000,...,0.0,0.0,0.0,3176.210524,0.0,0.0,0.0,0.0,0.0,0.0
1998,99915912,0.0,24592.78,5559.074604,0.0,1726.428435,0.0,0.000000,0.0,11008.993442,...,0.0,0.0,0.0,12007.680882,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
type = pd.concat([data_type['customer_id'], 
           data_type.iloc[:,1:].reindex(sorted(data_type.iloc[:,1:].columns), axis=1)], 
          axis=1)
type

Unnamed: 0,customer_id,1000_D_type_mean,1000_D_type_std,1010_C_type_mean,1010_D_type_mean,1010_D_type_std,1030_C_type_mean,1030_D_type_mean,1030_D_type_std,1100_C_type_mean,...,7071_C_type_std,7071_D_type_mean,7071_D_type_std,7074_C_type_mean,7074_C_type_std,7075_C_type_mean,7075_C_type_std,8100_C_type_mean,8100_C_type_std,8146_C_type_mean
0,6815,0.0,0.0,0.00,7627.253810,4788.538032,0.0,2770.465889,3118.134751,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,27914,0.0,0.0,0.00,0.000000,0.000000,0.0,2896.459828,1584.580907,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,53395,0.0,0.0,0.00,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,104032,0.0,0.0,0.00,0.000000,0.000000,0.0,5512.117714,5214.584303,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,218079,0.0,0.0,0.00,18258.671368,60711.295610,0.0,11628.229167,9528.297800,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.0,0.00,59588.510500,269974.682665,0.0,7126.126341,6798.228074,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1996,99770379,0.0,0.0,0.00,1497.280000,648.341258,0.0,5201.069474,6187.153899,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1997,99787257,0.0,0.0,0.00,0.000000,0.000000,0.0,2146.097778,1572.587374,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1998,99915912,0.0,0.0,24592.78,5559.074604,14412.337420,0.0,1726.428435,785.572420,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 18. Извлечь часы из столбца tr_datetime и удалить столбец tr_datetime

Есть несколько решений данной задачи:
1) перевод в datetime и взять оттуда часы. Но выдает ошибку при переводе в pd.to_datetime 'DateParseError: second must be in 0..59: 59 12:29:60, at position 7' Это нужно доп. функцию писать, чтобы от 60 секунд избавиться.
2) взять по индексу эл-ов через функцию :)

In [47]:
data['hours'] = data['tr_datetime'].apply(lambda x: x[-8:-6])

In [48]:
data['hours'] = data['hours'].astype('int')

In [49]:
# проверим нет ли выбросов, как с секундами
data['hours'].value_counts().sort_index()

hours
0     115099
1       9398
2      11833
3      13467
4      15346
5      19187
6      25278
7      31569
8      41681
9      51119
10     60304
11     62489
12     68026
13     69217
14     66513
15     66798
16     63740
17     59671
18     52524
19     43907
20     33131
21     23140
22     15343
23      9408
Name: count, dtype: int64

In [50]:
data = data.drop('tr_datetime', axis=1)

In [51]:
data

Unnamed: 0,customer_id,mcc_code,tr_type,amount,direction,hours
0,79780256,4814,1030,3144.28,D,13
1,79780256,4814,1030,5614.79,D,10
2,79780256,6011,2010,112295.79,D,9
3,79780256,6011,2010,67377.47,D,9
4,79780256,4814,1030,1122.96,D,8
...,...,...,...,...,...,...
1028183,46437397,4814,1030,8983.66,D,13
1028184,46437397,6011,7010,89836.63,C,8
1028185,46437397,6010,7071,103424.42,C,19
1028186,46437397,5411,1010,26456.89,D,14


### 19. Сделать pivot, в котором строки это customer_id, столбцы часы, полученные на предыдущем этапе, в ячейках средние и стандартные отклонения по amount, значения должны быть разделены по направлениям

**Подсказка:** Можно сделать расчеты отдельно для каждого направления платежей, потом присоединить к заранее подготовленному списку уникальных customer_id. Так будет проще, наглядней и меньше вероятность сделать ошибку.

In [52]:
data

Unnamed: 0,customer_id,mcc_code,tr_type,amount,direction,hours
0,79780256,4814,1030,3144.28,D,13
1,79780256,4814,1030,5614.79,D,10
2,79780256,6011,2010,112295.79,D,9
3,79780256,6011,2010,67377.47,D,9
4,79780256,4814,1030,1122.96,D,8
...,...,...,...,...,...,...
1028183,46437397,4814,1030,8983.66,D,13
1028184,46437397,6011,7010,89836.63,C,8
1028185,46437397,6010,7071,103424.42,C,19
1028186,46437397,5411,1010,26456.89,D,14


In [53]:
data_hours = data.pivot_table(index='customer_id', 
                 columns=['hours','direction'],
                 values=['amount'], 
                 aggfunc=['mean','std'])\
.fillna(0)

data_hours

Unnamed: 0_level_0,mean,mean,mean,mean,mean,mean,mean,mean,mean,mean,...,std,std,std,std,std,std,std,std,std,std
Unnamed: 0_level_1,amount,amount,amount,amount,amount,amount,amount,amount,amount,amount,...,amount,amount,amount,amount,amount,amount,amount,amount,amount,amount
hours,0,0,1,1,2,2,3,3,4,4,...,19,19,20,20,21,21,22,22,23,23
direction,C,D,C,D,C,D,C,D,C,D,...,C,D,C,D,C,D,C,D,C,D
customer_id,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4,Unnamed: 17_level_4,Unnamed: 18_level_4,Unnamed: 19_level_4,Unnamed: 20_level_4,Unnamed: 21_level_4
6815,0.000,0.000000,0.0,2245.92,0.0,2245.920,0.0,0.000000,0.0,0.000000,...,0.000000,9338.674926,0.000000,24529.440017,0.000000,0.000000,0.0,0.000000,0.0,0.000000
27914,0.000,22158.090000,0.0,0.00,0.0,0.000,0.0,2245.920000,0.0,422232.170000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
53395,8690.078,8075.413125,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,149281.618857,0.000000,0.0,0.000000,0.0,0.000000
104032,0.000,11229.580000,0.0,3368.87,0.0,3368.870,0.0,3368.870000,0.0,3368.870000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,794.052631,0.0,0.000000
218079,0.000,57402.685000,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,0.000000,...,85757.521514,53959.132073,0.000000,63858.671390,0.000000,14199.043480,0.0,31958.001197,0.0,212655.606096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99717689,0.000,36225.514775,0.0,23207.80,0.0,20213.240,0.0,11229.580000,0.0,26950.990000,...,0.000000,16124.811782,127048.187615,11317.939731,0.000000,61793.362844,0.0,265534.904662,0.0,0.000000
99770379,0.000,2441.591148,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
99787257,0.000,0.000000,0.0,0.00,0.0,0.000,0.0,449.180000,0.0,145984.525000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
99915912,24592.780,19913.457000,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,0.000000,...,0.000000,6575.062507,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000


In [54]:
# '0_hour_c_avg'
data_hours.columns = [f"{i[2]}_hour_{i[-1]}_{i[0]}" for i in data_hours.columns]
data_hours.reset_index(inplace=True)
data_hours

Unnamed: 0,customer_id,0_hour_C_mean,0_hour_D_mean,1_hour_C_mean,1_hour_D_mean,2_hour_C_mean,2_hour_D_mean,3_hour_C_mean,3_hour_D_mean,4_hour_C_mean,...,19_hour_C_std,19_hour_D_std,20_hour_C_std,20_hour_D_std,21_hour_C_std,21_hour_D_std,22_hour_C_std,22_hour_D_std,23_hour_C_std,23_hour_D_std
0,6815,0.000,0.000000,0.0,2245.92,0.0,2245.920,0.0,0.000000,0.0,...,0.000000,9338.674926,0.000000,24529.440017,0.000000,0.000000,0.0,0.000000,0.0,0.000000
1,27914,0.000,22158.090000,0.0,0.00,0.0,0.000,0.0,2245.920000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
2,53395,8690.078,8075.413125,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,149281.618857,0.000000,0.0,0.000000,0.0,0.000000
3,104032,0.000,11229.580000,0.0,3368.87,0.0,3368.870,0.0,3368.870000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,794.052631,0.0,0.000000
4,218079,0.000,57402.685000,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,...,85757.521514,53959.132073,0.000000,63858.671390,0.000000,14199.043480,0.0,31958.001197,0.0,212655.606096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.000,36225.514775,0.0,23207.80,0.0,20213.240,0.0,11229.580000,0.0,...,0.000000,16124.811782,127048.187615,11317.939731,0.000000,61793.362844,0.0,265534.904662,0.0,0.000000
1996,99770379,0.000,2441.591148,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
1997,99787257,0.000,0.000000,0.0,0.00,0.0,0.000,0.0,449.180000,0.0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000
1998,99915912,24592.780,19913.457000,0.0,0.00,0.0,0.000,0.0,0.000000,0.0,...,0.000000,6575.062507,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000


In [55]:
hours = pd.concat([data_hours['customer_id'], 
           data_hours.iloc[:,1:].reindex(sorted(data_hours.iloc[:,1:].columns), axis=1)], 
          axis=1)
hours

Unnamed: 0,customer_id,0_hour_C_mean,0_hour_C_std,0_hour_D_mean,0_hour_D_std,10_hour_C_mean,10_hour_C_std,10_hour_D_mean,10_hour_D_std,11_hour_C_mean,...,7_hour_D_mean,7_hour_D_std,8_hour_C_mean,8_hour_C_std,8_hour_D_mean,8_hour_D_std,9_hour_C_mean,9_hour_C_std,9_hour_D_mean,9_hour_D_std
0,6815,0.000,0.000000,0.000000,0.000000,2.470507e+06,0.000000,18108.685250,22593.184705,224.590000,...,34811.695000,46054.960674,0.0000,0.000000,29758.387500,55024.935000,112295.7900,0.000000,22747.228421,27349.850285
1,27914,0.000,0.000000,22158.090000,10246.910640,6.737750e+03,4491.830000,3743.195556,3849.302856,37057.610000,...,15347.093333,11248.279631,3743.1950,1690.661842,3144.285000,1478.440813,6737.7500,6352.406906,2245.920000,0.000000
2,53395,8690.078,14272.273285,8075.413125,8556.307444,2.245916e+04,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
3,104032,0.000,0.000000,11229.580000,6254.196851,0.000000e+00,0.000000,141867.013333,152328.179515,0.000000,...,112295.790000,95286.131872,8983.6600,0.000000,114990.888000,124130.282452,0.0000,0.000000,116563.028000,150175.202395
4,218079,0.000,0.000000,57402.685000,77178.573161,0.000000e+00,0.000000,33467.850833,75100.847330,0.000000,...,12333.727500,16683.511662,0.0000,0.000000,8382.656000,14278.368338,0.0000,0.000000,77567.515714,115185.873872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.000,0.000000,36225.514775,94897.296917,6.737747e+05,317620.458431,221391.146667,535648.397677,179673.260000,...,7018.486250,3487.115077,898366.3100,0.000000,33706.917619,120968.437457,477257.1025,589775.654682,121443.500870,465755.315557
1996,99770379,0.000,0.000000,2441.591148,2851.240546,1.067371e+05,112037.267384,43138.428000,75271.808695,157214.106667,...,0.000000,0.000000,86842.0800,120326.041633,36372.607000,61299.934814,235821.1550,15881.017265,14321.457333,34044.941482
1997,99787257,0.000,0.000000,0.000000,0.000000,0.000000e+00,0.000000,673.770000,0.000000,0.000000,...,91521.067500,51651.992404,0.0000,0.000000,29196.905000,28230.722285,0.0000,0.000000,8983.660000,0.000000
1998,99915912,24592.780,0.000000,19913.457000,63383.895888,3.431226e+04,40052.274730,11267.979695,37128.036114,24368.186667,...,16144.581343,31496.416398,50252.3675,41770.158927,7223.982174,17413.270131,51656.0650,48941.238996,15448.331518,40936.572001


### 20. Соединить полученный DataFrame с pivot-ом по mcc кодам и по часам

**Примечание:** Суть тут в том, что мы по формируем набор данных, где для каждого customer_id мы имеем рассчитанные на основе транзакций признаки, такие как среднее арифметическое и стандартное отклонение сумм транзакций для каждого mcc кода, для каждого mcc с учетом направления транзакции, и для каждого часа в сутках без mcc кодов, но с учетом направления транзакции.

**Подсказка:** Список полей результирующего набора данных(… - другие аналогичные поля):

        ['customer_id',
         '742_mcc_avg',
         '742_mcc_std',
         '1711_mcc_avg',
         '1711_mcc_std',
         '1731_mcc_avg',
         '1731_mcc_std',
         ...
         ...
         ...
         '1010_c_type_avg',
         '1010_c_type_std',
         '1030_c_type_avg',
         '1030_c_type_std',
         '1100_c_type_avg',
         ...
         ...
         '1000_d_type_avg',
         '1000_d_type_std',
         '1010_d_type_avg',
         '1010_d_type_std',
         '1030_d_type_avg',
         ...
         ...
         '0_hour_c_avg',
         '0_hour_c_std',
         '1_hour_c_avg',
         '1_hour_c_std',
         '2_hour_c_avg',
         '2_hour_c_std',
         ...
         ...
         '23_hour_c_avg',
         '23_hour_c_std',
         '0_hour_d_avg',
         '0_hour_d_std',
         '1_hour_d_avg',
         '1_hour_d_std',
         ...
         ...
         '22_hour_d_avg',
         '22_hour_d_std',
         '23_hour_d_avg',
         '23_hour_d_std’]

у нас получились таблицы :

* hours
* type
* mcc

In [56]:
mcc.shape

(2000, 361)

In [57]:
type.shape

(2000, 173)

In [58]:
hours.shape

(2000, 97)

In [59]:
361+173+97 -2

629

In [60]:
# по дефолту оставляем inner
answer = mcc.merge(type, on='customer_id')\
.merge(hours, on='customer_id')

In [61]:
answer

Unnamed: 0,customer_id,1711_mcc_mean,1711_mcc_std,1731_mcc_mean,1731_mcc_std,1799_mcc_mean,1799_mcc_std,2741_mcc_mean,3000_mcc_mean,3000_mcc_std,...,7_hour_D_mean,7_hour_D_std,8_hour_C_mean,8_hour_C_std,8_hour_D_mean,8_hour_D_std,9_hour_C_mean,9_hour_C_std,9_hour_D_mean,9_hour_D_std
0,6815,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,34811.695000,46054.960674,0.0000,0.000000,29758.387500,55024.935000,112295.7900,0.000000,22747.228421,27349.850285
1,27914,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,15347.093333,11248.279631,3743.1950,1690.661842,3144.285000,1478.440813,6737.7500,6352.406906,2245.920000,0.000000
2,53395,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000,0.0000,0.000000,0.000000,0.000000
3,104032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,112295.790000,95286.131872,8983.6600,0.000000,114990.888000,124130.282452,0.0000,0.000000,116563.028000,150175.202395
4,218079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,318268.72,0.0,...,12333.727500,16683.511662,0.0000,0.000000,8382.656000,14278.368338,0.0000,0.000000,77567.515714,115185.873872
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,99717689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,7018.486250,3487.115077,898366.3100,0.000000,33706.917619,120968.437457,477257.1025,589775.654682,121443.500870,465755.315557
1996,99770379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,0.000000,0.000000,86842.0800,120326.041633,36372.607000,61299.934814,235821.1550,15881.017265,14321.457333,34044.941482
1997,99787257,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,91521.067500,51651.992404,0.0000,0.000000,29196.905000,28230.722285,0.0000,0.000000,8983.660000,0.000000
1998,99915912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.0,...,16144.581343,31496.416398,50252.3675,41770.158927,7223.982174,17413.270131,51656.0650,48941.238996,15448.331518,40936.572001


In [62]:
answer.shape

(2000, 629)

### 21. Какое кол-во столбцов получилось в итоговом DataFrame-е

хочется отметить, что в связи с особенностью подсчета в pivot_table , у некоторых столбцов не считлаоьс std, у которых было всего одно знаечние по уникальному id и mcc. Так что по сути можно добавить, но там будут только 0 и много nan, которые заменим на 0. в 16 задание на примере показал.

In [63]:
display(f'Кол-во столбцов, которое получилось в новом DF: {answer.shape[1]}')

'Кол-во столбцов, которое получилось в новом DF: 629'

### 22. Сохранить результирующий датасет в csv-файл features.csv

In [69]:
answer.to_csv('features.csv', index=False)