# Capstone 3 - Step 2: Data Wrangling

**The Data Science Method**  


1.   Problem Identification 


2.   **Data Wrangling**
    
  * Data Collection
    
      - Locating the data
    
      - Data loading
    
      - Data joining
    
   * Data Organization
    
      -  File structure
    
      -  Git & Github
    
  * Data Definition
    
      - Column names
    
      - Data types (numeric, categorical, timestamp, etc.)
    
      - Description of the columns
    
      - Count or percent per unique values or codes (including NA)
    
      - The range of values or codes
    
  * Data Cleaning
    
      - NA or missing data
    
      - Duplicates

3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features</b>

4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set

5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

## Data Collection

In [1]:
#load python packages
import os
import pandas as pd

In [2]:
# read in csv files
df_acct01= pd.read_csv('data/balance_history_acct01.csv')

In [3]:
df_acct02= pd.read_csv('data/balance_history_acct02.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# combine account files
df = pd.concat([df_acct01, df_acct02],ignore_index=True)

In [5]:
# examine first five rows
df.head()

Unnamed: 0,id,Type,Source,Amount,Fee,Destination Platform Fee,Net,Currency,Created (UTC),Available On (UTC),...,Customer Facing Currency,Transfer,Transfer Date (UTC),Transfer Group,receipt (metadata),products (metadata),name (metadata),email (metadata),phone (metadata),order (metadata)
0,txn_1H4yaiAEZE0DjbbjyUL3LA0l,charge,ch_1H4yahAEZE0DjbbjNylKh8nq,5.0,0.45,,4.55,usd,2020-07-15 00:35,2020-07-17 00:00,...,usd,,,,True,"[""ISO-5.1.6""]",,,,
1,txn_1H4wdJAEZE0DjbbjpseDpWHX,charge,ch_1H4wdIAEZE0Djbbjwb6rph5h,1.0,0.34,,0.66,usd,2020-07-14 22:29,2020-07-16 00:00,...,usd,,,,True,"[""ISO-5.1.6""]",,,,
2,txn_1H4wMhAEZE0Djbbja2ayZO55,charge,ch_1H4wMfAEZE0DjbbjKWLNY2vc,1.0,0.34,,0.66,usd,2020-07-14 22:12,2020-07-16 00:00,...,usd,,,,True,"[""ISO-5.1.6""]",,,,
3,txn_1H4v2WAEZE0Djbbjif8lMl9m,charge,ch_1H4v2UAEZE0DjbbjGT6iHaTh,10.0,0.59,,9.41,usd,2020-07-14 20:47,2020-07-16 00:00,...,usd,,,,True,"[""ISO-5.1.6""]",,,,
4,txn_1H4tnzAEZE0DjbbjWcwXA4zN,charge,ch_1H4tnyAEZE0DjbbjEE794to5,1.0,0.33,,0.67,usd,2020-07-14 19:28,2020-07-16 00:00,...,usd,,,,True,"[""ISO-5.1.6""]",,,,


## Data Definition

### Column Names 

In [6]:
df.columns

Index(['id', 'Type', 'Source', 'Amount', 'Fee', 'Destination Platform Fee',
       'Net', 'Currency', 'Created (UTC)', 'Available On (UTC)', 'Description',
       'Customer Facing Amount', 'Customer Facing Currency', 'Transfer',
       'Transfer Date (UTC)', 'Transfer Group', 'receipt (metadata)',
       'products (metadata)', 'name (metadata)', 'email (metadata)',
       'phone (metadata)', 'order (metadata)'],
      dtype='object')

### Data Types 

In [7]:
df.dtypes

id                           object
Type                         object
Source                       object
Amount                      float64
Fee                         float64
Destination Platform Fee    float64
Net                         float64
Currency                     object
Created (UTC)                object
Available On (UTC)           object
Description                  object
Customer Facing Amount      float64
Customer Facing Currency     object
Transfer                     object
Transfer Date (UTC)          object
Transfer Group               object
receipt (metadata)           object
products (metadata)          object
name (metadata)              object
email (metadata)             object
phone (metadata)            float64
order (metadata)            float64
dtype: object

In [8]:
# change the Created (UTC) to a datetime object
df['Created (UTC)']=pd.to_datetime(df['Created (UTC)'])
df['Available On (UTC)']=pd.to_datetime(df['Available On (UTC)'])
df['Transfer Date (UTC)']=pd.to_datetime(df['Transfer Date (UTC)'])

In [9]:
df.dtypes

id                                  object
Type                                object
Source                              object
Amount                             float64
Fee                                float64
Destination Platform Fee           float64
Net                                float64
Currency                            object
Created (UTC)               datetime64[ns]
Available On (UTC)          datetime64[ns]
Description                         object
Customer Facing Amount             float64
Customer Facing Currency            object
Transfer                            object
Transfer Date (UTC)         datetime64[ns]
Transfer Group                      object
receipt (metadata)                  object
products (metadata)                 object
name (metadata)                     object
email (metadata)                    object
phone (metadata)                   float64
order (metadata)                   float64
dtype: object

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84386 entries, 0 to 84385
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   id                        84386 non-null  object        
 1   Type                      84386 non-null  object        
 2   Source                    84386 non-null  object        
 3   Amount                    84386 non-null  float64       
 4   Fee                       84386 non-null  float64       
 5   Destination Platform Fee  2 non-null      float64       
 6   Net                       84386 non-null  float64       
 7   Currency                  84386 non-null  object        
 8   Created (UTC)             84386 non-null  datetime64[ns]
 9   Available On (UTC)        84386 non-null  datetime64[ns]
 10  Description               84324 non-null  object        
 11  Customer Facing Amount    77255 non-null  float64       
 12  Customer Facing Cu

### Count of unique values or codes

In [11]:
for col in df.columns: print(col, df[col].nunique())

id 84386
Type 8
Source 84250
Amount 2335
Fee 228
Destination Platform Fee 1
Net 2407
Currency 2
Created (UTC) 82098
Available On (UTC) 1654
Description 210
Customer Facing Amount 628
Customer Facing Currency 1
Transfer 2027
Transfer Date (UTC) 1365
Transfer Group 2
receipt (metadata) 2
products (metadata) 9
name (metadata) 373
email (metadata) 376
phone (metadata) 342
order (metadata) 410


In [12]:
# print the percent of unique values per column
for col in df.columns: print(col, 100*df[col].nunique()/df.shape[0])

id 100.0
Type 0.009480245538359444
Source 99.8388358258479
Amount 2.7670466665086626
Fee 0.2701869978432441
Destination Platform Fee 0.0011850306922949305
Net 2.8523688763538977
Currency 0.002370061384589861
Created (UTC) 97.2886497760292
Available On (UTC) 1.960040765055815
Description 0.24885644538193538
Customer Facing Amount 0.7441992747612163
Customer Facing Currency 0.0011850306922949305
Transfer 2.402057213281824
Transfer Date (UTC) 1.61756689498258
Transfer Group 0.002370061384589861
receipt (metadata) 0.002370061384589861
products (metadata) 0.010665276230654374
name (metadata) 0.44201644822600905
email (metadata) 0.44557154030289386
phone (metadata) 0.4052804967648662
order (metadata) 0.48586258384092146


#### show values of categorical columns

In [13]:
pd.unique(df['products (metadata)'])

array(['["ISO-5.1.6"]', nan, '["ISO-5.1.5"]', '["ISO-5.1.4"]',
       '["ISO-5.1.3"]', '["ISO-5.1.2"]', '["ISO-5.1"]', '["ISO-5.0"]',
       '["ISO-0.4.1"]', '["ISO-0.4"]'], dtype=object)

### Summary Statistics for Numeric Columns

In [14]:
# review numeric dataset's summary statistics.
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Amount,84386.0,0.4753149,63.60006,-4880.23,1.0,5.0,10.0,1000.0
Fee,84386.0,0.471606,0.6377782,-29.3,0.33,0.45,0.59,29.3
Destination Platform Fee,2.0,0.15,0.0,0.15,0.15,0.15,0.15,0.15
Net,84386.0,0.003708909,63.53174,-4880.23,0.67,4.55,9.31,970.7
Customer Facing Amount,77255.0,6.256969,9.168511,-1000.0,1.0,5.0,10.0,1000.0
phone (metadata),371.0,377487000000.0,2953657000000.0,4321243.0,4821565000.0,8145583000.0,34676360000.0,54351160000000.0
order (metadata),410.0,10168980.0,8727048.0,1074951.0,2956036.0,7833630.0,14036340.0,39255520.0


## Data Cleaning

### Handle the missing and NA values

In [15]:
nas=pd.DataFrame(df.isnull().sum().sort_values(ascending=False)/len(df),columns = ['percent'])
pos = nas['percent'] > 0
nas[pos]

Unnamed: 0,percent
Destination Platform Fee,0.999976
Transfer Group,0.999953
phone (metadata),0.995604
order (metadata),0.995141
email (metadata),0.99513
name (metadata),0.99513
products (metadata),0.35167
receipt (metadata),0.35167
Customer Facing Currency,0.084505
Customer Facing Amount,0.084505


_Missing values are all in fields that are not of interest - ignore for now_

### Look for duplicate rows

In [16]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,id,Type,Source,Amount,Fee,Destination Platform Fee,Net,Currency,Created (UTC),Available On (UTC),...,Customer Facing Currency,Transfer,Transfer Date (UTC),Transfer Group,receipt (metadata),products (metadata),name (metadata),email (metadata),phone (metadata),order (metadata)


In [17]:
# get counts and totals for each description group
df.groupby('Description').agg({'Amount':['sum','count'], 
                         'Created (UTC)':['min','max'] })

Unnamed: 0_level_0,Amount,Amount,Created (UTC),Created (UTC)
Unnamed: 0_level_1,sum,count,min,max
Description,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Adjustment of lost dispute for ch_AgawCvBPmfSk05,-2.71,1,2017-08-12 19:13:00,2017-08-12 19:13:00
Application fee from application elementary AppCenter for alcinnz@eml.cc (acct_1BWv48LDw3ntuSqb),11.07,36,2018-01-13 14:43:00,2020-05-24 01:51:00
Application fee from application elementary AppCenter for artem@anufrij.de (acct_1BqoBQF8oLkaj8EK),351.82,1237,2018-02-05 17:35:00,2020-07-15 00:52:00
Application fee from application elementary AppCenter for awedeven+github@gmail.com (acct_1Fh9HbDV3QDXCd71),0.17,1,2020-03-24 06:11:00,2020-03-24 06:11:00
Application fee from application elementary AppCenter for bablu.boy@gmail.com (acct_1A4BZiDKLihkFfTN),102.80,312,2017-04-28 21:47:00,2020-07-09 12:27:00
...,...,...,...,...
REFUND FOR PAYMENT,-32.32,21,2019-05-02 08:16:00,2020-07-02 21:10:00
STRIPE PAYOUT,-443233.24,2030,2015-02-03 02:38:00,2020-07-15 01:10:00
elementary OS Freya 0.3.2,2677.50,506,2016-02-12 06:26:00,2016-02-29 17:04:00
elementary OS download,59878.39,8631,2015-02-02 20:25:00,2016-02-12 04:33:00


### Reset Loki 0.4 to Loki 0.4.0

In [18]:
df.loc[df.Description == 'Loki 0.4', 'Description'] = 'Loki 0.4.1'

### Select only download revenue rows

In [19]:
# make a new dataframe with just the download revenues
df2 = df.loc[df['Description'].isin(['Freya 0.3.2','Hera 5.1','Hera 5.1.2','Hera 5.1.3','Hera 5.1.4','Hera 5.1.5','Hera 5.1.6','Juno 5.0','Loki 0.4','Loki 0.4.0','Loki 0.4.1'])]

In [20]:
# get counts and totals for each description group
df2.groupby('Description').agg({'Amount':['sum','count'], 
                         'Created (UTC)':['min','max'] })

Unnamed: 0_level_0,Amount,Amount,Created (UTC),Created (UTC)
Unnamed: 0_level_1,sum,count,min,max
Description,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Freya 0.3.2,24345.74,4717,2016-03-01 23:48:00,2016-09-09 19:29:00
Hera 5.1,43309.9,5948,2019-12-03 14:52:00,2020-02-07 02:14:00
Hera 5.1.2,27999.07,4249,2020-02-06 03:04:00,2020-04-06 17:11:00
Hera 5.1.3,12683.57,1725,2020-04-06 17:19:00,2020-05-04 18:11:00
Hera 5.1.4,13812.91,2002,2020-05-01 20:41:00,2020-06-05 23:00:00
Hera 5.1.5,11527.81,1788,2020-06-04 16:17:00,2020-07-08 04:29:00
Hera 5.1.6,2104.2,281,2020-07-08 04:48:00,2020-07-15 00:35:00
Juno 5.0,109089.17,15764,2018-10-16 18:58:00,2019-12-03 14:26:00
Loki 0.4.0,7038.46,896,2016-09-09 17:01:00,2016-09-16 21:40:00
Loki 0.4.1,153791.64,30141,2016-09-13 21:10:00,2018-10-18 10:32:00


## Export data to a new csv file 

In [21]:
df2.to_csv('data\step2_output.csv', index = False)