# Sourcing raw data and saving processed data

<ol>
    <li> Only columns <b>'Consumer complaint narrative'</b> and <b>'Product'</b> are needed. </li>
    <li> All observations with missing value in the varibale <b>'Consumer complaint narrative'</b> needs to be removed. </li>
    <li> All duplicate observations in the dataframe needs to be removed. </li>
    <li> Target variable 'Product' needs to be remapped based on the analysis done. </li>
    <li> Splitting data into training, testing and validation sets and saving the files. </li>
</ol>

# Importing Modules

In [8]:
%load_ext autotime
import wget
import pandas as pd
import preprocessorRawdata as pp
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from feature_engine.imputation import DropMissingData

The autotime extension is already loaded. To reload it, use:
  %reload_ext autotime
time: 0 ns (started: 2022-01-26 18:07:37 +05:30)


# Download and Load the Latest Data

In [2]:
wget.download('https://files.consumerfinance.gov/ccdb/complaints.csv.zip')

100% [......................................................................] 398362492 / 398362492

'complaints.csv (1).zip'

time: 49.6 s (started: 2022-01-26 18:06:06 +05:30)


In [3]:
con_com = pd.read_csv('complaints.csv.zip', compression='zip' ,usecols=['Product', 'Consumer complaint narrative']) # Reading only the required columns

time: 15.7 s (started: 2022-01-26 18:06:56 +05:30)


# Configuration

In [4]:
# variable mappings
PRODUCT_MAPPING = {'Credit card': 'Credit card or prepaid card',
                   'Prepaid card': 'Credit card or prepaid card',
                   'Credit reporting':'Credit reporting, credit repair services, or other personal consumer reports',
                   'Money transfers':'Money transfer, virtual currency, or money service',
                   'Virtual currency':'Money transfer, virtual currency, or money service',
                   'Payday loan':'Consumer loan, Vehicle loan or lease, Payday loan, title loan, or personal loan',
                   'Other financial service': 'Money transfer, virtual currency, or money service',
                   'Consumer Loan':'Consumer loan, Vehicle loan or lease, Payday loan, title loan, or personal loan',
                   'Vehicle loan or lease':'Consumer loan, Vehicle loan or lease, Payday loan, title loan, or personal loan',
                   'Payday loan, title loan, or personal loan':'Consumer loan, Vehicle loan or lease, Payday loan, title loan, or personal loan',
                   'Bank account or service':'Bank account or service, Savings account',
                   'Checking or savings account':'Bank account or service, Savings account'}

# Independent variables
INDEPENDENT_FEATURES = ['Consumer complaint narrative']

# Dependent variable
DEPENDENT_FEATURES = ['Product']

time: 0 ns (started: 2022-01-26 18:07:11 +05:30)


# Pipeline

In [5]:
# set up the pipeline
price_pipe = Pipeline([
    
    # ===== DROP MISSING DATA ===== #
    ('drop_missing_observation', DropMissingData(
        variables=INDEPENDENT_FEATURES)),
    
    # ===== DROP DUPLICATE DATA ===== #
    ('drop_duplicate_observations', pp.DropDuplicateData()),
    
    # ===== REMAPPING TARGET VARIABLE ===== #
    ('target_variable_mapping', pp.Mapper(DEPENDENT_FEATURES, PRODUCT_MAPPING)),
    
])

time: 15 ms (started: 2022-01-26 18:07:11 +05:30)


# Saving Train, Test and Valid split

In [6]:
# con_com = price_pipe.fit_transform(con_com)

# trainX, testX, valX, trainY, testY, valY = pp.trainTestValid_split(con_com['Consumer complaint narrative'],
#                                                                    con_com['Product'],
#                                                                    trainsize=70000,
#                                                                    testsize=30000)

# train = pd.DataFrame({'consumer_complaint':trainX, 'product':trainY})
# test = pd.DataFrame({'consumer_complaint':testX, 'product':testY})
# valid = pd.DataFrame({'consumer_complaint':valX, 'product':valY})

# # Saving train and test data
# train.to_csv('train.csv', index=False)
# test.to_csv('test.csv', index=False)
# valid.to_csv('valid.csv', index=False)

time: 0 ns (started: 2022-01-26 18:07:11 +05:30)


In [9]:
trainX, testX, trainY, testY = train_test_split(con_com['Consumer complaint narrative'],
                                                con_com['Product'],
                                                train_size=100000,
                                                random_state=0,
                                                stratify = con_com['Product'])

train = pd.DataFrame({'consumer_complaint':trainX, 'product':trainY})
test = pd.DataFrame({'consumer_complaint':testX, 'product':testY})

# Saving train and test data
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)

time: 22.5 s (started: 2022-01-26 18:07:41 +05:30)


# End of Notebook