## You have been asked to work with a bank to clean and store the data they collected as part of a recent marketing campaign, which aimed to get customers to take out a personal loan. They plan to conduct more marketing campaigns going forward so would like you to set up a PostgreSQL database to store this campaign's data, designing the schema in a way that would allow data from future campaigns to be easily imported.

## They have supplied you with a csv file called "bank_marketing.csv", which you will need to clean, reformat, and split, in order to save separate files based on the tables you will create.

# Project Task

### Use your data cleaning and database design skills to author a script that sets up tables in a PostgreSQL database for bank marketing campaigns

### Work with csv data in Python before producing tables in a PostgreSQL database to hold information about bank marketing campaigns.

In [76]:
import numpy as np
import pandas as pd

### Read in bank_marketing.csv as a pandas DataFrame.

In [77]:
bank_marketing_df = pd.read_csv('bank_marketing.csv')
bank_marketing_df.columns

Index(['client_id', 'age', 'job', 'marital', 'education', 'credit_default',
       'housing', 'loan', 'contact', 'month', 'day', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')

### Split the data into three DataFrames using information provided about the desired tables as your guide: one with information about the client, another containing campaign data, and a third to store information about economics at the time of the campaign.

In [78]:
client = bank_marketing_df.iloc[:, 0:8]
campaign = bank_marketing_df.iloc[:, [0, 12, 11, 13, 14, 15, 21, 9, 10]]
economics = bank_marketing_df.iloc[:, [0, 16, 17, 19, 20]]

### Rename the column "client_id" to "id" in client (leave as-is in the other subsets); 

In [79]:
client = client.rename(columns={"client_id":"id"})
client.head(3)

Unnamed: 0,id,age,job,marital,education,credit_default,housing,loan
0,0,56,housemaid,married,basic.4y,no,no,no
1,1,57,services,married,high.school,unknown,no,no
2,2,37,services,married,high.school,no,yes,no


### "duration" to "contact_duration", "previous" to "previous_campaign_contacts", "y" to "campaign_outcome", "poutcome" to "previous_outcome", and "campaign" to "number_contacts" in campaign; 


In [80]:
campaign = campaign.rename(columns={
    "duration": "contact_duration",
    "previous": "previous_campaign_contacts",
    "y": "campaign_outcome",
    "poutcome": "previous_outcome",
    "campaign": "number_contacts"
})
campaign.head(3)

Unnamed: 0,client_id,number_contacts,contact_duration,pdays,previous_campaign_contacts,previous_outcome,campaign_outcome,month,day
0,0,1,261,999,0,nonexistent,no,may,13
1,1,1,149,999,0,nonexistent,no,may,19
2,2,1,226,999,0,nonexistent,no,may,23


### and "euribor3m" to "euribor_three_months" and "nr_employed" to "number_employed" in economics.

In [81]:
economics = economics.rename(columns={
    "euribor3m": "euribor_three_months",
    "nr_employed": "number_employed"
})
economics.head(3)

Unnamed: 0,client_id,emp_var_rate,cons_price_idx,euribor_three_months,number_employed
0,0,1.1,93.994,4.857,5191.0
1,1,1.1,93.994,4.857,5191.0
2,2,1.1,93.994,4.857,5191.0


### Clean the "education" column, changing "." to "_" and "unknown" to NumPy's null values.

In [82]:
client["education"].value_counts(dropna=False)

university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: education, dtype: int64

In [83]:
client["education"] = client["education"].replace(["unknown"], np.NaN)
client["education"] = client["education"].str.replace(".", "_", regex=False)
client["education"].value_counts(dropna=False)

university_degree      12168
high_school             9515
basic_9y                6045
professional_course     5243
basic_4y                4176
basic_6y                2292
NaN                     1731
illiterate                18
Name: education, dtype: int64

### Remove periods from the "job" column.

In [84]:
client["job"].value_counts(dropna=False)

admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: job, dtype: int64

In [85]:
client["job"] = client["job"].str.replace(".", "", regex=False)
client["job"].value_counts(dropna=False)

admin            10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: job, dtype: int64

### Convert "success" and "failure" in the "previous_outcome" and "campaign_outcome" columns to binary (1 or 0), along with the changing "nonexistent" to NumPy's null values in "previous_outcome".

In [86]:
campaign["previous_outcome"].value_counts(dropna=False)

nonexistent    35563
failure         4252
success         1373
Name: previous_outcome, dtype: int64

In [87]:
campaign["campaign_outcome"].value_counts(dropna=False)

no     36548
yes     4640
Name: campaign_outcome, dtype: int64

In [88]:
campaign = campaign.replace(
    {"previous_outcome": {"success": 1, "failure": 0, "nonexistent": np.NaN}, 
     "campaign_outcome": {"yes": 1, "no": 0}})
campaign["previous_outcome"].value_counts(dropna=False)

NaN    35563
0.0     4252
1.0     1373
Name: previous_outcome, dtype: int64

In [89]:
campaign["campaign_outcome"].value_counts(dropna=False)

0    36548
1     4640
Name: campaign_outcome, dtype: int64

### Add a column called campaign_id in campaign, where all rows have a value of 1.

In [90]:
campaign_id = np.ones((campaign.shape[0], 1), dtype=int)
campaign.insert(loc=0, column="campaign_id", value=campaign_id)
campaign["campaign_id"].value_counts()

1    41188
Name: campaign_id, dtype: int64

### Create a datetime column called last_contact_date, in the format of "year-month-day", where the year is 2022, and the month and day values are taken from the "month" and "day" columns.

In [91]:
campaign = campaign.replace(
    {"month": {
        "may": 5,
        "jul": 7,
        "aug": 8,
        "jun": 6,
        "nov": 11,
        "apr": 4,
        "oct": 10,
        "sep": 9,
        "mar": 3,
        "dec": 12}
    }
)

In [92]:
campaign["last_contact_date"] = pd.to_datetime(
    dict(
        year=2022, 
        month=campaign["month"].astype(str),
        day=campaign["day"]
    )
)
campaign.dtypes

campaign_id                            int64
client_id                              int64
number_contacts                        int64
contact_duration                       int64
pdays                                  int64
previous_campaign_contacts             int64
previous_outcome                     float64
campaign_outcome                       int64
month                                  int64
day                                    int64
last_contact_date             datetime64[ns]
dtype: object

### Remove any redundant data that might have been used to create new columns.


In [96]:
campaign.drop(labels=["month", "day"], axis=1)

Unnamed: 0,campaign_id,client_id,number_contacts,contact_duration,pdays,previous_campaign_contacts,previous_outcome,campaign_outcome,last_contact_date
0,1,0,1,261,999,0,,0,2022-05-13
1,1,1,1,149,999,0,,0,2022-05-19
2,1,2,1,226,999,0,,0,2022-05-23
3,1,3,1,151,999,0,,0,2022-05-27
4,1,4,1,307,999,0,,0,2022-05-03
...,...,...,...,...,...,...,...,...,...
41183,1,41183,1,334,999,0,,1,2022-11-30
41184,1,41184,1,383,999,0,,0,2022-11-06
41185,1,41185,2,189,999,0,,0,2022-11-24
41186,1,41186,1,442,999,0,,1,2022-11-17


### Save the three DataFrames to csv files without an index as client.csv, campaign.csv, and economics.csv respectively.

In [101]:
client.to_csv("client.csv", index=False)
economics.to_csv("economics.csv", index=False)
campaign.to_csv("campaign.csv", index=False)

### Create a variable called client_table, containing SQL code as a string to create a table called client using values from client.csv.
### Create a variable called campaign_table, containing SQL code as a string to create a table called campaign using values from campaign.csv.
### Create a variable called economics_table, containing SQL code as a string to create a table called economics using values from economics.csv.

### In client, campaign, and economic, ensure the final line copies the data from their respective csv files using the following code snippet:

`\copy table_name from 'file_name.csv' DELIMITER ',' CSV HEADER`