![openclassrooms](https://s3.eu-west-1.amazonaws.com/course.oc-static.com/courses/6204541/1+HnqdJ-5ofxiPP9HIxdNdpw.jpeg)
# Merge Data Using Pandas
This time, your task will be to build a more comprehensive dataset. You’ll need to use all of the datasets we’ve provided (the two customer files and the loans file) and merge them using all of the Pandas methods we’ve covered.


In [1]:
import numpy as np
import pandas as pd

In [2]:
# previous processing
loans = pd.read_csv('https://raw.githubusercontent.com/OpenClassrooms-Student-Center/en-8253136-Use-Python-Libraries-for-Data-Science/main/data/loans.csv')

# calculate the debt-to-income ratio
loans['debt_to_income'] = round(loans['repayment'] * 100 / loans['income'], 2)

# rename rate to interest_rate
loans.rename(columns={'rate':'interest_rate'}, inplace=True)

# calculate the total cost of the loan
loans['total_cost'] = loans['repayment'] * loans['term']

# calculate monthly profits generated
loans['profit'] = round((loans['total_cost'] * loans['interest_rate']/100)/(24), 2)

# create the risk variable
loans['risk'] = 'No'
loans.loc[loans['debt_to_income'] > 35, 'risk'] = 'Yes'

# customer profile DataFrame
customer_profile = loans.groupby('identifier')[['repayment','debt_to_income','total_cost','profit']].sum()
customer_profile.reset_index(inplace=True)
customer_profile.head()

loans.head()


Unnamed: 0,identifier,city,zip code,income,repayment,term,type,interest_rate,debt_to_income,total_cost,profit,risk
0,0,CHICAGO,60100,3669.0,1130.05,240,real estate,1.168,30.8,271212.0,131.99,No
1,1,DETROIT,48009,5310.0,240.0,64,automobile,3.701,4.52,15360.0,23.69,No
2,1,DETROIT,48009,5310.0,1247.85,300,real estate,1.173,23.5,374355.0,182.97,No
3,2,SAN FRANCISCO,94010,1873.0,552.54,240,real estate,0.972,29.5,132609.6,53.71,No
4,3,SAN FRANCISCO,94010,1684.0,586.03,180,real estate,1.014,34.8,105485.4,44.57,No


Firstly, let’s import the two customer files:

In [3]:
customers_1 = pd.read_csv('https://raw.githubusercontent.com/OpenClassrooms-Student-Center/en-8253136-Use-Python-Libraries-for-Data-Science/main/data/customers.csv')
customers_1.head()


Unnamed: 0,identifier,email,name,gender
0,0,JohnSmith@rhyta.com,John Smith,M
1,1,MaryJohnson@fleckens.hu,Mary Johnson,F
2,2,WilliamBrown@einrot.com,William Brown,M
3,3,JamesLee@armyspy.com,James Lee,M
4,4,PatriciaGarcia@rhyta.com,Patricia Garcia,F


In [4]:
customers_2 = pd.read_csv('https://raw.githubusercontent.com/OpenClassrooms-Student-Center/en-8253136-Use-Python-Libraries-for-Data-Science/main/data/customers_cont.csv')
customers_2.head()


Unnamed: 0,identifier,email,name,gender
0,150,EricHayes@teleworm.us,Eric Hayes,M
1,151,MonaMoreno@armyspy.com,Mona Moreno,F
2,152,VincentCraig@einrot.com,Vincent Craig,M
3,153,GlendaParsons@cuvox.de,Glenda Parsons,F
4,154,RogerWatkins@dayrep.com,Roger Watkins,M


Your first task will be to bring together these two DataFrames, `customers_1` and `customers_2`, into one big DataFrame called `customers` which will contain all of our customer data.

In [7]:
customer_profile = loans.groupby('identifier')[['repayment','debt_to_income','total_cost','profit']].sum()
customer_profile.reset_index(inplace=True)
customer_profile.head()

Unnamed: 0,identifier,repayment,debt_to_income,total_cost,profit
0,0,1130.05,30.8,271212.0,131.99
1,1,1487.85,28.02,389715.0,206.66
2,2,552.54,29.5,132609.6,53.71
3,3,586.03,34.8,105485.4,44.57
4,4,423.61,28.7,101666.4,51.21


Now you’re going to merge the customer file with the customer profiles we created before. These profiles can be found in the `customer_profile` DataFrame we created previously in chapter 4. You can call this final DataFrame `data`:

In [8]:

data = pd.merge(customer_profile, customers_1, on='identifier')
data.head()

Unnamed: 0,identifier,repayment,debt_to_income,total_cost,profit,email,name,gender
0,0,1130.05,30.8,271212.0,131.99,JohnSmith@rhyta.com,John Smith,M
1,1,1487.85,28.02,389715.0,206.66,MaryJohnson@fleckens.hu,Mary Johnson,F
2,2,552.54,29.5,132609.6,53.71,WilliamBrown@einrot.com,William Brown,M
3,3,586.03,34.8,105485.4,44.57,JamesLee@armyspy.com,James Lee,M
4,4,423.61,28.7,101666.4,51.21,PatriciaGarcia@rhyta.com,Patricia Garcia,F


The bank’s marketing department has provided us with a file containing our customers' ages

In [9]:
customers_age = pd.read_csv('https://raw.githubusercontent.com/OpenClassrooms-Student-Center/en-8253136-Use-Python-Libraries-for-Data-Science/main/data/customers_age.csv')
customers_age.head()


Unnamed: 0,identifier,age
0,0,54
1,1,23
2,2,3
3,3,42
4,4,47


Add the age information to the `data` DataFrame. However, it would seem that some customers who took out a loan aren’t present in this file. We need to ensure that all of the information in our `data` DataFrame is retained, so please choose your arguments with care!

In [10]:
data = pd.merge(data, customers_age, on='identifier', how='left')
data.head()

Unnamed: 0,identifier,repayment,debt_to_income,total_cost,profit,email,name,gender,age
0,0,1130.05,30.8,271212.0,131.99,JohnSmith@rhyta.com,John Smith,M,54.0
1,1,1487.85,28.02,389715.0,206.66,MaryJohnson@fleckens.hu,Mary Johnson,F,23.0
2,2,552.54,29.5,132609.6,53.71,WilliamBrown@einrot.com,William Brown,M,3.0
3,3,586.03,34.8,105485.4,44.57,JamesLee@armyspy.com,James Lee,M,42.0
4,4,423.61,28.7,101666.4,51.21,PatriciaGarcia@rhyta.com,Patricia Garcia,F,47.0


Well done! Why don’t you compare your answers with the [solution](https://colab.research.google.com/github/OpenClassrooms-Student-Center/en-8253136-Use-Python-Libraries-for-Data-Science/blob/main/notebooks/P2/P2C5%20-%20Merge%20Data%20Using%20Pandas%20-%20CORRECTION.ipynb) now?