# Lab 07: Data Cleaning & Data Transformation

In this lab, you will:

- Import "raw" data and clean the imported data
- Transform the data

**Deadline: 11:59 PM Tuesday 10/26/2021**

***

## 1. Importing Data

In this lab, we will investigate the U.S. Small Business Administration (SBA) loan dataset. The dataset details can be found on [Kaggle](https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied).

- Make sure you take a close look at the list of variables!
- You will need to download the dataset to your local computer, then use the local file to load the data to Python.
- **DO NOT STORE THE DATA ANYWHERE INSIDE YOUR NETID FOLDER!!!**
    - This dataset is too big for GitHub and you won't be able to push to the remote repo.

Import the data to a DataFrame named `sba_loan`.

In [5]:
import pandas as pd
sba_loan = pd.read_csv('SBAnational.csv')
sba_loan.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,28-Feb-97,1997,...,N,Y,,28-Feb-99,"$60,000.00",$0.00,P I F,$0.00,"$60,000.00","$48,000.00"
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,28-Feb-97,1997,...,N,Y,,31-May-97,"$40,000.00",$0.00,P I F,$0.00,"$40,000.00","$32,000.00"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,28-Feb-97,1997,...,N,N,,31-Dec-97,"$287,000.00",$0.00,P I F,$0.00,"$287,000.00","$215,250.00"


### Exercise 1.1

Take a look at the data types of each column and compare it with the corresponding variable description. For each column, decide if you need to modify the data type or keep it as it is. Include your reasoning.

*Note: There are many possible correct answers here. Your reasoning is what matters.*

#### Your Answer:

*Please use bullet point!*


In [11]:
sba_loan.dtypes

LoanNr_ChkDgt          int64
Name                  object
City                  object
State                 object
Zip                    int64
Bank                  object
BankState             object
NAICS                  int64
ApprovalDate          object
ApprovalFY            object
Term                   int64
NoEmp                  int64
NewExist             float64
CreateJob              int64
RetainedJob            int64
FranchiseCode          int64
UrbanRural             int64
RevLineCr             object
LowDoc                object
ChgOffDate            object
DisbursementDate      object
DisbursementGross     object
BalanceGross          object
MIS_Status            object
ChgOffPrinGr          object
GrAppv                object
SBA_Appv              object
dtype: object

- LoanNr_ChkDgt: Keep it as int64, since it only contains number used as primary key. 
- Name: Keep it as object, since it contains text and does not have clear category pattern. 
- City: Keep it as object, since it contains text and does not have clear category pattern. 
- State: Keep it as object, since it contains text and does not have clear category pattern. 
- Zip:  Keep it as int64, since it only contains integer. 
- Bank: Keep it as object, since it contains text and does not have clear category pattern. 
- BankState: Keep it as object, since it contains text and does not have clear category pattern. 
- NAICS: Keep it as int64, since it only contains integer. 
- ApprovalDate: Modify it to datetime. Because it contains date-time data. 
- ApprovalFY: Keep it as object, since it contains text and does not have clear category pattern. 
- Term: Keep it as int64, since it only contains integer. 
- NoEmp: Keep it as int64, since it only contains integer. 
- NewExist: Keep it as float64, since it only contains integer. 
- CreateJob: Keep it as int64, since it only contains integer. 
- RetainedJob: Keep it as int64, since it only contains integer. 
- FranchiseCode: Keep it as int64, since it only contains integer. 
- UrbanRural: Keep it as int64, since it only contains integer. 
- RevLineCr: Modify it to category. Because it only contains Y for Yes and N for No and has a very clear category pattern.
- LowDoc: Modify it to category. Because it only contains Y for Yes and N for No and has a very clear category pattern.
- ChgOffDate: Modify it to datetime. Because it contains date-time data. 
- DisbursementDate: Modify it to datetime. Because it contains date-time data. 
- DisbursementGross: Modify it to float64, since it will be easier to compare value and all its data can be convert to float by deleting the `$` before numbers.
- BalanceGross: Modify it to float64, since it will be easier to compare value and all its data can be convert to float by deleting the `$` before numbers.
- MIS_Status: Modify it to category. Because it only contains PIF for paid in full and CHGOFF for loan status charged off and has a very clear category pattern.
- ChgOffPrinGr: Modify it to float64, since it will be easier to compare value and all its data can be convert to float by deleting the `$` before numbers.
- GrAppv: Modify it to float64, since it will be easier to compare value and all its data can be convert to float by deleting the `$` before numbers.
- SBA_Appv: Modify it to float64, since it will be easier to compare value and all its data can be convert to float by deleting the `$` before numbers.


### Exercise 1.2

Based on your answer on 1.1, perform any needed modification.

In [24]:
# column ApprovalDate
sba_loan['ApprovalDate']=  pd.to_datetime(sba_loan['ApprovalDate'])

In [29]:
# column RevLineCr
sba_loan['RevLineCr'] = sba_loan['RevLineCr'].astype('category')

In [31]:
# column LowDoc
sba_loan['LowDoc'] = sba_loan['LowDoc'].astype('category')

In [32]:
# column ChgOffDate
sba_loan['ChgOffDate']=  pd.to_datetime(sba_loan['ChgOffDate'])

In [33]:
# column DisbursementDate
sba_loan['DisbursementDate']=  pd.to_datetime(sba_loan['DisbursementDate'])

In [37]:
# column DisbursementGross
sba_loan['DisbursementGross'] = sba_loan['DisbursementGross'].replace('[\$,]', '', regex=True).astype('float64')

In [38]:
# column BalanceGross
sba_loan['BalanceGross'] = sba_loan['BalanceGross'].replace('[\$,]', '', regex=True).astype('float64')

In [39]:
# column MIS_Status
sba_loan['MIS_Status'] = sba_loan['MIS_Status'].astype('category')

In [40]:
# column ChgOffPrinGr
sba_loan['ChgOffPrinGr'] = sba_loan['ChgOffPrinGr'].replace('[\$,]', '', regex=True).astype('float64')

In [41]:
# column GrAppv
sba_loan['GrAppv'] = sba_loan['GrAppv'].replace('[\$,]', '', regex=True).astype('float64')

In [42]:
# column ChgOffPrinGr
sba_loan['ChgOffPrinGr'] = sba_loan['ChgOffPrinGr'].replace('[\$,]', '', regex=True).astype('float64')

In [45]:
# column SBA_Appv
sba_loan['SBA_Appv'] = sba_loan['SBA_Appv'].replace('[\$,]', '', regex=True).astype('float64')

In [46]:
sba_loan.dtypes

LoanNr_ChkDgt                 int64
Name                         object
City                         object
State                        object
Zip                           int64
Bank                         object
BankState                    object
NAICS                         int64
ApprovalDate         datetime64[ns]
ApprovalFY                   object
Term                          int64
NoEmp                         int64
NewExist                    float64
CreateJob                     int64
RetainedJob                   int64
FranchiseCode                 int64
UrbanRural                    int64
RevLineCr                  category
LowDoc                     category
ChgOffDate           datetime64[ns]
DisbursementDate     datetime64[ns]
DisbursementGross           float64
BalanceGross                float64
MIS_Status                 category
ChgOffPrinGr                float64
GrAppv                      float64
SBA_Appv                    float64
dtype: object

### Exercise 1.3

Identify the columns with missing data.

In [52]:
sba_loan.isnull().any(axis=0)

LoanNr_ChkDgt        False
Name                  True
City                  True
State                 True
Zip                  False
Bank                  True
BankState             True
NAICS                False
ApprovalDate         False
ApprovalFY           False
Term                 False
NoEmp                False
NewExist              True
CreateJob            False
RetainedJob          False
FranchiseCode        False
UrbanRural           False
RevLineCr             True
LowDoc                True
ChgOffDate            True
DisbursementDate      True
DisbursementGross    False
BalanceGross         False
MIS_Status            True
ChgOffPrinGr         False
GrAppv               False
SBA_Appv             False
dtype: bool

In [53]:
sba_loan.columns[sba_loan.isnull().any()]

Index(['Name', 'City', 'State', 'Bank', 'BankState', 'NewExist', 'RevLineCr',
       'LowDoc', 'ChgOffDate', 'DisbursementDate', 'MIS_Status'],
      dtype='object')

***

## 2. Data Transformation

### Exercise 2.1

If you have not removed the dollar sign and change data type of `int64` or `float64` of the following columns in Exercise 1.2, then please do so here!

- `DisbursementGross`
- `BalanceGross`
- `ChgOffPrinGr`
- `SBA_Appv`

In [None]:
# your code here


### Exercise 2.2

Create a new column named `Industry`. Use the first 2 digits of NAICS (if provided) to identify the industry for each observation.

For example, the first observation's NAICS is 451120. The first two digits are 45, so the value for this observation in the `Industry` column is `Retail trade` (see the table provided on Kaggle).

**Notes: The use of loops is not allowed in this exercise. Vectorized string functions such as `str.func_name()` are not allowed either (since it has not been covered in this course, we will learn it in future weeks).**

In [84]:
string = "11 Agriculture, forestry, fishing and hunting.\
21 Mining, quarrying, and oil and gas extraction.\
22 Utilities.\
23 Construction.\
31 Manufacturing.\
32 Manufacturing.\
33 Manufacturing.\
42 Wholesale trade.\
44 Retail trade.\
45 Retail trade.\
48 Transportation and warehousing.\
49 Transportation and warehousing.\
51 Information.\
52 Finance and insurance.\
53 Real estate and rental and leasing.\
54 Professional, scientific, and technical services.\
55 Management of companies and enterprises.\
56 Administrative and support and waste management and remediation services.\
61 Educational services.\
62 Health care and social assistance.\
71 Arts, entertainment, and recreation.\
72 Accommodation and food services.\
81 Other services (except public administration).\
92 Public administration"

In [94]:
dic = dict([item.split(' ', 1) for item in string.split('.')])

In [95]:
dic

{'11': 'Agriculture, forestry, fishing and hunting',
 '21': 'Mining, quarrying, and oil and gas extraction',
 '22': 'Utilities',
 '23': 'Construction',
 '31': 'Manufacturing',
 '32': 'Manufacturing',
 '33': 'Manufacturing',
 '42': 'Wholesale trade',
 '44': 'Retail trade',
 '45': 'Retail trade',
 '48': 'Transportation and warehousing',
 '49': 'Transportation and warehousing',
 '51': 'Information',
 '52': 'Finance and insurance',
 '53': 'Real estate and rental and leasing',
 '54': 'Professional, scientific, and technical services',
 '55': 'Management of companies and enterprises',
 '56': 'Administrative and support and waste management and remediation services',
 '61': 'Educational services',
 '62': 'Health care and social assistance',
 '71': 'Arts, entertainment, and recreation',
 '72': 'Accommodation and food services',
 '81': 'Other services (except public administration)',
 '92': 'Public administration'}

In [91]:
# your code here
columns = list(sba_loan.columns)

def f(x):
    try:
        x['Industry'] = dic[str(x['NAICS'])[:2]]
    except:
        a = 0
    return x

sba_loan = sba_loan.apply(f, axis = 1)
sba_loan = sba_loan[columns + ['Industry']]

In [92]:
sba_loan

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,Industry
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,1997-02-28,1997,...,Y,NaT,1999-02-28,60000.0,0.0,P I F,0.0,60000.0,48000.0,Retail trade
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,1997-02-28,1997,...,Y,NaT,1997-05-31,40000.0,0.0,P I F,0.0,40000.0,32000.0,Accommodation and food services
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,1997-02-28,1997,...,N,NaT,1997-12-31,287000.0,0.0,P I F,0.0,287000.0,215250.0,Health care and social assistance
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,1997-02-28,1997,...,Y,NaT,1997-06-30,35000.0,0.0,P I F,0.0,35000.0,28000.0,
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,1997-02-28,1997,...,N,NaT,1997-05-14,229000.0,0.0,P I F,0.0,229000.0,229000.0,
5,1000084002,"B&T SCREW MACHINE COMPANY, INC",PLAINVILLE,CT,6062,"TD BANK, NATIONAL ASSOCIATION",DE,332721,1997-02-28,1997,...,N,NaT,1997-06-30,517000.0,0.0,P I F,0.0,517000.0,387750.0,Manufacturing
6,1000093009,MIDDLE ATLANTIC SPORTS CO INC,UNION,NJ,7083,WELLS FARGO BANK NATL ASSOC,SD,0,1980-06-02,1980,...,N,1991-06-24,1980-07-22,600000.0,0.0,CHGOFF,208959.0,600000.0,499998.0,
7,1000094005,WEAVER PRODUCTS,SUMMERFIELD,FL,34491,REGIONS BANK,AL,811118,1997-02-28,1997,...,Y,NaT,1998-06-30,45000.0,0.0,P I F,0.0,45000.0,36000.0,Other services (except public administration)
8,1000104006,TURTLE BEACH INN,PORT SAINT JOE,FL,32456,CENTENNIAL BANK,FL,721310,1997-02-28,1997,...,N,NaT,1997-07-31,305000.0,0.0,P I F,0.0,305000.0,228750.0,Accommodation and food services
9,1000124001,INTEXT BUILDING SYS LLC,GLASTONBURY,CT,6073,WEBSTER BANK NATL ASSOC,CT,0,1997-02-28,1997,...,Y,NaT,1997-04-30,70000.0,0.0,P I F,0.0,70000.0,56000.0,


### Exercise 2.3 (bonus)

If you have not changed the data type of the `ApprovalDate` and the `ChgOffDate` columns to `datetime64` in Exercise 1.2, then please do so! The information provided [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) might be helpful!

In [93]:
# your code here

***

## Submit Your Work

You're almost done -- congratulations!

You need to do a few more things:

1. Save your work.  To do this, create a **notebook checkpoint** by using the menu within the notebook to go **File -> Save and Checkpoint**

2. Click **File** -> **Close and Halt** to close this notebook.

3. Click **Logout** on Jupyter to return your terminal back to the command prompt.

4. Follow the assignment instructions to submit this assignment.