# A mini Machine Learning Project exploring feature and model selection

This code will:    
- Loads the data
- Performs necessary data cleaning and wrangling and 
- Exporting the clean dataset to a new file  named "clean_data.csv" in the data folder.

## 1. Preprocessing:

- The data can be found from this [link](http://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data#) of **UCI**
- After download and unzip the file, choose the 3year.arff file as the data file only. 
- This data contains financial rates from 3rd year of the forecasting period and corresponding class label that indicates bankruptcy status after 3 years. The data contains 10503 instances (financial statements), 495 represents bankrupted companies, 10008 firms that did not bankrupt in the forecasting period. 

In [1]:
# Loading all needed library
from scipy.io import arff
import numpy as np
import pandas as pd


In [2]:
# Importing data
data = arff.loadarff('../data/3year.arff')
df = pd.DataFrame(data[0])

## 2. Data Wrangling

### Replacing feature *class* into a binary label
> It can be seen that the last column contains the label of each company whether it was bankrupt in the forecasting period or not. I want to transform it into binary label which will be easier to handle than character labels.

The new label will be:
- 0: represents firms that did not bankrupt in the forecasting period        
- 1: represents bankrupted companies

In [3]:
df['bankrupt'] = df['class'].map({b'0':0, b'1':1})
# After replacing, remove the unused feature
df.drop('class',axis=1,inplace=True)
df.head()

Unnamed: 0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,...,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,bankrupt
0,0.17419,0.41299,0.14371,1.348,-28.982,0.60383,0.21946,1.1225,1.1961,0.46359,...,0.16396,0.37574,0.83604,7e-06,9.7145,6.2813,84.291,4.3303,4.0341,0
1,0.14624,0.46038,0.2823,1.6294,2.5952,0.0,0.17185,1.1721,1.6018,0.53962,...,0.027516,0.271,0.90108,0.0,5.9882,4.1103,102.19,3.5716,5.95,0
2,0.000595,0.22612,0.48839,3.1599,84.874,0.19114,0.004572,2.9881,1.0077,0.67566,...,0.007639,0.000881,0.99236,0.0,6.7742,3.7922,64.846,5.6287,4.4581,0
3,0.024526,0.43236,0.27546,1.7833,-10.105,0.56944,0.024526,1.3057,1.0509,0.56453,...,0.048398,0.043445,0.9516,0.14298,4.2286,5.0528,98.783,3.695,3.4844,0
4,0.18829,0.41504,0.34231,1.9279,-58.274,0.0,0.23358,1.4094,1.3393,0.58496,...,0.17648,0.32188,0.82635,0.073039,2.5912,7.0756,100.54,3.6303,4.6375,0


In [4]:
# Excluding Attr37 which has 4736 missing values (47% data lost)
df.drop('Attr37',axis=1,inplace=True)

In [10]:
# Writing the new dataset into a new file
df.to_csv("../data/data_clean.csv",index=False)