<a href="https://colab.research.google.com/github/Jinwooxxi/kagglestudy_jw/blob/main/Porto/Porto_1st_kernel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

This notebook aims at getting a good insight in the data for the PorteSeguro competition. Besides that, it gives some tips and trick to prepare your data for modeling. 

1. Visual inspection of your data
2. Defining the metadata
3. Descriptive statistics
4. Handling imbalaced classed
5. Data quality checks
6. Exploratory data cisualization
7. Feature enginneering
8. Feature selection
9. Feature scaling

# Loading packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', 100)

# Loading data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
train = pd.read_csv('/content/drive/My Drive/porto/train.csv')
test = pd.read_csv('/content/drive/My Drive/porto/test.csv')

# Data at first sight

Here is an excerpt of the data description for the competitons:

* Features that belong to similar groupings are tagged as such in the feature names (e.g, ind, reg, car, calc).
* Featuer names include the postfix bin to indicate binary features and cat to indicate categorical features.
* Feautures without these designations are either continuous or ordinal.
* Values of -1 indicate that the feature was missing from the observation.
* The target columns signifiies whether or not a claim was filed for that policy holder.

In [None]:
train.head()

In [None]:
train.tail()

We indeed see the following

* binary variables
* categorical variables of which the category values are integers
* other variables with integer or float values
* variables with -1 representing missing values
* the target variable and an ID variable

In [None]:
train.shape

In [None]:
train.drop_duplicates()
train.shape

In [None]:
test.shape

So later on we can creat dummy cariables for the 14 categorical variables. The *bin* variables are already binary and do not need dummification.

In [None]:
train.info()

Again, with the info() method we see that the data type is integer or float. No null values are present in the data set. That's normal because missing values are replaced by -1. We'll look into that later.

# Metadata

To facilitate the data management, we';; store meta-information about the variables in a DataFrame. This will be helpful when we want to select spcific variables for analysis, visualization, modeling..

Concretely we will store:
* role : input, ID, target
* level : nominal, interval, ordinal, binary
* keep : True, False
* dtype : int, float, str

In [None]:
data = []

for f in train.columns:
  # Defining the role
  if f == 'target':
    role = 'target'
  elif f == 'id':
    role = 'id'
  else:
    role = 'input'

  # Defining the level
  if 'bin' in f or f == 'target':
    level = 'binary'
  elif 'cat' in f or f == 'id':
    level = 'nominal'
  elif train[f].dtype == float:
    level = 'interval'
  elif train[f].dtype == int:
    level = 'ordinal'
  
  # Initialize keep to True for all variables except for id
  keep = True
  if f == 'id':
    keep = False
  
  # Defining the data type
  dtype = train[f].dtype

  # Creating a Dict that contains all the metadata for the variable
  f_dict = {
      'varname' : f,
      'role' : role,
      'level' : level,
      'keep' : keep,
      'dtype' : dtype
  }
  data.append(f_dict)

meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

In [None]:
meta

Example to extract all nminal varialbles that are not dropped

In [None]:
meta[(meta.level == 'nominal') & (meta.keep)].index

Below the number of variables per role and level are displayed.

In [None]:
pd.DataFrame({'count':meta.groupby(['role', 'level'])['role'].size()}).reset_index()

# Descriptive statistics

We can also apply the describe method on the dataframe. However, it doesn't make much sense to calculate the mean, std, ... on categorical variables and the id variable. We'll explore the categorical variables visually later.

Thanks to our meta file we can easily select the variables on which we want to compute the descriptive statistics. To keep things clear, we';; do this per data type

## Interval variables

In [None]:
v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()

### reg variables
* only ps_reg_03 has missing values
* the range (min to max) differs between the variables. We could apply scaling (e.g. StandardScaler), but it depends on the classifier we will want to use.

### car variables
* ps_car_12 and ps_car_15 have missing values
* again, the range differs and we could apply scaling.

### calc variables
* no missing values
* this seems to be some kind of ratio as the maximum is 0.9
all three _calc variables have very similar distributions

**Overall,** we can see that the range of the interval variables is rather small. Perhaps some transformation (e.g. log) is already applied in order to anonymize the data?

## Ordinal variable

In [None]:
v = meta[(meta.level == 'ordinal') & (meta.keep)].index
train[v].describe()

Only one missing variable: ps_car_11
We could apply scaling to deal with the different ranges

## Binary variable

In [None]:
v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()

* A priori in the train data is 3.645%, which is strongly imbalanced.
* From the means we can conclude that for most variables the value is zero in most cases.