In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Science, my $0.02 💰

As a person looking to **start** getting involved with data science, it can be a bit daunting to just jump into it. 

Most of what you find in articles and blog posts are explanations of very complicated machine learning models, neural networks, deep learning, and many other things I am yet to understand. 

Browsing through the expert's notebooks can be a very enriching experience, but **very overwhelming** at the same time. 

I wish I can bring a refreshing (and hopefully insightful) view to my fellow beginners with this attempt at Exploratory Data Analysis, and hopefully show you that many others are in the same boat as you are.

<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana;">
    📌 I would greatly appreciate ANY feedback you might have regarding my code, though process, approach, etc. 
</div>

# 1. Problem - What are we even trying to do? 🎯

I am not going to lie, I had to read through several discussion posts to understand what data we are dealing with and what we are trying to achieve. Long story short:

**We are trying to predict HOW MUCH loss is associated to loan default**

In human, we want to know how much money will a lending entity (bank, credit union, etc.) lose if one of their customers decides to give up on paying their loan. 

Now, I like to usually start with a couple of **hypothesis or predictions** about the data on hand and try to prove them right/wrong. For that I usually create a relationship between the features using prior knowledge and common sense (ex: Amount of excercise is positively correlated with calories burnt). In this case, since the features are anonymized, we will have to skip that part and find some insight on the numbers themselves. 

Also, notice that we are trying to predict **how much loss** is associated to a person defaulting. Personally, I more commonly see this kind of financial data associated to a classification prediction: *Will this person default on their loan?*

Given that we are going to predict a value and not a category, it makes this a **Regression** problem.

# 2. Getting Started 👩‍💻

### Let's load our Libraries and Data

First step to start exploring the data we got is to bring in our "tools". We will use the basic stack:

In [None]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas global settings
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [None]:
# Data
train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

# 3. Exploring 🔍
## Now that we have the data...

We want to do simple exploration on our two main data sets. How many columns/rows do they have? What kind of values they contain? Are there any missing values? 

That sort of thing...

In [None]:
# Let's start with rows and columns
print('Train: ' , train.shape)
print('Test: ', test.shape)

#### Checkpoint

The train data set has one more column than our test data set. This makes sense since we will be using the test data set to create our competition submission. In other words, we are predicting that missing column. 

Also, A HUNDRED COLUMNS! Wow 

In [None]:
# Let's take a look at the first few rows in each of our data sets
train.head()

In [None]:
test.head()

#### Perfect, we found something to fix! Let's get rid of the id column in both of the data sets:

In [None]:
# Dropping id column
train.drop(['id'], axis=1, inplace= True)
test.drop(['id'], axis=1, inplace= True)

#### Now, let's check how many missing values we have

In [None]:
# Number of missing values
print(f'Number of Missing Values in Train: {sum(train.isnull().sum())}')
print(f'Number of Missing Values in Test: {sum(test.isnull().sum())}')

#### Finally, lets look at the datatypes

In [None]:
train.dtypes

In [None]:
test.dtypes

#### Datatype summary

We have a combination of integers and floats for our features. It would be interesting to explore the number of unique values in each columns and see how it compares to hour total amount of rows. Maybe this would allow us to treat some of our integer columns as categorical?

Also, our integer columns are f1, f16, f27, f55, f60 and f86.

### What about unique values?

Let's take a look at the unique values of our different fields

In [None]:
train.nunique().sort_values(ascending=True)

In [None]:
test.nunique().sort_values(ascending=True)

In both of our data sets, the integers fields seems to have significantly smaller amount of unique values. Regardless, the smallest one is 284 values, which is still a lot to consider it a categorical. 

### Let's compare the number of unique values in our integer fields.

In [None]:
# Creating our list of integer fields 
integer_fields = ['f1', 'f16', 'f27', 'f55', 'f60', 'f86']

# We now create a dataframe with the count of our unique values per field
train_unique = pd.DataFrame(train[integer_fields].nunique())
train_unique = train_unique.reset_index(drop=False) #This line is necessary so it know to take our list of fields as a field itself and not as index
train_unique.columns = ['Features', 'Count']


# Do the same for our test data
test_unique = pd.DataFrame(test[integer_fields].nunique())
test_unique = test_unique.reset_index(drop=False) #This line is necessary so it know to take our list of fields as a field itself and not as index
test_unique.columns = ['Features', 'Count']


In [None]:
# Creating our plot
sns.set_style("dark")
plot = sns.barplot(x= train_unique.Features, y=train_unique.Count)

plt.title('Count of Unique Values - Train')
plt.xlabel('Features')
plt.ylabel('Count')
plt.xticks(rotation=30, horizontalalignment="center")
plt.figure(figsize=(20,20))

x = train_unique['Features'].values.tolist()
y = train_unique['Count'].values.tolist()

# Annotations
for bar in plot.patches:
    plot.annotate(format(bar.get_height(), '.0f'), 
                  (bar.get_x() + bar.get_width()/2, bar.get_height()), 
                  ha='center', va='center', size='10', xytext=(0,8), 
                  textcoords='offset points')
plt.show()

In [None]:
# Plot
sns.set_style("dark")
plot2 = sns.barplot(x= test_unique.Features, y=test_unique.Count)

plt.title('Count of Unique Values - Test')
plt.xlabel('Features')
plt.ylabel('Count')
plt.xticks(rotation=30, horizontalalignment="center")
plt.figure(figsize=(20,20))

# Annotations
for i in plot2.patches:
    plot2.annotate(format(i.get_height(), '.0f'), 
                  (i.get_x() + i.get_width()/2, i.get_height()), 
                  ha='center', va='center', size='10', xytext=(0,8), 
                  textcoords='offset points')
plt.show()

Here we can clearly see that we cannot treat these integer features as categoricals because of their magnitude of unique values. The amount of unique values is referred to as **Cardinality**. Features with high cardinailty don't tend to be the best categorical variables.

## So, what magic spell will we use to predict? 

Like I mentioned earlier, this is a **Regression** problem. That means we will be predicting a number. There are variety of models that we can apply for this purpose, but some of them will require the data to meet some especific criteria. 

For example, one of the criteria to apply a Linear Regression model is that the data is normally distributed. Let's check for that:

In [None]:
fig, axes = plt.subplots(10,10,figsize=(12, 12))
axes = axes.flatten()

for idx, ax in enumerate(axes):
    sns.kdeplot(data=train, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)
    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Distribution by feature', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()


Even though they are very small, you can see each of the features distribution:
* We have some **Normally Distributed** features
* Some of the features are **bimodal or trimodal**
* A lot of our features are **skewed**

**Normal Distribution** - When the distribution of the values is a symetrical bell-shaped graph

**Multimodal (bimodal, trimodal, etc.)** - When the distribution shows several "peaks" (local maxima)

**Skewed** - refers from a distortion that deviates from the ideal symetrical bell curve. The curve is shifted to  the left or right. 

So, if we are planning on applying any linear models, we will ahve to address the distribution issue.

## What about our prediction value?

I wonder what the distribution of our **loss** field looks like

In [None]:
# Plot
sns.set_style("dark")
sns.histplot(data=train.loss)

plt.title('Distribution of Loss')
plt.figure(figsize=(20,20))

As shown on some of our other features, this feature's distribution is also skewed.

# 4. Conclusion 🔚

To summarize our findings:
* This is a **large** data set, with 100 features
* All of the **features are numerical**, and will be treated as such in our prediction models
    * Our integer features, even though they had a smaller amount of unique values, will not be treated as categoricals
* Most of our features **don't have a normal distribution**
* Our prediction vector is **positively skewed**

# EDA Closing Thoughts 💭

Thank you so much for taking the time to read this notebook. I hope you found it somewhat informative and helpful. Please leave a comment if you have any recommendations or thoughts!

# 5. Predictions

## Random Forrest Regressor
We mentioned earlier when we dove into the features that the data was not normally distributed. For this reason we will start with tree-based models, see their performance, and go from there. 

In [None]:
# Sepparating our matrix of features and prediction vector
X = train.iloc[:, 0:-1].values
y = train.iloc[:, -1].values

In [None]:
# Importing the library and fitting the model
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X,y)

In [None]:
# Creating predictions
preds = regressor.predict(test)

In [None]:
#resetting the index to the correct number
test.index += 250000

In [None]:
#Creating Submission File
submission = pd.DataFrame({"id": test.index, "loss": preds})
submission.to_csv("submission.csv", index=False)