# Introduction

## Credit Risk Prediction with Financial Data 

This Jupyter Notebook demonstrates the preprocessing of a financial dataset from Kaggle and the prediction of credit risk scores. Two machine learning approaches/algorithms - Logistic Regression and Decision Trees - are applied to forecast credit risk. This project is part of my learning process as a novice in ML/Data Science.
Note that the dataset that is used here must be downloaded separately from Kaggle.

#### Import all necessary libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('Financial_Dataset.csv')  ## load data
df

Sort the dataset by Id and Quarter because the entries are unordered

In [None]:
df.sort_values(by=['Company_ID', 'Quarter'], axis=0, ascending=True, inplace=True)

In [None]:
df

Since there is more than one row containing the same company for the same quarter, it is recomendable to check duplicates with respect to the columns ID and quarter.

In [None]:
df.duplicated(['Company_ID', 'Quarter'], keep= False).sum()

In [None]:
df.drop_duplicates(subset=['Company_ID', 'Quarter'], keep='last', inplace = True) ## Now all redundant entries are reduced to the last respective one
df

The next few cells are intended to get an overview of the data, especially the number of companies, if all four quarters are registered for every company.
Moreover we add the Column Quarters_per_Company to later remove all companies that have just one quarter recorded more easy, since that means that those entries are no time series data.

In [None]:
number_of_coms = len(df['Company_ID'].unique())
attribute_values = df['Company_ID'].unique()
print(number_of_coms)
print(attribute_values)

In [None]:
df.groupby('Company_ID')['Quarter'].unique()

In [None]:
type(df.groupby('Company_ID')['Quarter'].unique())

In [None]:
n_quarter_per_com = df.groupby('Company_ID')['Quarter'].unique().apply(lambda x: x.size)
n_quarter_per_com

In [None]:
n_quarter_per_com.value_counts()

In [None]:
counts = df.groupby('Company_ID')['Quarter'].count()

In [None]:
df['Quarters_per_Company'] = df['Company_ID'].map(counts)
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df[df['Debt_to_Equity'] == 4.99]['Credit_Risk_Score'] # Interesting to see that all companies that have the maximum value of 4.99 with respect to debt to equity ratio, 
                                                      # have different credit risk scores, which means that the impact of debt to equity on the risk score is not that big

In [None]:
df['Quarter'] = pd.Categorical(df['Quarter']) # Turning variables of type object to category, because this is the right datatype to work with for vars with finite numbers of characteristics
df['Stock_Trend'] = pd.Categorical(df['Stock_Trend'])
df['Credit_Risk_Score'] = pd.Categorical(df['Credit_Risk_Score'])
df.info()

In [None]:
df['Stock_Trend_Code'] = df['Stock_Trend'].cat.codes # One Hot Encoding the columns Stock Trend and target column Credit Risk Score from high, medium, low to 0, 1, 2
df['Credit_Risk_Score_Code'] = df['Credit_Risk_Score'].cat.codes

In [None]:
df

The next few steps contain to put all numerical variables into one list, compute the correlation of them with the target variable to decide   which of them are of interest for the further analysis. Moreover we build moving averages as well as standard deviations and differences of them and add those as columns to the dataframe.

In [None]:
ex_vars = ['Company_ID', 'Quarter', 'Credit_Risk_Score', 'Stock_Trend']
corr_vars = [item for item in df.columns if item not in ex_vars]
corr_vars

In [None]:
df[corr_vars].corr(method='pearson')['Credit_Risk_Score_Code']

In [None]:
for item in corr_vars[:8]:
    df[f'{item}_moving_average'] = df.groupby('Company_ID')[item].transform(lambda x: x.rolling(window=2, min_periods = 1).mean())
    df[f'{item}_moving_std'] = df.groupby('Company_ID')[item].transform(lambda x: x.rolling(window=3, min_periods = 1).std()).fillna(0)
    df[f'{item}_diff'] = df.groupby('Company_ID')[item].transform(lambda x: x.diff()).fillna(0)
df

In [None]:
df.isna().sum()

In [None]:
df.isnull().sum()

In [None]:
vars = ['Revenue', 'Net_Profit', 'Debt_to_Equity', 'Current_Ratio', 'EPS', 'Stock_Volatility', 'Market_Cap', 'Credit_Score'] # Plotting the value distributions of all numeric columns
fig, axs = plt.subplots(2,4, figsize=(16,9), tight_layout = True)
axs = axs.flatten()
for index, item in enumerate(vars):
    axs[index].hist(df[item], bins = 10, color='lightblue', edgecolor = 'black')
    axs[index].set_xlabel(f'{item} Value Range')
    axs[index].set_ylabel('Frequency')
    axs[index].set_title('Variables Distributions')
plt.show()

In [None]:
number_quarter_per_companys = n_quarter_per_com.to_dict()
number_quarter_per_companys

In the next few steps all company ids with all four quarters recorded get filtered, saved in the list companies and then the dataframe gets filtered according to those ids. For our purpose just the first 12 ids are used as representatives. We than plot for every of those companies the original numeric values like revenue and so on accross all 4 quarters to see how values change. Finally we store the mean value and the standard deviation of all those numeric values (list --> vars) in a seperate dictionary, replace outliers in those columns identified by interquartile range with the mean value and check the dictionary again. As we see in the distribution plots the values are very even distributed so no wonder that nothing really happended to the means and std, since no significant outliers were detected.

In [None]:
companies = []
for index, item in enumerate(number_quarter_per_companys):
    if number_quarter_per_companys[item] == 4:
        companies.append(item)
companies = companies[:12]
filterd_df = df[df['Company_ID'].isin(companies)]

In [None]:
filtered_df = filterd_df.set_index(['Company_ID', 'Quarter'],drop=False)
filtered_df

In [None]:
vars = ['Revenue', 'EPS', 'Debt_to_Equity', 'Net_Profit', 'Market_Cap', 'Stock_Volatility']
fig, axs = plt.subplots(12, 6, figsize= (24, 20), tight_layout=True)
for index, item in enumerate(companies):
    for id, it in enumerate(vars):
        axs[index][id].plot(filtered_df[filtered_df['Company_ID']==item]['Quarter'], filtered_df[filtered_df['Company_ID']==item][it])
        axs[index][id].set_xlabel('Quarters')
        axs[index][id].set_ylabel(it)
plt.show()

In [None]:
means, stds = {}, {}
for item in vars:
    mean = df[item].mean()
    std = df[item].std()
    means[item] = mean
    stds[item] = std
print(means)
print(stds)

In [None]:
for item in vars:
    mean = df[item].mean()
    std = df[item].std()
    df.loc[(df[item] > mean+3*std) | (df[item] < mean-3*std)] = mean

In [None]:
for item in vars:
    mean = df[item].mean()
    std = df[item].std()
    means[item] = mean
    stds[item] = std
print(means)
print(stds)

In [None]:
df = df[df['Quarters_per_Company'] > 1] # remove all entries/companys with just one quarter recorded
df.set_index(['Company_ID', 'Quarter'], drop=False, inplace=True)
df

The next steps collect all unique companies/ids and split them into ids used for training and those that are going to used for testing.
According to the ids we can then build up the training and the test sets. This is crucial since we have time series data and at least two entries for every company, which means the same id could be moved to training and testing set if we gave the whole X and y to the train_test_split from sklearn.

In [None]:
ids = df.index.get_level_values(level=0).unique() 
ids = np.array(ids)
ids

In [None]:
train_id, test_id = train_test_split(ids, test_size=0.3, random_state=42)

In [None]:
X_train = df[df['Company_ID'].isin(train_id)].drop(columns=['Company_ID','Credit_Risk_Score', 'Stock_Trend', 'Quarters_per_Company', 'Credit_Risk_Score_Code'], axis=1)
X_test = df[df['Company_ID'].isin(test_id)].drop(columns=['Company_ID', 'Credit_Risk_Score', 'Stock_Trend', 'Quarters_per_Company', 'Credit_Risk_Score_Code'], axis=1)
y_train = df[df['Company_ID'].isin(train_id)]['Credit_Risk_Score_Code']
y_test = df[df['Company_ID'].isin(test_id)]['Credit_Risk_Score_Code']

In [None]:
print(X_train.shape) # the shapes of X and y sets look promising
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

Now we scale the values of all numeric columns by StandardScaler and then train the algorithms (Logistic Regression and Decision Tree) to finally predict our credit risk scores. Here I also decided to first remove 'Quarter' from the X_train and X_test. We then use all important classification metrics to consider our results.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train.drop(columns=['Quarter'], axis=1))
X_test = scaler.transform(X_test.drop(columns=['Quarter'], axis=1))

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(multi_class='multinomial', class_weight='balanced')
model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)
preds

In [None]:
np.unique(preds)

In [None]:
test_y = np.array(y_test)

In [None]:
results = pd.DataFrame(X_test)
results['truth'] = test_y
results['Predictions'] = preds
results

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix

In [None]:
print(f1_score(y_test, preds, labels=[0,1,2], average='weighted'))
print(recall_score(y_test, preds, labels=[0,1,2], average='weighted'))
print(precision_score(y_test, preds, labels=[0,1,2], average='weighted'))

In [None]:
cm = confusion_matrix(y_test, preds)
sns.heatmap(cm, annot=True, cmap='bwr')

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

model = DecisionTreeClassifier(criterion='entropy', random_state=42)
parameter_grid = {'max_depth': [3, 5, 7, 10],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [2, 3, 4, 5]}
grid_search = GridSearchCV(model, parameter_grid)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
mod = grid_search.best_estimator_
score_predictions = mod.predict(X_test)
score_predictions

In [None]:
print(f1_score(y_test, score_predictions, average='weighted'))
print(recall_score(y_test, score_predictions, average='weighted'))
print(precision_score(y_test, score_predictions, average='weighted'))

In [None]:
cm = confusion_matrix(y_test, score_predictions)
sns.heatmap(cm, annot=True, cmap='bwr')