# EDA on Finance Dataset

This notebook will focus on the Exploratory Data Analysis of the financail loans dataset. 

In [None]:
## necessary Imports

In [None]:
import pandas as pd
import numpy as np
from transformation import DataFrameTransform
from db_utils import RDSDatabaseConnector
import matplotlib.pyplot as plt
import seaborn as sns


## Loading in the dataset

The loans csv extracted from the AWS RDS was previously extracted and saved locally. This is now loaded in as pandas to allow the use of EDA techniques.

In [None]:
df = pd.read_csv('loan_payments.csv')
df.head()

## Ensuring Columns are correct Datatype

The code below gives me some useful information about the dataset

In [None]:
transformer = DataFrameTransform(df)

null_values_before = transformer.check_null_values()
print("NULL values before imputation:")
print(null_values_before)


transformer.impute_missing_values(strategy='mean')
null_values_after = transformer.check_null_values_after()
print("\nNULL values after imputation:")
print(null_values_after)

### Nulls

I created the DataFrameTransform class to facilitate data processing on the dataframe. First, I used check_null_values method return a dictionary with the count of NULL values in each column of the dataframe. 

The drop_columns method allows me for the removal of specified columns from the DataFrame.

For handling missing numeric data, the impute_missing_values method allows me to find the median and mean. 

The check_null_values_after method reports the count of NULL values in each column. 

The drop_columns_with_high_null_percentage identifies columns in dataframe where the percentage of NULL values exceeds a specified threshold, which I set to 50%.


In [None]:
from transformation import Plotter

plotter = Plotter(df)
plotter.plot_null_values()

## Skewed columns

I used the identify_skewed_columns method from my DataFrameTransform class to find columns in the dataset that have skewness greater than 75%. This method gives me a list of these skewed columns, stored in skewed_columns. 

Next, I decided to transform these skewed columns using transform_skewed_columns. This method applies a transformation to numeric columns that exhibit skewness over teh 75% mentioned before. 

Finally, I printed out skewed_columns to see which columns were identified as skewed. This step helps me understand which columns require transformation based on their skewness levels.

In [None]:
print(dir(transformer))

In [None]:
from transformation import DataTransform
transformer = DataTransform(df)

skewed_columns = transformer.identify_skewed_columns()
transformer.transform_skewed_columns(skewed_columns)
transformer.visualize_skewness(skewed_columns)
transformer.save_dataframe('transformed_loan_payments.csv')

print("Identified skewed columns:")
print(skewed_columns)

## Remove outliers

Here I aim to identify the outlier and remove them by adding functions to the DataTransform and the Plotter class.

In [None]:
import pandas as pd
from transformation import DataTransform

df = pd.read_csv('loan_payments.csv')


transformer = DataTransform(df)


transformer.plot_outliers()

