# Project 2 - Customer loans in Finance

This file serves as a tool for myself to revisit what I have done and the things that I have learnt.

## Milestone 1-2: Initialise & run a class to extract the data 

**1. Initialise the class**

- Created a Python file to contain code for extraction - **db_utils.py**
- Created a Class **RDSDatabase** which will be used for the extraction

**2. Store Database Credentials**

- Created a credentials.yaml file to store the database credentials provided by AiCore
- Created a .gitignore file to keep the credentials secure and prevent them from being pushed to GitHub: 

    1. Create .gitignore file
        - git init > git touch .gitignore > git nano .gitignore > 
    2. Add in credentials.yaml
        - git add .gitignore 
    3. Commit to Github
        - git commit -m "Adding .gitignore to GitHub" > git push origin main
     

**2. Load credentials**

In [None]:
import yaml

def load_credentials(filepath: str) -> dict:
    with open(filepath, "r") as f:
        credentials = yaml.safe_load(f)
    

Add in ErrorHandling controls

In [None]:
import yaml

def load_credentials(filepath: str) -> dict:
    try:
        with open(filepath, "r") as f:
            credentials = yaml.safe_load(f)
            return credentials
    except ExceptionError as e:
        print(f"Error loading credentials {e}")
        return {}

**3. Initialise RDSDatabase Connector**

- Initilising RDSDataseConnector taking the dictionary of credentials from above as a parameter
- Setting "*self.engine = None*". Initilising it this way means that I am ensuring that the attributed are used only when they have valid values. It will be then set to a SQLAlchelmy engine object later. 

In [None]:
class RDSDatabaseConnector:
    def __init__(self, credentials: dict):
        self.credentials = credentials
        self.engine = None

**4. Initialise SQLAlchemy Engine**

- Defined method in RDSDatabaseConnector to set the engine to the SQLAlchemy Engine

In [None]:
    def initialise_engine(self):
        '''
        Initialises a SQLAlchemy engine using the provided credentials
        '''
        try:
            engine_url = (f"postgresql://
            {self.credentials['RDS_USER']}:{self.credentials['RDS_PASSWORD']}@{self.credentials['RDS_HOST']}/{self.credentials['RDS_DATABASE']}")
            self.engine=create_engine(engine_url)
            print("SQLAlchemy engine initialized successfully.")
        except Exception as e:
            print(f"Error initializing SQLAlchemy engine: {e}")

**5. Extract data**
- Created a method to extract data from the RDS Database and return it as a Pandas DataFrame

In [None]:
def extract_data(self, query:str) -> pd.DataFrame:
    if self.engine is None: 
        raise ValueError ("Engine is not initialised. Call initialise_engine() first")
    return pd.read_sql(query,self.engine)


**5. Create function to save the extracted data to a local file**

In [None]:
def save_to_csv (self, data: pd.DataFrame, filename: str):
    data.to_csv(filename, index=False)

**6. Disconnect**

In [2]:
def disconnect():
    if self.engine:
        self.engine.dispose()
        print("SQLAlchemy engine connection is closed")
    else:
        print("No active connection to close.")

**7. Call the method** 
- To ensure that this code is only run when called upon I included the line if __name__ = "__main__" 
- Then called the method to connect to the database and disconnect once finished saving the data to the csv.

In [None]:
if __name__ == "__main__":
    credentials = load_credentials("credentials.yaml")

    connector = RDSDatabaseConnector(credentials)
    connector.initialise_engine()
    
    query = "SELECT * FROM loan_payments"
    data = connector.extract_data(query)

    if not data.empty:
        connector.save_to_csv(data,"loan_payments.csv")
    
    connector.disconnect()


## Milestone 3: Exploratory Data Analysis (EDA)

This milestone is set to gain a deeper understanding of the data and identify any patterns which might exist. I'll be: 
- Reviewing the data to identify any issues, such as missing or incorrectly formatted data. 
- Applying statistical techniques to gain insight on the data's distribution and apply visualisation techniques to identify patterns or trends in the data. 

#### **Task 1**

**Convert columns to the correct format within DataTransform Class** 

Are there any columns in the exisiting df that need amending? 

From the original data *df = pd.df = pd.read_csv("loan_payments.csv) > print(df.types)* I convert the following: 
- **term** = currently an object so convert to numberical but converting to an integer representing the number of months.
- **issue_date, earliest_credit_line, last_payment_date, next_payment_date, last_credit_pull_date** = need to convert to datetime.
- **employment_length** = convert to an integer, for <1 and 10+ change to 0 and 10 respectively. 
- **loan_status** = As it contains a limited number of unique values I convert to category.

For employment date I need to extract the number from the full details given in the column :


In [None]:
df[column] = df[column].str.replace('< 1 year', '0')
df[column] = df[column].str.replace('10+ years', '10')
df[column] = df[column].str.extract(r'(\d+)') #Explaination below
df[column] = df[column].astype(float)

For df[column].str.extract(r'(\d+)')
- **r** indicates that the string is a raw string, which means that the backslashes are treated as literal characters and not as escape characters
- **\d** matches any 0-9 digit
- **+** means "one or more" of the preceeding element in this case is digits

So **(r'(\d+)')** matches one or more digits in the string and captures them as a group. The **str.extract** method returns a df with the extracted digits. 

**Defined DataTransform Class and opened new ipynb to ensure all analysis is in one place**

- This .ipynb is still used for my own personal understanding of what I learnt throughtout this project.

- Analysis though is now found on loan_portfolio_analysis.ipynb

For those columns that needed changing to datetime, if no date_format is provided, the method uses the default parsing behaviour of pd.to_datetime(). 

By explicitly stating the date format that we want to use we avoid potential errors that may arise later on. 

In [None]:
  def convert_to_datetime(self, column: str, date_format: str = None) -> pd.DataFrame:
        if date_format:
            self.df[column] = pd.to_datetime(self.df[column], format=date_format)
        else: 
            self.df[column] = pd.to_datetime(self.df[column])
        return self.df

Calling the method we have to specify the format wanted. In the data these columns are presented by 'Jan-2021 or May-2025' etc so abbreviated month name and total year. 

In Python to format this you have many options: 
- %B = Full month name 
- %b = abbreviated month name
- %Y = four-digit year
- %y = two-digit year

With multiple columns to be tranformed you can do all in one go via: 

In [None]:
    def convert_multiple_to_datetime(self, columns: list, date_format: str = None) -> pd.DataFrame:
        for column in columns: 
            if date_format:
                self.df[column] = pd.to_datetime(self.df[column], format=date_format)
            else: 
                self.df[column] = pd.to_datetime(self.df[column])
        return self.df

Then you call it with which will enable all the columns called out will be converted

In [None]:
columns_to_convert_to_datetime = [
    'issue_date',
    'earliest_credit_line',
    'last_payment_date',
    'next_payment_date',
    'last_credit_pull_date'
]
transformer.convert_multiple_to_datetime(
    columns_to_convert_to_datetime, date_format='%b-%Y'
    )

**Understanding the Data** 

> Difference between total_payment & total_payment_inv

- total_payment is from the borrower's perspective, representing the total amount they have paid, while 
- total_payment_inv is from the investor's perspective, representing the total amount they have received. 

total_payment_inv can be more insightful for someone analyzing loans from the perspective of a bank or financial institution. This is because total_payment_inv represents the total amount received by investors, which includes principal and interest payments. It provides a clear picture of the returns generated by the loans.

> Difference between funded_amount & funded_amount_inv 

- funded_amount is from the borrower's perspective, representing the amount they have received, while 
- funded_amount_inv is from the investor's perspective, representing the amount they have invested.

funded_amount_inv can be more insightful for someone analyzing loans from the perspective of a bank or financial institution, as it provides a clear picture of the investment made by the investors.

> Difference between out_prncp & out_prncp_inv

- out_prncp is from the borrower's perspective, representing the amount they still need to repay, while 
- out_prncp_inv is from the investor's perspective, representing the amount they are still owed.

out_prncp = Outstanding Principal

out_prncp_inv can be more insightful for someone analyzing loans from the perspective of a bank or financial institution, as it provides a clear picture of the remaining investment that needs to be recovered.

#### **Task 2**

- Describe all columns in the DataFrame to check their data types
- Extract statistical values: median, standard deviation and mean from the columns and the DataFrame
- Count distinct values in categorical columns
- Print out the shape of the DataFrame
- Generate a count/percentage count of NULL values in each column
- Any other methods you may find useful.

Here I set up a new .py file **dataframe_info.py** to do host this code. It is best practice to seperate classes to ensure ease of following.

In [None]:
import pandas as pd

class DataFrameInfo:

    def __init__(self, df: pd.DataFrame):
        self.df = df

    def describe_columns(self) -> pd.DataFrame:
        return self.df.dtypes

For the statistical values i.e. mean, median and standard deviation this
can only be extracted from numberical columns so we must put a condition
in place so that the class only looks at the relevant columns:

- numberic_cols=self.df.select_dtypes(include=['number']).columns


In [None]:
    def extract_statistical_values(self) -> pd.DataFrame:
        numberic_cols=self.df.select_dtypes(include=['number']).columns
        return self.df[numberic_cols].agg(['median','std','mean'])

Statistical values provide a summary of the data in the DataFrame, giving insights into the distribution and variability of the data. Here's what each typically show:

1. **Median**: The median is the middle value of the data when it is sorted in ascending order. It is a measure of central tendency that is less affected by outliers compared to the mean. The median helps to understand the typical value in the data.

2. **Standard Deviation (std)**: The standard deviation measures the amount of variation or dispersion in the data.     
- A low standard deviation indicates that the data points are close to the mean, while 
- a high standard deviation indicates that the data points are spread out over a wider range. 

    It helps you understand the consistency of the data.

3. **Mean**: The mean is the average of the data, calculated by summing all the values and dividing by the number of data points. It is a measure of central tendency that gives an idea of the overall level of the data. However, it **can be affected by outliers.**

By examining these statistical values, you can gain a better understanding of the characteristics of the data, such as its central tendency, variability, and distribution. This information is useful for identifying patterns, detecting anomalies, and making informed decisions based on the data.

The results from our data were:

In [None]:
                 id     member_id   loan_amount  funded_amount  \
median  7.084590e+06  8.709873e+06  12000.000000   12000.000000   
std     9.571362e+06  1.031281e+07   8082.196709    8019.017599   
mean    7.621797e+06  8.655350e+06  13333.076100   13229.509117   

        funded_amount_inv       term   int_rate  instalment  \
median       11300.000000  36.000000  13.160000  347.150000   
std           8099.473527  15.826533   4.392893  238.920012   
mean         12952.622979  38.857111  13.507328  400.013953   

        employment_length    annual_inc  ...  total_payment_inv  \
median           6.000000  61000.000000  ...        9835.830000   
std              3.649479  51589.339577  ...        8363.508506   
mean             5.690749  72220.848249  ...       11788.946618   

        total_rec_prncp  total_rec_int  total_rec_late_fee  recoveries  \
median      7644.920000    1734.640000            0.000000    0.000000   
std         6958.124264    2581.657345            6.215792  630.843636   
mean        9407.048589    2577.757101            0.901512   93.501288   

        collection_recovery_fee  last_payment_amount  \
median                 0.000000           562.670000   
std                  120.193950          5323.801675   
mean                  10.859057          3130.706393   

        collections_12_mths_ex_med  mths_since_last_major_derog  policy_code  
median                    0.000000                    42.000000          1.0  
std                       0.070990                    21.052360          0.0  
mean                      0.004208                    42.253634          1.0  


Summary in the findings: 
1. Loan_amount = Suggests typical loan amount is around £12,000, there are some loans with significantly higher amounts, leading to a higher mean than standard deviation. 
2. Term = The median term is 36 months, with a mean of 38.86 months and a standard deviation of 15.83. This indicates that most loans have a term of around 36 months, but there are some loans with longer terms. 
3. Interest Rate (int_rate) = The median interest rate is 13.16%, with a mean of 13.50 and standard deviation of 4.39%. This suggests that while the typical interest rate is around 13.6%, there is some variation in the interest rates offered.
4. Annual Income (annual_inc) = Suggests typical income is around £61,000, with a mean of £72,220.85 and standard deviation of £51,589.34. Whilst typical Annual Income is around £61,000 there are some borrowers with significantly higher incomes, leading to a higher mean than standard deviation. 
5. Total Payment (total_payment_inv) = The median total payment is £9,835.83, with a mean of £11,788.95 and standard deviation of £8,363.51. This suggests that although the typical total payment is around £9,835.83, there are some loans with significantly higher total payments. 

**Making Informed Decisions**

- Loan Approval: Understanding the typical loan amounts, terms, and interest rates can help you set criteria for loan approval.

- Risk Assessment: Analyzing the variation in annual income and total payments can help you assess the risk associated with different borrowers.

- Product Offerings: Identifying patterns in loan data can help you tailor your product offerings to meet the needs of your customers.

Then further DataFrameInfo useful tools: 

In [None]:
    def count_distinct_values(self) -> pd.DataFrame:
        categorical_columns = self.df.select_dtypes(
            include=['category','object']).columns
        return self.df[categorical_columns].nunique()
    
    def print_shape(self) -> tuple:
        return self.df.shape

    def count_null_values(self) -> pd.DataFrame:
        null_counts = self.df.isnull().sum()
        null_percentage = round((self.df.isnull().sum() / len(self.df)) * 100,2)
        return pd.DataFrame({'null_count': null_counts, 'null_percentage': null_percentage})

    def get_summary(self) -> pd.DataFrame:
        return self.df.describe()

    def get_correlation_matrix(self) -> pd.DataFrame:
        return self

Seeing the Distinct Values in Category Columns allows us to: 
- Modeling: In ML, knowing the distinct values is crucial for encoding categorical variables and ensuring that the model can handle them correctly. 
- Understand the Data: Diversity of the categories in the present data
- Cleaning: Can help you spot inconsistencies or errors in the data, for example unexpected or misspelled categories, you can clean and standardize the data. 
- Analysis & Visualisations: Create bar/pie chart for different categories.


- Count NULL values - is explained in Task 3 (below) 

- Describe - Why do we do describe as well as statistical values? 

    Statistical values allows you to extract only the statistical values you're interested in whereas summary provides a broad overview. Describe provides a summary of the central tendency, dispersion, and shape of the dataset’s distribution for numeric columns. This is a quick way to get an overview of your data. 

    We have both to help with validation purposes, comparing the results from both methods can help validate findings and ensure consistency in analysis. 



#### Task 3: Removing NULL values

- funded_amount: 3,007 null values (5.54%)
- int_rate: 5,169 null values (9.53%)
- employment_length: 2,118 null values (3.91%)
- mths_since_last_delinq: 31,002 null values (57.17%)
- mths_since_last_record: 48,050 null values (88.60%)
- last_payment_date: 73 null values (0.13%)
- next_payment_date: 32,608 null values (60.13%)
- last_credit_pull_date: 7 null values (0.01%)
- collections_12_mths_ex_med: 51 null values (0.09%)
- mths_since_last_major_derog: 46,732 null values (86.17%)

So which columns to drop?
- Anything above 50% drop: 
    - mths_since_last_delinq
    - mths_since_last_record
    - next_payment_date
    - mths_since_last_major_derog
- Impute mean or median on the remaining columns
    - From our data we can see that the data contains outliers/extreme values, so we will use median as it is less affected by outliers. 
    - However in the code we have added in a line so that if we were to use this again and we chose to use mean we can.
    The median helps preserve the central tendency of the data without being influenced by extreme values.

In [None]:
    def impute_missing_values(self, strategy: str = 'median') -> pd.DataFrame:
        for column in self.df.columns:
            if self.df[column].isnull().sum() > 0:
                if self.df[column].dtype in ['float64','int64']:
                    if strategy == 'median':
                        self.df[column].fillna(self.df[column].median())
                    elif strategy == 'mean':
                        self.df[column].fillna(self.df[column].mean())
                else:
                    self.df[column] = self.df[column].fillna(self.df[column].mode()[0])
        return self.df

- We then call it using *transformer.impute_missing_values(strategy='median')*

- Note that for Non-Numeric Columns the code fills missing values with the mode (most frequent value).


#### Task 4: Skew

- Identify skewed columns.
- Determine threshold, over which column will be considered as skewed.
- Visualise this using Plotter class.
- Perform transformations to determine which gives biggest reduction of skew: 
    - Log
    - Square Root
    - BoxCox
- Apply the identified transformations on the columns
- Visualise to check the results to ensure transformations have improved skewness of data. 

**Identify Skewed Columns**


In [None]:
numberical_cols = cleaned_df.select_dtypes(include=['float64','int64'])
skewness = numberical_cols.skew().abs()
print("Skewness of columns:\n", skewness)

Skewness of columns:
 id                             2.370336
member_id                      2.205422
loan_amount                    0.805259
funded_amount                  0.821787
funded_amount_inv              0.813927
term                           0.707703
int_rate                       0.412032
instalment                     0.996981
employment_length              0.115188
annual_inc                     8.711831
dti                            0.189420
delinq_2yrs                    5.370002
inq_last_6mths                 3.248918
open_accounts                  1.059282
total_accounts                 0.779014
out_prncp                      2.356426
out_prncp_inv                  2.356848
total_payment                  1.267891
total_payment_inv              1.256197
total_rec_prncp                1.261015
total_rec_int                  2.204322
total_rec_late_fee            13.184305
recoveries                    14.589793
collection_recovery_fee       27.636843
last_payment_amount            2.499381
collections_12_mths_ex_med    20.252780
policy_code                    0.000000

**Determine Threshold** 

Using the following guidelines: 

- < 0.5 **Low skewness** : Data fairly symmetrical no transformation needed. 

- 0.5 - 1.0 **Moderate Skewness** : Consider transformation if impacts analysis.

- '>'  1.0 **High Skewness** : Transformation recommended to reduce skewness. 

Once the skewed columns are identified, you should perform transformations on these columns to determine which transformation results in the biggest reduction in skew. Create the the method to transform the columns in your DateFrameTransform class.

The three methods we will use to compare against each other are: 
1. Log transformations: 
    - Reduces skewness by compressing the range of values. It is particularly useful for data with a long right tail (positive skew). For example: 

        ```python
        Values  Log_Transformed
        0       1         0.693147
        1      10         2.397895
        2     100         4.615121
        3    1000         6.908755
        4   10000         9.210440

        ```

2. Square Root transformation
    - Reduces skewness by compressing the range of values, but less aggressively than the log transformation. It is useful for moderately skewed data.

    ```python

        Values  Sqrt_Transformed
        0       1          1.000000
        1      10          3.162278
        2     100         10.000000
        3    1000         31.622777
        4   10000        100.000000
    ```

3. Boxcox Transformation 
    - A more flexible transformation that can handle a variety of data distributions. It requires all input values to be positive and can stabilize variance and make the data more normally distributed.

        ```python
        Values  BoxCox_Transformed
        0       1            0.000000
        1      10            1.000000
        2     100            2.000000
        3    1000            3.000000
        4   10000            4.000000
        ```

##### Summary: 

**Log Transformation:** Compresses the range of values, useful for highly skewed data.

**Square Root Transformation:** Compresses the range of values, useful for moderately skewed data.

**Box-Cox Transformation:** Flexible transformation that stabilizes variance and makes data more normally distributed, requires positive values.

Full code: 

In [None]:
def transform_skewed_cols(self, columns:list) -> pd.DataFrame:
        for column in columns:
            print(f"\nProcessing column: {column}")

            if (self.df[column] <= 0).any():
                self.df[column] = self.df[column] - self.df[column].min() + 1

            #Apply log transformation
            log_transformed = np.log1p(self.df[column])
            log_skewness = log_transformed.skew()
            print(f"Log skewness for {column}: {log_skewness}")

            # Apply square root transformation
            sqrt_transformed = np.sqrt(self.df[column])
            sqrt_skewness = sqrt_transformed.skew()
            print(f"Sqrt skewness for {column}: {sqrt_skewness}")

            # Apply Box-Cox transformation (requires positive values)
            try:
                boxcox_transformed, _ = boxcox(self.df[column])
                boxcox_skewness = pd.Series(boxcox_transformed).skew()
                print(f"Box-Cox skewness for {column}: {boxcox_skewness}")
            except ValueError as e:
                print(f"Error applying Box-Cox transformation to column {column}: {e}")
                continue

            # Determine the best transformation
            transformations = {
                'log': log_skewness,
                'sqrt': sqrt_skewness,
                'boxcox': boxcox_skewness
            }
            best_transformation = min(transformations, key=transformations.get)
            print(f"\nBest transformation for {column}: {best_transformation}")

            # Apply the best transformation
            if best_transformation == 'log':
                self.df[column] = log_transformed
            elif best_transformation == 'sqrt':
                self.df[column] = sqrt_transformed
            elif best_transformation == 'boxcox':
                self.df[column] = boxcox_transformed

        return self.df 

Breaking it down:

``` python
if (self.df[column] <= 0).any():
                self.df[column] = self.df[column] - self.df[column].min() + 1
```

This is checking if any value in the columns are <=0 and if so shifts the entire column by subtracting the minimum value and adding 1 to ensure all values are positive. This is because BoxCox cannot transform negative numbers. 

Shifted Values = Original Value - Minimum Value + 1

```Python
If we had the following values: 
-5
-3
0
2

This would then become: 
(-5) - (-5) + 1 = 1
(-3) - (-5) + 1 = 3
(0) - (-5) + 1 = 6
(2) - (-5) + 1 = 8
```

"shift entire column" = adjusting all the values in the column

In [None]:
#Apply log transformation
log_transformed = np.log1p(self.df[column])
log_skewness = log_transformed.skew()
print(f"Log skewness for {column}: {log_skewness}")

# Apply square root transformation
sqrt_transformed = np.sqrt(self.df[column])
sqrt_skewness = sqrt_transformed.skew()
print(f"Sqrt skewness for {column}: {sqrt_skewness}"

# Apply Box-Cox transformation (requires positive values)
try:
    boxcox_transformed, _ = boxcox(self.df[column])
    boxcox_skewness = pd.Series(boxcox_transformed).skew()
    print(f"Box-Cox skewness for {column}: {boxcox_skewness}")
except ValueError as e:
    print(f"Error applying Box-Cox transformation to column {column}: {e}")
    

Log Transformation: This applies the log transformation to the column using np.log1p, which is equivalent to np.log(1 + x). It then calculates the skewness of the transformed column and prints it.

Square Root Transformation: This applies the square root transformation to the column using np.sqrt. It then calculates the skewness of the transformed column and prints it.

Box-Cox Transformation: 
- To ensure that the code works without crashing we have included Error Handling in this part of the code.
- ValueError will capture if any of the data is non-positive and will print a message indicating that the Box-Cox transformation has failed for the specific column and provides an error message. 


In [None]:
# Determine the best transformation
transformations = {
    'log': log_skewness,
    'sqrt': sqrt_skewness,
    'boxcox': boxcox_skewness
    }

best_transformation = min(transformations, key=transformations.get)
print(f"\nBest transformation for {column}: {best_transformation}")

This creates a dictionary with the skewness values for each transformation. It then finds the transformation with the lowest skewness (i.e., the best transformation) and prints it.

In [None]:
# Apply the best transformation
if best_transformation == 'log':
    self.df[column] = log_transformed
elif best_transformation == 'sqrt':
    self.df[column] = sqrt_transformed
elif best_transformation == 'boxcox':
    self.df[column] = 

This applies the best transformation to the column based on the lowest skewness value.

Some reasons behind reducing skewness are:

- Statistical Assumptions: Many statistical methods assume that the data is normally distributed. Reducing skewness can help meet these assumptions and improve the performance of these methods.

- Model Performance: In machine learning, the choice of transformation can impact model performance. It's often a good idea to experiment with different transformations and evaluate their impact on the model.

Plotting these via Plotly?
 
 - Updated the __init__ method to include both original_df and transformed_df which will allow us to compare the two when calling
 - Updated null_values to include a transformed part, this hasn't changed my documentation as I am still able to show the original data (new_df) in one graph. Still initialising with plotter = Plotter(original_df=df, transformed_df=transformed_df) but then having new_df as both arguments. Similarly when the data is cleaned used the same in both! 



In [None]:
class Plotter:
    '''
    Class to visualise insights from the data
    '''
    def __init__ (self, original_df:pd.DataFrame, transformed_df:pd.DataFrame):
        self.original_df = original_df
        self.transformed_df = transformed_df

    def plot_null_values(self):
        #Plot null values in the original DataFrame
        null_counts_original = self.original_df.isnull().sum()
        null_counts_original = null_counts_original[null_counts_original > 0]
        null_df_original= pd.DataFrame({'Columns': null_counts_original.index, 'Null Values': null_counts_original.values})
        fig = px.bar(null_df_original, x='Columns', y='Null Values', title='Null Values in Each Column')
        fig.show()

        # Plot null values in the transformed DataFrame
        null_counts_transformed = self.transformed_df.isnull().sum()
        null_counts_transformed = null_counts_transformed[null_counts_transformed > 0]
        null_df_transformed = pd.DataFrame({'Columns': null_counts_transformed.index, 'Null Values': null_counts_transformed.values})
        fig = px.bar(null_df_transformed, x='Columns', y='Null Values', title='Null Values in Each Column (Transformed)')
        

    def plot_histogram(self, column: str):
        # Plot original data
        fig = px.histogram(self.original_df, x=column, nbins=30, title=f'Histogram of {column} (Original)')
        fig.update_layout(xaxis_title=column, yaxis_title='Frequency')
        fig.show()

        # Plot transformed data
        fig = px.histogram(self.transformed_df, x=column, nbins=30, title=f'Histogram of {column} (Transformed)')
        fig.update_layout(xaxis_title=column, yaxis_title='Frequency')
        fig.show()

    def plot_boxplot(self, column: str):
        # Plot original data
        fig = go.Figure()
        fig.add_trace(go.Box(y=self.original_df[column], name=f'{column} (Original)'))

        # Plot transformed data
        fig.add_trace(go.Box(y=self.transformed_df[column], name=f'{column} (Transformed)'))

        fig.update_layout(title=f'Boxplot of {column} (Original and Transformed)', yaxis_title='Values')
        fig.show()

#### Task 5: Outliers

**IQR Method**: Identifies outliers based on the interquartile range.

**Z-score Method**: Identifies outliers based on the number of standard deviations from the mean.

**Combined Approach**: Uses both methods to provide a comprehensive outlier detection.

By using both IQR and Z-score methods, you can effectively identify and handle outliers in your dataset, improving the quality and accuracy of your analysis. 

In [None]:
    def remove_outliers(self, columns: list, method: str = 'both'):
        for column in columns:
            if method == 'IQR'or method == 'both':
                Q1 = self.df[column].quantile(0.25)
                Q3 = self.df[column].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                self.df = self.df[(self.df[column] >= lower_bound) & (self.df[column] <= upper_bound)]
                #Here the method is filtering through to include only rows where the column value are within the calculated bounds.
            elif method == 'Z-score' or method == 'both':
                self.df = self.df[(np.abs(stats.zscore(self.df[column])) < 3)]
                #Returns the modified dataframe with the outliers removed.
        return self.df

Here we have decided to do both so that all outliers can be removed. You can do one or the other by statinf 'IQR' or 'Z-Score' when calling the class i.e.:

In [None]:
clean_df = transformer.remove_outliers(columns=numerical_columns, method='both')

#OR single methods: 

clean_df = transformer.remove_outliers(columns=numerical_columns, method='IQR')
clean_df = transformer.remove_outliers(columns=numerical_columns, method='Z-score')

#### Task 6: Dropping overly correlated columns

Highly correlated columns in a dataset can lead to multicollinearity issues, which can affect the accuracy and interpretability of models built on the data. In this task, you will identify highly correlated columns and remove them to improve the quality of the data.


**Step 1:** First compute the correlation matrix for the dataset and visualise it.


Step 2: Identify which columns are highly correlated. You will need to decide on a correlation threshold and to remove all columns above this threshold.


Step 3: Decide which columns can be removed based on the results of your analysis.


Step 4: Remove the highly correlated columns from the dataset.

##### Step 1 - Visualise it:

I created a new class as the plotter class took in two arguments which is no longer needed for this. New class called DataVisualiser. I initiated it and then started the function below: 

In [None]:
def plot_correlation_matrix(self, columns:pd.DataFrame):
    '''
    Method to plot the correlation matrix of the given DataFrame
    '''
    # Convert non-numeric columns to numeric and drop columns with non-numeric values
    correlation_matrix = columns.corr() 

    fig = px.imshow(correlation_matrix, text_auto=True, aspect="auto", color_continuous_scale='RdBu_r')
    fig.update_layout(title='Correlation Matrix', width=800, height=800)
    fig.show()


Using the numerical columns that we produced earlier, this function now will take the columns as an argument and return all the numerical columns with correlation. Below is how we would then call it, using only numerical columns and the newly clean_df. 

In [None]:
import data_visualiser
numerical_columns = clean_df.select_dtypes(include=['float64','int64'])
correlation = data_visualiser.DataVisualiser(clean_df)
correlation.plot_correlation_matrix(numerical_columns)

If however we wanted to open in a webbrowser for ease of use rather than all on the VScode we can use the below code, which will then open a pop up window with the matrix on it!

In [None]:
import webbrowser
import os     

     if save_as_html:
            fig.write_html(file_name)
            webbrowser.open('file://' + os.path.realpath(file_name))
        else:
            pio.show(fig, renderer='browser') 

Breaking it down: 

- if `save_as_html` checks if the `save_to_html` parameter is set to `True` when calling the method.
- `fig.write_html(file_name)` saves the Plotly figure as an HTML specified with the specified `file_name` (again when calling the method)
- `webbrowser.open('file://' + os.path.realpath(file_name))` This opens the saved HTML file in a new browser window using the `webbrowser` module that had to import specifically for this stage. 
    - `os.path.realpath(file_name)` ensures that the full path to the file is used.
- else `save_to_html` is `False`
    -`pio.show(fig,renderer='browser')` will open the plot in a new browser window if the save_as_html is False.

You'd call this by:

`correlation.plot_correlation_matrix(numerical_columns, save_as_html=True)`

##### Step 2 & 3 -  Identify which columns are highly correlated. You will need to decide on a correlation threshold and to remove all columns above this threshold.

I used correlation of 0.8 to be the threshold. It's the most common but is not a hard rule. If you want to be more stringent, change to 0.9 or if you want to capture more relationships move to 0.7. This is changeable by updating the amount in the function and when calling the function:

In [None]:
    def find_highly_correlated_pairs(self, correlation_matrix: pd.DataFrame, threshold: float = 0.8):
        '''
        Function to find pairs of highly correlated features
        '''
        highly_correlated_pairs = []
        for i in range(len(correlation_matrix.columns)):
            for j in range(i):
                if abs(correlation_matrix.iloc[i, j]) > threshold:
                    highly_correlated_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j]))
        return highly_correlated_pairs

In the code above:

- Outer Loop:
    -  This loop iterates over the columns of the correlation matrix. The variable `i` represents the index of the current column. 
    - `len(correlation_matrix.columns)` tells us the number of columns within the correlation matrix.

- Inner Loop:
    - This loop iterates over the columns up to the current column `i`. The variable `j` represents the index of the current column in the inner loop. 
    - By using `range(i)` the inner loop ensures that each pair of columns is considered only once, avoiding duplicate pairs and self pairs (i.e. where `i` == `j`)

- Check Correction:
    - `if abs(correlation_matrix.iloc[i, j]) > threshold:` checks if the absolute value of the correlation coefficent between columns `i` and `j` is greater than the specified threshold.
    - `correlation_matrix.iloc[i, j]` accesses the correlation coefficient between the `i`-th and `j`-th columns of the correlation matrix.
    - `abs`is used to consider both positive and negative correlations.


- Append Highly Correlated Pairs:
    - If correlation coefficient exceeds the threshold, the pair of columns and their correlation coefficient are appended to the `highly_correlated_pairs` list.
    - `correlation_matrix.columns[i]` and `correlation_matrix.columns[j]` give the names of the columns.
    - `correlation_matrix.iloc[i, j]` gives the correlation coefficient between the columns.

In [None]:
Features: member_id and id - Correlation: 0.9900472991039773
Features: funded_amount and loan_amount - Correlation: 0.9948368389540044
Features: funded_amount_inv and loan_amount - Correlation: 0.9718686232142686
Features: funded_amount_inv and funded_amount - Correlation: 0.978275163802146
Features: instalment and loan_amount - Correlation: 0.9884455412722591
Features: instalment and funded_amount - Correlation: 0.9941440249097828
Features: instalment and funded_amount_inv - Correlation: 0.9714405982843544
Features: out_prncp_inv and out_prncp - Correlation: 0.9999999945687313
Features: total_payment and loan_amount - Correlation: 0.8308019310827159
Features: total_payment and funded_amount - Correlation: 0.8327576941881051
Features: total_payment and funded_amount_inv - Correlation: 0.8049919014718379
Features: total_payment and instalment - Correlation: 0.8343072964617597
Features: total_payment_inv and loan_amount - Correlation: 0.8085626991307507
Features: total_payment_inv and funded_amount - Correlation: 0.8126288846993878
Features: total_payment_inv and funded_amount_inv - Correlation: 0.8347148948857773
Features: total_payment_inv and instalment - Correlation: 0.8134371791332475
Features: total_payment_inv and total_payment - Correlation: 0.9655279710498574
Features: total_rec_prncp and loan_amount - Correlation: 0.801046862781626
Features: total_rec_prncp and funded_amount - Correlation: 0.8022933302032733
Features: total_rec_prncp and total_payment - Correlation: 0.9914675928619078
Features: total_rec_prncp and total_payment_inv - Correlation: 0.9568986422416129
Features: mths_since_last_major_derog and mths_since_last_delinq - Correlation: 0.8217320198486451

Breaking down the above: 

- **member_id**
    - **id** have a very high correlation of 0.9900, suggesting these identifiers are almost identical.

    - ACTION - Remove `id`


- **funded_amount** 
    - **loan_amount** - Using linear regression we can see that the relationship between loan_amount & funded_amount shows a good fit (with a R-squared of 0.93) and with the high correlation of 0.96 we can drop funded_amount and keep loan_amount.
    - **funded_amount_inv** - Given the high correlation and the near-perfect inverse relationship between funded_amount_inv and loan_amount, you can confidently drop funded_amount_inv from the dataset, as it is a derived feature and won't provide any additional insights or value.

    - ACTION - Remove `funded_amount` & `funded_amount_inv`

- **instalment**
    - **loan_amount** Since I have already dropped funded_amount & funded_amount_inv then it seems logical to remove instalment.

    - ACTION - Remove `instalment`

- **out_prncp_inv**
    - **out_prncp** are identical, having looked into detail about the difference of these two the out_prncp is more useful for the company than the out_prncp_inv. As out_prncp represents the portion of the outstanding principal that is owed to investors it is not as important to the company as out_prncp

    - ACTION - Remove `out_prncp_inv`

- **total_payment**
    - **total_payment_inv** - total amount paid to investors. Not necessary for this analysis I don't think and given the high correlation not needed. 
    - **total_rec_prncp** - total principal recieved by the lender, similary to the above, not as necessary as `total_payment`
    - `total_payment` provides a comprehensive view of the total amount paid by the borrower, which includes both principal and interest.

    - ACTION - Remove `total_payment_inv` & `total_rec_prncp`


##### Step 4: Drop columns

Just simply did it like this, then reprinted and put this updated df through the matrix builder to show the differences made from dropping these columns. 

In [None]:
columns_to_drop = ['id','funded_amount', 'funded_amount_inv', 'instalment', 'out_prncp_inv', 'total_payment_inv', 'total_rec_prncp']
transformer = DataFrameTransform(clean_df)
updated_df=transformer.drop_highly_correlated_columns(columns_to_drop)

In [None]:
new_numerical_cols=updated_df.select_dtypes(include=['float64','int64'])
new_correlation = data_visualiser.DataVisualiser(updated_df)
new_correlation.plot_correlation_matrix(new_numerical_cols, save_as_html=True)

#### Refactoring:

**DataTransform**
- changed `-> pd.DataFrame` to `-> 'DataTransform'`:
    - Why? - By returning from methods, you enable method chaining. This allows you to apply multiple tansformations in a concise and readable way for example you could: 

       ```python
        updated_df = 
        transformer.convert_to_int('column1')
        .convert_to_category('column2')
        .drop_columns(columns_to_drop)
        .get_dataframe()
        ```
    - Returning this also keeps the transformations encapsulated within the class. Meaning that you can maintain the state of the DataFrame and add more methods to the class without affecting the external code.  

    - Whereas if you returned `pd.DataFrame` it provides direct access to the modified DataFrame after each transformation.

**Visualisation**
- Changed some of the code for clarity from `correlation_matrix = columns.corr()` to 

    ``` python
        selected_df = self.df[columns]
        correlation_matrix = selected_df.corr() 
    ```

    - This ensures that only relevant columns are included in the correlation matrix, avoiding unnecessary computations for unrelated columns.

**DataFrameInfo** 
- Added in docstrings to describe the purpose and parameters of all the functions and classes. 
- Changed `def count_distinct_values(self) -> pd.DataFrame:` to be `-> pd.Series` because the output of this method is a series of distinct value counts for categorical columns, not a DataFrame.
    - Series = one-dimentsional array with an index, which makes it suitable for representing counts of distinct values for each categorical column 
    - DataFrame = two-dimensional, tabular data structure, which is more appropriate for representing multiple columns and rows of data. 

- def print_shape(self) -> tuple: Because the shape of a DataFrame is naturally represented as a tuple. 
    - Tuple is a data structure in Python that allows you to store an ordered collection of items. Immutable. Heterogeneous.