# **Transformation**

## Objectives

* Having extracted and examined the data in Extraction.ipynb and with y-data profile we will use this notebook to transform the data.
    Columns of Interest for potential transformation:
    *  Marital status and sex - both have unknown values which need to be handled
    *  Attrition flag should be transformed from text to either boolean or numeric (potentially add new columns or separate datasets for attrited/ existing cutomers)
    *  Some columns have outliers such as Credit_Limit and Avg_Open_To_Buy decide how to deal with these and if any transformation is necessary.
    *  Other columns will need to be dropped as they are not relevant to project (Both Naive_Bayes columns for example)

## Inputs

* We will be using the raw BankChurners.ipynb as the main input
* The project hypotheses as documented in README 

## Outputs

* A cleaned and transformed csv file
* Visualisations to support the transformation process
* Dialogue to show rationale for any transformation

## Additional Comments

* While most transformation will be completed in this notebook - some may also take place in powerBI with the cleaned CSV file (adding or renaming columns for example). These changes will be documented in this notebbok.
 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Setup

Import all the Python libraries required to carry out ETL (Extract, Transform, Load) and EDA (Exploratory Data Analysis).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline # For building machine learning pipelines
from feature_engine.selection import DropFeatures, DropDuplicateFeatures # For feature selection
from feature_engine.imputation import MeanMedianImputer, CategoricalImputer # For handling missing data
from feature_engine.imputation import ArbitraryNumberImputer # For handling missing data
from feature_engine.encoding import OneHotEncoder, OrdinalEncoder # For encoding categorical variables
from feature_engine.transformation import LogTransformer # For transforming numerical variables
from feature_engine.outliers import Winsorizer # For handling outliers
import joblib # For saving and loading models

sns.set_theme(style="whitegrid") # Set seaborn theme for plots
pd.set_option('display.max_columns', None) # Display all columns in pandas DataFrames
# random_state = 1 # For reproducibility


---

# Transform


1. Import the raw dataset (`BankChurners.csv`) into a DataFrame.
2. Create a copy of the raw data to preserve the original.
3. Check the dataset’s dimensions (rows and columns).
4. Preview the first few rows to understand the structure and content.


In [None]:
# Load the raw data
df_raw = pd.read_csv('Data/Raw/BankChurners.csv')

# Make a copy of the raw data
df = df_raw.copy()

# Display the first few rows of the dataframe
df.head()


---

# Initial Transformation

---

Section note

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
# import os
# try:
#   # create your folder here
#   # os.makedirs(name='')
# except Exception as e:
#   print(e)
