# Exploratory Data Analysis

Databricks was the chosen environment to host all compute related activities so that the project is scalable for big data. We start by configuring a compute cluster and setting up this Jupyter environment with a spark session.

Import the necessary libraries...

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import Row

Create spark session...

In [0]:
spark = SparkSession.builder.appName("Exploratory_Data_Analysis").getOrCreate()

Upload data to `/FileStore/tables/` and run `dbutils.fs.ls("/FileStore/tables/")` to list all available files...

In [0]:
# dbutils.fs.ls("/FileStore/tables/") ## Returns a list of all files in the Databricks filesystem (commented out for privacy)

Collect the data...

In [0]:
class Dataset:
    def __init__(self, file_path: str) -> None:
        self.name = file_path.split("/")[-1]
        self.spark_df = spark.read.csv(file_path, inferSchema=True, header=True)
    @property
    def num_rows(self) -> int:
        return self.spark_df.count()
    @property
    def num_cols(self) -> int:
        return len(self.spark_df.columns)
    @property
    def columns(self) -> list[str]:
        return self.spark_df.columns


folder_path = "/FileStore/tables/"
file_names = [
    "test.csv",
    "train.csv",
    "Customer_Churn_Records.csv",
    "Bank_Customer_Churn_Prediction.csv",
    "Churn_Modeling.csv",
    "Churn_Modelling.csv",
    "Churn_Modelling-1.csv",
    "churn.csv",
]
file_paths = [folder_path + f for f in file_names]
df_lst = [Dataset(f) for f in file_paths]

In [0]:
print(f"There are {len(df_lst)} spark dataframes.")

There are 8 spark dataframes.


Analysing the various sizes of the datasets...

In [0]:
shape_df = spark.createDataFrame(
    [
        Row(
            df_index=i,
            number_of_columns=sdf.num_cols,
            number_of_rows=sdf.num_rows,
            file_name=sdf.name,
        )
        for i, sdf in enumerate(df_lst)
    ]
)

In [0]:
shape_df.show(truncate=False)

+--------+-----------------+--------------+----------------------------------+
|df_index|number_of_columns|number_of_rows|file_name                         |
+--------+-----------------+--------------+----------------------------------+
|0       |13               |110023        |test.csv                          |
|1       |14               |165034        |train.csv                         |
|2       |18               |10000         |Customer_Churn_Records.csv        |
|3       |12               |10000         |Bank_Customer_Churn_Prediction.csv|
|4       |14               |10000         |Churn_Modeling.csv                |
|5       |14               |10000         |Churn_Modelling.csv               |
|6       |14               |10002         |Churn_Modelling-1.csv             |
|7       |14               |10000         |churn.csv                         |
+--------+-----------------+--------------+----------------------------------+



In [0]:
max_num_cols = shape_df.agg({"number_of_columns": "max"}).collect()[0][0]
print(f"There are up to {max_num_cols} columns.")

There are up to 18 columns.


Analysing the column names...

In [0]:
col_tuples = [[c for c in sdf.columns] for sdf in df_lst]
for t in col_tuples:
    while len(t) < shape_df.agg({"number_of_columns": "max"}).collect()[0][0]:
        t.append(None)
col_tuples = [[str(i)] + t for i, t in enumerate(col_tuples[:])]
col_tuples = [tuple(t) for t in col_tuples[:]]

column_pandas_df = spark.createDataFrame([Row(*t) for t in col_tuples])

In [0]:
column_pandas_df.show(truncate=True)

+---+-----------+------------+-------+-----------+---------+------+-------+---------------+-----------+-------------+----------------+--------------+---------------+------+--------+------------------+---------+------------+
| _1|         _2|          _3|     _4|         _5|       _6|    _7|     _8|             _9|        _10|          _11|             _12|           _13|            _14|   _15|     _16|               _17|      _18|         _19|
+---+-----------+------------+-------+-----------+---------+------+-------+---------------+-----------+-------------+----------------+--------------+---------------+------+--------+------------------+---------+------------+
|  0|         id|  CustomerId|Surname|CreditScore|Geography|Gender|    Age|         Tenure|    Balance|NumOfProducts|       HasCrCard|IsActiveMember|EstimatedSalary|  null|    null|              null|     null|        null|
|  1|         id|  CustomerId|Surname|CreditScore|Geography|Gender|    Age|         Tenure|    Balance|N

What looked good in Pandas does not look good as a Spark dataframe. Approaching representation in a different angle...

In [0]:
column_melt_df = spark.createDataFrame(
    [
        Row(df_index=i, column_name=c, file_name=sdf.name)
        for i, sdf in enumerate(df_lst)
        for c in sdf.columns
    ]
)

In [0]:
column_melt_df.count()

Out[12]: 113

In [0]:
column_melt_df.show(113, truncate=False)

+--------+------------------+----------------------------------+
|df_index|column_name       |file_name                         |
+--------+------------------+----------------------------------+
|0       |id                |test.csv                          |
|0       |CustomerId        |test.csv                          |
|0       |Surname           |test.csv                          |
|0       |CreditScore       |test.csv                          |
|0       |Geography         |test.csv                          |
|0       |Gender            |test.csv                          |
|0       |Age               |test.csv                          |
|0       |Tenure            |test.csv                          |
|0       |Balance           |test.csv                          |
|0       |NumOfProducts     |test.csv                          |
|0       |HasCrCard         |test.csv                          |
|0       |IsActiveMember    |test.csv                          |
|0       |EstimatedSalary

In [0]:
column_melt_df.select("column_name").dropDuplicates().count()

Out[14]: 31

In [0]:
column_melt_df.select("column_name").dropDuplicates().show(31)

+------------------+
|       column_name|
+------------------+
|     NumOfProducts|
|           Balance|
|       CreditScore|
|               Age|
|        CustomerId|
|         HasCrCard|
|         Geography|
|            Tenure|
|           Surname|
|            Gender|
|    IsActiveMember|
|                id|
|   EstimatedSalary|
|         RowNumber|
|            Exited|
|          Complain|
|     active_member|
|      credit_score|
|           balance|
|         Card Type|
|      Point Earned|
|Satisfaction Score|
|       customer_id|
|            tenure|
|       credit_card|
|           country|
|               age|
|            gender|
|   products_number|
|  estimated_salary|
|             churn|
+------------------+



Given that there were up to 18 columns in `shape_df.show(truncate=False)`, 31 columns in `column_df.count()` seems a bit much. The line `column_df.select('column_name').dropDuplicates().show(31)` confirms this when we see duplicate column names such as "age" and "Age".

In [0]:
column_melt_df.filter("column_name = 'Geography'").show(truncate=False)

+--------+-----------+--------------------------+
|df_index|column_name|file_name                 |
+--------+-----------+--------------------------+
|0       |Geography  |test.csv                  |
|1       |Geography  |train.csv                 |
|2       |Geography  |Customer_Churn_Records.csv|
|4       |Geography  |Churn_Modeling.csv        |
|5       |Geography  |Churn_Modelling.csv       |
|6       |Geography  |Churn_Modelling-1.csv     |
|7       |Geography  |churn.csv                 |
+--------+-----------+--------------------------+



A lot of the datasets appear to be duplicates. Selecting the first row from each dataset yields...

In [0]:
for sdf in df_lst:
    print(sdf.spark_df.first().asDict().values())

dict_values([165034, 15773898, 'Lucchese', 586, 'France', 'Female', 23.0, 2, 0.0, 2, 0.0, 1.0, 160976.75])
dict_values([0, 15674932, 'Okwudilichukwu', 668, 'France', 'Male', 33.0, 3, 0.0, 2, 1.0, 0.0, 181449.97, 0])
dict_values([1, 15634602, 'Hargrave', 619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88, 1, 1, 2, 'DIAMOND', 464])
dict_values([15634602, 619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88, 1])
dict_values([1, 15634602, 'Hargrave', 619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88, 1])
dict_values([1, 15634602, 'Hargrave', 619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88, 1])
dict_values([1, 15634602, 'Hargrave', 619, 'France', 'Female', 42.0, 2, 0.0, 1, 1, 1, 101348.88, 1])
dict_values([1, 15634602, 'Hargrave', 619, 'France', 'Female', 42, 2, 0.0, 1, 1, 1, 101348.88, 1])


Setting aside `train.csv` and `test.csv` (because clearly this data has been shuffled for ML purposes), it appears that the rest of the data seems to all be the same data. Duplicate data may come in handy when validating data during the integration process in the upcoming data and ML pipelines. Remember each dataset had approx 10,000 rows and it can be clearly seen from the first row of each dataset that they all represent the same female customer from France with credit score 619 etc. All values appear to be the same at first glance.

So there may only be two groups of data instead of seven groups of data. This is quite astonishing given that the other notebook `local_eda.ipynb` working with Pandas suggested data might be in three groups with `Customer_Churn_Records.csv` being a "unique" dataset. Clearly this is justification to clean the column names.

In [0]:
def column_lowercase_cleaner(list_of_columns: list[str]) -> list:
    return [c.lower().replace(" ", "").replace("_", "") for c in list_of_columns[:]]


def find_non_common_columns(dataset_1: Dataset, dataset_2: Dataset) -> set[str]:
    return set(column_lowercase_cleaner(dataset_1.columns)) ^ set(
        column_lowercase_cleaner(dataset_2.columns)
    )

In [0]:
raw_column_names_df = column_melt_df.select("column_name").dropDuplicates().collect()
raw_column_names_lst = [row[0] for row in raw_column_names_df]
column_names_lowercase = list(set(column_lowercase_cleaner(raw_column_names_lst)))
print(
    f'After removing case sensitivity from column names, there appears to be {len(column_names_lowercase)} unique columns, which are: {", ".join(column_names_lowercase[:-1])}, and {column_names_lowercase[-1]}.'
)

After removing case sensitivity from column names, there appears to be 24 unique columns, which are: numofproducts, estimatedsalary, isactivemember, tenure, customerid, churn, creditcard, activemember, hascrcard, balance, productsnumber, surname, satisfactionscore, complain, gender, pointearned, age, cardtype, id, exited, rownumber, creditscore, country, and geography.


Judging from `shape_df.show(truncate=False)` and our observation that most of the data should be duplicated, we look at the anomalies `df2` (18 columns) and `df3` (12 columns) closer, using `df4` (14 columns) as a control observation. We will come back to the anomaly of `df6` (two extra rows) later.

In [0]:
df_lst[2].spark_df.printSchema()

root
 |-- RowNumber: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Surname: string (nullable = true)
 |-- CreditScore: integer (nullable = true)
 |-- Geography: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- Balance: double (nullable = true)
 |-- NumOfProducts: integer (nullable = true)
 |-- HasCrCard: integer (nullable = true)
 |-- IsActiveMember: integer (nullable = true)
 |-- EstimatedSalary: double (nullable = true)
 |-- Exited: integer (nullable = true)
 |-- Complain: integer (nullable = true)
 |-- Satisfaction Score: integer (nullable = true)
 |-- Card Type: string (nullable = true)
 |-- Point Earned: integer (nullable = true)



In [0]:
df_lst[3].spark_df.printSchema()

root
 |-- customer_id: integer (nullable = true)
 |-- credit_score: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- balance: double (nullable = true)
 |-- products_number: integer (nullable = true)
 |-- credit_card: integer (nullable = true)
 |-- active_member: integer (nullable = true)
 |-- estimated_salary: double (nullable = true)
 |-- churn: integer (nullable = true)



In [0]:
df_lst[4].spark_df.printSchema()

root
 |-- RowNumber: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Surname: string (nullable = true)
 |-- CreditScore: integer (nullable = true)
 |-- Geography: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- Balance: double (nullable = true)
 |-- NumOfProducts: integer (nullable = true)
 |-- HasCrCard: integer (nullable = true)
 |-- IsActiveMember: integer (nullable = true)
 |-- EstimatedSalary: double (nullable = true)
 |-- Exited: integer (nullable = true)



In [0]:
find_non_common_columns(df_lst[3], df_lst[4])

Out[23]: {'activemember',
 'churn',
 'country',
 'creditcard',
 'exited',
 'geography',
 'hascrcard',
 'isactivemember',
 'numofproducts',
 'productsnumber',
 'rownumber',
 'surname'}

It appears that `df3` is identical to `df4` where `df3` drops columns `rownumber` and `surname`. We can also see some unexpected pairings that contribute to extra "unique" columns:

- `country` and `geography`
- `numofproducts` and `productsnumber`
- `creditcard` and `hascrcard`
- `activemember` and `isactivemember`
- `churn` and `exited`

We also see that `df3` has wildly different naming. So we compare `df2` and `df4` rather than comparing `df2` and `df3`.

In [0]:
find_non_common_columns(df_lst[2], df_lst[4])

Out[24]: {'cardtype', 'complain', 'pointearned', 'satisfactionscore'}

And we confirm that `df2` has the exact same columns, where the only difference is that four new columns have been added.

In [0]:
filtered_df_lst = [df_lst[i] for i in [2, 3, 4]]
for sdf in filtered_df_lst:
    print(sdf.spark_df.first().asDict())

{'RowNumber': 1, 'CustomerId': 15634602, 'Surname': 'Hargrave', 'CreditScore': 619, 'Geography': 'France', 'Gender': 'Female', 'Age': 42, 'Tenure': 2, 'Balance': 0.0, 'NumOfProducts': 1, 'HasCrCard': 1, 'IsActiveMember': 1, 'EstimatedSalary': 101348.88, 'Exited': 1, 'Complain': 1, 'Satisfaction Score': 2, 'Card Type': 'DIAMOND', 'Point Earned': 464}
{'customer_id': 15634602, 'credit_score': 619, 'country': 'France', 'gender': 'Female', 'age': 42, 'tenure': 2, 'balance': 0.0, 'products_number': 1, 'credit_card': 1, 'active_member': 1, 'estimated_salary': 101348.88, 'churn': 1}
{'RowNumber': 1, 'CustomerId': 15634602, 'Surname': 'Hargrave', 'CreditScore': 619, 'Geography': 'France', 'Gender': 'Female', 'Age': 42, 'Tenure': 2, 'Balance': 0.0, 'NumOfProducts': 1, 'HasCrCard': 1, 'IsActiveMember': 1, 'EstimatedSalary': 101348.88, 'Exited': 1}


Finally, we check the schema for consistent data types.

In [0]:
for i, sdf in enumerate(df_lst):
    print(f"df{i} shape({sdf.num_rows}, {sdf.num_cols}): {sdf.name}")
    sdf.spark_df.printSchema()

df0 shape(110023, 13): test.csv
root
 |-- id: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Surname: string (nullable = true)
 |-- CreditScore: integer (nullable = true)
 |-- Geography: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- Balance: double (nullable = true)
 |-- NumOfProducts: integer (nullable = true)
 |-- HasCrCard: double (nullable = true)
 |-- IsActiveMember: double (nullable = true)
 |-- EstimatedSalary: double (nullable = true)

df1 shape(165034, 14): train.csv
root
 |-- id: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Surname: string (nullable = true)
 |-- CreditScore: integer (nullable = true)
 |-- Geography: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Tenure: integer (nullable = true)
 |-- Balance: double (nullable = true)
 |-- NumOfProducts: integer (nullable 

In conclusion, there are 2 datasets and up to 18 columns. A data pipeline is needed to fix consistency between columns names and data types (eg `Age` is a double in `df6` but an integer in `df7`).