<a href="https://colab.research.google.com/github/SlyFox579/bdt-2023-25962701/blob/main/koalas_random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

An implementation for porting to other platforms and discussion (this is not to do exploratory analysis but rather to consider the APIs and technologies involved - it is not intended to be a good or reference solution to this problem).

Obtain the data from Google Cloud Storage buckets

In [22]:
! wget https://storage.googleapis.com/bdt-spark-store/external_sources.csv -O gcs_external_sources.csv

--2023-11-06 18:09:26--  https://storage.googleapis.com/bdt-spark-store/external_sources.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.139.207, 74.125.141.207, 173.194.211.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.139.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15503836 (15M) [text/csv]
Saving to: ‘gcs_external_sources.csv’


2023-11-06 18:09:27 (15.6 MB/s) - ‘gcs_external_sources.csv’ saved [15503836/15503836]



In [23]:
! wget https://storage.googleapis.com/bdt-spark-store/internal_data.csv -O gcs_internal_data.csv

--2023-11-06 18:09:30--  https://storage.googleapis.com/bdt-spark-store/internal_data.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.139.207, 74.125.141.207, 173.194.211.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.139.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152978396 (146M) [text/csv]
Saving to: ‘gcs_internal_data.csv’


2023-11-06 18:09:35 (32.7 MB/s) - ‘gcs_internal_data.csv’ saved [152978396/152978396]



In [7]:
# get spark
VERSION='3.5.0'
!wget https://dlcdn.apache.org/spark/spark-$VERSION/spark-$VERSION-bin-hadoop3.tgz

--2023-11-06 15:59:56--  https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 400395283 (382M) [application/x-gzip]
Saving to: ‘spark-3.5.0-bin-hadoop3.tgz’


2023-11-06 16:00:12 (71.5 MB/s) - ‘spark-3.5.0-bin-hadoop3.tgz’ saved [400395283/400395283]



In [8]:
# decompress spark
!tar xf spark-$VERSION-bin-hadoop3.tgz

# install python package to help with system paths
!pip install -q findspark

In [9]:
# Let Colab know where the java and spark folders are

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-{VERSION}-bin-hadoop3"

In [13]:
!lsb_release -a
!apt-get update
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy
Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:6 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ Packages [46.6 kB]
Get:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease [18.1 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:9 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1,186 kB]
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [49.8 kB]
G

In [14]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("local[*]").getOrCreate()

In [None]:
import databricks.koalas as ks

# Read the CSV files using Koalas
df_data = ks.read_csv('gcs_internal_data.csv')
df_ext = ks.read_csv('gcs_external_sources.csv')

Join them on their common identifier key

In [None]:
import databricks.koalas as ks

# Assuming you have df_data and df_ext DataFrames already created.

# Merge the DataFrames on the 'SK_ID_CURR' column using an inner join.
df_full = ks.merge(df_data, df_ext, on='SK_ID_CURR', how='inner')

# Display the first few rows of the merged DataFrame.
df_full.head()


We will filter a few features out for the sake of this example

In [None]:
columns_extract = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
                  'DAYS_BIRTH', 'DAYS_EMPLOYED', 'NAME_EDUCATION_TYPE',
                  'DAYS_ID_PUBLISH', 'CODE_GENDER', 'AMT_ANNUITY',
                  'DAYS_REGISTRATION', 'AMT_GOODS_PRICE', 'AMT_CREDIT',
                  'ORGANIZATION_TYPE', 'DAYS_LAST_PHONE_CHANGE',
                  'NAME_INCOME_TYPE', 'AMT_INCOME_TOTAL', 'OWN_CAR_AGE', 'TARGET']
df = df_full[columns_extract]

Let's obtain a train and test split

In [None]:
import numpy as np

# Set the seed for reproducibility
ks.set_option("compute.default_random_state", np.random.RandomState(101))


In [None]:
# Shuffle the DataFrame with random_state for reproducibility.
df_shuffled = df.sample(frac=1, random_state=np.random.RandomState(101))

# Calculate the split point based on the desired train-test split ratio.
split_point = int(0.8 * len(df_shuffled))

# Split the DataFrame into train and test sets.
train = df_shuffled.iloc[:split_point]
test = df_shuffled.iloc[split_point:]


In [None]:
print(train.TARGET.value_counts()/len(train.index))
print(test.TARGET.value_counts()/len(test.index))

Handle the categorical variables

In [None]:
# Perform one-hot encoding on categorical columns for the 'train' and 'test' DataFrames.
train = ks.get_dummies(train)
test = ks.get_dummies(test)

# Print the shapes of the resulting DataFrames.
print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)


Align the training and test data (as the test data may not have the same columns in the encoding)

In [None]:
# Align the training and testing data, keep only columns present in both dataframes
train, test = train.align(test, join = 'inner', axis = 1)

print('Training Features shape: ', train.shape)
print('Testing Features shape: ', test.shape)

Get labels from data

In [None]:
train_labels = train['TARGET']
test_labels = test['TARGET']

Fill in missing data and scale

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer as Imputer

# Assuming you have 'train' and 'test' DataFrames with the 'TARGET' column and other features.

# Drop the 'TARGET' column from the training and testing data
if 'TARGET' in train:
    train = train.drop(columns=['TARGET'])
    test = test.drop(columns=['TARGET'])
else:
    train = train.copy()
    test = test.copy()

# Feature names
features = list(train.columns)

# Median imputation of missing values
imputer = Imputer(strategy='median')

# Scale each feature to 0-1
scaler = StandardScaler()

# Fit on the training data and transform both training and testing data
imputer.fit(train)
train = ks.from_pandas(imputer.transform(train))
test = ks.from_pandas(imputer.transform(test))

# Fit on the training data and transform both training and testing data
scaler.fit(train)
train = ks.from_pandas(scaler.transform(train))
test = ks.from_pandas(scaler.transform(test))

# Print the shapes of the resulting DataFrames
print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)


Fit random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Make the random forest classifier
random_forest = RandomForestClassifier(n_estimators = 100,
                                       random_state = 50,
                                       verbose = 1, n_jobs = -1)
# Train on the training data
random_forest.fit(train, train_labels)

# Extract feature importances
feature_importance_values = random_forest.feature_importances_
feature_importances = pd.DataFrame({'feature': features, 'importance': feature_importance_values})

# Make predictions on the test data
predictions = random_forest.predict(test)

Evaluate on test

In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score

print(accuracy_score(test_labels, predictions))

In [None]:
feature_importances.sort_values('importance', ascending=False).head(10)