# Data Analysis
The notebook focuses on connecting to GCP to access the data from the bucket.

To make the connection, we use a [GCP service account ](https://cloud.google.com/iam/docs/service-account-overview)that holds permissions to access our bucket.
### Steps
1. Access Service accounts in the GCP account
2. Open `deng-capstone-service-account`
3. Create a new key file and download it locally for access in the next step. Rename file to a concise name.
4. Set path of the key file as option in your spark configuration -    `spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","/path/to/file/<renamed>.json")`

### Resources
1. https://gobiviswa.medium.com/google-cloud-storage-handson-connecting-using-pyspark-5eefc0d8d932
2. https://cloud.google.com/iam/docs/service-account-overview


## Spark Application setup

In [None]:
from datetime import datetime, timedelta

from pyspark import SparkConf
from pyspark.sql import SparkSession


spark = SparkSession.builder \
    .appName('data-engineering-capstone') \
    .config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar") \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .getOrCreate()

# Set GCS credentials. Ensure path points to you downloaded key file
spark._jsc.hadoopConfiguration().set(
    "google.cloud.auth.service.account.json.keyfile",
    "C:\pro\gcp-key.json")



## Read from GCS

In [None]:
# file path to data in GCS bucket

file_path = "gs://ecommerce-customer/e-commerce-customer-behavior.csv"


df = spark.read.csv(file_path, header=True, inferSchema=True)

df.show(5)

In [None]:
# Describe the dataset
df.describe().show()


In [None]:
# Drop all rows that contain any null values in any column
df = df.dropna()


In [None]:
# Remove duplicate rows from the DataFrame
df = df.dropDuplicates()


In [None]:
# Display summary statistics of the DataFrame (count, mean, stddev, min, max) for numerical columns
df.describe().show()


In [None]:
# Segmentation based on spending habits, demographics, loyalty, and satisfaction
from pyspark.sql.functions import when

segmentation_df = df.withColumn("SpendingCategory", 
                                when(df["Total Spend"] > 1000, "High")
                                .when(df["Total Spend"] > 500, "Medium")
                                .otherwise("Low"))

segmentation_df.show()


In [None]:
from pyspark.sql.functions import current_date, datediff

# Identify inactive customers (no purchase in the last 30 days)
inactive_customers = segmentation_df.filter((segmentation_df["Days Since Last Purchase"]) > 30)

# Identify recent customers (purchased within the last 7 days)
recent_customers = segmentation_df.filter((segmentation_df["Days Since Last Purchase"]) <= 30)




In [None]:
# Display the content of the 'inactive_customers' DataFrame
inactive_customers.show()


In [None]:
# Display the content of the 'recent_customers' DataFrame
recent_customers.show()


In [None]:
from datetime import datetime

# Get current datetime in the format MMDDYYYYHHMMSS
datetime_now = datetime.now().strftime("%m%d%Y%H%M%S")

# Define the base output path using formatted strings
base_output_path = f"gs://ecommerce-customer/output/{datetime_now}"

# Write inactive customers to GCS with overwrite mode
inactive_customers_output_path = f"{base_output_path}/inactive_customers"
inactive_customers.write.csv(inactive_customers_output_path, header=True, mode='overwrite')

# Write recent customers to GCS with overwrite mode
recent_customers_output_path = f"{base_output_path}/recent_customers"
recent_customers.write.csv(recent_customers_output_path, header=True, mode='overwrite')



In [None]:

df = segmentation_df
df.show()