<a href="https://colab.research.google.com/github/TanishqLambhate/Data-Science-Training/blob/pyspark/Pyspark_day_4_Problem_Statement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=eedd8729c2340812eab3d9698ae79827aca0fdc114b5de03b0ac32990e1bd543
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [5]:
# ### Problem Statement: Employee Salary Data Transformation and Analysis

# A company has collected a CSV file containing employee data, including names, ages, genders, and salaries. The company’s management is interested in conducting a detailed analysis of their workforce, focusing on the salary structure. They need to implement an ETL (Extract, Transform, Load) pipeline to transform the raw employee data into a more usable format for business decision-making.

# **Objective**:
# The goal is to build an ETL pipeline using PySpark to transform the raw employee data by applying filtering, creating new salary-related metrics, and calculating salary statistics by gender. After the transformations, the processed data should be saved in an efficient file format (Parquet) for further analysis and reporting.

# ### **Task Requirements**:
# 1. **Extract**:
#    - Load the employee data from a CSV file containing the following columns: `name`, `age`, `gender`, and `salary`.
import pandas as pd

# Create a sample CSV data
data = {
    "name": ["John", "Jane", "Mike", "Emily", "Alex"],
    "age": [28, 32, 45, 23, 36],
    "gender": ["Male", "Female", "Male", "Female", "Male"],
    "salary": [60000, 72000, 84000, 52000, 67000]
}

df = pd.DataFrame(data)

# Save the DataFrame as a CSV file
csv_file_path = "/content/sample_people.csv"
df.to_csv(csv_file_path, index=False)

# Confirm the CSV file is created
print(f"CSV file created at: {csv_file_path}")

from pyspark.sql import SparkSession

spark=SparkSession.builder.appName("Create View").getOrCreate()

df_people=spark.read.format("csv").option("header","true").load(csv_file_path)
df_people.show()
# 2. **Transform**:
#    - **Filter**: Only include employees aged 30 and above in the analysis.
#    - **Add New Column**: Calculate a 10% bonus on the current salary for each employee and add it as a new column (`salary_with_bonus`).
#    - **Aggregation**: Group the employees by gender and compute the average salary for each gender.

df_people.createOrReplaceTempView("people_temp_view")

#Run an sql query on the view
result_temp_view=spark.sql("Select * from people_temp_view where age>=30")

result_temp_view.show()

result_temp_view=result_temp_view.withColumn("salary_with_bonus",result_temp_view["salary"]*1.1)
result_temp_view.show()

result_temp_view=result_temp_view.groupBy("gender").avg("salary_with_bonus")
result_temp_view.show()


# 3. **Load**:
#    - Save the transformed data (including the bonus salary) in a Parquet file format for efficient storage and retrieval.
#    - Ensure the data can be easily accessed for future analysis or reporting.

result_temp_view.write.parquet("/content/people_bonus.parquet")



# ### **Key Deliverables**:
# 1. A PySpark-based ETL pipeline that performs the following:
#    - Loads the raw employee CSV data.
#    - Applies filtering, transformations, and aggregations.
#    - Saves the transformed data to a Parquet file.
# 2. A summary report showing the following:
#    - The list of employees aged 30 and above with their original salary and salary with the 10% bonus.
#    - The average salary per gender.

# ### **Sample Data**:

# | name  | age  | gender | salary  |
# |-------|------|--------|---------|
# | John  | 28   | Male   | 60000   |
# | Jane  | 32   | Female | 72000   |
# | Mike  | 45   | Male   | 84000   |
# | Emily | 23   | Female | 52000   |
# | Alex  | 36   | Male   | 67000   |

# ### **Expected Output**:

# 1. A filtered DataFrame that shows the employees aged 30 and above, with an additional column `salary_with_bonus` (10% bonus added to their salary).

# 2. A Parquet file containing the transformed data.

# 3. A DataFrame showing the average salary by gender.

# ### **Challenges**:
# - The raw data may contain employees below the age threshold of 30, who need to be filtered out.
# - Calculating new metrics (like salary bonuses) and ensuring data integrity during transformation.
# - Efficiently saving the transformed data in a format suitable for large-scale data analytics (e.g., Parquet).

# ### **Success Criteria**:
# - The company should be able to retrieve the filtered and transformed data with accurate salary information, including the bonus.
# - The saved Parquet file should be structured for efficient retrieval and further analysis.
# - The aggregated data (average salary by gender) should provide insights into the company's pay structure across genders.


CSV file created at: /content/sample_people.csv
+-----+---+------+------+
| name|age|gender|salary|
+-----+---+------+------+
| John| 28|  Male| 60000|
| Jane| 32|Female| 72000|
| Mike| 45|  Male| 84000|
|Emily| 23|Female| 52000|
| Alex| 36|  Male| 67000|
+-----+---+------+------+

+----+---+------+------+
|name|age|gender|salary|
+----+---+------+------+
|Jane| 32|Female| 72000|
|Mike| 45|  Male| 84000|
|Alex| 36|  Male| 67000|
+----+---+------+------+

+----+---+------+------+-----------------+
|name|age|gender|salary|salary_with_bonus|
+----+---+------+------+-----------------+
|Jane| 32|Female| 72000|          79200.0|
|Mike| 45|  Male| 84000|92400.00000000001|
|Alex| 36|  Male| 67000|          73700.0|
+----+---+------+------+-----------------+

+------+----------------------+
|gender|avg(salary_with_bonus)|
+------+----------------------+
|Female|               79200.0|
|  Male|               83050.0|
+------+----------------------+

