####Question 1 (Filtering Data)
##### Given a DataFrame of employee details, filter out the employees who are older than 30 years.

**Input data** 
```
-------------+-----+
| ID | Name  | Age |
+----+-------+-----+
| 1  | John  | 28  |
| 2  | Alice | 34  |
| 3  | Bob   | 32  |
| 4  | David | 25  |
+----+-------+-----+
```
**Output data**

```
+----+-------+-----+
| ID | Name  | Age |
+----+-------+-----+
| 2  | Alice | 34  |
| 3  | Bob   | 32  |
+----+-------+-----+
```

In [0]:
data = [
    {'id': '1', 'name': 'John', 'age': 28},
    {'id': '2', 'name': 'Alice', 'age': 34},
    {'id': '3', 'name': 'Bob', 'age': 32},
    {'id': '4', 'name': 'David', 'age': 25}
]

df = spark.createDataFrame(data)

df = df.filter(" age >30 ")

display(df)

age,id,name
34,2,Alice
32,3,Bob


####Question 2 (Grouping and Aggregation)
##### Given a DataFrame of sales data, find the total sales for each product.

**Input data** 
```
+--------+-------+------+
| Product| Region| Sales|
+--------+-------+------+
| A      | North | 100  |
| A      | South | 150  |
| B      | North | 200  |
| B      | South | 300  |
+--------+-------+------+
```
**Output data**

```
+--------+-----+
| Product| Total_Sales|
+--------+-----+
| A      | 250 |
| B      | 500 |
+--------+-----+

```

In [0]:
from pyspark.sql.functions import sum 
data = [
    {'Product': 'A', 'Region': 'North', 'Sales': 100},
    {'Product': 'A', 'Region': 'South', 'Sales': 150},
    {'Product': 'B', 'Region': 'North', 'Sales': 200},
    {'Product': 'B', 'Region': 'South', 'Sales': 300}
]

df = spark.createDataFrame(data)

df = df.groupBy("Product").agg(sum('Sales').alias("Total_Sales"))

display(df)

Product,Total_Sales
A,250
B,500


####Question 3 (Joining DataFrames)
##### Given two DataFrames, join them on the 'ID' column.

**Input data (dataframe 1)** 
```
+----+-------+
| ID | Name  |
+----+-------+
| 1  | John  |
| 2  | Alice |
| 3  | Bob   |
+----+-------+

```

**Input data (dataframe 2)** 
```
+----+-----+
| ID | Age |
+----+-----+
| 1  | 28  |
| 2  | 34  |
| 4  | 25  |
+----+-----+

```
**Output data**

```
+----+-------+-----+
| ID | Name  | Age |
+----+-------+-----+
| 1  | John  | 28  |
| 2  | Alice | 34  |
+----+-------+-----+


```

In [0]:
data1 = [
    {'ID': '1', 'Name': 'John'},
    {'ID': '2', 'Name': 'Alice'},
    {'ID': '3', 'Name': 'Bob'}
]

data2 = [
    {'ID': '1', 'Age': 28},
    {'ID': '2', 'Age': 34},
    {'ID': '4', 'Age': 25}
]

df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)

df3 = df1.join(df2,on = 'ID')

display(df3)

ID,Name,Age
1,John,28
2,Alice,34


####Question 4 (Handling Missing Data)
##### Given a DataFrame, fill missing values in the 'Age' column with the mean age.

**Input data** 
```
+----+-------+-----+
| ID | Name  | Age |
+----+-------+-----+
| 1  | John  | 28  |
| 2  | Alice | null|
| 3  | Bob   | 32  |
| 4  | David | null|
+----+-------+-----+
```
**Output data**

```
+----+-------+-----+
| ID | Name  | Age |
+----+-------+-----+
| 1  | John  | 28  |
| 2  | Alice | 30  |
| 3  | Bob   | 32  |
| 4  | David | 30  |
+----+-------+-----+

```

In [0]:
from pyspark.sql.functions import mean
data = [
    {'ID': '1', 'Name': 'John', 'Age': 28},
    {'ID': '2', 'Name': 'Alice', 'Age': None},
    {'ID': '3', 'Name': 'Bob', 'Age': 32},
    {'ID': '4', 'Name': 'David', 'Age': None}
]

df = spark.createDataFrame(data)

avg_age = df.select(mean('Age')).collect()[0][0]

df = df.fillna({'Age':avg_age})

display(df)


Age,ID,Name
28,1,John
30,2,Alice
32,3,Bob
30,4,David


####Question 5 (Sorting Data)
##### Given a DataFrame of product sales, sort the data by 'Sales' in descending order.

**Input data** 
```
+--------+------+
| Product| Sales|
+--------+------+
| A      | 150  |
| B      | 200  |
| C      | 100  |
+--------+------+

```
**Output data**

```
+--------+------+
| Product| Sales|
+--------+------+
| B      | 200  |
| A      | 150  |
| C      | 100  |
+--------+------+

```

In [0]:
data = [
    {'Product': 'A', 'Sales': 150},
    {'Product': 'B', 'Sales': 200},
    {'Product': 'C', 'Sales': 100}
]

df = spark.createDataFrame(data)

display(df.orderBy('Sales',ascending=False))


Product,Sales
B,200
A,150
C,100


####Question 6 (Pivoting Data)
##### Given a DataFrame, pivot the data to show sales per region for each product.

**Input data** 
```
+--------+-------+------+
| Product| Region| Sales|
+--------+-------+------+
| A      | North | 100  |
| A      | South | 150  |
| B      | North | 200  |
| B      | South | 300  |
+--------+-------+------+

```
**Output data**

```
+--------+------+------+
| Product| North| South|
+--------+------+------+
| A      | 100  | 150  |
| B      | 200  | 300  |
+--------+------+------+

```

In [0]:
data = [
    {'Product': 'A', 'Region': 'North', 'Sales': 100},
    {'Product': 'A', 'Region': 'South', 'Sales': 150},
    {'Product': 'B', 'Region': 'North', 'Sales': 200},
    {'Product': 'B', 'Region': 'South', 'Sales': 300}
]

df = spark.createDataFrame(data)

df = df.groupBy('Product').pivot('Region').sum('Sales')

display(df)



Product,North,South
B,200,300
A,100,150


####Question 7 (Window Functions)
##### Given a DataFrame of employee salaries, add a column that shows the rank of each employee within their department based on salary.

**Input data** 
```
+----+-------+---------+------+
| ID | Name  | Dept    |Salary|
+----+-------+---------+------+
| 1  | John  | IT      | 5000 |
| 2  | Alice | HR      | 6000 |
| 3  | Bob   | IT      | 7000 |
| 4  | David | HR      | 5500 |
+----+-------+---------+------+


```
**Output data**

```
+----+-------+---------+------+----+
| ID | Name  | Dept    |Salary|Rank|
+----+-------+---------+------+----+
| 3  | Bob   | IT      | 7000 |  1 |
| 1  | John  | IT      | 5000 |  2 |
| 2  | Alice | HR      | 6000 |  1 |
| 4  | David | HR      | 5500 |  2 |
+----+-------+---------+------+----+

```

In [0]:
from pyspark.sql.functions import row_number,col,desc
from pyspark.sql.window import Window
data = [
    {'ID': 1, 'Name': 'John', 'Dept': 'IT', 'Salary': 5000},
    {'ID': 2, 'Name': 'Alice', 'Dept': 'HR', 'Salary': 6000},
    {'ID': 3, 'Name': 'Bob', 'Dept': 'IT', 'Salary': 7000},
    {'ID': 4, 'Name': 'David', 'Dept': 'HR', 'Salary': 5500}
]

df = spark.createDataFrame(data)

df = df.withColumn('Rank',row_number().over(Window.partitionBy('Dept').orderBy(desc('Salary'))))

display(df)


Dept,ID,Name,Salary,Rank
HR,2,Alice,6000,1
HR,4,David,5500,2
IT,3,Bob,7000,1
IT,1,John,5000,2


####Question 8 (String Manipulation)
##### Given a DataFrame of employee details, create a new column 'Initials' that contains the initials of each employee's name.

**Input data** 
```
+----+-------+
| ID | Name  |
+----+-------+
| 1  | John  |
| 2  | Alice |
| 3  | Bob   |
| 4  | David |
+----+-------+


```
**Output data**

```
+----+-------+--------+
| ID | Name  |Initials|
+----+-------+--------+
| 1  | John  | J      |
| 2  | Alice | A      |
| 3  | Bob   | B      |
| 4  | David | D      |
+----+-------+--------+


```

In [0]:
data = [
    {'ID': 1, 'Name': 'John'},
    {'ID': 2, 'Name': 'Alice'},
    {'ID': 3, 'Name': 'Bob'},
    {'ID': 4, 'Name': 'David'}
]

df = spark.createDataFrame(data)

df = df.withColumn("Initials",col('Name').substr(0,1))

display(df)


ID,Name,Initials
1,John,J
2,Alice,A
3,Bob,B
4,David,D
