In [None]:
from pyspark.sql import SparkSession, Row
import collections


spark = SparkSession.builder.appName("teenagers").getOrCreate()



def mapper(line):
    data = line.split(',')
    id = int(data[0])
    name = str(data[1].encode('utf-8'))
    age = int(data[2])
    numFriends = int(data[3])

    return Row(id,name,age,numFriends)
    

lines = spark.SparkContext.textFile("fakeFriends.csv")

people = lines.map(mapper)

peopleSchema = people.createDataframe('people').cache()
peopleSchema.createOrReplaceTempView()

teenagers = spark.sql("select * from people where age> 10 and age < 21");

for teen in teenagers.collect():
    print(teen)


peopleSchema.groupBy("age").count().orderBy("age").show()


spark.stop()


### **Step-by-Step Explanation**

#### 1. **Importing Required Libraries**
```python
from pyspark.sql import SparkSession, Row
import collections
```
- **`SparkSession`**: Entry point to using Spark SQL. It allows creating DataFrames and executing SQL queries.
- **`Row`**: Used to create a row-like structure for DataFrame creation.
- **`collections`**: Standard Python library (not used in this code).

---

#### 2. **Creating a Spark Session**
```python
spark = SparkSession.builder.appName("teenagers").getOrCreate()
```
- A Spark session is created with the application name `"teenagers"`. This initializes the Spark engine.

---

#### 3. **Defining the Mapper Function**
```python
def mapper(line):
    data = line.split(',')
    id = int(data[0])
    name = str(data[1].encode('utf-8'))
    age = int(data[2])
    numFriends = int(data[3])

    return Row(id, name, age, numFriends)
```
- **Purpose**: Converts a line from the CSV file into a structured `Row` object.
- **How it works**:
  1. Splits the line using `,` to separate columns.
  2. Extracts the `id`, `name`, `age`, and `numFriends` fields, converting them to appropriate types.
  3. Returns a `Row` object with these fields.

---

#### 4. **Loading the CSV File**
```python
lines = spark.SparkContext.textFile("fakeFriends.csv")
```
- **`textFile`**: Loads the `fakeFriends.csv` file as an RDD (Resilient Distributed Dataset), where each line is a string.

---

#### 5. **Mapping Lines to Rows**
```python
people = lines.map(mapper)
```
- Applies the `mapper` function to each line of the RDD, converting it into an RDD of `Row` objects.

---

#### 6. **Creating a DataFrame**
```python
peopleSchema = people.createDataframe('people').cache()
```
- **`createDataFrame`**: Converts the RDD of `Row` objects into a DataFrame named `peopleSchema`.
- **`cache`**: Caches the DataFrame in memory for faster access during subsequent operations.

---

#### 7. **Creating a Temporary SQL View**
```python
peopleSchema.createOrReplaceTempView()
```
- Registers the DataFrame as a temporary SQL table (`people`) to run SQL queries on it.

---

#### 8. **Finding Teenagers**
```python
teenagers = spark.sql("select * from people where age> 10 and age < 21")
```
- Executes an SQL query to find all rows where `age` is between 11 and 20 (teenagers).

---

#### 9. **Printing Teenagers**
```python
for teen in teenagers.collect():
    print(teen)
```
- **`collect`**: Brings the results from the cluster to the driver program as a Python list.
- **`print`**: Prints each teenager’s details.

---

#### 10. **Grouping by Age**
```python
peopleSchema.groupBy("age").count().orderBy("age").show()
```
- Groups the DataFrame by `age` and counts the number of people in each age group.
- Orders the results by age and displays them.

---

#### 11. **Stopping the Spark Session**
```python
spark.stop()
```
- Stops the Spark session and releases resources.

---

### **Sample Input File (`fakeFriends.csv`)**
```
1,John,18,200
2,Jane,22,150
3,Mike,15,300
4,Sara,19,250
5,Tom,10,100
```

---

### **Step-by-Step Execution with Example**

#### **1. Mapper Function**
Input line: `"1,John,18,200"`

Output Row:
```python
Row(id=1, name='John', age=18, numFriends=200)
```

#### **2. DataFrame Creation**
DataFrame:
```
+---+----+---+----------+
| id|name|age|numFriends|
+---+----+---+----------+
|  1|John| 18|       200|
|  2|Jane| 22|       150|
|  3|Mike| 15|       300|
|  4|Sara| 19|       250|
|  5|Tom | 10|       100|
+---+----+---+----------+
```

#### **3. SQL Query for Teenagers**
SQL Query:
```sql
SELECT * FROM people WHERE age > 10 AND age < 21;
```

Result:
```
+---+----+---+----------+
| id|name|age|numFriends|
+---+----+---+----------+
|  1|John| 18|       200|
|  3|Mike| 15|       300|
|  4|Sara| 19|       250|
+---+----+---+----------+
```

#### **4. Grouping by Age**
Output:
```
+---+-----+
|age|count|
+---+-----+
| 10|    1|
| 15|    1|
| 18|    1|
| 19|    1|
| 22|    1|
+---+-----+
```

---

### **Key Concepts Illustrated**
1. **RDD to DataFrame Conversion**: Shows how to transform unstructured text data into structured DataFrames.
2. **SQL Queries**: Demonstrates SQL-like operations on DataFrames.
3. **Grouping and Aggregation**: Groups and counts data efficiently.

Let me know if you'd like further clarification!