In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func

spark = SparkSession.builder.appName("WordCount").getOrCreate()


inputDF = spark.read.text("file:///C:/Users/Mahbub/Desktop/Data Engineering/Spark/book.txt")


words = inputDF.select(func.explode(func.split(inputDF.value,"\\W+")).alias("word"))

wordsWithoutEmptyString = words.filter(words.word != "")

smallerCaseWords =  words.select(func.lower(wordsWithoutEmptyString.word).alias("word"))

wordCount = smallerCaseWords.groupBy("word").count()

wordCountSorted = wordCount.sort("count")

wordCountSorted.show(wordCountSorted.count())


---

### Code Explanation

#### **1. Importing Required Libraries**
```python
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
```
- `SparkSession`: Entry point for PySpark applications.
- `functions` (aliased as `func`): Provides utility functions like `explode`, `split`, `lower`, etc.

---

#### **2. Initializing SparkSession**
```python
spark = SparkSession.builder.appName("WordCount").getOrCreate()
```
- Creates a `SparkSession` named "WordCount".
- This is required to work with DataFrames in PySpark.

---

#### **3. Reading the Input File**
```python
inputDF = spark.read.text("file:///C:/Users/Mahbub/Desktop/Data Engineering/Spark/book.txt")
```
- Reads the text file into a DataFrame named `inputDF`.
- Each line of the file becomes a row in the DataFrame under the column `value`.

**Sample Input File (`book.txt`):**
```
Hello world
This is a test
Hello Spark
```

**`inputDF`:**

| value           |
|------------------|
| Hello world      |
| This is a test   |
| Hello Spark      |

---

#### **4. Splitting Lines into Words**
```python
words = inputDF.select(func.explode(func.split(inputDF.value, "\\W+")).alias("word"))
```
- **`split(inputDF.value, "\\W+")`:**
  - Splits each line into words based on non-word characters (`\W+`).
  - Example: `"Hello world"` → `["Hello", "world"]`.

- **`explode`:**
  - Expands each array element into a separate row.
  - Example: `["Hello", "world"]` → Two rows: `Hello`, `world`.

**`words`:**

| word            |
|------------------|
| Hello           |
| world           |
| This            |
| is              |
| a               |
| test            |
| Hello           |
| Spark           |

---

#### **5. Filtering Out Empty Strings**
```python
wordsWithoutEmptyString = words.filter(words.word != "")
```
- Removes rows where the word is an empty string (`""`).

---

#### **6. Converting Words to Lowercase**
```python
smallerCaseWords = words.select(func.lower(wordsWithoutEmptyString.word).alias("word"))
```
- Converts all words to lowercase using `lower`.

**`smallerCaseWords`:**

| word            |
|------------------|
| hello           |
| world           |
| this            |
| is              |
| a               |
| test            |
| hello           |
| spark           |

---

#### **7. Counting Word Occurrences**
```python
wordCount = smallerCaseWords.groupBy("word").count()
```
- Groups the words and counts their occurrences.

**`wordCount`:**

| word   | count |
|--------|-------|
| hello  | 2     |
| world  | 1     |
| this   | 1     |
| is     | 1     |
| a      | 1     |
| test   | 1     |
| spark  | 1     |

---

#### **8. Sorting Words by Count**
```python
wordCountSorted = wordCount.sort("count")
```
- Sorts the DataFrame by the `count` column in ascending order.

**`wordCountSorted`:**

| word   | count |
|--------|-------|
| world  | 1     |
| this   | 1     |
| is     | 1     |
| a      | 1     |
| test   | 1     |
| spark  | 1     |
| hello  | 2     |

---

#### **9. Displaying Results**
```python
wordCountSorted.show(wordCountSorted.count())
```
- Displays all rows of the sorted word count DataFrame.

**Output:**
```

+-----+-----+
| word|count|
+-----+-----+
|world|    1|
|this |    1|
|is   |    1|
|a    |    1|
|test |    1|
|spark|    1|
|hello|    2|
+-----+-----+
```

---

### Key Concepts Illustrated
1. **DataFrame Operations**: Reading, transforming, and analyzing text data.
2. **Functions**: Using `split`, `explode`, `lower`, `groupBy`, and `count` for text processing.
3. **Chaining Transformations**: How multiple transformations are applied in sequence.