
---

### **1. Importing Libraries**
```python
from pyspark import SparkConf, SparkContext
import collections
```
- **`SparkConf`**: Configures your Spark application. It allows you to set parameters like the application name and the master node (local, cluster, etc.).
- **`SparkContext`**: Acts as the entry point to Spark functionality. It connects your application to the Spark cluster.
- **`collections.OrderedDict`**: A Python collection that maintains the order of items based on insertion or a custom order.

---

### **2. Setting Up Spark Configuration**
```python
conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
```
- **`setMaster("local")`**: Specifies that the Spark job will run locally on your machine. Replace `"local"` with a cluster URL for distributed computation.
- **`setAppName("RatingsHistogram")`**: Assigns a name to your Spark application, useful for tracking jobs in the Spark UI.
- **`SparkContext(conf=conf)`**: Initializes the Spark context using the configuration object.

---

### **3. Loading Data**
```python
lines = sc.textFile("file:///C:/Users/Mahbub/Desktop/Data Engineering/Spark/ml-100k/u.data")
```
- **`sc.textFile()`**: Reads a text file into an RDD (Resilient Distributed Dataset). The file path is prefixed with `file:///` to indicate a local file.
- **RDD**: A distributed collection of data that Spark processes in parallel.

---

### **4. Extracting Ratings**
```python
ratings = lines.map(lambda x: x.split()[2])
```
- **`lines.map()`**: Applies a transformation to each element of the RDD. In this case, it splits each line by whitespace and extracts the third column (`[2]`), which represents the movie rating.
- **`lambda x: x.split()[2]`**: A Python lambda function used for inline processing.

---

### **5. Counting Ratings**
```python
result = ratings.countByValue()
```
- **`countByValue()`**: Counts the occurrences of each unique value in the RDD and returns a dictionary where keys are the unique values, and values are their counts.

---

### **6. Sorting Results**
```python
sortedResults = collections.OrderedDict(sorted(result.items()))
```
- **`result.items()`**: Returns the dictionary items (key-value pairs) from the `countByValue()` output.
- **`sorted()`**: Sorts the items based on keys (rating values) by default.
- **`OrderedDict`**: Ensures the sorted order is preserved in the dictionary.

---

### **7. Printing Results**
```python
for key, value in sortedResults.items():
    print("%s %i" % (key, value))
```
- **`for key, value in sortedResults.items()`**: Iterates through the sorted dictionary.
- **`print("%s %i" % (key, value))`**: Formats the output as a string (`%s`) for the key and an integer (`%i`) for the value.

---

### **Key Concepts in PySpark**
- **RDD**: The fundamental data structure in Spark, designed for distributed processing.
- **Transformations**: Functions like `map()` that create a new RDD from an existing one.
- **Actions**: Functions like `countByValue()` that trigger computation and return results.

---

### **What This Code Does**
1. Reads a file containing movie ratings data.
2. Extracts the rating column from the data.
3. Counts the frequency of each rating value.
4. Sorts the results in ascending order of ratings.
5. Prints the sorted rating frequencies.
