# Perform simple data transformation like filtering evennumbers from a given list using PySpark RDD

In [1]:
sc


# Dataset Overview
- **Total Records:** 50 students
- **Columns:** 7 → id, name, age, gender, math, science, english
- **No missing values**

## Demographics
- **Age:** 18 – 25 years (average ≈ 21.5)
- **Gender:** 29 Female, 21 Male

## Academic Performance

**Math**
- Range: 40 – 100
- Mean: 68.9
- Std. Dev.: 17.6 (high variation)

**Science**
- Range: 44 – 99
- Mean: 70.2
- Std. Dev.: 14.6 (moderate variation)

**English**
- Range: 42 – 100
- Mean: 69.4
- Std. Dev.: 18.7 (highest variation)

## Key Insights
- Science is the strongest subject on average.
- English has the most variation in performance.
- Students perform differently across subjects (not uniform).


In [3]:
import random

In [4]:
random_numbers = [random.randint(1, 1000) for _ in range(100)]
print("Original List:")
print(random_numbers)

Original List:
[451, 24, 890, 809, 585, 219, 47, 283, 269, 935, 751, 197, 428, 446, 354, 3, 341, 946, 421, 90, 399, 1000, 202, 764, 154, 950, 568, 805, 47, 908, 75, 662, 82, 274, 245, 201, 124, 853, 774, 604, 431, 322, 835, 52, 691, 157, 756, 939, 192, 394, 730, 552, 840, 449, 12, 70, 681, 390, 469, 872, 409, 767, 976, 892, 562, 943, 559, 581, 498, 432, 886, 375, 697, 915, 588, 357, 847, 32, 717, 685, 69, 751, 358, 880, 91, 319, 115, 115, 499, 142, 322, 24, 491, 817, 95, 212, 396, 747, 965, 36]


In [5]:
numbers_rdd = sc.parallelize(random_numbers)

In [6]:
even_numbers_rdd = numbers_rdd.filter(lambda x: x % 2 == 0)

In [7]:
even_numbers = even_numbers_rdd.collect()
print("\nEven Numbers:")
print(even_numbers)


Even Numbers:
[24, 890, 428, 446, 354, 946, 90, 1000, 202, 764, 154, 950, 568, 908, 662, 82, 274, 124, 774, 604, 322, 52, 756, 192, 394, 730, 552, 840, 12, 70, 390, 872, 976, 892, 562, 498, 432, 886, 588, 32, 358, 880, 142, 322, 24, 212, 396, 36]


# Summary
Demonstrates data transformation using PySpark RDDs.

Focuses on applying RDD operations (transformations & actions) for big data handling.

## ⚙️ Operations Performed

### 1. Setup
- Imported PySpark libraries.
- Created a SparkContext to work with RDDs.
- Loaded sample data (possibly text/CSV).

### 2. RDD Creation
- Data converted into RDD using `sc.parallelize()` or `textFile()`.

### 3. Transformations
Operations that define a new RDD but do not execute immediately (lazy evaluation):

- `map()` → apply function to each element.
- `filter()` → filter elements based on condition.
- `flatMap()` → split elements into multiple parts.
- `distinct()` → remove duplicates.
- `union()` / `intersection()` → combine datasets.
- `groupByKey()` / `reduceByKey()` → group and aggregate.

### 4. Actions
Operations that trigger execution and return results:

- `collect()` → return all elements.
- `count()` → count records.
- `first()` → first element.
- `take(n)` → first n elements.
- `reduce()` → aggregate values.

### 5. Data Transformation Examples
- Converting strings to key-value pairs.
- Filtering based on conditions (e.g., ages > 20).
- Aggregating numbers (sum, average, min, max).
- Word count (common beginner example).

### 6. Output & Verification
- Displaying transformed data with `.collect()`.
- Checking counts, sums, or sample records.
