In [1]:
from pyspark.sql import SparkSession


In [3]:
spark=(
SparkSession
    .builder
    .appName("creating datframe")
    .master("local[*]")
    .getOrCreate()
)

25/01/11 16:39:54 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [5]:
#way to create a dataframe with data and column name
data = [("Alice", 34), ("Bob", 36), ("Cathy", 29)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 36|
|Cathy| 29|
+-----+---+



# reading from csv 

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.printSchema()
df.show()



In PySpark, the options `header=True` and `inferSchema=True` are used when reading data from files, such as CSV files, to control how the data is interpreted and loaded into a DataFrame.

---

### **1. `header=True`**
This option tells PySpark to treat the first row of the file as column headers.

- **Purpose**: It ensures that the column names in the DataFrame match the header row of the CSV file instead of defaulting to generic column names like `_c0`, `_c1`, etc.
- **Example**: If the CSV file looks like this:

  ```
  Name,Age,City
  Alice,34,New York
  Bob,36,Chicago
  ```

  With `header=True`, the DataFrame will look like:
  ```
  +-----+---+---------+
  | Name|Age|     City|
  +-----+---+---------+
  |Alice| 34| New York|
  |  Bob| 36|  Chicago|
  +-----+---+---------+
  ```

  Without `header=True`, the first row of the CSV file is treated as data, and the column names default to `_c0`, `_c1`, etc.:
  ```
  +-----+---+---------+
  |_c0  |_c1|     _c2 |
  +-----+---+---------+
  | Name|Age|     City|
  |Alice| 34| New York|
  |  Bob| 36|  Chicago|
  +-----+---+---------+
  ```

---

### **2. `inferSchema=True`**
This option tells PySpark to automatically detect the data types of columns based on their values.

- **Purpose**: By default, PySpark treats all columns as `string` unless you explicitly specify the schema. `inferSchema=True` lets PySpark inspect the data and determine the appropriate types (e.g., integer, float, string, etc.).

- **Example**: If the CSV file contains:
  ```
  Name,Age,City
  Alice,34,New York
  Bob,36,Chicago
  ```

  With `inferSchema=True`, the DataFrame will have these types:
  ```
  +-----+---+---------+
  | Name|Age|     City|
  +-----+---+---------+
  |Alice| 34| New York|
  |  Bob| 36|  Chicago|
  +-----+---+---------+

  Schema:
  root
   |-- Name: string (nullable = true)
   |-- Age: integer (nullable = true)
   |-- City: string (nullable = true)
  ```

  Without `inferSchema=True`, all columns will be treated as `string`:
  ```
  +-----+----+---------+
  | Name| Age|     City|
  +-----+----+---------+
  |Alice|  34| New York|
  |  Bob|  36|  Chicago|
  +-----+----+---------+

  Schema:
  root
   |-- Name: string (nullable = true)
   |-- Age: string (nullable = true)
   |-- City: string (nullable = true)
  ```

---

### **When to Use Them**
- **`header=True`**: Use when your file has meaningful column headers.
- **`inferSchema=True`**: Use when you want PySpark to assign appropriate data types based on the file's content.

---

### **Usage Example**
```python
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)
df.printSchema()
df.show()
```

youtube ma krish nayak ko free code camp ma video xa tyo herni na bhujae yo