# DataFrame

* Creates a `DataFrame` from an `RDD`, a `list`, a `pandas.DataFrame`, a `numpy.ndarray`, or a `pyarrow.Table`.

### There are **4 different ways** to create a `DataFrame` in **PySpark**
1. Create a DataFrame from a list of tuples.
2. Create a DataFrame from a list of dictionaries.
3. Create a DataFrame from Row objects.
4. Create a DataFrame from a pandas DataFrame.

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Creating DataFrame").getOrCreate()

### 1. Create a DataFrame from a list of tuples.

In [0]:
data = [
  (1, 'name 1', 24, 150000, 'India'),
  (2, 'name 2', 25, 160000, 'USA'),
  (3, 'name 3', 26, 170000, 'Canada'),
  (4, 'name 4', 27, 180000, 'UK'),
  (5, 'name 5', 28, 190000, 'Australia')
]

# Also use StructType to define DataType
schema = '''
  id int,
  name string,
  age int,
  salary double,
  country string
'''

columns = ['id', 'name', 'age', 'salary', 'country']

df = spark.createDataFrame(data, columns)  # Pass Schema if you want to change Datatype else Default
df.show()
df.printSchema()

+---+------+---+------+---------+
| id|  name|age|salary|  country|
+---+------+---+------+---------+
|  1|name 1| 24|150000|    India|
|  2|name 2| 25|160000|      USA|
|  3|name 3| 26|170000|   Canada|
|  4|name 4| 27|180000|       UK|
|  5|name 5| 28|190000|Australia|
+---+------+---+------+---------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- salary: long (nullable = true)
 |-- country: string (nullable = true)



In [0]:
type(df)

pyspark.sql.connect.dataframe.DataFrame

### 2. Create a DataFrame from a list of dictionaries.

In [0]:
data = [
  {'id': 1, 'name': 'name 1', 'age': 24, 'salary': 150000, 'country': 'India'},
  {'id': 2, 'name': 'name 2', 'age': 25, 'salary': 160000, 'country': 'USA'},
  {'id': 3, 'name': 'name 3', 'age': 26, 'salary': 170000, 'country': 'Canada'},
  {'id': 4, 'name': 'name 4', 'age': 27, 'salary': 180000, 'country': 'UK'},
  {'id': 5, 'name': 'name 5', 'age': 28, 'salary': 190000, 'country': 'Australia'}
]

schema = '''
  id int,
  name string,
  age int,
  salary double,
  country string
'''

df = spark.createDataFrame(data, schema=schema)
df.show()
df.printSchema()

+---+------+---+--------+---------+
| id|  name|age|  salary|  country|
+---+------+---+--------+---------+
|  1|name 1| 24|150000.0|    India|
|  2|name 2| 25|160000.0|      USA|
|  3|name 3| 26|170000.0|   Canada|
|  4|name 4| 27|180000.0|       UK|
|  5|name 5| 28|190000.0|Australia|
+---+------+---+--------+---------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- country: string (nullable = true)



### 3. Create a DataFrame from Row objects.

### **Row()**
* A row in `DataFrame`.
* Row can be used to create a row object by using named arguments.

In [0]:
from pyspark.sql import Row

data = [
  Row(id=1, name='name 1', age=24, salary=150000, country='India'),
  Row(id=2, name='name 2', age=25, salary=160000, country='USA'),
  Row(id=3, name='name 3', age=26, salary=170000, country='Canada'),
  Row(id=4, name='name 4', age=27, salary=180000, country='UK'),
  Row(id=5, name='name 5', age=28, salary=190000, country='Australia')
]

# Another way
Employee = Row('id', 'name', 'age', 'salary', 'country')

data2 = [
    Employee(1, 'name 1', 24, 150000, 'India'),
    Employee(2, 'name 2', 25, 160000, 'USA'),
    Employee(3, 'name 3', 26, 170000, 'Canada'),
    Employee(4, 'name 4', 27, 180000, 'UK'),
    Employee(5, 'name 5', 28, 190000, 'Australia'),
]

schema = '''
  id int,
  name string,
  age int,
  salary double,
  country string
'''

df = spark.createDataFrame(data, schema=schema)
df.show()
df.printSchema()

+---+------+---+--------+---------+
| id|  name|age|  salary|  country|
+---+------+---+--------+---------+
|  1|name 1| 24|150000.0|    India|
|  2|name 2| 25|160000.0|      USA|
|  3|name 3| 26|170000.0|   Canada|
|  4|name 4| 27|180000.0|       UK|
|  5|name 5| 28|190000.0|Australia|
+---+------+---+--------+---------+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- country: string (nullable = true)



### 4. Create a DataFrame from a pandas DataFrame.

In [0]:
import pandas as pd

data = {
  'id': [1, 2, 3, 4, 5],
  'name': ['name 1', 'name 2', 'name 3', 'name 4', 'name 5'],
  'age': [24, 25, 26, 27, 28],
  'salary': [150000, 160000, 170000, 180000, 190000],
  'country': ['India', 'USA', 'Canada', 'UK', 'Australia']
}

pdf = pd.DataFrame(data)

schema = '''
  id int,
  name string,
  age int,
  salary double,
  country string
'''

df = spark.createDataFrame(pdf)  # Pass schema if needed
df.show()
df.printSchema()

+---+------+---+------+---------+
| id|  name|age|salary|  country|
+---+------+---+------+---------+
|  1|name 1| 24|150000|    India|
|  2|name 2| 25|160000|      USA|
|  3|name 3| 26|170000|   Canada|
|  4|name 4| 27|180000|       UK|
|  5|name 5| 28|190000|Australia|
+---+------+---+------+---------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- salary: long (nullable = true)
 |-- country: string (nullable = true)



In [0]:
spark.stop()