#### RDD to DataFrame
Similar to RDDs, DataFrames are immutable and distributed data structures in Spark. Even though RDDs are a fundamental data structure in Spark, working with data in DataFrame is easier than RDD most of the time and so understanding of how to convert RDD to DataFrame is necessary.

In [9]:
import findspark
findspark.init()
import pyspark

#Initiate Spark Context
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
#sc=pyspark.SparkContext()
sc.stop()

In [10]:
sc=pyspark.SparkContext()

In [11]:
spark = SparkSession.builder.appName('abc').getOrCreate()

In [13]:
# Create a list of tuples
sample_list = [('Mona',20), ('Jennifer',34), ('John',20), ('Jim',26)]

# Create a RDD from the list
rdd = sc.parallelize(sample_list)

# Create a PySpark DataFrame
names_df = spark.createDataFrame(rdd, schema=['Name', 'Age'])

# Check the type of names_df
print("The type of names_df is", type(names_df))

The type of names_df is <class 'pyspark.sql.dataframe.DataFrame'>


In [20]:
file_path_points=r'C:\Users\aperez\Documents\Python_projects\Spark\Fifa2018_dataset.csv'

In [44]:
# Create an DataFrame from file_path
people_df = spark.read.csv(file_path_points, header=True, inferSchema=True)

# Check the type of people_df
print("The type of people_df is", type(people_df))

#print(people_df.take(10))


The type of people_df is <class 'pyspark.sql.dataframe.DataFrame'>


In [46]:
# DataFrame Schema
people_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Photo: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Flag: string (nullable = true)
 |-- Overall: integer (nullable = true)
 |-- Potential: integer (nullable = true)
 |-- Club: string (nullable = true)
 |-- Club Logo: string (nullable = true)
 |-- Value: string (nullable = true)
 |-- Wage: string (nullable = true)
 |-- Special: integer (nullable = true)
 |-- Acceleration: string (nullable = true)
 |-- Aggression: string (nullable = true)
 |-- Agility: string (nullable = true)
 |-- Balance: string (nullable = true)
 |-- Ball control: string (nullable = true)
 |-- Composure: string (nullable = true)
 |-- Crossing: string (nullable = true)
 |-- Curve: string (nullable = true)
 |-- Dribbling: string (nullable = true)
 |-- Finishing: string (nullable = true)
 |-- Free kick accuracy: string (nullable = true)
 |-- GK diving: string (nullable = true)


In [49]:
# Column names
people_df.columns

['_c0',
 'Name',
 'Age',
 'Photo',
 'Nationality',
 'Flag',
 'Overall',
 'Potential',
 'Club',
 'Club Logo',
 'Value',
 'Wage',
 'Special',
 'Acceleration',
 'Aggression',
 'Agility',
 'Balance',
 'Ball control',
 'Composure',
 'Crossing',
 'Curve',
 'Dribbling',
 'Finishing',
 'Free kick accuracy',
 'GK diving',
 'GK handling',
 'GK kicking',
 'GK positioning',
 'GK reflexes',
 'Heading accuracy',
 'Interceptions',
 'Jumping',
 'Long passing',
 'Long shots',
 'Marking',
 'Penalties',
 'Positioning',
 'Reactions',
 'Short passing',
 'Shot power',
 'Sliding tackle',
 'Sprint speed',
 'Stamina',
 'Standing tackle',
 'Strength',
 'Vision',
 'Volleys',
 'CAM',
 'CB',
 'CDM',
 'CF',
 'CM',
 'ID',
 'LAM',
 'LB',
 'LCB',
 'LCM',
 'LDM',
 'LF',
 'LM',
 'LS',
 'LW',
 'LWB',
 'Preferred Positions',
 'RAM',
 'RB',
 'RCB',
 'RCM',
 'RDM',
 'RF',
 'RM',
 'RS',
 'RW',
 'RWB',
 'ST']

In [54]:
people_df.describe('Potential').show()

+-------+-----------------+
|summary|        Potential|
+-------+-----------------+
|  count|            17981|
|   mean|71.19081252433124|
| stddev|6.102199325567456|
|    min|               46|
|    max|               94|
+-------+-----------------+



In [40]:
df_name = people_df.select('Name', 'Value')

In [42]:
df_name.show(2)

+-----------------+------+
|             Name| Value|
+-----------------+------+
|Cristiano Ronaldo|€95.5M|
|         L. Messi| €105M|
+-----------------+------+
only showing top 2 rows



### Operating on DataFrames in PySpark

In [43]:
people_df.count()

17981