PySpark

This project is a hands-on tutorial on how to use Spark for data processing and analysis, with a focus on Spark SQL, DataFrames, Structured Streaming, and Machine Learning (MLlib). The project includes code examples and explanations for various Spark features and APIs.

Installation

PySpark 3.4.0
NumPy 1.24.3
Pandas 2.0.1

Usage

Spark SQL is a module in Apache Spark that works with structured data. It allows SQL queries to be seamlessly mixed with Spark programs. PySpark DataFrames allow for efficient reading, writing, transforming, and analyzing of data using both Python and SQL. The same underlying execution engine is used for both Python and SQL, ensuring that the full power of Spark is always utilized.

RDD Operations

The project includes examples of creating, manipulating, and transforming RDDs for distributed data processing and analysis.

Spark SQL

This section of the project provides explanations and examples of Spark SQL for data processing and analysis, covering common operations such as filtering, grouping, aggregating, and joining.

DataFrames

This section of the project focuses on Spark DataFrames for structured data processing and analysis. It includes examples of creating, manipulating, and querying DataFrames.

Structured Streaming

The project provides an explanation of Spark Structured Streaming for real-time data processing and analysis. It covers how to create streaming data sources, process data streams, and output the results.

Machine Learning (MLlib)

The project includes an explanation of how to use Spark's MLlib library for machine learning tasks such as classification and regression.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
ClassificationProblem		ClassificationProblem
RDD_operations		RDD_operations
Structured Streaming		Structured Streaming
dataframe_sql_ml_HndsOn		dataframe_sql_ml_HndsOn
day4		day4
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark

Installation

Usage

RDD Operations

Spark SQL

DataFrames

Structured Streaming

Machine Learning (MLlib)

About

Releases

Packages

Languages

FatimaMHelmy/PySpark

Folders and files

Latest commit

History

Repository files navigation

PySpark

Installation

Usage

RDD Operations

Spark SQL

DataFrames

Structured Streaming

Machine Learning (MLlib)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages