Skip to content

FatimaMHelmy/PySpark

Repository files navigation

PySpark

This project is a hands-on tutorial on how to use Spark for data processing and analysis, with a focus on Spark SQL, DataFrames, Structured Streaming, and Machine Learning (MLlib). The project includes code examples and explanations for various Spark features and APIs.

Installation

  • PySpark 3.4.0
  • NumPy 1.24.3
  • Pandas 2.0.1

Usage

Spark SQL is a module in Apache Spark that works with structured data. It allows SQL queries to be seamlessly mixed with Spark programs. PySpark DataFrames allow for efficient reading, writing, transforming, and analyzing of data using both Python and SQL. The same underlying execution engine is used for both Python and SQL, ensuring that the full power of Spark is always utilized.

RDD Operations

The project includes examples of creating, manipulating, and transforming RDDs for distributed data processing and analysis.

Spark SQL

This section of the project provides explanations and examples of Spark SQL for data processing and analysis, covering common operations such as filtering, grouping, aggregating, and joining.

DataFrames

This section of the project focuses on Spark DataFrames for structured data processing and analysis. It includes examples of creating, manipulating, and querying DataFrames.

Structured Streaming

The project provides an explanation of Spark Structured Streaming for real-time data processing and analysis. It covers how to create streaming data sources, process data streams, and output the results.

Machine Learning (MLlib)

The project includes an explanation of how to use Spark's MLlib library for machine learning tasks such as classification and regression.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published