PySpark-Queries

Pyspark tutorial with different query that you can use on notebook using pyspark. It is very useful tool to analyze large amount of data.

What is PySpark

PySpark is a Python API for Apache Spark to process larger datasets in a distributed cluster. It is written in Python to run a Python application using Apache Spark capabilities.

Implementation

A PySpark library to apply SQL-like analysis on a huge amount of structured or semi-structured data. We can also use SQL queries with PySparkSQL. It can also be connected to Apache Hive. HiveQL can be also be applied. PySparkSQL is a wrapper over the PySpark core. PySparkSQL introduced the DataFrame, a tabular representation of structured data that is similar to that of a table from a relational database management system.

Use Cases

Batch processing
PySpark RDD and DataFrame’s are used to process batch pipelines where you would need high throughput.
Realtime processing
PySpark Streaming is used to for real time processing.
Machine Learning
PySpark ML and MLlib is used for machine learning.
Graph processing
PySpark GraphX and GraphFrames are used for Graph processing.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
PlayingWithPySpark.ipynb		PlayingWithPySpark.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark-Queries

What is PySpark

Implementation

Use Cases

About

Releases

Packages

Languages

RahulParajuli/PySpark-Queries

Folders and files

Latest commit

History

Repository files navigation

PySpark-Queries

What is PySpark

Implementation

Use Cases

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages