<a href="https://colab.research.google.com/github/Shakorly/Machine_learning_with_PySpark/blob/main/overview_of_Machine_learning_with_PySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Chapter 1: MLlib**

Welcome to the "Machine Learning with PySpark" tutorial! In this first chapter, we'll introduce you to the core library we'll be using for building machine learning models on large datasets: MLlib.

Imagine you have a huge collection of something, like information about millions of houses across a country, and you want to predict the price of a new house based on its features (like size, location, number of rooms). Doing this with traditional machine learning tools might be slow or even impossible on a single computer if the dataset is too big to fit into memory.

This is where Spark comes in. Spark is a powerful engine designed to process very large datasets by distributing the work across many computers in a cluster.

Now, within this powerful Spark engine, we need specific tools for machine learning tasks like predicting prices (regression), classifying emails as spam or not spam (classification), or grouping similar customers together (clustering). This is exactly what MLlib provides.

Think of Spark as a large, well-equipped workshop capable of handling enormous projects. MLlib is the specialized toolbox within that workshop specifically designed for machine learning projects. It contains all the hammers, saws, and specific gadgets you need for tasks like training models, preparing your data for training, and making predictions, all built to work seamlessly with Spark's ability to handle big data.

**What is MLlib?**
MLlib is Spark's Machine Learning Library. Its primary goal is to provide a scalable and easy-to-use set of machine learning algorithms and tools that can run on large datasets distributed across a cluster of computers.

Instead of trying to load all your massive data onto one machine's memory, MLlib algorithms are designed to process data in chunks, in parallel, on different machines managed by Spark. This is the key to handling "big data" for machine learning.

MLlib offers tools for various common machine learning tasks, such as:

Classification: Categorizing data (e.g., is this a picture of a cat or a dog? Is this transaction fraudulent?).
Regression: Predicting a continuous value (e.g., what will the price of this house be? How much will this customer spend?).
Clustering: Grouping similar data points together (e.g., finding different customer segments).
Collaborative Filtering: Making recommendations (e.g., suggesting movies based on what other users liked).
Feature Extraction & Transformation: Preparing your data for the machine learning algorithms.
For our house price prediction example, we would use MLlib's regression algorithms.

**Why Use MLlib?**
You might already know about other great machine learning libraries like scikit-learn, TensorFlow, or PyTorch. These are excellent, but they are primarily designed to run on a single machine, possibly using multiple cores or GPUs on that machine.

When your dataset grows beyond what a single machine can handle, you need a distributed solution. MLlib, built on Spark, is designed precisely for this scenario. It allows you to use familiar machine learning concepts and algorithms but execute them on data spread across an entire cluster of machines.

Here's a simple comparison:

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors

print("Successfully insatll SparkSession")

Successfully insatll SparkSession


This code simply shows that you access MLlib's capabilities by importing specific tools or algorithms from the pyspark.ml package. When you install pyspark, you get MLlib automatically.

**MLlib**:
The Starting Point
MLlib itself is the collection of algorithms and helper functions. However, to use MLlib, you first need to interact with Spark. The entry point for any Spark application, including those using MLlib, is something called a SparkSession.

The SparkSession is like getting the keys to the Spark workshop. Once you have it, you can load data and start using the tools (MLlib) within it.

We will explore how to get started with Spark and how to obtain this crucial SparkSession in the next chapter.

Conclusion
In this chapter, we learned that MLlib is Spark's powerful library for doing machine learning on big, distributed datasets. It provides algorithms and tools for common ML tasks like regression and classification, designed to work in parallel across a cluster of machines. We saw that MLlib is accessed via the pyspark.ml package and operates by distributing computation alongside Spark's data distribution.

To start using MLlib, the very first step is always to set up a Spark environment and get a way to interact with it. This is where the SparkSession comes in.

Ready to open the Spark workshop? Let's learn about the SparkSession in the next chapter.

###Create a SparkSession
>> Creating a SparkSession is usually the very first step in any PySpark application. You typically use a builder pattern. Here's the basic way to do it:

In [7]:
spark = SparkSession.builder.appName("MyFirstSparkApp").getOrCreate()

### Let's break down this simple code:

>>from pyspark.sql import SparkSession: This line imports the necessary class from the PySpark library.



>>SparkSession.builder: This starts the process of building a SparkSession object. It returns a builder object that helps configure the session.

>>.appName("MyFirstSparkApp"): You give your Spark application a name. This is helpful when you're running many applications on a cluster; you can easily identify yours in Spark's monitoring UIs. Choose a descriptive name!

>>.getOrCreate(): This is the key method. It checks if a SparkSession already exists with the current configuration.
If one exists, it returns the existing session.

--------------------------------------
If one does not exist, it creates a new one based on the configurations you've set (like the app name).
This pattern ensures you don't accidentally create multiple SparkSessions in the same application, which is usually not desired.

### Let read in our first data

we work with this  Automotive Data through out the project

In [11]:
df = spark.read.csv('/content/data.csv', inferSchema=True, header=True, sep=",")
df.show(5)

+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|symboling|normalized-losses|       make|fuel-type|aspiration|num-of-doors| body-style|drive-wheels|engine-location|wheel-base|length|width|height|curb-weight|engine-type|num-of-cylinders|engine-size|fuel-system|bore|stroke|compression-ratio|horsepower|peak-rpm|city-mpg|highway-mpg|price|
+---------+-----------------+-----------+---------+----------+------------+-----------+------------+---------------+----------+------+-----+------+-----------+-----------+----------------+-----------+-----------+----+------+-----------------+----------+--------+--------+-----------+-----+
|        3|             NULL|alfa-romero|      gas|       std|         two|convertible|         rwd|          front|      88.6| 16

**Explanation**




*   spark.read: Accesses the reader object within the SparkSession.
*   .csv:Meaning the data is a csv format
NOTE: Ypu can also pass another data format, like json, parguet, exel ect

*   inferSchema=True: Asks Spark to try and automatically figure out the data types for each column (like 'age' being an integer). For large datasets, it's often better practice to define the schema manually for performance, but this is easy for beginners.

*   header=True: Indicates that the first row of the CSV is a header row containing column names.




