# Big Data with PySpark

### Pyspark installation on local machine
1. Install jdk 8 or higher via https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
2. install pyspark with brew install apache-spark
3. install pyspark and findspark via
    - pip install findspark
    - pip install pyspark
4. In either your bashrc, zshrc, or whatever shell config you use, enter this:
    - export PYSPARK_PYTHON=python3
    - export SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark
5. In your text editor with a python file open, type the following:

In [None]:
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SparkSession

spark = SparkSession \
   .builder \
   .appName("Python Spark regression example") \
   .config("spark.some.config.option", "some-value") \
   .getOrCreate()

## Dataframes In Pyspark

In [1]:
from pyspark import SparkContext
from pyspark.sql import SparkSession

spark = SparkSession .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#### Loading a csv file with pyspark

In [4]:
regressionDataFrame = spark.read.csv('Advertising.csv',header=True, inferSchema = True)
regressionDataFrame.show(5)

+---+-----+-----+---------+-----+
|_c0|   TV|Radio|Newspaper|Sales|
+---+-----+-----+---------+-----+
|  1|230.1| 37.8|     69.2| 22.1|
|  2| 44.5| 39.3|     45.1| 10.4|
|  3| 17.2| 45.9|     69.3|  9.3|
|  4|151.5| 41.3|     58.5| 18.5|
|  5|180.8| 10.8|     58.4| 12.9|
+---+-----+-----+---------+-----+
only showing top 5 rows



#### Drop a columm from the tabel

In [5]:
regressionDataFrame = regressionDataFrame.drop('_c0')
regressionDataFrame.show(5)

+-----+-----+---------+-----+
|   TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3|  9.3|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
+-----+-----+---------+-----+
only showing top 5 rows



### Getting a tabels column names

In [6]:
regressionDataFrame.columns

['TV', 'Radio', 'Newspaper', 'Sales']

#### Grouping data based on conditions

In [7]:
regressionDataFrame.groupBy(regressionDataFrame.TV > 100).count().show(5)

+----------+-----+
|(TV > 100)|count|
+----------+-----+
|      true|  130|
|     false|   70|
+----------+-----+



#### Filtering rows based on the given conditions 

In [9]:
regressionDataFrame.filter(regressionDataFrame.TV > 100).show(5)

+-----+-----+---------+-----+
|   TV|Radio|Newspaper|Sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
|120.2| 19.6|     11.6| 13.2|
|199.8|  2.6|     21.2| 10.6|
+-----+-----+---------+-----+
only showing top 5 rows



#### Selecing columns based on conditions

In [10]:
regressionDataFrame.select(regressionDataFrame.TV > 100).show(5)

+----------+
|(TV > 100)|
+----------+
|      true|
|     false|
|     false|
|      true|
|      true|
+----------+
only showing top 5 rows



#### Applying mathimatical functions on the dataframe columns 

In [11]:
from pyspark.sql.functions import mean, min, max
regressionDataFrame.select([mean('TV'), min('TV'), max('TV')]).show()

+--------+-------+-------+
| avg(TV)|min(TV)|max(TV)|
+--------+-------+-------+
|147.0425|    0.7|  296.4|
+--------+-------+-------+

