<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


## Feature Extraction and Transformation using Spark


Estimated time needed: **30** minutes


<p style='color: red'>The purpose of this lab is to show you how to use Spark to extract and transform features.


## __Table of Contents__

<ol>
  <li>
    <a href="#Objectives">Objectives
    </a>
  </li>
  <li>
    <a href="#Datasets">Datasets
    </a>
  </li>
  <li>
    <a href="#Setup">Setup
    </a>
    <ol>
      <li>
        <a href="#Installing-Required-Libraries">Installing Required Libraries
        </a>
      </li>
      <li>
        <a href="#Importing-Required-Libraries">Importing Required Libraries
        </a>
      </li>
    </ol>
  </li>
  <li> 
    <a href="#Examples">Examples
    </a>
    <ol>
    <li>
      <a href="#Task-1---Tokenizer">Task 1 - Tokenizer
      </a>
    </li>
    <li>
      <a href="#Task-2---CountVectorizer">Task 2 - CountVectorizer
      </a>
    </li>
    <li>
      <a href="#Task-3---TF-IDF">Task 3 - TF-IDF
      </a>
    </li>
    <li>
      <a href="#Task-4---StopWordsRemover">Task 4 - StopWordsRemover
      </a>
    </li>
    <li>
      <a href="#Task-5---StringIndexer">Task 5 - StringIndexer
      </a>
    </li>
    <li>
      <a href="#Task-6---StandardScaler">Task 6 - StandardScaler
      </a>
    </li>
    </ol>
  </li>
  <li>
    <a href="#Exercises">Exercises
    </a>
  </li>
  <ol>
    <li>
      <a href="#Exercise-1---Tokenizer">Exercise 1 - Tokenizer
      </a>
    </li>
    <li>
      <a href="#Exercise-2---CountVectorizer">Exercise 2 - CountVectorizer
      </a>
    </li>
    <li>
      <a href="#Exercise-3---StringIndexer">Exercise 3 - StringIndexer
      </a>
    </li>
    <li>
      <a href="#Exercise-4---StandardScaler">Exercise 4 - StandardScaler
      </a>
    </li>
  </ol>
</ol>


















## Objectives

After completing this lab you will be able to:

 - Use the feature extractor CountVectorizer
 - Use the feature extractor TF-IDF
 - Use the feature transformer Tokenizer
 - Use the feature transformer StopWordsRemover
 - Use the feature transformer StringIndexer
 - Use the feature transformer StandardScaler
 


## Datasets

In this lab you will be using dataset(s):

 - Modified version of car mileage dataset. Original dataset available at https://archive.ics.uci.edu/ml/datasets/auto+mpg 
 


----


## Setup


For this lab, we will be using the following libraries:

*   [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMSkillsNetworkBD0231ENCoursera2789-2023-01-01) for connecting to the Spark Cluster


### Installing Required Libraries

Spark Cluster is pre-installed in the Skills Network Labs environment. However, you need libraries like pyspark and findspark to connect to this cluster.

If you wish to download this jupyter notebook and run on your local computer, follow the instructions mentioned <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/labs/Connecting_to_spark_cluster_using_Skills_Network_labs.ipynb">here.</a>



The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:


### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [1]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

In [2]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Feature Extraction and Transformation using Spark").getOrCreate()

24/08/29 16:49:44 WARN Utils: Your hostname, codebase resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
24/08/29 16:49:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/29 16:49:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Task 1 - Tokenizer


A tokenizer is used to break a sentence into words.


In [3]:
#import tokenizer
from pyspark.ml.feature import Tokenizer

In [4]:
#create a sample dataframe
sentenceDataFrame = spark.createDataFrame([
    (1, "Spark is a distributed computing system."),
    (2, "It provides interfaces for multiple languages"),
    (3, "Spark is built on top of Hadoop")
], ["id", "sentence"])

In [5]:
#display the dataframe
sentenceDataFrame.show(truncate = False)

                                                                                

+---+---------------------------------------------+
|id |sentence                                     |
+---+---------------------------------------------+
|1  |Spark is a distributed computing system.     |
|2  |It provides interfaces for multiple languages|
|3  |Spark is built on top of Hadoop              |
+---+---------------------------------------------+



In [6]:
#create tokenizer instance.
#mention the column to be tokenized as inputcol
#mention the output column name where the tokens are to be stored.
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

In [7]:
#tokenize
token_df = tokenizer.transform(sentenceDataFrame)

In [8]:
#display the tokenized data
token_df.show(truncate=False)

+---+---------------------------------------------+----------------------------------------------------+
|id |sentence                                     |words                                               |
+---+---------------------------------------------+----------------------------------------------------+
|1  |Spark is a distributed computing system.     |[spark, is, a, distributed, computing, system.]     |
|2  |It provides interfaces for multiple languages|[it, provides, interfaces, for, multiple, languages]|
|3  |Spark is built on top of Hadoop              |[spark, is, built, on, top, of, hadoop]             |
+---+---------------------------------------------+----------------------------------------------------+



## Task 2 - CountVectorizer


CountVectorizer is used to convert text into numerical format. It gives the count of each word in a given document.


In [9]:
#import CountVectorizer
from pyspark.ml.feature import CountVectorizer

In [10]:
#create a sample dataframe and display it.
textdata = [(1, "I love Spark Spark provides Python API ".split()),
            (2, "I love Python Spark supports Python".split()),
            (3, "Spark solves the big problem of big data".split())]

textdata = spark.createDataFrame(textdata, ["id", "words"])

textdata.show(truncate=False)

+---+-------------------------------------------------+
|id |words                                            |
+---+-------------------------------------------------+
|1  |[I, love, Spark, Spark, provides, Python, API]   |
|2  |[I, love, Python, Spark, supports, Python]       |
|3  |[Spark, solves, the, big, problem, of, big, data]|
+---+-------------------------------------------------+



In [11]:
# Create a CountVectorizer object
# mention the column to be count vectorized as inputcol
# mention the output column name where the count vectors are to be stored.
cv = CountVectorizer(inputCol="words", outputCol="features")

In [12]:
# Fit the CountVectorizer model on the input data
model = cv.fit(textdata)

                                                                                

In [13]:
# Transform the input data to bag-of-words vectors
result = model.transform(textdata)

In [14]:
# display the dataframe
result.show(truncate=False)

+---+-------------------------------------------------+---------------------------------------------------+
|id |words                                            |features                                           |
+---+-------------------------------------------------+---------------------------------------------------+
|1  |[I, love, Spark, Spark, provides, Python, API]   |(13,[0,1,2,4,10,11],[2.0,1.0,1.0,1.0,1.0,1.0])     |
|2  |[I, love, Python, Spark, supports, Python]       |(13,[0,1,2,4,8],[1.0,2.0,1.0,1.0,1.0])             |
|3  |[Spark, solves, the, big, problem, of, big, data]|(13,[0,3,5,6,7,9,12],[1.0,2.0,1.0,1.0,1.0,1.0,1.0])|
+---+-------------------------------------------------+---------------------------------------------------+



## Task 3 - TF-IDF


Term Frequency-Inverse Document Frequency is used to quantify the importance of a word in a document. TF-IDF is computed by multiplying the number of times a word occurs in a document by the inverse document frequency of the word.


In [33]:
#import necessary classes for TF-IDF calculation
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

In [39]:
#create a sample dataframe and display it.
sentenceData = spark.createDataFrame([
        (1, "good boy"),
        (2, "good girl"),
        (3, "boy good girl")
    ], ["id", "sentence"])

sentenceData.show(truncate = False)

+---+-------------+
|id |sentence     |
+---+-------------+
|1  |good boy     |
|2  |good girl    |
|3  |boy good girl|
+---+-------------+



In [40]:
#tokenize the "sentence" column and store in the column "words"
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show(truncate = False)

+---+-------------+-----------------+
|id |sentence     |words            |
+---+-------------+-----------------+
|1  |good boy     |[good, boy]      |
|2  |good girl    |[good, girl]     |
|3  |boy good girl|[boy, good, girl]|
+---+-------------+-----------------+



In [41]:
# Create a HashingTF object
# mention the "words" column as input
# mention the "rawFeatures" column as output

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=10)
featurizedData = hashingTF.transform(wordsData)

featurizedData.show(truncate = False)

+---+-------------+-----------------+--------------------------+
|id |sentence     |words            |rawFeatures               |
+---+-------------+-----------------+--------------------------+
|1  |good boy     |[good, boy]      |(10,[7,8],[1.0,1.0])      |
|2  |good girl    |[good, girl]     |(10,[8,9],[1.0,1.0])      |
|3  |boy good girl|[boy, good, girl]|(10,[7,8,9],[1.0,1.0,1.0])|
+---+-------------+-----------------+--------------------------+



In [42]:
# Create an IDF object
# mention the "rawFeatures" column as input
# mention the "features" column as output

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
tfidfData = idfModel.transform(featurizedData)

In [43]:
#display the tf-idf data
tfidfData.select("sentence", "features").show(truncate=False)

+-------------+----------------------------------------------------------+
|sentence     |features                                                  |
+-------------+----------------------------------------------------------+
|good boy     |(10,[7,8],[0.28768207245178085,0.0])                      |
|good girl    |(10,[8,9],[0.0,0.28768207245178085])                      |
|boy good girl|(10,[7,8,9],[0.28768207245178085,0.0,0.28768207245178085])|
+-------------+----------------------------------------------------------+



## Task 4 - StopWordsRemover


StopWordsRemover is a transformer that filters out stop words like "a","an" and "the".


In [46]:
#import StopWordsRemover
from pyspark.ml.feature import StopWordsRemover

In [47]:
#create a dataframe with sample text and display it
textData = spark.createDataFrame([
    (1, ['Spark', 'is', 'an', 'open-source', 'distributed', 'computing', 'system']),
    (2, ['IT', 'has', 'interfaces', 'for', 'multiple', 'languages']),
    (3, ['It', 'has', 'a', 'wide', 'range', 'of', 'libraries', 'and', 'APIs'])
], ["id", "sentence"])

textData.show(truncate = False)

+---+------------------------------------------------------------+
|id |sentence                                                    |
+---+------------------------------------------------------------+
|1  |[Spark, is, an, open-source, distributed, computing, system]|
|2  |[IT, has, interfaces, for, multiple, languages]             |
|3  |[It, has, a, wide, range, of, libraries, and, APIs]         |
+---+------------------------------------------------------------+



In [48]:
# remove stopwords from "sentence" column and store the result in "filtered_sentence" column
remover = StopWordsRemover(inputCol="sentence", outputCol="filtered_sentence")
textData = remover.transform(textData)

In [49]:
# display the dataframe
textData.show(truncate = False)

+---+------------------------------------------------------------+----------------------------------------------------+
|id |sentence                                                    |filtered_sentence                                   |
+---+------------------------------------------------------------+----------------------------------------------------+
|1  |[Spark, is, an, open-source, distributed, computing, system]|[Spark, open-source, distributed, computing, system]|
|2  |[IT, has, interfaces, for, multiple, languages]             |[interfaces, multiple, languages]                   |
|3  |[It, has, a, wide, range, of, libraries, and, APIs]         |[wide, range, libraries, APIs]                      |
+---+------------------------------------------------------------+----------------------------------------------------+



## Task 5 - StringIndexer


StringIndexer converts a column of strings into a column of integers.


In [50]:
#import StringIndexer
from pyspark.ml.feature import StringIndexer

In [51]:
#create a dataframe with sample text and display it
colors = spark.createDataFrame(
    [(0, "red"), (1, "red"), (2, "blue"), (3, "yellow" ), (4, "yellow"), (5, "yellow")],
    ["id", "color"])

colors.show()

+---+------+
| id| color|
+---+------+
|  0|   red|
|  1|   red|
|  2|  blue|
|  3|yellow|
|  4|yellow|
|  5|yellow|
+---+------+



In [52]:
# index the strings in the column "color" and store their indexes in the column "colorIndex"
indexer = StringIndexer(inputCol="color", outputCol="colorIndex")
indexed = indexer.fit(colors).transform(colors)

                                                                                

In [53]:
# display the dataframe
indexed.show()

+---+------+----------+
| id| color|colorIndex|
+---+------+----------+
|  0|   red|       1.0|
|  1|   red|       1.0|
|  2|  blue|       2.0|
|  3|yellow|       0.0|
|  4|yellow|       0.0|
|  5|yellow|       0.0|
+---+------+----------+



## Task 6 - StandardScaler



StandardScaler transforms the data so that it has a mean of 0 and a standard deviation of 1


In [54]:
#import StandardScaler
from pyspark.ml.feature import StandardScaler


In [55]:
# Create a sample dataframe and display it
from pyspark.ml.linalg import Vectors
data = [(1, Vectors.dense([70, 170, 17])),
        (2, Vectors.dense([80, 165, 25])),
        (3, Vectors.dense([65, 150, 135]))]
df = spark.createDataFrame(data, ["id", "features"])

df.show()

+---+------------------+
| id|          features|
+---+------------------+
|  1| [70.0,170.0,17.0]|
|  2| [80.0,165.0,25.0]|
|  3|[65.0,150.0,135.0]|
+---+------------------+



In [56]:
# Define the StandardScaler transformer
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)

In [57]:
# Fit the transformer to the dataset
scalerModel = scaler.fit(df)

In [58]:
# Scale the data
scaledData = scalerModel.transform(df)

In [59]:
# Show the scaled data
scaledData.show(truncate = False)

+---+------------------+-----------------------------------------------------------+
|id |features          |scaledFeatures                                             |
+---+------------------+-----------------------------------------------------------+
|1  |[70.0,170.0,17.0] |[-0.218217890235993,0.8006407690254367,-0.6369487984517485]|
|2  |[80.0,165.0,25.0] |[1.0910894511799611,0.3202563076101752,-0.5156252177942725]|
|3  |[65.0,150.0,135.0]|[-0.8728715609439701,-1.120897076635609,1.152574016246021] |
+---+------------------+-----------------------------------------------------------+



Stop Spark Session


In [60]:
spark.stop()

# Exercises


Create Spark Session


In [61]:
#Create SparkSession
#Ignore any warnings by SparkSession command

spark = SparkSession.builder.appName("Exercises - Feature Extraction and Transformation using Spark").getOrCreate()

Create Dataframes


In [62]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/proverbs.csv


--2024-08-29 17:19:24--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/proverbs.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 846 [text/csv]
Saving to: ‘proverbs.csv’


2024-08-29 17:19:26 (417 MB/s) - ‘proverbs.csv’ saved [846/846]



In [83]:
# Load proverbs dataset
textdata = spark.read.csv("proverbs.csv", header=True, inferSchema=True)

In [84]:
# display dataframe
textdata.show(5,truncate = False)

+---+---------------------------------------------+
|id |text                                         |
+---+---------------------------------------------+
|1  |When in Rome do as the Romans do.            |
|2  |Do not judge a book by its cover.            |
|3  |Actions speak louder than words.             |
|4  |A picture is worth a thousand words.         |
|5  |If at first you do not succeed try try again.|
+---+---------------------------------------------+
only showing top 5 rows



In [85]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg.csv


--2024-08-29 17:26:41--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-BD0231EN-Coursera/datasets/mpg.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13891 (14K) [text/csv]
Saving to: ‘mpg.csv.1’


2024-08-29 17:26:43 (243 MB/s) - ‘mpg.csv.1’ saved [13891/13891]



In [86]:
# Load mpg dataset
mpgdata = spark.read.csv("mpg.csv", header=True, inferSchema=True)

In [87]:
# display dataframe
mpgdata.show(5)

+----+---------+-----------+----------+------+----------+----+--------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|
+----+---------+-----------+----------+------+----------+----+--------+
|15.0|        8|      390.0|       190|  3850|       8.5|  70|American|
|21.0|        6|      199.0|        90|  2648|      15.0|  70|American|
|18.0|        6|      199.0|        97|  2774|      15.5|  70|American|
|16.0|        8|      304.0|       150|  3433|      12.0|  70|American|
|14.0|        8|      455.0|       225|  3086|      10.0|  70|American|
+----+---------+-----------+----------+------+----------+----+--------+
only showing top 5 rows



### Exercise 1 - Tokenizer


In [88]:
#display the dataframe
textdata.show(5,truncate = False)

+---+---------------------------------------------+
|id |text                                         |
+---+---------------------------------------------+
|1  |When in Rome do as the Romans do.            |
|2  |Do not judge a book by its cover.            |
|3  |Actions speak louder than words.             |
|4  |A picture is worth a thousand words.         |
|5  |If at first you do not succeed try try again.|
+---+---------------------------------------------+
only showing top 5 rows



Write code to tokenize the "text" column of the "textdata" dataframe and store the tokens in the column "words"


In [89]:
# your code goes here
from pyspark.ml.feature import Tokenizer
tokenizer = Tokenizer(inputCol='text', outputCol='words')
textdata = tokenizer.transform(textdata)
textdata.show(5, truncate=False)

+---+---------------------------------------------+--------------------------------------------------------+
|id |text                                         |words                                                   |
+---+---------------------------------------------+--------------------------------------------------------+
|1  |When in Rome do as the Romans do.            |[when, in, rome, do, as, the, romans, do.]              |
|2  |Do not judge a book by its cover.            |[do, not, judge, a, book, by, its, cover.]              |
|3  |Actions speak louder than words.             |[actions, speak, louder, than, words.]                  |
|4  |A picture is worth a thousand words.         |[a, picture, is, worth, a, thousand, words.]            |
|5  |If at first you do not succeed try try again.|[if, at, first, you, do, not, succeed, try, try, again.]|
+---+---------------------------------------------+--------------------------------------------------------+
only showing top 5 

<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task 1

</details>


<details>
    <summary>Click here for Solution</summary>

```python
from pyspark.ml.feature import Tokenizer

tokenizer = Tokenizer(inputCol="text", outputCol="words")

textdata = tokenizer.transform(textdata)
```

</details>


In [90]:
#display the tokenized data
textdata.select("id","words").show(5,truncate=False)

+---+--------------------------------------------------------+
|id |words                                                   |
+---+--------------------------------------------------------+
|1  |[when, in, rome, do, as, the, romans, do.]              |
|2  |[do, not, judge, a, book, by, its, cover.]              |
|3  |[actions, speak, louder, than, words.]                  |
|4  |[a, picture, is, worth, a, thousand, words.]            |
|5  |[if, at, first, you, do, not, succeed, try, try, again.]|
+---+--------------------------------------------------------+
only showing top 5 rows



### Exercise 2 - CountVectorizer


CountVectorize the column "words" of the "textdata" dataframe and store the result in the column "features"


In [91]:
# your code goes here
from pyspark.ml.feature import CountVectorizer
cv = CountVectorizer(inputCol='words', outputCol='features')
textdata = cv.fit(textdata).transform(textdata)
textdata.show(5)

+---+--------------------+--------------------+--------------------+
| id|                text|               words|            features|
+---+--------------------+--------------------+--------------------+
|  1|When in Rome do a...|[when, in, rome, ...|(99,[0,4,5,11,12,...|
|  2|Do not judge a bo...|[do, not, judge, ...|(99,[1,3,4,19,20,...|
|  3|Actions speak lou...|[actions, speak, ...|(99,[7,10,81,86,9...|
|  4|A picture is wort...|[a, picture, is, ...|(99,[1,2,10,70,77...|
|  5|If at first you d...|[if, at, first, y...|(99,[3,4,16,17,22...|
+---+--------------------+--------------------+--------------------+
only showing top 5 rows



<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task 2
</details>


<details>
    <summary>Click here for Solution</summary>

```python
from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer(inputCol="words", outputCol="features")

model = cv.fit(textdata)

textdata = model.transform(textdata)
```

</details>


In [93]:
# Show the resulting dataframe
textdata.select("words","features").show(truncate=False)

+------------------------------------------------------------------------+----------------------------------------------------------------------------+
|words                                                                   |features                                                                    |
+------------------------------------------------------------------------+----------------------------------------------------------------------------+
|[when, in, rome, do, as, the, romans, do.]                              |(99,[0,4,5,11,12,41,69,93],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])               |
|[do, not, judge, a, book, by, its, cover.]                              |(99,[1,3,4,19,20,31,44,54],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])               |
|[actions, speak, louder, than, words.]                                  |(99,[7,10,81,86,97],[1.0,1.0,1.0,1.0,1.0])                                  |
|[a, picture, is, worth, a, thousand, words.]                            |(99,[1,2,10,70

### Exercise 3 - StringIndexer


Convert the string column "Origin" to a numeric column "OriginIndex" in the dataframe "mpgdata"


In [94]:
# your code goes here
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='Origin', outputCol='OriginIndex')
indexed = indexer.fit(mpgdata).transform(mpgdata)

<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task 5

</details>


<details>
    <summary>Click here for Solution</summary>

```python
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Origin", outputCol="OriginIndex")
indexed = indexer.fit(mpgdata).transform(mpgdata)
```

</details>


In [95]:
#show the dataframe

indexed.orderBy(rand()).show()


+----+---------+-----------+----------+------+----------+----+--------+-----------+
| MPG|Cylinders|Engine Disp|Horsepower|Weight|Accelerate|Year|  Origin|OriginIndex|
+----+---------+-----------+----------+------+----------+----+--------+-----------+
|15.0|        6|      258.0|       110|  3730|      19.0|  75|American|        0.0|
|27.5|        4|      134.0|        95|  2560|      14.2|  78|Japanese|        1.0|
|26.0|        4|      108.0|        93|  2391|      15.5|  74|Japanese|        1.0|
|20.0|        6|      156.0|       122|  2807|      13.5|  73|Japanese|        1.0|
|29.8|        4|       89.0|        62|  1845|      15.3|  80|European|        2.0|
|16.2|        6|      163.0|       133|  3410|      15.8|  78|European|        2.0|
|36.0|        4|      107.0|        75|  2205|      14.5|  82|Japanese|        1.0|
|23.0|        4|      115.0|        95|  2694|      15.0|  75|European|        2.0|
|18.0|        6|      258.0|       110|  2962|      13.5|  71|American|     

### Exercise 4 - StandardScaler



Create a single column named "feaures" using the columns "Cylinders", "Engine Disp", "Horsepower", "Weight"


In [96]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["Cylinders", "Engine Disp", "Horsepower", "Weight"], outputCol="features")

mpg_transformed_data = assembler.transform(mpgdata)

#show the dataframe
mpg_transformed_data.select("MPG","features").show(truncate = False)

+----+------------------------+
|MPG |features                |
+----+------------------------+
|15.0|[8.0,390.0,190.0,3850.0]|
|21.0|[6.0,199.0,90.0,2648.0] |
|18.0|[6.0,199.0,97.0,2774.0] |
|16.0|[8.0,304.0,150.0,3433.0]|
|14.0|[8.0,455.0,225.0,3086.0]|
|15.0|[8.0,350.0,165.0,3693.0]|
|18.0|[8.0,307.0,130.0,3504.0]|
|14.0|[8.0,454.0,220.0,4354.0]|
|15.0|[8.0,400.0,150.0,3761.0]|
|10.0|[8.0,307.0,200.0,4376.0]|
|15.0|[8.0,383.0,170.0,3563.0]|
|11.0|[8.0,318.0,210.0,4382.0]|
|10.0|[8.0,360.0,215.0,4615.0]|
|15.0|[8.0,429.0,198.0,4341.0]|
|21.0|[6.0,200.0,85.0,2587.0] |
|17.0|[8.0,302.0,140.0,3449.0]|
|9.0 |[8.0,304.0,193.0,4732.0]|
|14.0|[8.0,340.0,160.0,3609.0]|
|22.0|[6.0,198.0,95.0,2833.0] |
|14.0|[8.0,440.0,215.0,4312.0]|
+----+------------------------+
only showing top 20 rows



Use StandardScaler to scale the "features" column of the dataframe "mpg_transformed_data" and save the scaled data into the "scaledFeatures" column.


In [97]:
# your code goes here
from pyspark.ml.feature import StandardScaler
sc = StandardScaler(inputCol='features', outputCol='scaledFeatures')
scaledData = sc.fit(mpg_transformed_data).transform(mpg_transformed_data)

<details>
    <summary>Click here for a Hint</summary>
    
Refer to Task 6

</details>


<details>
    <summary>Click here for Solution</summary>

```python
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=True)

scalerModel = scaler.fit(mpg_transformed_data)

scaledData = scalerModel.transform(mpg_transformed_data)
```

</details>


In [98]:
# Show the scaled data
scaledData.select("features","scaledFeatures").show(truncate = False)

+------------------------+---------------------------------------------------------------------------+
|features                |scaledFeatures                                                             |
+------------------------+---------------------------------------------------------------------------+
|[8.0,390.0,190.0,3850.0]|[4.689927639954407,3.7269216145389974,4.936198346102636,4.532597594013996] |
|[6.0,199.0,90.0,2648.0] |[3.517445729965805,1.9016856443416936,2.338199216574933,3.1174853062205354]|
|[6.0,199.0,97.0,2774.0] |[3.517445729965805,1.9016856443416936,2.520059155641872,3.265824863842812] |
|[8.0,304.0,150.0,3433.0]|[4.689927639954407,2.9050876174868083,3.896998694291555,4.041664296168844] |
|[8.0,455.0,225.0,3086.0]|[4.689927639954407,4.3480752169621635,5.845498041437333,3.6331418636694006]|
|[8.0,350.0,165.0,3693.0]|[4.689927639954407,3.344673243817049,4.2866985637207105,4.347761796024335] |
|[8.0,307.0,130.0,3504.0]|[4.689927639954407,2.9337562452909545,3.3773988

Stop Spark Session


In [99]:
spark.stop()

Congratulations you have completed this lab.<br>


## Authors


[Ramesh Sannareddy](https://www.linkedin.com/in/rsannareddy/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMBD0231ENSkillsNetwork866-2023-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2023-05-14|0.1|Ramesh Sannareddy|Initial Version Created|


Copyright © 2023 IBM Corporation. All rights reserved.
