# Creating Features Quiz
Use this Jupyter notebook to find the answers to the quiz in the previous section. There is an answer key in the next part of the lesson.

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, CountVectorizer, \
    IDF, StringIndexer
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
# TODOS: 
# 1) import any other libraries you might need

In [3]:
# 2) run the cells below to read dataset and build body length feature
# 3) write code to answer the quiz questions 

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

### Read Dataset

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

### Build Body Length Feature

In [5]:
regexTokenizer = RegexTokenizer(inputCol="Body", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)

In [6]:
body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("BodyLength", body_length(df.words))

In [7]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

# Question 1
Select the question with Id = 1112. How many words does its body contain (check the BodyLength column)?

In [8]:
# TODO: write your code to answer question 1
cv= CountVectorizer(inputCol="words",outputCol="total_words")

In [9]:
cvmodel = cv.fit(df)

In [10]:
df = cvmodel.transform(df)

In [11]:
body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("bodylength",body_length(df.words))

In [12]:
df.where("Id == 1112").collect()

[Row(Body='<p>I submitted my iPhone application to iTunesConnect.Now it is in "Waiting for review". I want to release it only when i decide.. But am not able to see the option to set release date as "Automatically after success review"  or "release date will be set by Developer"(I mean Version Release Control option) . Somebody please help me ..Thanks in advance..</p>\n', Id=1112, Tags='iphone app-store itunes itunesconnect', Title='iPhone app release date option in iTunes Connect', oneTag='iphone', words=['p', 'i', 'submitted', 'my', 'iphone', 'application', 'to', 'itunesconnect', 'now', 'it', 'is', 'in', 'waiting', 'for', 'review', 'i', 'want', 'to', 'release', 'it', 'only', 'when', 'i', 'decide', 'but', 'am', 'not', 'able', 'to', 'see', 'the', 'option', 'to', 'set', 'release', 'date', 'as', 'automatically', 'after', 'success', 'review', 'or', 'release', 'date', 'will', 'be', 'set', 'by', 'developer', 'i', 'mean', 'version', 'release', 'control', 'option', 'somebody', 'please', 'help

# Question 2
Create a new column that concatenates the question title and body. Apply the same functions we used before to compute the number of words in this combined column. What's the value in this new column for Id = 5123?

In [13]:
# TODO: write your code to answer question 2
concat = udf(lambda x,y : x + y )
df = df.withColumn("title_body",concat(df.Title,df.Body))

In [14]:
df = df.withColumn("titlebodylength",body_length(df.title_body))

In [15]:
cv1= CountVectorizer(inputCol="title_body",outputCol="total_words")

In [16]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [18]:
df.where("ID == 5123").collect()

[Row(Body="<p>Here's an interesting experiment with using Git. Think of Github's ‘pages’ feature: I write a program in one branch (e.g. <code>master</code>), and a documentation website is kept in another, entirely unrelated branch (e.g. <code>gh-pages</code>).</p>\n\n<p>I can generate documentation in HTML format from the code in my <code>master</code>-branch, but I want to publish this as part of my documentation website in the <code>gh-pages</code> branch.</p>\n\n<p>How could I intelligently generate my docs from my code in <code>master</code>, move it to my <code>gh-pages</code> branch and commit the changes there? Should I use a post-commit hook or something? Would this be a good idea, or is it utterly foolish?</p>\n", Id=5123, Tags='git branch', Title='Git branch experiment', oneTag='git', words=['p', 'here', 's', 'an', 'interesting', 'experiment', 'with', 'using', 'git', 'think', 'of', 'github', 's', 'pages', 'feature', 'i', 'write', 'a', 'program', 'in', 'one', 'branch', 'e', '

# Create a Vector
Create a vector from the combined Title + Body length column. In the next few questions, you'll try different normalizer/scaler methods on this new column.

In [None]:
# TODO: write your code to create this vector

# Question 3
Using the Normalizer method what's the normalized value for question Id = 512?

In [None]:
# TODO: write your code to answer question 3

# Question 4
Using the StandardScaler method (scaling both the mean and the standard deviation) what's the normalized value for question Id = 512?

In [None]:
# TODO: write your code to answer question 4

# Question 5
Using the MinMAxScaler method what's the normalized value for question Id = 512?

In [None]:
# TODO: write your code to answer question 5