# DSC 650: Assignment 10.2
### Chase Lemons

###### Build a Classification Model

In this exercise, you will fit a binary logistic regression model to the baby name dataset you used in the previous exercise. This model will predict the sex of a person based on their age, name, and state they were born in. To train the model, you will use the data found in baby-names/names-classifier.

In [17]:
from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoderEstimator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler



###### Prepare in Input Features

First, you will need to prepare each of the input features. While age is a numeric feature, state and name are not. These need to be converted into numeric vectors before you can train the model. Use a StringIndexer along with the OneHotEncoderEstimator to convert the name, state, and sex columns into numeric vectors. Use the VectorAssembler to combine the name, state, and age vectors into a single features vector. Your final dataset should contain a column called features containing the prepared vector and a column called label containing the sex of the person.

In [9]:
# Reading in all of the files

file_location = "names_classifier/*.parquet"
file_type = "parquet"

spark = SparkSession.builder.appName("week10").getOrCreate()

infer_schema = "false"
first_row_is_header = "true"

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .load(file_location)

df.printSchema()

root
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: integer (nullable = true)



In [14]:
# Using StringIndexer to change the string fields to numeric

indexer = StringIndexer(inputCol="name", outputCol="nameIndex")
df1 = indexer.fit(df).transform(df)

indexer = StringIndexer(inputCol="state", outputCol="stateIndex")
df1 = indexer.fit(df1).transform(df1)

indexer = StringIndexer(inputCol="sex", outputCol="label")
df1 = indexer.fit(df1).transform(df1)

df1.show()

+------+-----+---+---+---------+----------+-----+
|  name|state|sex|age|nameIndex|stateIndex|label|
+------+-----+---+---+---------+----------+-----+
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|


In [15]:
# Using OneHotEncoderEstimator to turn the numeric fields we created in the previous cell to vecotrs.

encoder = OneHotEncoderEstimator(inputCols=["nameIndex", "stateIndex","label"],outputCols=["nameVec", "stateVec","sexVec"])
model = encoder.fit(df1)
encoded = model.transform(df1)

encoded.show()

+------+-----+---+---+---------+----------+-----+-----------------+--------------+-------------+
|  name|state|sex|age|nameIndex|stateIndex|label|          nameVec|      stateVec|       sexVec|
+------+-----+---+---+---------+----------+-----+-----------------+--------------+-------------+
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.0|       3.0|  0.0|(31313,[8],[1.0])|(50,[3],[1.0])|(1,[0],[1.0])|
|Joseph|   PA|  M| 26|      8.

In [16]:
# Combining the individual numeric vectors into a features column and the sex numeric column into a label column.
assembler = VectorAssembler(inputCols=["nameVec", "stateVec","age"],outputCol="features")

output = assembler.transform(encoded)

df_input_features = output.select("features", "label")
df_input_features.show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
|(31364,[8,31316,3...|  0.0|
+--------------------+-----+
only showing top 20 rows



###### Fit and Evaluate the Model

Fit the model as a logistic regression model with the following parameters. LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8). Provide the area under the ROC curve for the model.

In [18]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol = 'features', labelCol = 'label', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(df_input_features)

print(lrModel.summary.areaUnderROC)



0.5
