# Sparkify Project

In this project user churn rate is predicted for a music streaming software called Sparkify.

### Notebook Content
**1. Get started**
- Import packages
- Build spark session

**2. Extract data**
- Extract data from local .json or a s3 storage

**3. Clean data**
- Drop null values
- Add column indicating churn (label)
- Aggregate data on user level
- Ensure numeric value

**4. Exploratory Data Analysis**
- Show aggregates of each column

**5. Feature Engineering**
- One Hot Encoding for categorical features
- Assembling feature vector

**6. Modeling**
- Build pipeline (random forest + logistic regression)
- Find best model with grid search
- Evaluate accuracy and F2

## 1. Get Started
Import required packages for [pyspark sql](https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html) and [mlib](https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html).

In [303]:
# SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, from_unixtime, trunc, isnan, when, count, col, lit
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import col
from pyspark.sql import functions as f
from pyspark.sql import types as t

# MLib
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import CountVectorizer, IDF, Normalizer, PCA, RegexTokenizer, StandardScaler, StopWordsRemover, StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler

from pyspark.mllib.evaluation import MulticlassMetrics

In [304]:
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

sparkContext = spark.sparkContext

## 2. Extract Data
Extract data from a local file or access Amazon S3.

In [305]:
# For running on a EC2 cluster 
event_data_s3n_small = "s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json"
event_data_s3n_full = "s3n://udacity-dsnd/sparkify/sparkify_event_data.json"

# For local use
event_data_local = "data/mini_sparkify_event_data.json"

df = spark.read.json(event_data_local)
df.head(2)

# User very small subset of data for trying out functions
#df = df.limit(1000)

[Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30'),
 Row(artist='Five Iron Frenzy', auth='Logged In', firstName='Micah', gender='M', itemInSession=79, lastName='Long', length=236.09424, level='free', location='Boston-Cambridge-Newton, MA-NH', method='PUT', page='NextSong', registration=1538331630000, sessionId=8, song='Canada', status=200, ts=1538352180000, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.103 Safari/537.36"', userId='9')]

In [306]:
# Show columns
df.columns

['artist',
 'auth',
 'firstName',
 'gender',
 'itemInSession',
 'lastName',
 'length',
 'level',
 'location',
 'method',
 'page',
 'registration',
 'sessionId',
 'song',
 'status',
 'ts',
 'userAgent',
 'userId']

## 2. Clean Data
Check for empty values and null values. 
Create dataframe that aggregates data on a user level.

In [307]:
# Show how many null values each column has
for column in df.columns:
    print("Column " + column + " has " + str(len(df.where(df[column] == "").collect())) + " empty values.")

Column artist has 0 empty values.
Column auth has 0 empty values.
Column firstName has 0 empty values.
Column gender has 0 empty values.
Column itemInSession has 0 empty values.
Column lastName has 0 empty values.
Column length has 0 empty values.
Column level has 0 empty values.
Column location has 0 empty values.
Column method has 0 empty values.
Column page has 0 empty values.
Column registration has 0 empty values.
Column sessionId has 0 empty values.
Column song has 0 empty values.
Column status has 0 empty values.
Column ts has 0 empty values.
Column userAgent has 0 empty values.
Column userId has 8346 empty values.


In [308]:
print("Remove rows with empty userIds.")
df = df.where(df.userId != "")
print("Check how many empty userIds are left: " + \
      str(len(df.where(df.userId == "").collect())))

Remove rows with empty userIds.
Check how many empty userIds are left: 0


In [309]:
print("Null values per column:")
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show(vertical=True)

Null values per column:
-RECORD 0------------
 artist        | 0   
 auth          | 0   
 firstName     | 0   
 gender        | 0   
 itemInSession | 0   
 lastName      | 0   
 length        | 0   
 level         | 0   
 location      | 0   
 method        | 0   
 page          | 0   
 registration  | 0   
 sessionId     | 0   
 song          | 0   
 status        | 0   
 ts            | 0   
 userAgent     | 0   
 userId        | 0   



In [310]:
df.select("page").dropDuplicates().orderBy(df.page.asc()).show()

+--------------------+
|                page|
+--------------------+
|               About|
|          Add Friend|
|     Add to Playlist|
|              Cancel|
|Cancellation Conf...|
|           Downgrade|
|               Error|
|                Help|
|                Home|
|              Logout|
|            NextSong|
|         Roll Advert|
|       Save Settings|
|            Settings|
|    Submit Downgrade|
|      Submit Upgrade|
|         Thumbs Down|
|           Thumbs Up|
|             Upgrade|
+--------------------+



In [311]:
print("Example event flow for one particular user:")
df.where(df.userId == 100001).select("page", "ts").orderBy("ts").show(1000)

Example event flow for one particular user:
+--------------------+-------------+
|                page|           ts|
+--------------------+-------------+
|                Home|1538376504000|
|            NextSong|1538376509000|
|         Roll Advert|1538376542000|
|            NextSong|1538376747000|
|         Roll Advert|1538376783000|
|            NextSong|1538377349000|
|            NextSong|1538377748000|
|            NextSong|1538377932000|
|            NextSong|1538378245000|
|            NextSong|1538378483000|
|            NextSong|1538378687000|
|            NextSong|1538378877000|
|            NextSong|1538379041000|
|            NextSong|1538379207000|
|         Roll Advert|1538379230000|
|            NextSong|1538379420000|
|            NextSong|1538379668000|
|            NextSong|1538380000000|
|            NextSong|1538380179000|
|              Logout|1538380180000|
|                Home|1538380429000|
|            NextSong|1538380481000|
|            NextSong|153838082

#### Get users that churned
Creating a column `Churn` to use as the label for your model. I am using the `Cancellation Confirmation` events to define churn, which happen for both paid and free users.

In [312]:
# Get users that churned
churned_users = df.where(df.page == "Cancellation Confirmation").select("userId").dropDuplicates()

#### Add a boolean column to df indicating user churn

In [313]:
churn = udf(lambda x: 1, IntegerType())
churned_users = churned_users.withColumn("churned", lit(1))
churned_users.show(5)

# Join new column to df
df = df.join(churned_users, on=['userId'], how='left')

+------+-------+
|userId|churned|
+------+-------+
|   125|      1|
|    51|      1|
|    54|      1|
|100014|      1|
|   101|      1|
+------+-------+
only showing top 5 rows



#### Aggregate on user level

In [314]:
df.columns

['userId',
 'artist',
 'auth',
 'firstName',
 'gender',
 'itemInSession',
 'lastName',
 'length',
 'level',
 'location',
 'method',
 'page',
 'registration',
 'sessionId',
 'song',
 'status',
 'ts',
 'userAgent',
 'churned']

In [315]:
# Create user aggregation and add churn information
df_user_agg = df.select(["userId", "gender", "churned"]).groupBy(["userId", "gender"]).max()
df_user_agg = df_user_agg.na.fill(0)
df_user_agg = df_user_agg.withColumnRenamed("max(churned)", "churned")

In [316]:
df_user_agg.show(5)

+------+------+-------+
|userId|gender|churned|
+------+------+-------+
|100010|     F|      0|
|200002|     M|      0|
|   125|     M|      1|
|   124|     F|      0|
|    51|     M|      1|
+------+------+-------+
only showing top 5 rows



In [317]:
# Calculate songs per user
df_songs_per_user = df.where(df.page == "NextSong").select(["userId", "page"]).groupBy(["userId"]).count()
df_songs_per_user = df_songs_per_user.withColumnRenamed("count", "songs_listened")

In [318]:
# Calculate ads per user
df_ads_per_user = df.where(df.page == "Roll Advert").select(["userId", "page"]).groupBy(["userId"]).count()
df_ads_per_user = df_ads_per_user.withColumnRenamed("count", "adverts_rolled")

In [319]:
# Calculate friends added per user
df_friends_added_per_user = df.where(df.page == "Add Friend").select(["userId", "page"]).groupBy(["userId"]).count()
df_friends_added_per_user = df_friends_added_per_user.withColumnRenamed("count", "friends_added")

In [320]:
# Calculate upgrades to premium per user
df_upgrades_per_user = df.where(df.page == "Submit Upgrade").select(["userId", "page"]).groupBy(["userId"]).count()
df_upgrades_per_user = df_upgrades_per_user.withColumnRenamed("count", "times_upgraded")

In [321]:
# Calculate downgrades for free per user
df_downgrades_per_user = df.where(df.page == "Submit Downgrade").select(["userId", "page"]).groupBy(["userId"]).count()
df_downgrades_per_user = df_downgrades_per_user.withColumnRenamed("count", "times_downgraded")

In [322]:
# Calculate playlist additions per user
df_playlist_adds_per_user = df.where(df.page == "Add to Playlist").select(["userId", "page"]).groupBy(["userId"]).count()
df_playlist_adds_per_user = df_playlist_adds_per_user.withColumnRenamed("count", "playlist_additions")

In [323]:
# Calculate errors per user
df_errors_per_user = df.where(df.page == "Error").select(["userId", "page"]).groupBy(["userId"]).count()
df_errors_per_user = df_errors_per_user.withColumnRenamed("count", "errors")

In [324]:
# Calculate help access per user
df_help_per_user = df.where(df.page == "Help").select(["userId", "page"]).groupBy(["userId"]).count()
df_help_per_user = df_help_per_user.withColumnRenamed("count", "help_access_count")

In [325]:
# Calculate logouts per user
df_logouts_per_user = df.where(df.page == "Logout").select(["userId", "page"]).groupBy(["userId"]).count()
df_logouts_per_user = df_logouts_per_user.withColumnRenamed("count", "logouts")

In [326]:
# Calculate thumbs up per user
df_thumbs_up_per_user = df.where(df.page == "Thumbs Up").select(["userId", "page"]).groupBy(["userId"]).count()
df_thumbs_up_per_user = df_thumbs_up_per_user.withColumnRenamed("count", "thumbs_up_given")

In [327]:
# Calculate thumbs down per user
df_thumbs_down_per_user = df.where(df.page == "Thumbs Down").select(["userId", "page"]).groupBy(["userId"]).count()
df_thumbs_down_per_user = df_thumbs_down_per_user.withColumnRenamed("count", "thumbs_down_given")

In [328]:
# Calculate time registered ("user age")
df_age_per_user = df.groupBy("userId").agg(f.max('ts').alias("last_login"), \
    f.min('registration').alias("registration"))

df_age_per_user = df_age_per_user.withColumn("user_age", df_age_per_user.last_login - df_age_per_user.registration)

df_age_per_user.show(1)

+------+-------------+-------------+----------+
|userId|   last_login| registration|  user_age|
+------+-------------+-------------+----------+
|100010|1542823952000|1538016340000|4807612000|
+------+-------------+-------------+----------+
only showing top 1 row



In [329]:
# Join aggregates together to one table
df_user_agg = df_user_agg.join(df_songs_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_ads_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_friends_added_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_playlist_adds_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_errors_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_help_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_logouts_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_thumbs_up_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_thumbs_down_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_upgrades_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_downgrades_per_user, on=['userId'], how='left')
df_user_agg = df_user_agg.join(df_age_per_user, on=['userId'], how='left')

# Fill null values created after join with 0, as they represent missing counts
df_user_agg = df_user_agg.na.fill(0)

In [330]:
df_user_agg.where(df_user_agg.userId == 125).show(1, vertical=True)

-RECORD 0---------------------------
 userId             | 125           
 gender             | M             
 churned            | 1             
 songs_listened     | 8             
 adverts_rolled     | 1             
 friends_added      | 0             
 playlist_additions | 0             
 errors             | 0             
 help_access_count  | 0             
 logouts            | 0             
 thumbs_up_given    | 0             
 thumbs_down_given  | 0             
 times_upgraded     | 0             
 times_downgraded   | 0             
 last_login         | 1539318918000 
 registration       | 1533157139000 
 user_age           | 6161779000    



## 3. Explore Data


In [331]:
print('Number of distinct users:')
print(len(df.dropDuplicates(['userId']).collect()))
print("\nNumber of churned users:")
print(len(churned_users.collect()))

Number of distinct users:
225

Number of churned users:
52


#### Observe different aggregates

In [332]:
df_user_agg.groupBy("churned").agg(f.mean('songs_listened'), \
    f.mean('adverts_rolled'), \
    f.mean('friends_added'), \
    f.mean('playlist_additions'), \
    f.mean('errors')).show()

+-------+-------------------+-------------------+------------------+-----------------------+------------------+
|churned|avg(songs_listened)|avg(adverts_rolled)|avg(friends_added)|avg(playlist_additions)|       avg(errors)|
+-------+-------------------+-------------------+------------------+-----------------------+------------------+
|      1|  699.8846153846154| 18.596153846153847| 12.23076923076923|      19.96153846153846|0.6153846153846154|
|      0| 1108.1734104046243|  17.14450867052023|21.046242774566473|     31.722543352601157|1.2716763005780347|
+-------+-------------------+-------------------+------------------+-----------------------+------------------+



In [333]:
df_user_agg.groupBy("churned").agg(f.mean('thumbs_up_given'), \
    f.mean('thumbs_down_given'), \
    f.mean('times_upgraded'), \
    f.mean('times_downgraded'), \
    f.mean('logouts'), \
    f.mean('help_access_count')).show()

+-------+--------------------+----------------------+-------------------+---------------------+------------------+----------------------+
|churned|avg(thumbs_up_given)|avg(thumbs_down_given)|avg(times_upgraded)|avg(times_downgraded)|      avg(logouts)|avg(help_access_count)|
+-------+--------------------+----------------------+-------------------+---------------------+------------------+----------------------+
|      1|               35.75|     9.538461538461538| 0.6153846153846154|  0.17307692307692307|10.634615384615385|     4.596153846153846|
|      0|   61.80346820809248|     11.84971098265896| 0.7341040462427746|  0.31213872832369943| 15.45086705202312|     7.023121387283237|
+-------+--------------------+----------------------+-------------------+---------------------+------------------+----------------------+



In [334]:
# Convert from unix time to normal time to see results in human readable way
df_user_agg.groupBy("churned").agg(\
    from_unixtime(f.mean('last_login') / 1000 ,"yyyy-MM-dd HH:mm:ss:SSS").alias("avg_last_login"),
    from_unixtime(f.mean('registration') / 1000 ,"yyyy-MM-dd HH:mm:ss:SSS").alias("avg_registration"), \
    (f.mean('user_age') / 1000/3600/24).alias("avg_user_age_days")).show()

+-------+--------------------+--------------------+------------------+
|churned|      avg_last_login|    avg_registration| avg_user_age_days|
+-------+--------------------+--------------------+------------------+
|      1|2018-10-27 14:03:...|2018-08-31 06:42:...|57.305992922008535|
|      0|2018-11-24 12:26:...|2018-08-29 22:32:...| 86.62061938021837|
+-------+--------------------+--------------------+------------------+



# Feature Engineering
One Hot Encoding categorical columns (gender) and assembling required subset of features into a feature vector.
The output feature column is not scaled yet and therefore called "features_raw".

#### Select required features

In [335]:
# Select required features
feature_list = ["genderIndex", "songs_listened", "adverts_rolled", "friends_added", "playlist_additions",\
               "errors", "help_access_count", "logouts", "thumbs_up_given", "thumbs_down_given",\
              "times_upgraded", "times_downgraded", "user_age"]

#### One Hot Encoding the Gender Column

In [336]:
# Use OneHotEncoding to transform gender into one hot vector 
stringIndexer = StringIndexer(inputCol="gender", outputCol="genderIndex")
model = stringIndexer.fit(df_user_agg)
indexed = model.transform(df_user_agg)

# In Cloud
#encoder = OneHotEncoder(inputCol="genderIndex", outputCol="genderIndexVec")
#df_features = encoder.transform(indexed)

# Local
encoder = OneHotEncoder(inputCol="genderIndex", outputCol="genderIndexVec")
encoder.setDropLast(False)
ohe = encoder.fit(indexed)
df_features = ohe.transform(indexed)

In [337]:
# https://stackoverflow.com/questions/49632830/pyspark-output-of-onehotencoder-looks-odd
df_features.select("gender", "genderIndex", "genderIndexVec").show(5)

+------+-----------+--------------+
|gender|genderIndex|genderIndexVec|
+------+-----------+--------------+
|     F|        1.0| (2,[1],[1.0])|
|     M|        0.0| (2,[0],[1.0])|
|     M|        0.0| (2,[0],[1.0])|
|     F|        1.0| (2,[1],[1.0])|
|     M|        0.0| (2,[0],[1.0])|
+------+-----------+--------------+
only showing top 5 rows



#### Feature Vector Assembler

In [338]:
# Assemble columns into feature vector
assembler = VectorAssembler(
    inputCols=feature_list,
    outputCol="features_raw")

output = assembler.transform(df_features)

feature_str = ""
for feature in feature_list:
    feature_str = feature_str + feature + ", "
    
print("Assembled columns " + feature_str + " to vector column 'features'")

# Rename churned column to label
output = output.withColumnRenamed("churned", "label")
df_features = output.select("features_raw", "label")
df_features.show(truncate=False)

Assembled columns genderIndex, songs_listened, adverts_rolled, friends_added, playlist_additions, errors, help_access_count, logouts, thumbs_up_given, thumbs_down_given, times_upgraded, times_downgraded, user_age,  to vector column 'features'
+-------------------------------------------------------------------------+-----+
|features_raw                                                             |label|
+-------------------------------------------------------------------------+-----+
|[1.0,275.0,52.0,4.0,7.0,0.0,2.0,5.0,17.0,5.0,0.0,0.0,4.807612E9]         |0    |
|[0.0,387.0,7.0,4.0,8.0,0.0,2.0,5.0,21.0,6.0,1.0,0.0,6.054448E9]          |0    |
|(13,[1,2,12],[8.0,1.0,6.161779E9])                                       |1    |
|[1.0,4079.0,4.0,74.0,118.0,6.0,23.0,59.0,171.0,41.0,0.0,0.0,1.1366431E10]|0    |
|[0.0,2111.0,0.0,28.0,52.0,1.0,12.0,24.0,100.0,21.0,0.0,0.0,1.680985E9]   |1    |
|[0.0,150.0,16.0,1.0,5.0,1.0,1.0,3.0,7.0,1.0,0.0,0.0,6.288035E9]          |0    |
|[0.0,1914.0,1.0,31

## Modeling
Split the full dataset into train, test, and validation sets. 
Test out two ML models:
- Random Forest Classifier
- Logistic Regression

Evaluating the accuracy of the various models, tuning parameters as necessary.

In [339]:
train, test = df_features.select(["features_raw", "label"]).randomSplit([0.7, 0.3], seed=42)

#### Apply Standard Scaler to Feature Vector (only use training data for fittting)

In [340]:
scaler = StandardScaler(inputCol="features_raw", outputCol="features",
                        withStd=True, withMean=True)

# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(train)

# Normalize each feature to have unit standard deviation.
train = scalerModel.transform(train)
test = scalerModel.transform(test)

# Select only relevant columns
train = train.select("features", "label")
test = test.select("features", "label")

# Show example train and test data
train.show(1, vertical=True, truncate=False)
test.show(1, vertical=True, truncate=False)

-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 features | [-0.9555298793060595,-0.5439738915591357,-0.49468011245263704,-0.6931110808050502,-0.6133286470638624,-0.7825327949102233,-0.5951993048674474,-0.5809558577810566,-0.5156540944278364,-0.38763499730063644,0.36187874786029606,-0.48950888391319386,-0.2987138412644498] 
 label    | 0                                                                                                                                                                                                                                                                        
only showing top 1 row

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------

## Model 1 - Random Forest

In [341]:
# Configure Random Forest classifier
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=[rf])

### Tune Model

In [342]:
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees,[4, 8, 12]) \
    .addGrid(rf.maxDepth,[5, 10, 20, 30]) \
    .addGrid(rf.maxBins,[100, 200]) \
    .build()

#paramGrid = ParamGridBuilder() \
#    .addGrid(rf.numTrees,[4]) \
#    .addGrid(rf.maxDepth,[5]) \
#    .addGrid(rf.maxBins,[100]) \
#    .build()


crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2)

In [343]:
cvModel_q1 = crossval.fit(train)

Exception ignored in: <function JavaModelWrapper.__del__ at 0x000001F1A7BE30D0>
Traceback (most recent call last):
  File "c:\python3\lib\site-packages\pyspark\mllib\common.py", line 137, in __del__
    self._sc._gateway.detach(self._java_model)
AttributeError: 'MulticlassMetrics' object has no attribute '_sc'


In [344]:
cvModel_q1.avgMetrics

[0.7730279496730672,
 0.7730279496730672,
 0.7763499697106194,
 0.7763499697106194,
 0.7763499697106194,
 0.7763499697106194,
 0.7763499697106194,
 0.7763499697106194,
 0.767055156717452,
 0.767055156717452,
 0.7551783957078579,
 0.7551783957078579,
 0.7551783957078579,
 0.7551783957078579,
 0.7551783957078579,
 0.7551783957078579,
 0.7695126348382927,
 0.7695126348382927,
 0.7355793976426299,
 0.7355793976426299,
 0.7355793976426299,
 0.7355793976426299,
 0.7355793976426299,
 0.7355793976426299]

In [345]:
results = cvModel_q1.transform(test)

In [346]:
results.show(2)

+--------------------+-----+-------------+-----------+----------+
|            features|label|rawPrediction|probability|prediction|
+--------------------+-----+-------------+-----------+----------+
|[-0.9555298793060...|    1|    [4.0,0.0]|  [1.0,0.0]|       0.0|
|[-0.9555298793060...|    0|    [4.0,0.0]|  [1.0,0.0]|       0.0|
+--------------------+-----+-------------+-----------+----------+
only showing top 2 rows



In [347]:
# Get best pipeline and params
bestPipeline = cvModel_q1.bestModel
bestRFModel = bestPipeline.stages[-1]

maxBins_best = bestRFModel._java_obj.getMaxBins()
maxDepth_best = bestRFModel._java_obj.getMaxDepth()
numTrees_best = bestRFModel._java_obj.getNumTrees()

print("The best performing params: \n" + \
     "numTrees: " + str(numTrees_best) + "\n" + \
     "maxDepth: " + str(maxDepth_best) + "\n" + \
     "maxBins: " + str(maxBins_best) + "."
)

The best performing params: 
numTrees: 4
maxDepth: 10
maxBins: 100.


### Compute Accuracy of Best Model

#### Accuracy (all users))
How well was churn predicted among all users?

In [348]:
count_correct = results.filter(results.label == results.prediction).count()
count_all = results.count()
accuracy = count_correct/count_all * 100
print("The model has an overall accuracy of " + str(int(accuracy)) + "%.")

The model has an overall accuracy of 66%.


#### Accuracy (churned users)
How well was churn predicted among churned users?

In [349]:
result_churners = results.filter(results.label == 1)
count_correct = result_churners.filter(results.label == results.prediction).count()
count_all = result_churners.count()
accuracy = count_correct/count_all * 100
print("The model has an accuracy of " + str(int(accuracy)) + "% for predicting churn correctly.")

The model has an accuracy of 16% for predicting churn correctly.


## Model 2 - Logistic Regression

### Build Pipeline

In [350]:
lr =  LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)
pipeline = Pipeline(stages=[lr])

### Tune Model --> input F1 estimator here

In [351]:
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam,[0.0, 0.1]) \
    .addGrid(lr.maxIter,[5, 10, 50, 100]) \
    .build()


crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=2)

In [352]:
cvModel_q2 = crossval.fit(train)

In [353]:
cvModel_q2.avgMetrics

[0.7589439389338912,
 0.7400005349308758,
 0.741290383181997,
 0.741290383181997,
 0.6929578374983036,
 0.6929578374983036,
 0.6929578374983036,
 0.6929578374983036]

In [354]:
results = cvModel_q2.transform(test)

In [355]:
results.show(2)

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[-0.9555298793060...|    1|[1.29115316421777...|[0.78434231022866...|       0.0|
|[-0.9555298793060...|    0|[1.41030407288604...|[0.80381389948080...|       0.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 2 rows



In [356]:
# Get best pipeline and params
bestPipeline = cvModel_q2.bestModel
bestLRModel = bestPipeline.stages[-1]

regParam_best = bestLRModel._java_obj.getRegParam()
maxIter_best = bestLRModel._java_obj.getMaxIter()

print("The best performing params: \n" + \
     "regParam: " + str(regParam_best) + "\n" + \
     "maxIter: " + str(maxIter_best)
)

The best performing params: 
regParam: 0.0
maxIter: 5


### Compute Accuracy of Best Model

#### Accuracy (all users)

In [357]:
count_correct = results.filter(results.label == results.prediction).count()
count_all = results.count()
accuracy = count_correct/count_all * 100
print("The model has an overall accuracy of " + str(int(accuracy)) + "%.")

The model has an overall accuracy of 76%.


#### Accuracy (churned users)

In [358]:
result_churners = results.filter(results.label == 1)
count_correct = result_churners.filter(results.label == results.prediction).count()
count_all = result_churners.count()
accuracy = count_correct/count_all * 100
print("The model has an accuracy of " + str(int(accuracy)) + "% for predicting churn correctly.")

The model has an accuracy of 33% for predicting churn correctly.


# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.