In [2]:
import pyspark
import pandas as pd
from pyspark import SparkContext


In [3]:
sc = SparkContext(master="local[2]")
sc

In the above output, the Spark UI is a link that opens the Spark dashboard in localhost: http://192.168.0.6:4040/, which will be running in the background. When one clicks the link it will open a Spark dashboard that shows the available jobs running on our machine. Currently, we have no running jobs as shown:

### Creating SparkSession
By creating SparkSession, it enables us to interact with the different Spark functionalities. The functionalities include data analysis and creating our text classification model.



In [4]:
from pyspark.sql import SparkSession 

## Initializing the TextClassifier

Using the imported SparkSession we can now initialize our app.



In [6]:
spark = SparkSession.builder.appName("TextClassifier").getOrCreate()

We use the builder.appName() method to give a name to our app.

After initializing our app, we can now view our launched UI to see the running jobs. The running jobs are shown below:

We use the Udemy dataset that contains all the courses offered by Udemy. The dataset contains the course title and subject they belong

In [7]:
df = spark.read.csv("udemy_dataset.csv",header=True,inferSchema=True)

In [8]:
df.show()

+---+---------+--------------------+--------------------+-------+-----+---------------+-----------+------------+------------------+----------------+--------------------+----------------+--------------------+
|_c0|course_id|        course_title|                 url|is_paid|price|num_subscribers|num_reviews|num_lectures|             level|content_duration| published_timestamp|         subject|  clean_course_title|
+---+---------+--------------------+--------------------+-------+-----+---------------+-----------+------------+------------------+----------------+--------------------+----------------+--------------------+
|  0|  1070968|Ultimate Investme...|https://www.udemy...|   True|  200|           2147|         23|          51|        All Levels|       1.5 hours|2017-01-18T20:58:58Z|Business Finance|Ultimate Investme...|
|  1|  1113822|Complete GST Cour...|https://www.udemy...|   True|   75|           2792|        923|         274|        All Levels|        39 hours|2017-03-09T16:34:20Z

In [9]:
df.columns

['_c0',
 'course_id',
 'course_title',
 'url',
 'is_paid',
 'price',
 'num_subscribers',
 'num_reviews',
 'num_lectures',
 'level',
 'content_duration',
 'published_timestamp',
 'subject',
 'clean_course_title']

We will only need the course_title and subject columns in building our model.

#### Selecting the needed columns
We select the course_title and subject columns. These are the columns we will use in building our model.



In [10]:
df.select('course_title','subject').show()

+--------------------+----------------+
|        course_title|         subject|
+--------------------+----------------+
|Ultimate Investme...|Business Finance|
|Complete GST Cour...|Business Finance|
|Financial Modelin...|Business Finance|
|Beginner to Pro -...|Business Finance|
|How To Maximize Y...|Business Finance|
|Trading Penny Sto...|Business Finance|
|Investing And Tra...|Business Finance|
|Trading Stock Cha...|Business Finance|
|Options Trading 3...|Business Finance|
|The Only Investme...|Business Finance|
|Forex Trading Sec...|Business Finance|
|Trading Options W...|Business Finance|
|Financial Managem...|Business Finance|
|Forex Trading Cou...|Business Finance|
|Python Algo Tradi...|Business Finance|
|Short Selling: Le...|Business Finance|
|Basic Technical A...|Business Finance|
|The Complete Char...|Business Finance|
|7 Deadly Mistakes...|Business Finance|
|Financial Stateme...|Business Finance|
+--------------------+----------------+
only showing top 20 rows



In [11]:
data = df.select('course_title','subject')

### Checking for missing values
We need to check for any missing values in our dataset. This ensures that we have a well-formatted dataset that trains our model.

We use the toPandas() method to check for missing values in our subject column and drop the missing values.

In [13]:
data.toPandas()[['course_title','subject']].isnull().sum()

course_title    0
subject         6
dtype: int64

In [14]:
data = data.dropna(subset=('subject'))

Machine learning algorithms understand only numbers, so we have the texts into numeric values during this stage.
We import all the packages required for feature engineering:

In [15]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover,CountVectorizer,IDF
from pyspark.ml.feature import StringIndexer

* **I'll take out a very small fraction of the dataset that will be used to test the performance of the model later.**

In [22]:
(data,test_data) = data.randomSplit((0.9,0.001),seed=42)

In [23]:
tokenizer = Tokenizer(inputCol='course_title',outputCol='mytokens')

stopwords_remover = StopWordsRemover(inputCol='mytokens',outputCol='filtered_tokens')

vectorizer = CountVectorizer(inputCol='filtered_tokens',outputCol='rawFeatures')

idf = IDF(inputCol='rawFeatures',outputCol='vectorizedFeatures')

* **Adding labels**
* **data transformation.**

In [24]:
labelEncoder = StringIndexer(inputCol='subject',outputCol='label').fit(data)

In [25]:
labelEncoder.transform(data).show(10,False)

+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|course_title                                                |subject                                                                                                                                                                                |label|
+------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|#1 Piano Hand Coordination: Play 10th Ballad in Eb Key songs|Musical Instruments                                                                                                                                                                

## Dictionary of all labels
We have to assign numeric values to the subject categories available in our dataset for easy predictions.



In [26]:
label_dict = {'Web Development':0.0,
 'Business Finance':1.0,
 'Musical Instruments':2.0,
 'Graphic Design':3.0}

As shown, Web Development is assigned 0.0, Business Finance assigned 1.0, Musical Instruments assigned 2.0, and Graphic Design assigned 3.0.

We add these labels into our dataset as shown:

In [27]:
data = labelEncoder.transform(data)

### Splitting our dataset
We split our dataset into train set and test set. 

70% of our dataset will be used for training and 30% for testing.

The last stage involves building our model using the LogisticRegression algorithm.



In [30]:
(trainDF,testDF) = data.randomSplit((0.7,0.3),seed=42)

* **Creating estimator**
An estimator is a function that takes data as input, fits the data, and creates a model used to make predictions.

In [31]:
from pyspark.ml.classification import LogisticRegression 
lr = LogisticRegression(featuresCol='vectorizedFeatures',labelCol='label')

Building the pipeline
Let’s import the Pipeline() method that we’ll use to build our model.

Fitting the five stages
We add the initialized 5 stages into the Pipeline() method.

In [32]:
from pyspark.ml import Pipeline 
pipeline = Pipeline(stages=[tokenizer,stopwords_remover,vectorizer,idf,lr])

Building model
We build our model by fitting our model into our training dataset by using the fit() method and passing the trainDF as our parameter.

Let’s initialize our model pipeline as lr_model.




In [33]:
lr_model = pipeline.fit(trainDF)

* **Testing model**

We test our model using the test dataset to see if it can classify the course title and assign the right subject.





In [35]:
predictions = lr_model.transform(testDF)


To see if our model was able to do the right classification, use the following command:

In [36]:
predictions.show()

+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|        course_title|             subject|label|            mytokens|     filtered_tokens|         rawFeatures|  vectorizedFeatures|       rawPrediction|         probability|prediction|
+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|#12 Hand Coordina...| Musical Instruments|  2.0|[#12, hand, coord...|[#12, hand, coord...|(3624,[295,509,59...|(3624,[295,509,59...|[2.25589659512528...|[0.00514507182890...|       2.0|
|'Greensleeves' Cr...| Musical Instruments|  2.0|['greensleeves', ...|['greensleeves', ...|(3624,[6,9,46,510...|(3624,[6,9,46,510...|[-1.3858717988728...|[2.75594285329454...|       2.0|
|              000!""|Learn Classical G...|  6.0|            [000

In [37]:
predictions.columns

['course_title',
 'subject',
 'label',
 'mytokens',
 'filtered_tokens',
 'rawFeatures',
 'vectorizedFeatures',
 'rawPrediction',
 'probability',
 'prediction']

#### Let's check the required columns to see the exact predictions of our model.

In [38]:
predictions.select('rawPrediction','probability','subject','label','prediction').show(10)

+--------------------+--------------------+--------------------+-----+----------+
|       rawPrediction|         probability|             subject|label|prediction|
+--------------------+--------------------+--------------------+-----+----------+
|[2.25589659512528...|[0.00514507182890...| Musical Instruments|  2.0|       2.0|
|[-1.3858717988728...|[2.75594285329454...| Musical Instruments|  2.0|       2.0|
|[4.10252349608296...|[0.00271732514424...|Learn Classical G...|  6.0|       1.0|
|[9.4153307471077,...|[0.75092189668272...|    Business Finance|  1.0|       0.0|
|[20.1486970431344...|[0.99999996990598...|     Web Development|  0.0|       0.0|
|[-0.9074221662231...|[6.14928429974160...|    Business Finance|  1.0|       1.0|
|[-8.6768079079238...|[7.28898835789187...|    Business Finance|  1.0|       1.0|
|[-17.923853708902...|[3.08447838815437...|      Graphic Design|  3.0|       3.0|
|[-6.7114779925134...|[5.42853162133382...| Musical Instruments|  2.0|       2.0|
|[28.84330106716

We can see from the data snippet above that our model did a very good job in predicting the correct subject_title.


### Model evaluation
This is checking the model accuracy so that we can know how well we trained our model.

Let’s import the MulticlassClassificationEvaluator. We’ll use it to evaluate our model and calculate the accuracy score.



In [39]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [40]:
evaluator = MulticlassClassificationEvaluator(labelCol='label',predictionCol='prediction',metricName='accuracy')

In [41]:
accuracy = evaluator.evaluate(predictions)

In [42]:
accuracy

0.9171483622350675

### Not bad!

## making a new prediction
We'll use our trained model to make predictions on some dataset that it hasn't seen before. i.e the text dataset we kept aside earlier. 
We want to see if it can predict the correct subjects.



In [43]:
new_pred = lr_model.transform(test_data)

In [44]:
new_pred.show()

+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|        course_title|         subject|            mytokens|     filtered_tokens|         rawFeatures|  vectorizedFeatures|       rawPrediction|         probability|prediction|
+--------------------+----------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|'Geometry Of Chan...|Business Finance|['geometry, of, c...|['geometry, chanc...|   (3624,[98],[1.0])|(3624,[98],[4.688...|[0.52016269357014...|[4.13674414433771...|       1.0|
|Angular 4 (2+) Cr...| Web Development|[angular, 4, (2+)...|[angular, 4, (2+)...|(3624,[3,6,53,63,...|(3624,[3,6,53,63,...|[25.3596667781611...|[0.99999999999007...|       0.0|
|AngularJs :basics...| Web Development|[angularjs, :basi...|[angularjs, :basi...|(3624,[3,135,187]...|(3624,[3,135,

Get all the available columns.

In [45]:
new_pred.columns

['course_title',
 'subject',
 'mytokens',
 'filtered_tokens',
 'rawFeatures',
 'vectorizedFeatures',
 'rawPrediction',
 'probability',
 'prediction']

From the above columns, let’s select the necessary columns that give the prediction results clearly.

In [48]:
new_pred.select('course_title','rawPrediction','probability','prediction').show()

print("\n", label_dict)

+--------------------+--------------------+--------------------+----------+
|        course_title|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|'Geometry Of Chan...|[0.52016269357014...|[4.13674414433771...|       1.0|
|Angular 4 (2+) Cr...|[25.3596667781611...|[0.99999999999007...|       0.0|
|AngularJs :basics...|[26.3565834177491...|[0.99999999999777...|       0.0|
|Elite Trend Trade...|[-12.489283868673...|[1.84053886220911...|       1.0|
|Ultimate Photosho...|[-0.5618535885263...|[1.68278610549832...|       3.0|
+--------------------+--------------------+--------------------+----------+


 {'Web Development': 0.0, 'Business Finance': 1.0, 'Musical Instruments': 2.0, 'Graphic Design': 3.0}


### The prediction is accurate, which can be seen from our created label dictionary.

