## Mothapo Regina - Building an Income Prediction Model
This project aims to create a classifier to determine whether an individual will earn more or less that $50k using the Random Forest Classifier and Decision Tree Classifier. Data is processed using Pyspark.

In [1]:
!pip install pyspark


Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=f0270d2ad78c22c00f131a79d43deb13465a3a68be4b9c3402cca0d46be94c07
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
# Import all the necessary modulees
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.sql.functions import col

In [3]:
# Start a new session
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('Income Prediction').getOrCreate()

In [5]:
spark

In [7]:
# Import The dataset
df = spark.read.csv('/content/income (1).csv', header=True, inferSchema=True,nullValue='?',ignoreTrailingWhiteSpace = True,ignoreLeadingWhiteSpace = True)

df.printSchema()
df.show()

root
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- weight: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- education_years: integer (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: integer (nullable = true)
 |-- capital_loss: integer (nullable = true)
 |-- hours_per_week: integer (nullable = true)
 |-- citizenship: string (nullable = true)
 |-- income_class: string (nullable = true)

+---+----------------+------+------------+---------------+--------------------+-----------------+-------------+------------------+------+------------+------------+--------------+-------------+------------+
|age|       workclass|weight|   education|education_years|      marital_status|       occupation| relationship|              race|   sex|capital_gain|capita

### Pre- Processing the Dataset
The Total number of records in the file is 32561

In [8]:
# Use the describe method to track how many columns have missing values
df.describe().show()

+-------+------------------+-----------+------------------+------------+-----------------+--------------+----------------+------------+------------------+------+------------------+----------------+------------------+-----------+------------+
|summary|               age|  workclass|            weight|   education|  education_years|marital_status|      occupation|relationship|              race|   sex|      capital_gain|    capital_loss|    hours_per_week|citizenship|income_class|
+-------+------------------+-----------+------------------+------------+-----------------+--------------+----------------+------------+------------------+------+------------------+----------------+------------------+-----------+------------+
|  count|             32561|      30725|             32561|       32561|            32561|         32561|           30718|       32561|             32561| 32561|             32561|           32561|             32561|      31978|       32561|
|   mean| 38.58164675532078|    

The columns that have one or more missing values are workclass, occupation and citizenship. All the missing values are dropped below.

In [None]:
# Display the columns
df.columns


['age',
 'workclass',
 'weight',
 'education',
 'education_years',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'capital_gain',
 'capital_loss',
 'hours_per_week',
 'citizenship',
 'income_class']

In [9]:
# Finding duplicates in the dataset
duplicates_df = df.groupBy(df.columns).count().filter(col('count')>1)
duplicates_df.show()


+---+----------------+------+------------+---------------+------------------+-----------------+-------------+------------------+------+------------+------------+--------------+-------------+------------+-----+
|age|       workclass|weight|   education|education_years|    marital_status|       occupation| relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|  citizenship|income_class|count|
+---+----------------+------+------------+---------------+------------------+-----------------+-------------+------------------+------+------------+------------+--------------+-------------+------------+-----+
| 39|         Private| 30916|     HS-grad|              9|Married-civ-spouse|     Craft-repair|      Husband|             White|  Male|           0|           0|            40|United-States|       <=50K|    2|
| 28|         Private|274679|     Masters|             14|     Never-married|   Prof-specialty|Not-in-family|             White|  Male|           0|           0

In [10]:
# Removing duplicates
df_2 = df.dropDuplicates()
duplicates = df.count() - df_2.count()
print('The number of duplicated rows in the dataset is {}.'.format(duplicates))

The number of duplicated rows in the dataset is 24.


In [12]:
#Remove rows with missing values
df_final= df_2.na.drop()

In [13]:
#Display the final data with no missing values and duplicates
df_final.show()

+---+------------+------+------------+---------------+------------------+-----------------+--------------+------------------+------+------------+------------+--------------+-------------+------------+
|age|   workclass|weight|   education|education_years|    marital_status|       occupation|  relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|  citizenship|income_class|
+---+------------+------+------------+---------------+------------------+-----------------+--------------+------------------+------+------------+------------+--------------+-------------+------------+
| 41| Federal-gov|130760|   Bachelors|             13|Married-civ-spouse|     Tech-support|       Husband|             White|  Male|           0|           0|            24|United-States|       <=50K|
| 54|Self-emp-inc|125417|     7th-8th|              4|Married-civ-spouse|Machine-op-inspct|       Husband|             White|  Male|           0|           0|            40|United-States|        >

In [14]:
#Using the describe method to see the count of rows remaining (30162 records)
df_final.describe().show()


+-------+------------------+-----------+------------------+------------+------------------+--------------+----------------+------------+------------------+------+------------------+-----------------+------------------+-----------+------------+
|summary|               age|  workclass|            weight|   education|   education_years|marital_status|      occupation|relationship|              race|   sex|      capital_gain|     capital_loss|    hours_per_week|citizenship|income_class|
+-------+------------------+-----------+------------------+------------+------------------+--------------+----------------+------------+------------------+------+------------------+-----------------+------------------+-----------+------------+
|  count|             30139|      30139|             30139|       30139|             30139|         30139|           30139|       30139|             30139| 30139|             30139|            30139|             30139|      30139|       30139|
|   mean| 38.44172003052

In [15]:
# Indexing categorical features in the dataset

index = StringIndexer(
    inputCols=['workclass','education','marital_status','occupation','relationship','race','sex','citizenship', 'income_class'],
    outputCols = ['{}_indexed'.format(column) for column in ['workclass','education','marital_status','occupation','relationship','race','sex','citizenship', 'income_class']]
    )

df_4 = index.fit(df_final).transform(df_final)
df_4.show()


+---+------------+------+------------+---------------+------------------+-----------------+--------------+------------------+------+------------+------------+--------------+-------------+------------+-----------------+-----------------+----------------------+------------------+--------------------+------------+-----------+-------------------+--------------------+
|age|   workclass|weight|   education|education_years|    marital_status|       occupation|  relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|  citizenship|income_class|workclass_indexed|education_indexed|marital_status_indexed|occupation_indexed|relationship_indexed|race_indexed|sex_indexed|citizenship_indexed|income_class_indexed|
+---+------------+------+------------+---------------+------------------+-----------------+--------------+------------------+------+------------+------------+--------------+-------------+------------+-----------------+-----------------+----------------------+---------

In [16]:
# The indexed features

indexed_df = df_4.select(
    'age','workclass_indexed','weight', 'education_indexed', 'education_years', 'marital_status_indexed', 'occupation_indexed',
    'relationship_indexed', 'race_indexed', 'sex_indexed', 'capital_gain', 'capital_loss', 'hours_per_week', 'citizenship_indexed', 'income_class_indexed'
                             ).show()

+---+-----------------+------+-----------------+---------------+----------------------+------------------+--------------------+------------+-----------+------------+------------+--------------+-------------------+--------------------+
|age|workclass_indexed|weight|education_indexed|education_years|marital_status_indexed|occupation_indexed|relationship_indexed|race_indexed|sex_indexed|capital_gain|capital_loss|hours_per_week|citizenship_indexed|income_class_indexed|
+---+-----------------+------+-----------------+---------------+----------------------+------------------+--------------------+------------+-----------+------------+------------+--------------+-------------------+--------------------+
| 41|              5.0|130760|              2.0|             13|                   0.0|              10.0|                 0.0|         0.0|        0.0|           0|           0|            24|                0.0|                 0.0|
| 54|              4.0|125417|              8.0|            

In [17]:
# Creating a feature vector

vec = VectorAssembler(
    inputCols= ['age','workclass_indexed','weight', 'education_indexed', 'education_years', 'marital_status_indexed', 'occupation_indexed',
    'relationship_indexed', 'race_indexed', 'sex_indexed', 'capital_gain', 'capital_loss', 'hours_per_week', 'citizenship_indexed']
               , outputCol= 'Feature Vector'
                      )

final_df = vec.transform(df_4)


# The final preprocessed dataset

final_df.show()


+---+------------+------+------------+---------------+------------------+-----------------+--------------+------------------+------+------------+------------+--------------+-------------+------------+-----------------+-----------------+----------------------+------------------+--------------------+------------+-----------+-------------------+--------------------+--------------------+
|age|   workclass|weight|   education|education_years|    marital_status|       occupation|  relationship|              race|   sex|capital_gain|capital_loss|hours_per_week|  citizenship|income_class|workclass_indexed|education_indexed|marital_status_indexed|occupation_indexed|relationship_indexed|race_indexed|sex_indexed|citizenship_indexed|income_class_indexed|      Feature Vector|
+---+------------+------+------------+---------------+------------------+-----------------+--------------+------------------+------+------------+------------+--------------+-------------+------------+-----------------+--------

In [18]:
# Splitting the dataset into 70% training and 30% testing data

train, test = final_df.randomSplit([.70, .30])
print('Train Size:', train.count())
print('Test Size:', test.count())

Train Size: 21189
Test Size: 8950


## RANDOM FOREST CLASSIFIER

In [19]:
# Creating the Random Forest Classifier
rf = RandomForestClassifier(featuresCol='Feature Vector', labelCol='income_class_indexed', maxBins=50)
model_1 = rf.fit(train)
pred_income = model_1.transform(test)

In [20]:

# Creating a Confusion Matrix
preds = pred_income.select(['prediction', 'income_class_indexed'])
metric = MulticlassMetrics(preds.rdd.map(tuple))
confusion_matrix = metric.confusionMatrix().toArray()
print('Random Forest Confusion Matrix:', '\n', confusion_matrix)



Random Forest Confusion Matrix: 
 [[6502.  238.]
 [1130. 1080.]]


In [21]:
# Accuracy of the Random Forest Classifier

evaluator = MulticlassClassificationEvaluator(labelCol='income_class_indexed', predictionCol='prediction')
accuracy = evaluator.evaluate(pred_income)
print('The accuracy of the Random Forest Classifier is {}'.format(accuracy))


The accuracy of the Random Forest Classifier is 0.8325713710735443


### Decision Tree Classifier

In [22]:
# Creating the Decision Tree Classifier
dt = DecisionTreeClassifier(featuresCol='Feature Vector', labelCol='income_class_indexed', maxBins=50)
model_2 = dt.fit(train)
pred_income_2 = model_2.transform(test)

In [23]:
# Creating the confusion matrix
preds_2 = pred_income_2.select(['prediction', 'income_class_indexed'])
metric = MulticlassMetrics(preds_2.rdd.map(tuple))
confusion_matrix_2 = metric.confusionMatrix().toArray()
print('Decision Tree Confusion Matrix:','\n', confusion_matrix_2)





Decision Tree Confusion Matrix: 
 [[6305.  435.]
 [ 959. 1251.]]


In [24]:
# Accuracy of the Decision Classifier

evaluator_2 = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='income_class_indexed')
accuracy_2 = evaluator_2.evaluate(pred_income_2)
print('The accuracy of the Decision Tree Classifier is {}'.format(accuracy_2))

The accuracy of the Decision Tree Classifier is 0.8366855764915432
