#### 1. Create the Spark Environment  (2M)

In [1]:
import os
import sys
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] + "/py4j-0.10.6-src.zip")
sys.path.insert(0, os.environ["PYLIB"] + "/pyspark.zip")

#### 2. Load the required Libraries for Spark Context and Spark Session (2M)

In [2]:
from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession


In [3]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import PipelineModel

import numpy as np

#### 3. Create the Spark Context and Spark Session (2M)

In [4]:
conf = SparkConf().setAppName("Batch56_CSE7322c_CUTe_PartB_2618").setMaster('local[*]').set("spark.driver.extraClassPath","/home/datasets/mysql-connector-java-8.0.13.jar").set("spark.executor.extraClassPath","/home/datasets/mysql-connector-java-8.0.13.jar")
sc = SparkContext(conf=conf)  
spark = SparkSession(sc)  

#### 4. Load the libraries for schema definition in Pyspark (2M)

In [5]:
from pyspark.sql.types import *
from pyspark.sql.functions import *


In [6]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%matplotlib inline
np.random.seed(333)

#### Problem Statement:
* This data was extracted from the census bureau database.The task is to classify the records based on the income field.Incomes have been binned at the 50K level to present a binary classification problem.The instance_weight attribute should not be used in the classifier. All the other attributes and their description are givn below.

#### Description of the Attributes
* **age**: continuous.
* **class of worker**: Not in universe, Federal government, Local government, Never worked, Private, Self-employed- incorporated, Self-employed-not incorporated, State government, Without pay.
* **detailed industry recode**: 0, 40, 44, 2, 43, 47, 48, 1, 11, 19, 24, 25, 32, 33, 34, 35, 36, 37, 38, 39, 4, 42, 45, 5, 15, 16, 22, 29, 31, 50, 14, 17, 18, 28, 3, 30, 41, 46, 51, 12, 13, 21, 23, 26, 6, 7, 9, 49, 27, 8, 10, 20.
* **detailed occupation recode** : 0, 12, 31, 44, 19, 32, 10, 23, 26, 28, 29, 42, 40, 34, 14, 36, 38, 2, 20, 25, 37, 41, 27, 24, 30, 43, 33, 16, 45, 17, 35, 22, 18, 39, 3, 15, 13, 46, 8, 21, 9, 4, 6, 5, 1, 11, 7.
* **education**: Children, 7th and 8th grade, 9th grade, 10th grade, High school graduate, 11th grade, 12th grade no diploma, 5th or 6th grade, Less than 1st grade, Bachelors degree(BA AB BS), 1st 2nd 3rd or 4th grade, Some college but no degree, Masters degree(MA MS MEng MEd MSW MBA), Associates degree-occup /vocational, Associates degree-academic program, Doctorate degree(PhD EdD), Prof school degree (MD DDS DVM LLB JD).
* **wage per hour**: continuous.
* **enroll in edu inst last wk**: Not in universe, High school, College or university.
* **marital stat**: Never married, Married-civilian spouse present, Married-spouse absent, Separated, Divorced, Widowed, Married-A F spouse present.
* **major industry code**: Not in universe or children, Entertainment, Social services, Agriculture, Education, Public administration, Manufacturing-durable goods, Manufacturing-nondurable goods, Wholesale trade, Retail trade, Finance insurance and real estate, Private household services, Business and repair services, Personal services except private HH, Construction, Medical except hospital, Other professional services, Transportation, Utilities and sanitary services, Mining, Communications, Hospital services, Forestry and fisheries, Armed Forces.
* **major occupation code**: Not in universe, Professional specialty, Other service, Farming forestry and fishing, Sales, Adm support including clerical, Protective services, Handlers equip cleaners etc , Precision production craft & repair, Technicians and related support, Machine operators assmblrs & inspctrs, Transportation and material moving, Executive admin and managerial, Private household services, Armed Forces.
* **race**: White, Black, Other, Amer Indian Aleut or Eskimo, Asian or Pacific Islander.
* **hispanic origin**: Mexican (Mexicano), Mexican-American, Puerto Rican, Central or South American, All other, Other Spanish, Chicano, Cuban, Do not know, NA.
* **sex**: Female, Male.
* **member of a labor union**: Not in universe, No, Yes.
* **reason for unemployment**: Not in universe, Re-entrant, Job loser - on layoff, New entrant, Job leaver, Other job loser.
* **Full or part time employment stat**: Children or Armed Forces, Full-time schedules, Unemployed part- time, Not in labor force, Unemployed full-time, PT for non-econ reasons usually FT, PT for econ reasons usually PT, PT for econ reasons usually FT.
* **capital gains**: continuous.
* **capital losses**: continuous.
* **dividends from stocks**: continuous.
* **tax filer stat**: Nonfiler, Joint one under 65 & one 65+, Joint both under 65, Single, Head of household, Joint both 65+.
* **region of previous residence**: Not in universe, South, Northeast, West, Midwest, Abroad.
* **state of previous residence**: Not in universe, Utah, Michigan, North Carolina, North Dakota, Virginia, Vermont, Wyoming, West Virginia, Pennsylvania, Abroad, Oregon, California, Iowa, Florida, Arkansas, Texas, South Carolina, Arizona, Indiana, Tennessee, Maine, Alaska, Ohio, Montana, Nebraska, Mississippi, District of Columbia, Minnesota, Illinois, Kentucky, Delaware, Colorado, Maryland, Wisconsin, New Hampshire, Nevada, New York, Georgia, Oklahoma, New Mexico, South Dakota, Missouri, Kansas, Connecticut, Louisiana, Alabama, Massachusetts, Idaho, New Jersey.
* **detailed household and family stat**: Child <18 never marr not in subfamily, Other Rel <18 never marr child of subfamily RP, Other Rel <18 never marr not in subfamily, Grandchild <18 never marr child of subfamily RP, Grandchild <18 never marr not in subfamily, Secondary individual, In group quarters, Child under 18 of RP of unrel subfamily, RP of unrelated subfamily, Spouse of householder, Householder, Other Rel <18 never married RP of subfamily, Grandchild <18 never marr RP of subfamily, Child <18 never marr RP of subfamily, Child <18 ever marr not in subfamily, Other Rel <18 ever marr RP of subfamily, Child <18 ever marr RP of subfamily, Nonfamily householder, Child <18 spouse of subfamily RP, Other Rel <18 spouse of subfamily RP, Other Rel <18 ever marr not in subfamily, Grandchild <18 ever marr not in subfamily, Child 18+ never marr Not in a subfamily, Grandchild 18+ never marr not in subfamily, Child 18+ ever marr RP of subfamily, Other Rel 18+ never marr not in subfamily, Child 18+ never marr RP of subfamily, Other Rel 18+ ever marr RP of subfamily, Other Rel 18+ never marr RP of subfamily, Other Rel 18+ spouse of subfamily RP, Other Rel 18+ ever marr not in subfamily, Child 18+ ever marr Not in a subfamily, Grandchild 18+ ever marr not in subfamily, Child 18+ spouse of subfamily RP, Spouse of RP of unrelated subfamily, Grandchild 18+ ever marr RP of subfamily, Grandchild 18+ never marr RP of subfamily, Grandchild 18+ spouse of subfamily RP.
* **detailed household summary in household**: Child under 18 never married, Other relative of householder, Nonrelative of householder, Spouse of householder, Householder, Child under 18 ever married, Group Quarters- Secondary individual, Child 18 or older.
| instance weight: ignore.
* **instance weight**: continuous.
* **migration code-change in msa**: Not in universe, Nonmover, MSA to MSA, NonMSA to nonMSA, MSA to nonMSA, NonMSA to MSA, Abroad to MSA, Not identifiable, Abroad to nonMSA.
* **migration code-change in reg**: Not in universe, Nonmover, Same county, Different county same state, Different state same division, Abroad, Different region, Different division same region.
* **migration code-move within reg**: Not in universe, Nonmover, Same county, Different county same state, Different state in West, Abroad, Different state in Midwest, Different state in South, Different state in Northeast.
* **live in this house 1 year ago**: Not in universe under 1 year old, Yes, No.
* **migration prev res in sunbelt**: Not in universe, Yes, No.
* **num persons worked for employer**: continuous.
* **family members under 18**: Both parents present, Neither parent present, Mother only present, Father only present, Not in universe.
* **country of birth father**: Mexico, United-States, Puerto-Rico, Dominican-Republic, Jamaica, Cuba, Portugal, Nicaragua, Peru, Ecuador, Guatemala, Philippines, Canada, Columbia, El-Salvador, Japan, England, Trinadad&Tobago, Honduras, Germany, Taiwan, Outlying-U S (Guam USVI etc), India, Vietnam, China, Hong Kong, Cambodia, France, Laos, Haiti, South Korea, Iran, Greece, Italy, Poland, Thailand, Yugoslavia, Holand-Netherlands, Ireland, Scotland, Hungary, Panama.
* **country of birth mother**: India, Mexico, United-States, Puerto-Rico, Dominican-Republic, England, Honduras, Peru, Guatemala, Columbia, El-Salvador, Philippines, France, Ecuador, Nicaragua, Cuba, Outlying-U S (Guam USVI etc), Jamaica, South Korea, China, Germany, Yugoslavia, Canada, Vietnam, Japan, Cambodia, Ireland, Laos, Haiti, Portugal, Taiwan, Holand-Netherlands, Greece, Italy, Poland, Thailand, Trinadad&Tobago, Hungary, Panama, Hong Kong, Scotland, Iran.
* **country of birth self**: United-States, Mexico, Puerto-Rico, Peru, Canada, South Korea, India, Japan, Haiti, El-Salvador, Dominican-Republic, Portugal, Columbia, England, Thailand, Cuba, Laos, Panama, China, Germany, Vietnam, Italy, Honduras, Outlying-U S (Guam USVI etc), Hungary, Philippines, Poland, Ecuador, Iran, Guatemala, Holand-Netherlands, Taiwan, Nicaragua, France, Jamaica, Scotland, Yugoslavia, Hong Kong, Trinadad&Tobago, Greece, Cambodia, Ireland.
* **citizenship**: Native- Born in the United States, Foreign born- Not a citizen of U S , Native- Born in Puerto Rico or U S Outlying, Native- Born abroad of American Parent(s), Foreign born- U S citizen by naturalization.
* **own business or self employed**: 0, 2, 1.
* **fill inc questionnaire for veteran's admin**: Not in universe, Yes, No.
* **veterans benefits**: 0, 2, 1.
* **weeks worked in year**: continuous.
* **year**: 94, 95.

* **Income** : -50000 and 50000

#### 5. Define the schema from the description above (4M)

In [7]:
customSchema = StructType([
        StructField("age", IntegerType(), True),
		StructField("class_of_worker", StringType(), True),
		StructField("detailed_industry_recode", IntegerType(), True),
		StructField("detailed_occupation_recode", IntegerType(), True),
		StructField("education", StringType(), True),
		StructField("wage_per_hour", IntegerType(), True),
		StructField("enroll_in_edu_inst_last_wk", StringType(), True),
		StructField("marital_stat", StringType(), True),
		StructField("major_industry_code", StringType(), True),
		StructField("major_occupation_code", StringType(), True),
		StructField("race", StringType(), True),
		StructField("hispanic_origin", StringType(), True),
		StructField("sex", StringType(), True),
		StructField("member_of_a_labor_union", StringType(), True),
		StructField("reason_for_unemployment", StringType(), True),
		StructField("Full_or_part_time_employment_stat", StringType(), True),
		StructField("capital_gains", IntegerType(), True),
		StructField("capital_losses", IntegerType(), True),
		StructField("dividends_from_stocks", IntegerType(), True),
		StructField("tax_filer_stat", StringType(), True),
		StructField("region_of_previous_residence", StringType(), True),
		StructField("state_of_previous_residence", StringType(), True),
		StructField("detailed_household_and_family_stat", StringType(), True),
		StructField("detailed_household_summary_in_household", StringType(), True),
		StructField("instance_weight", FloatType(), True),
		StructField("migration_code-change_in_msa", StringType(), True),
		StructField("migration_code-change_in_reg", StringType(), True),
		StructField("migration_code-move_within_reg", StringType(), True),
		StructField("live_in_this_house_1_year_ago", StringType(), True),
		StructField("migration_prev_res_in_sunbelt", StringType(), True),
		StructField("num_persons_worked_for_employer", IntegerType(), True),
		StructField("family_members_under_18", StringType(), True),
		StructField("country_of_birth_father", StringType(), True),
		StructField("country_of_birth_mother", StringType(), True),
		StructField("country_of_birth_self", StringType(), True),
		StructField("citizenship", StringType(), True),
		StructField("own_business_or_self_employed", IntegerType(), True),
		StructField("fill_inc_questionnaire_for_veteran's_admin", StringType(), True),
		StructField("veterans_benefits", IntegerType(), True),
		StructField("weeks_worked_in_year", IntegerType(), True),
		StructField("year", IntegerType(), True),
		StructField("Income",StringType(), True)])

#### 6. Read the Data from the CSV File (3M)

In [8]:
data = spark.read.csv(path='file:///home/2618B56/Cute_BD_part_B/trainData.csv',
                      header=False,
                      schema=customSchema)

#### 7.  Read/View Four rows (2M)

In [9]:
data.show(4)

+---+--------------------+------------------------+--------------------------+--------------------+-------------+--------------------------+--------------------+--------------------+---------------------+-----+---------------+------+-----------------------+-----------------------+---------------------------------+-------------+--------------+---------------------+-------------------+----------------------------+---------------------------+----------------------------------+---------------------------------------+---------------+----------------------------+----------------------------+------------------------------+-----------------------------+-----------------------------+-------------------------------+-----------------------+-----------------------+-----------------------+---------------------+--------------------+-----------------------------+------------------------------------------+-----------------+--------------------+----+--------+
|age|     class_of_worker|detailed_industry

#### 8. Inspect the data types of the Columns (2M)

In [10]:
data.printSchema()

root
 |-- age: integer (nullable = true)
 |-- class_of_worker: string (nullable = true)
 |-- detailed_industry_recode: integer (nullable = true)
 |-- detailed_occupation_recode: integer (nullable = true)
 |-- education: string (nullable = true)
 |-- wage_per_hour: integer (nullable = true)
 |-- enroll_in_edu_inst_last_wk: string (nullable = true)
 |-- marital_stat: string (nullable = true)
 |-- major_industry_code: string (nullable = true)
 |-- major_occupation_code: string (nullable = true)
 |-- race: string (nullable = true)
 |-- hispanic_origin: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- member_of_a_labor_union: string (nullable = true)
 |-- reason_for_unemployment: string (nullable = true)
 |-- Full_or_part_time_employment_stat: string (nullable = true)
 |-- capital_gains: integer (nullable = true)
 |-- capital_losses: integer (nullable = true)
 |-- dividends_from_stocks: integer (nullable = true)
 |-- tax_filer_stat: string (nullable = true)
 |-- region_of_pr

#### 9. Find the Total rows and Columns in the Dataset(2M)

In [11]:
print("No. of Columns = {}".format(len(data.columns)))
print('No. of Records = {}'.format(data.count()))

No. of Columns = 42
No. of Records = 99579


#### 10. Find the summary Statistics for the numerical attributes (2M)

In [12]:
numeric_col = [item[0] for item in data.dtypes if (item[1].startswith('int') | item[1].startswith('float'))]
string_col = [item[0] for item in data.dtypes if (item[1].startswith('string'))]

In [13]:
data.select(numeric_col).summary().show()

+-------+------------------+------------------------+--------------------------+-----------------+------------------+------------------+---------------------+------------------+-------------------------------+-----------------------------+------------------+--------------------+------------------+
|summary|               age|detailed_industry_recode|detailed_occupation_recode|    wage_per_hour|     capital_gains|    capital_losses|dividends_from_stocks|   instance_weight|num_persons_worked_for_employer|own_business_or_self_employed| veterans_benefits|weeks_worked_in_year|              year|
+-------+------------------+------------------------+--------------------------+-----------------+------------------+------------------+---------------------+------------------+-------------------------------+-----------------------------+------------------+--------------------+------------------+
|  count|             99579|                   99579|                     99579|            99579|     

#### 11. Find the missing Values in each Column (2M)

In [14]:
null_analysis_df = data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data.columns])
null_analysis_df_pd = null_analysis_df.toPandas().transpose().reset_index().rename(columns={"index": "attributes", 0: "count_of_missing_values"}).sort_values(by=['count_of_missing_values'], ascending=False)
null_analysis_df_1 = spark.createDataFrame(null_analysis_df_pd)
null_analysis_df_1.show()


+--------------------+-----------------------+
|          attributes|count_of_missing_values|
+--------------------+-----------------------+
|migration_code-ch...|                  49731|
|migration_code-ch...|                  49731|
|migration_code-mo...|                  49731|
|migration_prev_re...|                  49731|
|country_of_birth_...|                   3272|
|country_of_birth_...|                   2961|
|country_of_birth_...|                   1603|
|state_of_previous...|                    356|
|         citizenship|                      0|
|own_business_or_s...|                      0|
|     class_of_worker|                      0|
|family_members_un...|                      0|
|num_persons_worke...|                      0|
|fill_inc_question...|                      0|
|live_in_this_hous...|                      0|
|   veterans_benefits|                      0|
|weeks_worked_in_year|                      0|
|                year|                      0|
|     instanc

#### 12. Drop the Columns that have missing values more than 20% (3M)

In [15]:
total_rows = data.count()
null_analysis_df_2 = null_analysis_df_1.withColumn("perc_count_of_missing_values", (col('count_of_missing_values')/total_rows)*100)
null_analysis_df_2.show()


+--------------------+-----------------------+----------------------------+
|          attributes|count_of_missing_values|perc_count_of_missing_values|
+--------------------+-----------------------+----------------------------+
|migration_code-ch...|                  49731|          49.941252673756516|
|migration_code-ch...|                  49731|          49.941252673756516|
|migration_code-mo...|                  49731|          49.941252673756516|
|migration_prev_re...|                  49731|          49.941252673756516|
|country_of_birth_...|                   3272|          3.2858333584390285|
|country_of_birth_...|                   2961|          2.9735185129394752|
|country_of_birth_...|                   1603|          1.6097771618513945|
|state_of_previous...|                    356|          0.3575050964560801|
|         citizenship|                      0|                         0.0|
|own_business_or_s...|                      0|                         0.0|
|     class_

In [16]:
col_with_missing_values_less_than_20_perc= null_analysis_df_2.filter(col('perc_count_of_missing_values')<=20)
col_with_missing_values_less_than_20_perc.show()

+--------------------+-----------------------+----------------------------+
|          attributes|count_of_missing_values|perc_count_of_missing_values|
+--------------------+-----------------------+----------------------------+
|country_of_birth_...|                   3272|          3.2858333584390285|
|country_of_birth_...|                   2961|          2.9735185129394752|
|country_of_birth_...|                   1603|          1.6097771618513945|
|state_of_previous...|                    356|          0.3575050964560801|
|         citizenship|                      0|                         0.0|
|own_business_or_s...|                      0|                         0.0|
|     class_of_worker|                      0|                         0.0|
|family_members_un...|                      0|                         0.0|
|num_persons_worke...|                      0|                         0.0|
|fill_inc_question...|                      0|                         0.0|
|live_in_thi

In [17]:
#Get column names
col_with_missing_values_less_than_20_perc = col_with_missing_values_less_than_20_perc.select('attributes')
col_with_missing_values_less_than_20_perc_list = col_with_missing_values_less_than_20_perc.rdd.flatMap(lambda x: x).collect()


In [18]:
#Latest df
data_1 = data[col_with_missing_values_less_than_20_perc_list].drop('instance_weight')
data_1.show()

+-----------------------+-----------------------+---------------------+---------------------------+--------------------+-----------------------------+--------------------+-----------------------+-------------------------------+------------------------------------------+-----------------------------+-----------------+--------------------+----+---------------------------------------+----------------------------------+---+----------------------------+-------------------+------------------------+--------------------------+--------------------+-------------+--------------------------+--------------------+--------------------+---------------------+-----+------------------+------+-----------------------+-----------------------+---------------------------------+-------------+--------------+---------------------+--------+
|country_of_birth_father|country_of_birth_mother|country_of_birth_self|state_of_previous_residence|         citizenship|own_business_or_self_employed|     class_of_worker|fami

#### 13. Drop the rows with NA values and work on the remaining dataset(2M)

In [19]:
print("count of rows, before removing rows with null : {}".format(data_1.count()))
data_no_missing_val = data_1.na.drop()
print("count of rows, after removing rows with null : {}".format(data_no_missing_val.count()))


count of rows, before removing rows with null : 99579
count of rows, after removing rows with null : 95193


#### 14. The distribution of income class on education(2M)

In [20]:
data_no_missing_val_dist = data_no_missing_val.groupby('education','Income').count().sort(col("education"),col("Income")).withColumnRenamed(
    'count','num_units_with_specific_edu_and_income')
data_no_missing_val_dist.show(100)


+--------------------+--------+--------------------------------------+
|           education|  Income|num_units_with_specific_edu_and_income|
+--------------------+--------+--------------------------------------+
|          10th grade|- 50000.|                                  3554|
|          10th grade| 50000+.|                                    33|
|          11th grade|- 50000.|                                  3299|
|          11th grade| 50000+.|                                    28|
|12th grade no dip...|- 50000.|                                   961|
|12th grade no dip...| 50000+.|                                    16|
|1st 2nd 3rd or 4t...|- 50000.|                                   864|
|1st 2nd 3rd or 4t...| 50000+.|                                     5|
|    5th or 6th grade|- 50000.|                                  1528|
|    5th or 6th grade| 50000+.|                                     9|
|   7th and 8th grade|- 50000.|                                  3710|
|   7t

#### 15. Find the Correlation between  Columns : "weeks worked in year" and "no. of persons worked for employer" (2M)

In [21]:
data_no_missing_val.stat.corr("weeks_worked_in_year", "num_persons_worked_for_employer")

0.7495145872631152

#### 16. Define the schema dict from data type of the Data Frame (2M)

In [22]:
dict(data_no_missing_val.dtypes)

{'Full_or_part_time_employment_stat': 'string',
 'Income': 'string',
 'age': 'int',
 'capital_gains': 'int',
 'capital_losses': 'int',
 'citizenship': 'string',
 'class_of_worker': 'string',
 'country_of_birth_father': 'string',
 'country_of_birth_mother': 'string',
 'country_of_birth_self': 'string',
 'detailed_household_and_family_stat': 'string',
 'detailed_household_summary_in_household': 'string',
 'detailed_industry_recode': 'int',
 'detailed_occupation_recode': 'int',
 'dividends_from_stocks': 'int',
 'education': 'string',
 'enroll_in_edu_inst_last_wk': 'string',
 'family_members_under_18': 'string',
 "fill_inc_questionnaire_for_veteran's_admin": 'string',
 'hispanic_origin': 'string',
 'live_in_this_house_1_year_ago': 'string',
 'major_industry_code': 'string',
 'major_occupation_code': 'string',
 'marital_stat': 'string',
 'member_of_a_labor_union': 'string',
 'num_persons_worked_for_employer': 'int',
 'own_business_or_self_employed': 'int',
 'race': 'string',
 'reason_for_un

#### 17. Write code to Seperate the columns in Categorical and Numerical attributes (Not Manual)(2M)

In [23]:
numeric_col = [item[0] for item in data_no_missing_val.dtypes if (item[1].startswith('int') | item[1].startswith('float'))]
string_col = [item[0] for item in data_no_missing_val.dtypes if (item[1].startswith('string'))]
string_col.remove('Income')

print("numeric_col are : {}".format(numeric_col))
print("\n")
print("string_col are : {}".format(string_col))

numeric_col are : ['own_business_or_self_employed', 'num_persons_worked_for_employer', 'veterans_benefits', 'weeks_worked_in_year', 'year', 'age', 'detailed_industry_recode', 'detailed_occupation_recode', 'wage_per_hour', 'capital_gains', 'capital_losses', 'dividends_from_stocks']


string_col are : ['country_of_birth_father', 'country_of_birth_mother', 'country_of_birth_self', 'state_of_previous_residence', 'citizenship', 'class_of_worker', 'family_members_under_18', "fill_inc_questionnaire_for_veteran's_admin", 'live_in_this_house_1_year_ago', 'detailed_household_summary_in_household', 'detailed_household_and_family_stat', 'region_of_previous_residence', 'tax_filer_stat', 'education', 'enroll_in_edu_inst_last_wk', 'marital_stat', 'major_industry_code', 'major_occupation_code', 'race', 'hispanic_origin', 'sex', 'member_of_a_labor_union', 'reason_for_unemployment', 'Full_or_part_time_employment_stat']


#### 18. Split the data into train and test (2M)

In [24]:
(trainingData, testData) = data_no_missing_val.randomSplit([0.7, 0.3])
print("Training data count : {}".format(trainingData.count()))
print("Test data count : {}".format(testData.count()))


Training data count : 66622
Test data count : 28571


#### 19. Cache the train and validation data sets and unpersist data (2M)¶

In [25]:
trainingData.persist()
testData.persist()


print("Training data count : {}".format(trainingData.count()))
print("Test data count : {}".format(testData.count()))


trainingData.unpersist()
testData.unpersist()

Training data count : 66622
Test data count : 28571


DataFrame[country_of_birth_father: string, country_of_birth_mother: string, country_of_birth_self: string, state_of_previous_residence: string, citizenship: string, own_business_or_self_employed: int, class_of_worker: string, family_members_under_18: string, num_persons_worked_for_employer: int, fill_inc_questionnaire_for_veteran's_admin: string, live_in_this_house_1_year_ago: string, veterans_benefits: int, weeks_worked_in_year: int, year: int, detailed_household_summary_in_household: string, detailed_household_and_family_stat: string, age: int, region_of_previous_residence: string, tax_filer_stat: string, detailed_industry_recode: int, detailed_occupation_recode: int, education: string, wage_per_hour: int, enroll_in_edu_inst_last_wk: string, marital_stat: string, major_industry_code: string, major_occupation_code: string, race: string, hispanic_origin: string, sex: string, member_of_a_labor_union: string, reason_for_unemployment: string, Full_or_part_time_employment_stat: string, cap

#### 20. Check for the Class balance in the train and test data set (2M)

In [26]:
trainingData.groupBy('Income').count().withColumn("perc_count", (col('count')/trainingData.count())*100).show()

+--------+-----+------------------+
|  Income|count|        perc_count|
+--------+-----+------------------+
| 50000+.| 4071|6.1105940980456905|
|- 50000.|62551| 93.88940590195432|
+--------+-----+------------------+



In [27]:
testData.groupBy('Income').count().withColumn("perc_count", (col('count')/testData.count())*100).show()

+--------+-----+-----------------+
|  Income|count|       perc_count|
+--------+-----+-----------------+
| 50000+.| 1715|6.002590038850583|
|- 50000.|26856|93.99740996114942|
+--------+-----+-----------------+



#### 21.  Perform the required feature Preprocessing  and create the pipeline for the flow: (10M)

In [28]:
#Use VectorAssembler to combine a given list of numcolumns into a single vector column.
assembler_Num = VectorAssembler(inputCols=numeric_col, outputCol="num_features")

In [29]:
#Scale all the numeric attributes using MinMaxScaler
min_Max_Scalar = MinMaxScaler(inputCol="num_features", outputCol="scaled_num_features")

In [30]:
#Covert categorical to numeric
indexers_Cat = [StringIndexer(inputCol=cat_Var_Name, outputCol="{0}_index".format(cat_Var_Name), handleInvalid='skip') for cat_Var_Name in string_col ]
encoders_Cat = [OneHotEncoder(inputCol=indexer.getOutputCol(), outputCol="{0}_vec".format(indexer.getInputCol())) for indexer in indexers_Cat]
assembler_Cat = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders_Cat], outputCol="cat_features")

assembler = VectorAssembler(inputCols=["scaled_num_features", "cat_features"], outputCol="features")

In [31]:
#String index output
indexer_Label = StringIndexer(inputCol="Income", outputCol="label")

In [32]:
# Create the preprocessing stage
preprocessiong_Stages = [assembler_Num]+[min_Max_Scalar]+indexers_Cat+encoders_Cat+[assembler_Cat]+[assembler]+[indexer_Label]

Adding code for garbage collection.

In [33]:
import gc
gc.collect()

8

#### 22. Create the Logistic regression Model(5M)

In [34]:
lr = LogisticRegression(maxIter=10, labelCol="label", featuresCol="features")
lr_Pipeline = Pipeline(stages=preprocessiong_Stages+[lr])

In [35]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", 
                                              predictionCol="prediction",
                                              metricName="accuracy")

In [36]:
#Define param
paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.2, 0.3]) \
    .addGrid(lr.elasticNetParam, [0.4, 0.5, 0.6])\
    .build()

lr_crossval = CrossValidator(estimator=lr_Pipeline,
                             estimatorParamMaps=paramGrid,
                             evaluator=evaluator,
                             numFolds=2)

In [37]:
# Run cross-validation, and choose the best set of parameters.
lr_crossval_Model = lr_crossval.fit(trainingData)

In [38]:
lr_crossval_Model.bestModel.stages

[VectorAssembler_4b1e9db90c301270c578,
 MinMaxScaler_48d6b1576bc3e844223e,
 StringIndexer_44be9c29cbb8729f9ef3,
 StringIndexer_4cc5832fe4a9a1edd8a3,
 StringIndexer_472b9135301835949aff,
 StringIndexer_48a083931b83c32921a7,
 StringIndexer_4397b2972dd0e7eac075,
 StringIndexer_4962a8a36bed0dbc1407,
 StringIndexer_4a91a1988818e375ed98,
 StringIndexer_4328b1a91e1b26c78e21,
 StringIndexer_4ef3b9c4f98a6ccdcfe6,
 StringIndexer_429c87d94f70580bb40b,
 StringIndexer_46199560f84b3c39a559,
 StringIndexer_4fa8a20a41d669f7e9a8,
 StringIndexer_423e8837e327f395ee98,
 StringIndexer_4a678198ec47a1590a41,
 StringIndexer_4533948074bb8f7c371a,
 StringIndexer_457880e8dcd93fa7484f,
 StringIndexer_4d378cdd0878eadbefa3,
 StringIndexer_4d4e8d637fc96b628fb2,
 StringIndexer_442bad54db0d4edbe46b,
 StringIndexer_45ddb37019e4bf05e007,
 StringIndexer_4b5394499e15083936ec,
 StringIndexer_4415a54102c31b6e712e,
 StringIndexer_4538b385c937dff57a93,
 StringIndexer_4839a348ff2780e3a870,
 OneHotEncoder_4afc818c08fcb5fec3f2,


In [39]:
train_predictions_lrcv = lr_crossval_Model.transform(trainingData)
test_predictions_lrcv = lr_crossval_Model.transform(testData)

* Print the objective history.

In [40]:
lr_Summary = lr_crossval_Model.bestModel.stages[-1].summary
objectiveHistory = lr_Summary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
    print(objective)

objectiveHistory:
0.229999776827
0.229130563558
0.226974602963
0.226858340323
0.2267451163
0.226654203677
0.22621028838
0.226147527413
0.226065298486
0.225985700564
0.225876296073


#### 23. Answer the following questions (2M)

* What is the train accuracy ?

In [41]:
predictionAndLabels_train_lrcv = train_predictions_lrcv.select("prediction", "label")
train_accuracycv = evaluator.evaluate(predictionAndLabels_train_lrcv)
print("Train set accuracy  = " + str(train_accuracycv))

Train set accuracy  = 0.93889405902


* What is test accuracy?

In [42]:
predictionAndLabels_test_lrcv = test_predictions_lrcv.select("prediction", "label")
test_accuracycv = evaluator.evaluate(predictionAndLabels_test_lrcv)
print("Test set accuracy = " + str(test_accuracycv))

Test set accuracy = 0.9399719986


* What is your observations ? Write your oservations in the markdown format.

1. Data is highly imbalanced.
2. Based on F1 Score & accuracy, the classifier looks to perform well
3. F1 score:-
    a. 0.910177679985 for train
    
    b. 0.90884159297 for test
4. Accuracy:-
    a. 0.939489153149 for train
    
    b. 0.938579052004 for test
5. Have also tried out Binary Evaluation matrics, with a Roc-Auc score (areaUnderROC) of nearing 1.

In [43]:
gc.collect()

6117

#### 24. Perform the necessary tuning methods(2M)

Done Above, Below is my best model

In [44]:
print("Coefficients: " + str(lr_crossval_Model.bestModel.stages[-1].coefficients))
print("Intercept: " + str(lr_crossval_Model.bestModel.stages[-1].intercept))

Coefficients: (345,[3,9,188,204,211,263,280,303,305,306],[0.2836244572426422,1.4471021375511484,-0.03085967190955468,0.03629710801613633,0.10436158979180238,0.19365687509099686,-0.03673853533028195,-0.03673853533028195,0.22684609038357767,0.41638541859235534])
Intercept: -2.96786128745


#### 25. Save the Crossvalidated model and load the model and pass the test data (7M)

In [45]:
## Go through the documentation.
lr_crossval_Model.bestModel.save("/user/2618B56/cute_part_B")

In [46]:
#model_loaded = lr_Pipeline_model.load("/user/2618B56/cute_part_B")
model_loaded = PipelineModel.load("/user/2618B56/cute_part_B")

In [47]:
test_loaded_model_predictions_lrcv = model_loaded.transform(testData)

predictionAndLabels_lodel_loaded_test_lrcv = test_loaded_model_predictions_lrcv.select("prediction", "label")
test_loaded_model_accuracycv = evaluator.evaluate(predictionAndLabels_lodel_loaded_test_lrcv)
print("Test set accuracy = " + str(test_loaded_model_accuracycv))

Test set accuracy = 0.9399719986
