# Table of Contents 
**1/Perform Basic Operations on a Spark Dataframe**
- Reading a CSV file
- Defining the Schema

**2/EDA using PySpark**
- Check the Data Dimensions
- Describe the Data
- Missing Values Count
- Find Count of Unique Values in a Column

**3/Building ML Pipeline**
- Encode Categorical VariablesusingPySpark
- String Indexing
- One Hot Encoding
- Vector Assembler

**4/Model Evaluation**

**5/Bib**

In [2]:
#Notes to include
#Why working with dataframes
#changing the dtype of a column

In [3]:
#importing libraries
from pyspark.ml.classification import LogisticRegression
import pyspark.sql.types as tp

# Perform Basic Operations on a Spark Dataframe

In [5]:
#Reading a CSV file
data=spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/FileStore/tables/ibm_hr_analytics_attrition-7bd0d.csv")

In [6]:
#defining schema:
data.printSchema()

# EDA

In [8]:
#check data dimensions:
data.count(),len(data.columns)

we have 1470 lines and 35 columns in our dataset

We are going to drop some columns from this csv file like :
departement,Over18.
To do so we are going to use * for deleting multiple columns

In [11]:
#display number of null lines in each column
from pyspark.sql.functions import when, count, col
data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns]).toPandas()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [12]:
data=data.drop(*['departement','over18'])

To get a first overview of columns by dtype

In [14]:
from collections import Counter
print(Counter(x[1] for x in data.dtypes))

Display integer features and string features

In [16]:
String_features = [item[0] for item in data.dtypes if item[1].startswith('string')]
Int_features=[item[0] for item in data.dtypes if item[1].startswith('int')]

In [17]:
#get the summary of the numerical columns
data.select('StandardHours','YearsAtCompany','JobSatisfaction','MonthlyIncome').describe().show()

Standard Hours doesn't give much information as the variable is constant 0.0 (stddev)
So In the rest of the work we are not going to include it.

In [19]:
#Get an idea of the distribution of our Target variable:
data.groupby('Attrition').count().show()

In [20]:
categorical_variables=['Attrition','OverTime','JobRole']
numerical_variables=['JobSatisfaction','MonthlyIncome','YearsAtCompany']

# Building ML Pipeline

**Encode Categorical Variables**

Most machine learning algorithms accept the data only in numerical form. So, it is essential to convert any categorical variables present in our dataset into numbers.

In [23]:
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

In [24]:
df=data.select('Attrition','OverTime','JobRole','JobSatisfaction','MonthlyIncome','YearsAtCompany')

In [25]:
# define stage 1: transform the column feature_1 to numeric
from pyspark.ml.feature import VectorAssembler
stage_1 = StringIndexer(inputCol= 'Attrition', outputCol= 'Attrition_index')
# define stage 2: transform the column feature_2 to numeric
stage_2 = StringIndexer(inputCol= 'OverTime', outputCol= 'OverTime_index')
# define stage 3: transform the column feature_3 to numeric
stage_3 = StringIndexer(inputCol= 'JobRole', outputCol= 'JobRole_index')
# define stage 4: one hot encode the numeric versions of feature 2 and 3 generated from stage 1 and stage 2
stage_4= OneHotEncoderEstimator(inputCols=[stage_2.getOutputCol(), stage_3.getOutputCol()], 
                                 outputCols= ['OverTime_encoded','JobRole_encoded'])
# define stage 5: create a vector of all the features required to train the logistic regression model 
stage_5 = VectorAssembler(inputCols=numerical_variables+stage_4.getOutputCols(),outputCol='features')
# define stage 5: logistic regression model                          


In [26]:
# setup the pipeline
from pyspark.ml import Pipeline
regression_pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, stage_4,stage_5])

In [27]:
#fit and transform the 
fitted_model=regression_pipeline.fit(data)
transformed_model=fitted_model.transform(data)


In [28]:
#select features and target 
df=transformed_model.select('features','Attrition_index')

In [29]:
train,test = df.randomSplit([0.7,0.3],seed=12345)

# Model Evaluation

In [31]:
LR = LogisticRegression(featuresCol='features',labelCol='Attrition_index')

In [32]:
my_model=LR.fit(train)

In [33]:
evaluation_summary = my_model.evaluate(test)

In [34]:
evaluation_summary.accuracy

# Bib

- https://towardsdatascience.com/data-prep-with-spark-dataframes-3629478a1041
- https://www.analyticsvidhya.com/blog/2019/11/build-machine-learning-pipelines-pyspark/