# 2 - Apache Spark ML - Create train and test set

In this chapter, you will:

• Create a test and train set

• Learn more Spark functionality and how to use it

In [None]:
spark.read.csv("testing_bot_data.csv", header= True)

In [None]:
from pyspark.sql import SparkSession 

spark = SparkSession.builder \
    .master('local[*]') \
    .appName("ApacheSparkML") \
    .getOrCreate()

After filtering and working on the train DataFrame, we need to make sure the test set has the same structure.

Load testing data from CSV file:

In [None]:
df_test = spark.read.csv("testing_bot_data.csv", header= True)

Clean and prep testing data as well:
Remember that here we don't have bot value.

You will not drop id, as we will use it to compare results later.

Excecute the next commands:


In [None]:
from pyspark.sql.types import IntegerType, ArrayType, BooleanType, StringType
from pyspark.sql.functions import udf
from pyspark.sql.functions import when

# Dropping irrelevant columns and duplicates
df_test = df_test.drop('default_profile_image','has_extended_profile','url','created_at','lang')
df_test = df_test.dropDuplicates()


In [None]:

# First Transformation
df_test = df_test.withColumn("friends_count", df_test["friends_count"].cast(IntegerType()))
df_test = df_test.withColumn("listed_count", df_test["listed_count"].cast(IntegerType()))
df_test = df_test.withColumn("favourites_count", df_test["favourites_count"].cast(IntegerType()))
df_test = df_test.withColumn("statuses_count", df_test["statuses_count"].cast(IntegerType()))
df_test = df_test.withColumn("verified", df_test["verified"].cast(BooleanType()))
df_test = df_test.withColumn("default_profile", df_test["default_profile"].cast(BooleanType()))


In [None]:

# Second Transformation
df_test = df_test.withColumn('default_profile',df_test['default_profile'].cast(IntegerType()))
df_test = df_test.withColumn('name',when(df_test['name'].isNull(),0).otherwise(1))
df_test = df_test.withColumn('verified',df_test['verified'].cast(IntegerType()))


In [None]:

# Theird Transformation
df_test = df_test.withColumn('verified',when(df_test['verified'].isNull(),0).otherwise(df_test['verified']))
df_test = df_test.withColumn('default_profile',when(df_test['default_profile'].isNull(),0).otherwise(df_test['default_profile']))
df_test = df_test.withColumn('location',when(df_test['location'].isNull(),0).otherwise(1))
df_test = df_test.withColumn('status',when(df_test['status'].isNull(),0).otherwise(1))
df_test = df_test.withColumn('screen_name',when(df_test['screen_name'].isNull(),0).otherwise(1))


In [None]:
# Forth Transformation
df_test = df_test.dropna(subset=['description'])

def split_and_set(some_str):
    if isinstance(some_str, str):
        some_str = ''.join(c for c in some_str if c not in "[](){}<>,'/.")
        return list(set(some_str.split(' ')))
    return some_str

list_udf = udf(lambda y: split_and_set(y), ArrayType(StringType()))
df_test = df_test.withColumn('description', list_udf(df_test['description']))


In [None]:

# Fifth Transformation - fill NA:
df_test = df_test.fillna({'followers_count':0,'statuses_count':0,'favourites_count':0,'listed_count':0,'friends_count':0,})


Save to parquet:

Code sample:
```python
df_test.write.parquet("test_data")
```

The `test_data` file that you save doesn't consist of information about bots at all.

We can use it to compare various algorithms and see how they behave.
However, since our training data is supervised, we would like to test it with classified data.
This will help us estimate our model.

Hence, you will split the training data into testing and train data set.


In [None]:
# Load the train data:
df_train = spark.read.parquet("final_train_data")

Split the training data into training and test sets, hold 30% out for testing.

Use randomSplit function:

```python
(trainingData, testData) = some_data.randomSplit((0.7, 0.3))
```

<details><summary>Hint</summary>
<p>

Use randomSplit function:
    
```python
(trainingData, testData) = data.randomSplit((0.7, 0.3))

```  
</p>
</details>

Remember to validate yourself with count

In [None]:
# your code goes here

Save the split data for the next Chapter.

In [None]:
testData.write.mode('overwrite').parquet("classified_test_data")

In [None]:
trainingData.write.mode('overwrite').parquet("classified_train_data")

# Well Done! 👏👏👏


## You just finished:  Apache Spark ML - Create train and test set 


## Next exercise: Apache Spark ML and create machine learning models