# 2 - Apache Spark ML - Create train and test set

In this chapter, you will:

• Create a test and train set

• Learn more Spark functionality and how to use it

In [1]:
from pyspark.sql import SparkSession 

spark = SparkSession.builder \
    .master('local[*]') \
    .appName("ApacheSparkML") \
    .getOrCreate()

After filtering and working on the train DataFrame, we need to make sure the test set has the same structure.

Load testing data from CSV file:

In [2]:
df_test = spark.read.csv("testing_bot_data.csv", header= True)

In [3]:
df_test.schema

StructType(List(StructField(id,StringType,true),StructField(id_str,StringType,true),StructField(screen_name,StringType,true),StructField(location,StringType,true),StructField(description,StringType,true),StructField(url,StringType,true),StructField(followers_count,StringType,true),StructField(friends_count,StringType,true),StructField(listed_count,StringType,true),StructField(created_at,StringType,true),StructField(favourites_count,StringType,true),StructField(verified,StringType,true),StructField(statuses_count,StringType,true),StructField(lang,StringType,true),StructField(status,StringType,true),StructField(default_profile,StringType,true),StructField(default_profile_image,StringType,true),StructField(has_extended_profile,StringType,true),StructField(name,StringType,true)))

In [4]:
df_test.count()

578

In [4]:
df_test.limit(25) .toPandas ()

Unnamed: 0,id,id_str,screen_name,location,description,url,followers_count,friends_count,listed_count,created_at,favourites_count,verified,statuses_count,lang,status,default_profile,default_profile_image,has_extended_profile,name
0,2281292622.0,2281292622.0,__keating,brooklyn,lgbt editor at @buzzfeed. shannon.keating@buzz...,https://t.co/QneJmYRyhj,4466,1295,111.0,Tue Jan 07 23:26:52 +0000 2014,1579.0,True,3036,en,"""{'created_at': 'Tue Apr 11 15:31:51 +0000 201...",'truncated': False,'entities': {'hashtags': [],'symbols': [],'user_mentions': [{'screen_name': 'Carrasquillo'
1,2344040251.0,2344040251.0,_callme_Dani,"Los Angeles, CA",News Curation Editor @BuzzFeedNews I do a lot ...,,295,1016,10.0,Fri Feb 14 19:45:56 +0000 2014,300.0,False,618,en,"""{'created_at': 'Tue Apr 11 00:56:02 +0000 201...",'truncated': False,'entities': {'hashtags': [],'symbols': [],'user_mentions': [{'screen_name': 'elliesunak...
2,765871267.0,765871267.0,_little_britt_,,Family comes first! Also I am in love with piz...,https://t.co/E7DE1cJB7e,1001678,3017,14.0,8/18/2012 15:13,13040.0,True,3329,en,"""{'place': None, 'retweeted': False, 'favorite...",'created_at': 'Sat Apr 08 19:18:41 +0000 2017','id': 850790016838238210,'lang': 'en',"'retweet_count': 2037}"""
3,4772373433.0,4772373433.0,134k5,,@BuzzFeedJapan ��� @cnet_japan / DM��܋�㋁_��_��...,https://t.co/Cbguzs2PjT,445,487,17.0,Sun Jan 17 07:11:45 +0000 2016,1112.0,False,46,ja,"""{'created_at': 'Sat Apr 08 08:41:08 +0000 201...",'in_reply_to_status_id': 850628293522894849,'in_reply_to_status_id_str': '850628293522894...,'in_reply_to_user_id': 2249898907,'in_reply_to_user_id_str': '2249898907'
4,1324548560.0,,2181chrom_bot,自分の天幕,これはFE覚醒のクロム…つまり俺がツイ廃なbotらしい。よく分からんがネタ要素しかないそうだ...,http://t.co/10Swf6luED,187,68,13.0,Wed Apr 03 13:00:42 +0000 2013,,,690359,ja,"""{u'lang': u'ja', u'text': u'@2181lucina_bot \...",u'in_reply_to_status_id': 851191070100606976,u'in_reply_to_screen_name': u'2181lucina_bot',u'id_str': u'851191507486986241',u'urls': []
5,2561341789.0,,2LA1R_bot,,ふれあ語をつぶやくbotです たまに中の人(ふれあ)もつぶやきます,,80,87,,Wed Jun 11 13:12:06 +0000 2014,,,20167,ja,"""{u'lang': u'ja', u'text': u'\u3010\u3075\u308...",u'id_str': u'851191514206097409',u'urls': [],"u'id': 851191514206097409}""",TRUE
6,347810134.0,,3pei_bot,三河屋,■ちわーす！三河屋でーす！三郎くんには負けませんｗｗｗ■サザエさんに過去登場、三河屋さんへ勤...,http://twpf.jp/3pei_bot,2020,1978,56.0,Wed Aug 03 11:52:59 +0000 2011,,,968182,ja,"""{u'lang': u'ja', u'text': u'@kazenoraby \uff6...",u'in_reply_to_status_id': 851191255841124352,u'in_reply_to_screen_name': u'kazenoraby',u'id_str': u'851191511765176320',u'urls': []
7,856303860.0,,94kichi_bot,,94 チャック・ウィルソンと愉快な仲間たちの笑いあり涙ありなちょこっとキチガイツイートを集め...,,70,80,2.0,Mon Oct 01 12:39:46 +0000 2012,,,76735,ja,"""{u'lang': u'ja', u'text': u'\u307f\u3093\u306...",u'id_str': u'851191524062670848',u'urls': [],"u'id': 851191524062670848}""",TRUE
8,8.32875e+17,,A3_Dekasegi_bot,ビロード駅前,シトロン「A3!出稼ぎ日誌ダヨー！みんなの出稼ぎ中のあんなことやこんなことをまとめた日誌ネ！...,https://t.co/t171JmIrjL,181,144,2.0,Sat Feb 18 08:50:03 +0000 2017,,,1960,ja,"""{u'lang': u'ja', u'text': u'\u30b7\u30c8\u30e...",u'id_str': u'851191498854998016',u'urls': [],"u'id': 851191498854998016}""",TRUE
9,88856792.0,,aamir_khan,Mumbai,Actor.,https://t.co/l1dUhQjS8Y,20419393,9,6.0,Tue Nov 10 05:08:56 +0000 2009,,True,468,en,"""{u'lang': u'en', u'text': u'Hey guys, doing s...",u'id_str': u'849903030598344704',u'urls': [],u'media': [{u'expanded_url': u'https://twitte...,u'display_url': u'pic.twitter.com/uYmd8FKOVH'


Clean and prep testing data as well:
Remember that here we don't have bot value.

You will not drop id, as we will use it to compare results later.

Excecute the next commands:


In [5]:
from pyspark.sql.types import IntegerType, ArrayType, BooleanType, StringType
from pyspark.sql.functions import udf
from pyspark.sql.functions import when


# Dropping irrelevant columns and duplicates
df_test = df_test.drop('default_profile_image','has_extended_profile','url','created_at','lang','id','id_str')
df_test = df_test.dropDuplicates()


In [6]:

# First Transformation
df_test = df_test.withColumn("friends_count", df_test["friends_count"].cast(IntegerType()))
df_test = df_test.withColumn("listed_count", df_test["listed_count"].cast(IntegerType()))
df_test = df_test.withColumn("favourites_count", df_test["favourites_count"].cast(IntegerType()))
df_test = df_test.withColumn("statuses_count", df_test["statuses_count"].cast(IntegerType()))
df_test = df_test.withColumn("verified", df_test["verified"].cast(BooleanType()))
df_test = df_test.withColumn("default_profile", df_test["default_profile"].cast(BooleanType()))


In [7]:

# Second Transformation
df_test = df_test.withColumn('default_profile',df_test['default_profile'].cast(IntegerType()))
df_test = df_test.withColumn('name',when(df_test['name'].isNull(),0).otherwise(1))
df_test = df_test.withColumn('verified',df_test['verified'].cast(IntegerType()))


In [8]:

# Theird Transformation
df_test = df_test.withColumn('verified',when(df_test['verified'].isNull(),0).otherwise(df_test['verified']))
df_test = df_test.withColumn('default_profile',when(df_test['default_profile'].isNull(),0).otherwise(df_test['default_profile']))
df_test = df_test.withColumn('location',when(df_test['location'].isNull(),0).otherwise(1))
df_test = df_test.withColumn('status',when(df_test['status'].isNull(),0).otherwise(1))
df_test = df_test.withColumn('screen_name',when(df_test['screen_name'].isNull(),0).otherwise(1))


In [9]:
# Forth Transformation
df_test = df_test.dropna(subset=['description'])

def split_and_set(some_str):
    if isinstance(some_str, str):
        some_str = ''.join(c for c in some_str if c not in "[](){}<>,'/.")
        return list(set(some_str.split(' ')))
    return some_str

list_udf = udf(lambda y: split_and_set(y), ArrayType(StringType()))
df_test = df_test.withColumn('description', list_udf(df_test['description']))


In [10]:

# Fifth Transformation - fill NA:
df_test = df_test.fillna({'followers_count':0,'statuses_count':0,'favourites_count':0,'listed_count':0,'friends_count':0,})


Save to parquet:

Code sample:
```python
df_test.write.parquet("test_data")
```

In [11]:
df_test.write.parquet("test_data")

The `test_data` file that you save doesn't consist of information about bots at all.

We can use it to compare various algorithms and see how they behave.
However, since our training data is supervised, we would like to test it with classified data.
This will help us estimate our model.

Hence, you will split the training data into testing and train data set.

In [12]:
# Load the train data:
df_train = spark.read.parquet("final_train_data")

Split the training data into training and test sets, hold 30% out for testing.

Use randomSplit function:

```python
(trainingData, testData) = some_data.randomSplit((0.7, 0.3))
```

<details><summary>Hint</summary>
<p>

Use randomSplit function:
    
```python
(trainingData, testData) = data.randomSplit((0.7, 0.3))

```  
</p>
</details>

Remember to validate yourself with count

In [13]:
(trainingData, testData) = df_train.randomSplit([0.7, 0.3])

In [14]:
trainingData.count()

1714

In [15]:
testData.count()

719

Save the split data for the next Chapter.

In [16]:
testData.write.mode('overwrite').parquet("classified_test_data")

In [17]:
trainingData.write.mode('overwrite').parquet("classified_train_data")

# Well Done! 👏👏👏


## You just finished:  Apache Spark ML - Create train and test set 


## Next exercise: Apache Spark ML and create machine learning models