# Fictional Army - Filtering and Sorting

### Introduction:

This exercise was inspired by this [page](http://chrisalbon.com/python/)

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("My Application").getOrCreate()

from pyspark.sql.types import *
from pyspark.sql.functions import *

25/05/08 11:05:19 WARN Utils: Your hostname, neosoft-Latitude-E7270 resolves to a loopback address: 127.0.1.1; using 10.0.62.133 instead (on interface wlp1s0)
25/05/08 11:05:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/08 11:05:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. This is the data given as a dictionary

In [4]:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

### Step 3. Create a dataframe and assign it to a variable called army. 

#### Don't forget to include the columns names in the order presented in the dictionary ('regiment', 'company', 'deaths'...) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

In [3]:
schema = StructType([
    StructField("regiment", StringType(), True),
    StructField("company", StringType(), True),
    StructField("deaths", IntegerType(), True),
    StructField("battles", IntegerType(), True),
    StructField("size", IntegerType(), True),
    StructField("veterans", IntegerType(), True),
    StructField("readiness", IntegerType(), True),
    StructField("armored", IntegerType(), True),
    StructField("deserters", IntegerType(), True),
    StructField("origin", StringType(), True)
])

data = [
    ('Nighthawks', '1st', 523, 5, 1045, 1, 1, 1, 4, 'Arizona'),
    ('Nighthawks', '1st', 52, 42, 957, 5, 2, 0, 24, 'California'),
    ('Nighthawks', '2nd', 25, 2, 1099, 62, 3, 1, 31, 'Texas'),
    ('Nighthawks', '2nd', 616, 2, 1400, 26, 3, 1, 2, 'Florida'),
    ('Dragoons', '1st', 43, 4, 1592, 73, 2, 0, 3, 'Maine'),
    ('Dragoons', '1st', 234, 7, 1006, 37, 1, 1, 4, 'Iowa'),
    ('Dragoons', '2nd', 523, 8, 987, 949, 2, 0, 24, 'Alaska'),
    ('Dragoons', '2nd', 62, 3, 849, 48, 3, 1, 31, 'Washington'),
    ('Scouts', '1st', 62, 4, 973, 48, 2, 0, 2, 'Oregon'),
    ('Scouts', '1st', 73, 7, 1005, 435, 1, 0, 3, 'Wyoming'),
    ('Scouts', '2nd', 37, 8, 1099, 63, 2, 1, 2, 'Louisana'),
    ('Scouts', '2nd', 35, 9, 1523, 345, 3, 1, 3, 'Georgia')
]

army_df = spark.createDataFrame(data, schema=schema)

army_df.show(truncate=False)


                                                                                

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|regiment  |company|deaths|battles|size|veterans|readiness|armored|deserters|origin    |
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|1st    |523   |5      |1045|1       |1        |1      |4        |Arizona   |
|Nighthawks|1st    |52    |42     |957 |5       |2        |0      |24       |California|
|Nighthawks|2nd    |25    |2      |1099|62      |3        |1      |31       |Texas     |
|Nighthawks|2nd    |616   |2      |1400|26      |3        |1      |2        |Florida   |
|Dragoons  |1st    |43    |4      |1592|73      |2        |0      |3        |Maine     |
|Dragoons  |1st    |234   |7      |1006|37      |1        |1      |4        |Iowa      |
|Dragoons  |2nd    |523   |8      |987 |949     |2        |0      |24       |Alaska    |
|Dragoons  |2nd    |62    |3      |849 |48      |3        |1      |31       |Washington|
|Scouts    |1st    |6

### Step 4. Set the 'origin' colum as the index of the dataframe

In [6]:
army_df.select("origin", "*").show()

+----------+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|    origin|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|   Arizona|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|California|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|     Texas|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|   Florida|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|     Maine|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|
|      Iowa|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|    Alaska|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|


### Step 5. Print only the column veterans

In [7]:
army_df.select("veterans").show()

+--------+
|veterans|
+--------+
|       1|
|       5|
|      62|
|      26|
|      73|
|      37|
|     949|
|      48|
|      48|
|     435|
|      63|
|     345|
+--------+



### Step 6. Print the columns 'veterans' and 'deaths'

In [8]:
army_df.select("Veterans","deaths").show()

+--------+------+
|Veterans|deaths|
+--------+------+
|       1|   523|
|       5|    52|
|      62|    25|
|      26|   616|
|      73|    43|
|      37|   234|
|     949|   523|
|      48|    62|
|      48|    62|
|     435|    73|
|      63|    37|
|     345|    35|
+--------+------+



### Step 7. Print the name of all the columns.

In [9]:
army_df.columns

['regiment',
 'company',
 'deaths',
 'battles',
 'size',
 'veterans',
 'readiness',
 'armored',
 'deserters',
 'origin']

### Step 8. Select the 'deaths', 'size' and 'deserters' columns from Maine and Alaska

In [11]:
army_df.filter(col("origin").isin("Alaska", "Maine"))\
    .select(col("Deaths"), col("Size"), col("Deserters")).show()

+------+----+---------+
|Deaths|Size|Deserters|
+------+----+---------+
|    43|1592|        3|
|   523| 987|       24|
+------+----+---------+



### Step 9. Select the rows 3 to 7 and the columns 3 to 6

In [13]:
indexed_df = army_df.rdd.zipWithIndex() \
    .filter(lambda row: 2 <= row[1] <= 6) \
    .map(lambda row: row[0]) \
    .toDF(army_df.columns)

indexed_df.select(army_df.columns[2:6]).show()



+------+-------+----+--------+
|deaths|battles|size|veterans|
+------+-------+----+--------+
|    25|      2|1099|      62|
|   616|      2|1400|      26|
|    43|      4|1592|      73|
|   234|      7|1006|      37|
|   523|      8| 987|     949|
+------+-------+----+--------+



### Step 10. Select every row after the fourth row and all columns

In [14]:
from pyspark.sql.window import Window

window = Window.orderBy(lit(1))
army_indexed = army_df.withColumn("row_num", row_number().over(window))
army_fourth = army_indexed.filter(col("row_num") > 4).drop("row_num")

army_fourth.show()


25/05/08 11:22:20 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/05/08 11:22:20 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/05/08 11:22:20 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/05/08 11:22:20 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/05/08 11:22:20 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+--------+-------+------+-------+----+--------+---------+-------+---------+----------+
|regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+--------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|
|Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
|  Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|
|  Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|
|  Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|  Louisana|
|  Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3|   Georgia|
+--------+-------+------+-------+----+-----

                                                                                

### Step 11. Select every row up to the 4th row and all columns

In [15]:
army_first_4 = army_indexed.filter(col("row_num") <= 4).drop("row_num")

army_first_4.show()


25/05/08 11:25:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/05/08 11:25:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/05/08 11:25:34 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 13:>                                                         (0 + 1) / 1]

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+



                                                                                

### Step 12. Select the 3rd column up to the 7th column

In [20]:
army_df.select(army_df.columns[2:7]).show()

+------+-------+----+--------+---------+
|deaths|battles|size|veterans|readiness|
+------+-------+----+--------+---------+
|   523|      5|1045|       1|        1|
|    52|     42| 957|       5|        2|
|    25|      2|1099|      62|        3|
|   616|      2|1400|      26|        3|
|    43|      4|1592|      73|        2|
|   234|      7|1006|      37|        1|
|   523|      8| 987|     949|        2|
|    62|      3| 849|      48|        3|
|    62|      4| 973|      48|        2|
|    73|      7|1005|     435|        1|
|    37|      8|1099|      63|        2|
|    35|      9|1523|     345|        3|
+------+-------+----+--------+---------+



### Step 13. Select rows where df.deaths is greater than 50

In [21]:
army_df.filter(col("deaths") > 50).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|  Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
|    Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|
|    Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|
+----------+-------+-

### Step 14. Select rows where df.deaths is greater than 500 or less than 50

In [23]:
army_df.filter((col("deaths") > 500) | (col("deaths") < 50)).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|  origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4| Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|   Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2| Florida|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|   Maine|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|  Alaska|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|Louisana|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3| Georgia|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+



### Step 15. Select all the regiments not named "Dragoons"

In [24]:
army_df.filter(col("regiment") != "Dragoons").show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|    Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|
|    Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|  Louisana|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3|   Georgia|
+----------+-------+-

### Step 16. Select the rows called Texas and Arizona

In [26]:
army_df.filter(col("origin").isin("Texas", "Arizona")).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+-------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters| origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+-------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|  Texas|
+----------+-------+------+-------+----+--------+---------+-------+---------+-------+



### Step 17. Select the third cell in the row named Arizona

In [None]:
army_df.filter(col("origin") == "Arizona").select(army_df.columns[2]).show()

+------+
|deaths|
+------+
|   523|
+------+



### Step 18. Select the third cell down in the column named deaths

In [36]:
army_df.select("deaths").collect()[2][0]


25