# Fictional Army - Filtering and Sorting

### Introduction:

This exercise was inspired by this [page](http://chrisalbon.com/python/)

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 50 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 42.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=f1243289ab0ef4a736858baeebe28f3c0c080fa032076e3c4fde12909ecfa07d
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [3]:
from pyspark.sql import SparkSession, functions as f
from pyspark.sql.types import StructType, StringType, StructField, List

### Step 2. This is the data given as a dictionary

In [4]:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

### Step 3. Create a dataframe and assign it to a variable called army. 

#### Don't forget to include the columns names in the order presented in the dictionary ('regiment', 'company', 'deaths'...) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

In [5]:
spark = SparkSession.builder.appName("exercise23").getOrCreate()

columns = list(raw_data.keys())
data = [[*vals] for vals in zip(*raw_data.values())]

df = spark.createDataFrame(data, columns)

### Step 4. Set the 'origin' colum as the index of the dataframe

### Step 5. Print only the column veterans

In [6]:
df.select("veterans").show()

+--------+
|veterans|
+--------+
|       1|
|       5|
|      62|
|      26|
|      73|
|      37|
|     949|
|      48|
|      48|
|     435|
|      63|
|     345|
+--------+



### Step 6. Print the columns 'veterans' and 'deaths'

In [7]:
df.select("veterans", "deaths").show()

+--------+------+
|veterans|deaths|
+--------+------+
|       1|   523|
|       5|    52|
|      62|    25|
|      26|   616|
|      73|    43|
|      37|   234|
|     949|   523|
|      48|    62|
|      48|    62|
|     435|    73|
|      63|    37|
|     345|    35|
+--------+------+



### Step 7. Print the name of all the columns.

In [8]:
df.columns

['regiment',
 'company',
 'deaths',
 'battles',
 'size',
 'veterans',
 'readiness',
 'armored',
 'deserters',
 'origin']

### Step 8. Select the 'deaths', 'size' and 'deserters' columns from Maine and Alaska

In [9]:
df.filter(~f.col("origin").isin("Maine", "Alaska")).select("origin", "deaths", "size", "deserters").show()

+----------+------+----+---------+
|    origin|deaths|size|deserters|
+----------+------+----+---------+
|   Arizona|   523|1045|        4|
|California|    52| 957|       24|
|     Texas|    25|1099|       31|
|   Florida|   616|1400|        2|
|      Iowa|   234|1006|        4|
|Washington|    62| 849|       31|
|    Oregon|    62| 973|        2|
|   Wyoming|    73|1005|        3|
|  Louisana|    37|1099|        2|
|   Georgia|    35|1523|        3|
+----------+------+----+---------+



### Step 9. Select the rows 3 to 7 and the columns 3 to 6

In [10]:
for row in df.limit(7).tail(5):
  print(row[3:7])

(2, 1099, 62, 3)
(2, 1400, 26, 3)
(4, 1592, 73, 2)
(7, 1006, 37, 1)
(8, 987, 949, 2)


### Step 10. Select every row after the fourth row and all columns

In [11]:
df.tail(df.count()-4)

[Row(regiment='Dragoons', company='1st', deaths=43, battles=4, size=1592, veterans=73, readiness=2, armored=0, deserters=3, origin='Maine'),
 Row(regiment='Dragoons', company='1st', deaths=234, battles=7, size=1006, veterans=37, readiness=1, armored=1, deserters=4, origin='Iowa'),
 Row(regiment='Dragoons', company='2nd', deaths=523, battles=8, size=987, veterans=949, readiness=2, armored=0, deserters=24, origin='Alaska'),
 Row(regiment='Dragoons', company='2nd', deaths=62, battles=3, size=849, veterans=48, readiness=3, armored=1, deserters=31, origin='Washington'),
 Row(regiment='Scouts', company='1st', deaths=62, battles=4, size=973, veterans=48, readiness=2, armored=0, deserters=2, origin='Oregon'),
 Row(regiment='Scouts', company='1st', deaths=73, battles=7, size=1005, veterans=435, readiness=1, armored=0, deserters=3, origin='Wyoming'),
 Row(regiment='Scouts', company='2nd', deaths=37, battles=8, size=1099, veterans=63, readiness=2, armored=1, deserters=2, origin='Louisana'),
 Row(

### Step 11. Select every row up to the 4th row and all columns

In [12]:
df.limit(4).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+



### Step 12. Select the 3rd column up to the 7th column

In [13]:
df.select(*df.columns[3:8]).show()

+-------+----+--------+---------+-------+
|battles|size|veterans|readiness|armored|
+-------+----+--------+---------+-------+
|      5|1045|       1|        1|      1|
|     42| 957|       5|        2|      0|
|      2|1099|      62|        3|      1|
|      2|1400|      26|        3|      1|
|      4|1592|      73|        2|      0|
|      7|1006|      37|        1|      1|
|      8| 987|     949|        2|      0|
|      3| 849|      48|        3|      1|
|      4| 973|      48|        2|      0|
|      7|1005|     435|        1|      0|
|      8|1099|      63|        2|      1|
|      9|1523|     345|        3|      1|
+-------+----+--------+---------+-------+



### Step 13. Select rows where df.deaths is greater than 50

In [14]:
df.filter(df.deaths > 50).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|  Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
|    Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|
|    Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|
+----------+-------+-

### Step 14. Select rows where df.deaths is greater than 500 or less than 50

In [15]:
df.filter( (df.deaths > 500) | (df.deaths < 50) ).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|  origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4| Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|   Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2| Florida|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|   Maine|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|  Alaska|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|Louisana|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3| Georgia|
+----------+-------+------+-------+----+--------+---------+-------+---------+--------+



### Step 15. Select all the regiments not named "Dragoons"

In [16]:
df.filter(df.regiment == "Dragoons").show()

+--------+-------+------+-------+----+--------+---------+-------+---------+----------+
|regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+--------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|
|Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
+--------+-------+------+-------+----+--------+---------+-------+---------+----------+



### Step 16. Select the rows called Texas and Arizona

In [17]:
df.filter(df.origin.isin("Texas","Arizona")).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+-------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters| origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+-------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|Arizona|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|  Texas|
+----------+-------+------+-------+----+--------+---------+-------+---------+-------+



### Step 17. Select the third cell in the row named Arizona

In [18]:
df.filter(df.origin == "Arizona").select(df.columns[2]).show()

+------+
|deaths|
+------+
|   523|
+------+



### Step 18. Select the third cell down in the column named deaths

In [20]:
df.limit(3).tail(1)[0][2]

25