# Fictional Army - Filtering and Sorting

### Introduction:

This exercise was inspired by this [page](http://chrisalbon.com/python/)

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [None]:
import pandas as pd

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=208d4187c1419b4b11fbdaccc642fed5100f3b87469eb46e9f0f29b378a0650d
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
from pyspark.sql import SparkSession

In [11]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

In [36]:
from pyspark.sql.functions import col, monotonically_increasing_id

In [3]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. This is the data given as a dictionary

In [4]:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

### Step 3. Create a dataframe and assign it to a variable called army.

#### Don't forget to include the columns names in the order presented in the dictionary ('regiment', 'company', 'deaths'...) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

In [10]:
list(raw_data.keys())

['regiment',
 'company',
 'deaths',
 'battles',
 'size',
 'veterans',
 'readiness',
 'armored',
 'deserters',
 'origin']

In [12]:
l = [('regiment', StringType()),
              ('company',StringType()),
              ('deaths', IntegerType()),
              ('battles', IntegerType()),
              ('size', IntegerType()),
              ('veterans', IntegerType()),
              ('readiness', IntegerType()),
              ('armored', IntegerType()),
              ('deserters', IntegerType()),
              ('origin', StringType())]
schema= StructType([StructField(i[0],i[1], True) for i in l])


In [27]:
army = spark.createDataFrame(zip(*raw_data.values()),schema=list(raw_data.keys()))

### Step 4. Set the 'origin' colum as the index of the dataframe

In [22]:
army = army.orderBy("origin")
army.show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|    Scouts|    2nd|    35|      9|1523|     345|        3|      1|        3|   Georgia|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|    Scouts|    2nd|    37|      8|1099|      63|        2|      1|        2|  Louisana|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|
|    Scouts|    1st| 

### Step 5. Print only the column veterans

In [28]:
army.select(col('origin'), col('veterans')).show()


+----------+--------+
|    origin|veterans|
+----------+--------+
|   Arizona|       1|
|California|       5|
|     Texas|      62|
|   Florida|      26|
|     Maine|      73|
|      Iowa|      37|
|    Alaska|     949|
|Washington|      48|
|    Oregon|      48|
|   Wyoming|     435|
|  Louisana|      63|
|   Georgia|     345|
+----------+--------+



### Step 6. Print the columns 'veterans' and 'deaths'

In [29]:
army.select(col('origin'), col('veterans'), col('deaths')).show()

+----------+--------+------+
|    origin|veterans|deaths|
+----------+--------+------+
|   Arizona|       1|   523|
|California|       5|    52|
|     Texas|      62|    25|
|   Florida|      26|   616|
|     Maine|      73|    43|
|      Iowa|      37|   234|
|    Alaska|     949|   523|
|Washington|      48|    62|
|    Oregon|      48|    62|
|   Wyoming|     435|    73|
|  Louisana|      63|    37|
|   Georgia|     345|    35|
+----------+--------+------+



### Step 7. Print the name of all the columns.

In [30]:
army.columns

['regiment',
 'company',
 'deaths',
 'battles',
 'size',
 'veterans',
 'readiness',
 'armored',
 'deserters',
 'origin']

### Step 8. Select the 'deaths', 'size' and 'deserters' columns from Maine and Alaska

In [33]:
army.select(col('origin'), col('deaths'), col('size'), col('deserters')).filter(col('origin').isin('Maine', 'Alaska')).show()

+------+------+----+---------+
|origin|deaths|size|deserters|
+------+------+----+---------+
| Maine|    43|1592|        3|
|Alaska|   523| 987|       24|
+------+------+----+---------+



### Step 9. Select the rows 3 to 7 and the columns 3 to 6

In [37]:
army = army.withColumn("id", monotonically_increasing_id())

In [40]:
army.select([col(c) for c in army.columns]).filter((col("id")>=3) & (col("id")<=7)).show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|        id|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+----------+
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|         3|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|         4|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|         5|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|8589934592|
|  Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|8589934593|
|    Scouts|    1st|    62|      4| 973|      48|        2|      0|        2|    Oregon|8589934594|
|    Scouts|    1st|    73|      7|1005|     435|        1|      0|        3|   Wyoming|8589934595|


### Step 10. Select every row after the fourth row and all columns

### Step 11. Select every row up to the 4th row and all columns

### Step 12. Select the 3rd column up to the 7th column

### Step 13. Select rows where df.deaths is greater than 50

### Step 14. Select rows where df.deaths is greater than 500 or less than 50

### Step 15. Select all the regiments not named "Dragoons"

### Step 16. Select the rows called Texas and Arizona

### Step 17. Select the third cell in the row named Arizona

### Step 18. Select the third cell down in the column named deaths