# Fictional Army - Filtering and Sorting

### Introduction:

This exercise was inspired by this [page](http://chrisalbon.com/python/)

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
import os, sys
os.environ["SPARK_LOCAL_IP"] = "127.0.0.1"         # ① fuerza loopback
os.environ["PYSPARK_PYTHON"] = sys.executable      # ② mismo intérprete
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

In [2]:
import pandas as pd
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder\
                    .appName('ArmyData')\
                    .getOrCreate()

### Step 2. This is the data given as a dictionary

In [4]:
# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

### Step 3. Create a dataframe and assign it to a variable called army. 

#### Don't forget to include the columns names in the order presented in the dictionary ('regiment', 'company', 'deaths'...) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

In [5]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType



# Definir esquema
schema = StructType([
    StructField('regiment', StringType(), nullable=False),
    StructField('company', StringType(), nullable=False),
    StructField('deaths', IntegerType(), nullable=False),
    StructField('battles', IntegerType(), nullable=False),
    StructField('size', IntegerType(), nullable=False),
    StructField('veterans', IntegerType(), nullable=False),
    StructField('readiness', IntegerType(), nullable=False),
    StructField('armored', IntegerType(), nullable=False),
    StructField('deserters', IntegerType(), nullable=False),
    StructField('origin', StringType(), nullable=False)
])



num_rows = len(raw_data['regiment'])
data = []
for i in range(num_rows):
    row = (
        raw_data['regiment'][i],
        raw_data['company'][i],
        raw_data['deaths'][i],
        raw_data['battles'][i],
        raw_data['size'][i],
        raw_data['veterans'][i],
        raw_data['readiness'][i],
        raw_data['armored'][i],
        raw_data['deserters'][i],
        raw_data['origin'][i]
    )
    data.append(row)

# Crear DataFrame
army = spark.createDataFrame(data, schema=schema)



In [6]:
army.printSchema()


root
 |-- regiment: string (nullable = false)
 |-- company: string (nullable = false)
 |-- deaths: integer (nullable = false)
 |-- battles: integer (nullable = false)
 |-- size: integer (nullable = false)
 |-- veterans: integer (nullable = false)
 |-- readiness: integer (nullable = false)
 |-- armored: integer (nullable = false)
 |-- deserters: integer (nullable = false)
 |-- origin: string (nullable = false)



In [7]:
army.show()

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|        3|      1|        2|   Florida|
|  Dragoons|    1st|    43|      4|1592|      73|        2|      0|        3|     Maine|
|  Dragoons|    1st|   234|      7|1006|      37|        1|      1|        4|      Iowa|
|  Dragoons|    2nd|   523|      8| 987|     949|        2|      0|       24|    Alaska|
|  Dragoons|    2nd|    62|      3| 849|      48|        3|      1|       31|Washington|
|    Scouts|    1st| 

### Step 4. Set the 'origin' colum as the index of the dataframe

En pyspark no existe un indice tal cual.

In [None]:
from pyspar

In [10]:
df = army.toPandas()


In [13]:
df

Unnamed: 0,regiment,company,deaths,battles,size,veterans,readiness,armored,deserters,origin
0,Nighthawks,1st,523,5,1045,1,1,1,4,Arizona
1,Nighthawks,1st,52,42,957,5,2,0,24,California
2,Nighthawks,2nd,25,2,1099,62,3,1,31,Texas
3,Nighthawks,2nd,616,2,1400,26,3,1,2,Florida
4,Dragoons,1st,43,4,1592,73,2,0,3,Maine
5,Dragoons,1st,234,7,1006,37,1,1,4,Iowa
6,Dragoons,2nd,523,8,987,949,2,0,24,Alaska
7,Dragoons,2nd,62,3,849,48,3,1,31,Washington
8,Scouts,1st,62,4,973,48,2,0,2,Oregon
9,Scouts,1st,73,7,1005,435,1,0,3,Wyoming


In [14]:
df.set_index('origin', inplace=True)

In [15]:
df

Unnamed: 0_level_0,regiment,company,deaths,battles,size,veterans,readiness,armored,deserters
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Arizona,Nighthawks,1st,523,5,1045,1,1,1,4
California,Nighthawks,1st,52,42,957,5,2,0,24
Texas,Nighthawks,2nd,25,2,1099,62,3,1,31
Florida,Nighthawks,2nd,616,2,1400,26,3,1,2
Maine,Dragoons,1st,43,4,1592,73,2,0,3
Iowa,Dragoons,1st,234,7,1006,37,1,1,4
Alaska,Dragoons,2nd,523,8,987,949,2,0,24
Washington,Dragoons,2nd,62,3,849,48,3,1,31
Oregon,Scouts,1st,62,4,973,48,2,0,2
Wyoming,Scouts,1st,73,7,1005,435,1,0,3


### Step 5. Print only the column veterans

In [None]:
army.select('veterans').show()

+--------+
|veterans|
+--------+
|       1|
|       5|
|      62|
|      26|
|      73|
|      37|
|     949|
|      48|
|      48|
|     435|
|      63|
|     345|
+--------+



In [22]:
df['veterans'].head(10)

origin
Arizona         1
California      5
Texas          62
Florida        26
Maine          73
Iowa           37
Alaska        949
Washington     48
Oregon         48
Wyoming       435
Name: veterans, dtype: int32

### Step 6. Print the columns 'veterans' and 'deaths'

In [23]:
army.select(['veterans', 'deaths']).show(10)

+--------+------+
|veterans|deaths|
+--------+------+
|       1|   523|
|       5|    52|
|      62|    25|
|      26|   616|
|      73|    43|
|      37|   234|
|     949|   523|
|      48|    62|
|      48|    62|
|     435|    73|
+--------+------+
only showing top 10 rows



### Step 7. Print the name of all the columns.

In [24]:
army.columns

['regiment',
 'company',
 'deaths',
 'battles',
 'size',
 'veterans',
 'readiness',
 'armored',
 'deserters',
 'origin']

In [25]:
df.columns

Index(['regiment', 'company', 'deaths', 'battles', 'size', 'veterans',
       'readiness', 'armored', 'deserters'],
      dtype='object')

### Step 8. Select the 'deaths', 'size' and 'deserters' columns from Maine and Alaska

In [27]:
from pyspark.sql.functions import col
army.filter(col('origin').isin('Maine', 'Alaska')).select(['deaths', 'size', 'deserters']).show()

+------+----+---------+
|deaths|size|deserters|
+------+----+---------+
|    43|1592|        3|
|   523| 987|       24|
+------+----+---------+



In [29]:
df.loc[df.index.isin(['Maine', 'Alaska']), ['deaths', 'size', 'deserters']]

Unnamed: 0_level_0,deaths,size,deserters
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Maine,43,1592,3
Alaska,523,987,24


### Step 9. Select the rows 3 to 7 and the columns 3 to 6

### Step 10. Select every row after the fourth row and all columns

### Step 11. Select every row up to the 4th row and all columns

### Step 12. Select the 3rd column up to the 7th column

### Step 13. Select rows where df.deaths is greater than 50

### Step 14. Select rows where df.deaths is greater than 500 or less than 50

### Step 15. Select all the regiments not named "Dragoons"

### Step 16. Select the rows called Texas and Arizona

### Step 17. Select the third cell in the row named Arizona

### Step 18. Select the third cell down in the column named deaths

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.types import (StructType, StructField,
                               StringType, IntegerType)

# 1. Sesión de Spark
spark = (SparkSession.builder
         .appName("army")
         .getOrCreate())

# 2. Datos de ejemplo
raw_data = {
    "regiment":  ["Nighthawks", "Nighthawks", "Nighthawks", "Nighthawks",
                  "Dragoons",   "Dragoons",   "Dragoons",   "Dragoons",
                  "Scouts",     "Scouts",     "Scouts",     "Scouts"],
    "company":   ["1st", "1st", "2nd", "2nd",
                  "1st", "1st", "2nd", "2nd",
                  "1st", "1st", "2nd", "2nd"],
    "deaths":    [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
    "battles":   [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
    "size":      [1045, 957, 1099, 1400, 1592, 1006, 987, 849,
                  973, 1005, 1099, 1523],
    "veterans":  [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
    "readiness": [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
    "armored":   [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
    "deserters": [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
    "origin":    ["Arizona", "California", "Texas", "Florida",
                  "Maine", "Iowa", "Alaska", "Washington",
                  "Oregon", "Wyoming", "Louisiana", "Georgia"]
}

# 3. Esquema explícito (evita inferencia costosa/gemas pasadas por alto)
schema = StructType([
    StructField("regiment",  StringType(),  False),
    StructField("company",   StringType(),  False),
    StructField("deaths",    IntegerType(), False),
    StructField("battles",   IntegerType(), False),
    StructField("size",      IntegerType(), False),
    StructField("veterans",  IntegerType(), False),
    StructField("readiness", IntegerType(), False),
    StructField("armored",   IntegerType(), False),
    StructField("deserters", IntegerType(), False),
    StructField("origin",    StringType(),  False),
])

# 4. Conversión a lista de tuplas
data = list(zip(*raw_data.values()))  # ¡más compacto!

# 5. DataFrame final
army = spark.createDataFrame(data, schema=schema)

# ────────────────────────────────────────────────────────────
# FORMAS RÁPIDAS DE INSPECCIÓN
# ────────────────────────────────────────────────────────────

# (a) Estructura de columnas y tipos
army.printSchema()

# (b) Primeras N filas en formato tabular (sin recolectar todo):
army.show(5)                 # similar a df.head() en Pandas

# (c) Recoger filas concretas como objetos Row
first_two = army.take(2)     # → lista de Row
for r in first_two:
    print(r)

# (d) Acceder a campos de cada Row (por nombre o índice)
print(first_two[0]["regiment"], first_two[0]["deaths"])

# (e) Cuando el DataFrame es pequeño y quieres usar herramientas pandas:
pdf = army.limit(10).toPandas()   # usa Arrow si está activado
print(pdf)

# (f) Para sampling sin forzar una acción completa:
sample_df = army.sample(fraction=0.2, seed=42)
sample_df.show()

# Cerrar sesión si ya no la usarás
# spark.stop()


root
 |-- regiment: string (nullable = false)
 |-- company: string (nullable = false)
 |-- deaths: integer (nullable = false)
 |-- battles: integer (nullable = false)
 |-- size: integer (nullable = false)
 |-- veterans: integer (nullable = false)
 |-- readiness: integer (nullable = false)
 |-- armored: integer (nullable = false)
 |-- deserters: integer (nullable = false)
 |-- origin: string (nullable = false)

+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|  regiment|company|deaths|battles|size|veterans|readiness|armored|deserters|    origin|
+----------+-------+------+-------+----+--------+---------+-------+---------+----------+
|Nighthawks|    1st|   523|      5|1045|       1|        1|      1|        4|   Arizona|
|Nighthawks|    1st|    52|     42| 957|       5|        2|      0|       24|California|
|Nighthawks|    2nd|    25|      2|1099|      62|        3|      1|       31|     Texas|
|Nighthawks|    2nd|   616|      2|1400|      26|   