# Regiment

### Introduction:

Special thanks to: http://chrisalbon.com/ for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=01b090a77981281f4fb6e82394fb44c2199c46d0793cbf797f8e19ef5596e47b
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col, mean, when, sum, count, desc, min, max
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [9]:
import pyspark.sql.functions as F

### Step 2. Create the DataFrame with the following values:

In [3]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}

### Step 3. Assign it to a variable called regiment.
#### Don't forget to name each column

In [4]:
regiment = spark.createDataFrame(zip(*raw_data.values()),schema=list(raw_data.keys()))

### Step 4. What is the mean preTestScore from the regiment Nighthawks?  

In [5]:
regiment.show(10)

+----------+-------+--------+------------+-------------+
|  regiment|company|    name|preTestScore|postTestScore|
+----------+-------+--------+------------+-------------+
|Nighthawks|    1st|  Miller|           4|           25|
|Nighthawks|    1st|Jacobson|          24|           94|
|Nighthawks|    2nd|     Ali|          31|           57|
|Nighthawks|    2nd|  Milner|           2|           62|
|  Dragoons|    1st|   Cooze|           3|           70|
|  Dragoons|    1st|   Jacon|           4|           25|
|  Dragoons|    2nd|  Ryaner|          24|           94|
|  Dragoons|    2nd|    Sone|          31|           57|
|    Scouts|    1st|   Sloan|           2|           62|
|    Scouts|    1st|   Piger|           3|           70|
+----------+-------+--------+------------+-------------+
only showing top 10 rows



In [15]:
regiment.groupBy('regiment').agg(mean('preTestScore')).show()

+----------+-----------------+
|  regiment|avg(preTestScore)|
+----------+-----------------+
|Nighthawks|            15.25|
|  Dragoons|             15.5|
|    Scouts|              2.5|
+----------+-----------------+



### Step 5. Present general statistics by company

In [14]:
regiment.groupBy('company').agg(F.count('postTestScore').alias('count'),
         F.mean('postTestScore').alias('mean'),
         F.stddev('postTestScore').alias('std'),
         F.min('postTestScore').alias('min'),
         F.expr('percentile(postTestScore, array(0.25))')[0].alias('%25'),
         F.expr('percentile(postTestScore, array(0.5))')[0].alias('%50'),
         F.expr('percentile(postTestScore, array(0.75))')[0].alias('%75'),
         F.max('postTestScore').alias('max')).show()

+-------+-----+------------------+------------------+---+-----+----+----+---+
|company|count|              mean|               std|min|  %25| %50| %75|max|
+-------+-----+------------------+------------------+---+-----+----+----+---+
|    2nd|    6|              67.0|14.057026712644463| 57|58.25|62.0|68.0| 94|
|    1st|    6|57.666666666666664| 27.48575388572536| 25|34.25|66.0|70.0| 94|
+-------+-----+------------------+------------------+---+-----+----+----+---+



### Step 6. What is the mean of each company's preTestScore?

In [15]:
regiment.groupBy('company').agg(F.mean('PreTestScore')).show()

+-------+-----------------+
|company|avg(PreTestScore)|
+-------+-----------------+
|    2nd|             15.5|
|    1st|6.666666666666667|
+-------+-----------------+



### Step 7. Present the mean preTestScores grouped by regiment and company

In [16]:
regiment.groupBy('regiment','company').agg(F.mean('PreTestScore')).show()

+----------+-------+-----------------+
|  regiment|company|avg(PreTestScore)|
+----------+-------+-----------------+
|Nighthawks|    1st|             14.0|
|  Dragoons|    1st|              3.5|
|Nighthawks|    2nd|             16.5|
|  Dragoons|    2nd|             27.5|
|    Scouts|    2nd|              2.5|
|    Scouts|    1st|              2.5|
+----------+-------+-----------------+



### Step 8. Present the mean preTestScores grouped by regiment and company without heirarchical indexing

### Step 9. Group the entire dataframe by regiment and company

In [18]:
regiment.groupBy('regiment','company').mean().show()

+----------+-------+-----------------+------------------+
|  regiment|company|avg(preTestScore)|avg(postTestScore)|
+----------+-------+-----------------+------------------+
|Nighthawks|    1st|             14.0|              59.5|
|  Dragoons|    1st|              3.5|              47.5|
|Nighthawks|    2nd|             16.5|              59.5|
|  Dragoons|    2nd|             27.5|              75.5|
|    Scouts|    2nd|              2.5|              66.0|
|    Scouts|    1st|              2.5|              66.0|
+----------+-------+-----------------+------------------+



### Step 10. What is the number of observations in each regiment and company

In [19]:
regiment.groupBy('regiment','company').count().show()

+----------+-------+-----+
|  regiment|company|count|
+----------+-------+-----+
|Nighthawks|    1st|    2|
|  Dragoons|    1st|    2|
|Nighthawks|    2nd|    2|
|  Dragoons|    2nd|    2|
|    Scouts|    2nd|    2|
|    Scouts|    1st|    2|
+----------+-------+-----+



### Step 11. Iterate over a group and print the name and the whole data from the regiment

In [38]:
r = [rr[0] for rr in regiment.select('regiment').distinct().collect()]

In [39]:
r

['Nighthawks', 'Dragoons', 'Scouts']

In [40]:
for rr in r:
  print(rr)
  regiment.filter(col('regiment')==rr).show()

Nighthawks
+----------+-------+--------+------------+-------------+
|  regiment|company|    name|preTestScore|postTestScore|
+----------+-------+--------+------------+-------------+
|Nighthawks|    1st|  Miller|           4|           25|
|Nighthawks|    1st|Jacobson|          24|           94|
|Nighthawks|    2nd|     Ali|          31|           57|
|Nighthawks|    2nd|  Milner|           2|           62|
+----------+-------+--------+------------+-------------+

Dragoons
+--------+-------+------+------------+-------------+
|regiment|company|  name|preTestScore|postTestScore|
+--------+-------+------+------------+-------------+
|Dragoons|    1st| Cooze|           3|           70|
|Dragoons|    1st| Jacon|           4|           25|
|Dragoons|    2nd|Ryaner|          24|           94|
|Dragoons|    2nd|  Sone|          31|           57|
+--------+-------+------+------------+-------------+

Scouts
+--------+-------+-----+------------+-------------+
|regiment|company| name|preTestScore|po