# Ex3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 46 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 54.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=387261d06b50aaf02e778b80e0e0bb45a70c28745099d5df11ae6ae182210810
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [3]:
from pyspark.sql import SparkSession, functions as f

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [11]:
from pyspark.files import SparkFiles

url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"

spark = SparkSession.Builder().appName("Exercise2").getOrCreate()
spark.sparkContext.addFile(url)

df = spark.read.csv("file://"+SparkFiles.get("u.user"),sep = "|", header=True, inferSchema=True)
df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 3. Assign it to a variable called users and use the 'user_id' as index

### Step 4. See the first 25 entries

In [12]:
df.head(25)

[Row(user_id=1, age=24, gender='M', occupation='technician', zip_code='85711'),
 Row(user_id=2, age=53, gender='F', occupation='other', zip_code='94043'),
 Row(user_id=3, age=23, gender='M', occupation='writer', zip_code='32067'),
 Row(user_id=4, age=24, gender='M', occupation='technician', zip_code='43537'),
 Row(user_id=5, age=33, gender='F', occupation='other', zip_code='15213'),
 Row(user_id=6, age=42, gender='M', occupation='executive', zip_code='98101'),
 Row(user_id=7, age=57, gender='M', occupation='administrator', zip_code='91344'),
 Row(user_id=8, age=36, gender='M', occupation='administrator', zip_code='05201'),
 Row(user_id=9, age=29, gender='M', occupation='student', zip_code='01002'),
 Row(user_id=10, age=53, gender='M', occupation='lawyer', zip_code='90703'),
 Row(user_id=11, age=39, gender='F', occupation='other', zip_code='30329'),
 Row(user_id=12, age=28, gender='F', occupation='other', zip_code='06405'),
 Row(user_id=13, age=47, gender='M', occupation='educator', zip

### Step 5. See the last 10 entries

In [13]:
df.tail(10)

[Row(user_id=934, age=61, gender='M', occupation='engineer', zip_code='22902'),
 Row(user_id=935, age=42, gender='M', occupation='doctor', zip_code='66221'),
 Row(user_id=936, age=24, gender='M', occupation='other', zip_code='32789'),
 Row(user_id=937, age=48, gender='M', occupation='educator', zip_code='98072'),
 Row(user_id=938, age=38, gender='F', occupation='technician', zip_code='55038'),
 Row(user_id=939, age=26, gender='F', occupation='student', zip_code='33319'),
 Row(user_id=940, age=32, gender='M', occupation='administrator', zip_code='02215'),
 Row(user_id=941, age=20, gender='M', occupation='student', zip_code='97229'),
 Row(user_id=942, age=48, gender='F', occupation='librarian', zip_code='78209'),
 Row(user_id=943, age=22, gender='M', occupation='student', zip_code='77841')]

### Step 6. What is the number of observations in the dataset?

In [14]:
df.count()

943

### Step 7. What is the number of columns in the dataset?

In [15]:
len(df.columns)

5

### Step 8. Print the name of all the columns.

In [16]:
df.columns

['user_id', 'age', 'gender', 'occupation', 'zip_code']

### Step 9. How is the dataset indexed?

### Step 10. What is the data type of each column?

In [17]:
df.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- zip_code: string (nullable = true)



### Step 11. Print only the occupation column

In [18]:
df.select("occupation").show()

+-------------+
|   occupation|
+-------------+
|   technician|
|        other|
|       writer|
|   technician|
|        other|
|    executive|
|administrator|
|administrator|
|      student|
|       lawyer|
|        other|
|        other|
|     educator|
|    scientist|
|     educator|
|entertainment|
|   programmer|
|        other|
|    librarian|
|    homemaker|
+-------------+
only showing top 20 rows



### Step 12. How many different occupations are in this dataset?

In [34]:
df.select("occupation").distinct().count()

21

### Step 13. What is the most frequent occupation?

In [40]:
df.groupby("occupation").count().sort("count").tail(1)

[Row(occupation='student', count=196)]

### Step 14. Summarize the DataFrame.

In [43]:
summary = df.describe()

### Step 15. Summarize all the columns

In [44]:
summary.show()

+-------+-----------------+-----------------+------+-------------+------------------+
|summary|          user_id|              age|gender|   occupation|          zip_code|
+-------+-----------------+-----------------+------+-------------+------------------+
|  count|              943|              943|   943|          943|               943|
|   mean|            472.0|34.05196182396607|  null|         null| 50868.78810810811|
| stddev|272.3649512449549|12.19273973305903|  null|         null|30891.373254138176|
|    min|                1|                7|     F|administrator|             00000|
|    max|              943|               73|     M|       writer|             Y1A6B|
+-------+-----------------+-----------------+------+-------------+------------------+



### Step 16. Summarize only the occupation column

In [48]:
summary.select("summary", "occupation").show()

+-------+-------------+
|summary|   occupation|
+-------+-------------+
|  count|          943|
|   mean|         null|
| stddev|         null|
|    min|administrator|
|    max|       writer|
+-------+-------------+



### Step 17. What is the mean age of users?

In [50]:
summary.select("age").where(f.col("summary") == "mean").show()

+-----------------+
|              age|
+-----------------+
|34.05196182396607|
+-----------------+



### Step 18. What is the age with least occurrence?

In [51]:
min_size = df.groupBy("age").count().select(f.min("count")).collect()[0][0]
df.groupBy("age").count().sort("count").filter(f.col("count")==min_size).show()

+---+-----+
|age|count|
+---+-----+
|  7|    1|
| 10|    1|
| 73|    1|
| 11|    1|
| 66|    1|
+---+-----+

