# Tutorial: Hello Bash and Python

In this tutorial we will familiarise ourselves with bash and python, and Notebooks (inadvertantly). Please load this notebook in `colab.research.google.com` if you do not have a local instance of JupyterHub/JupyterLab running.


Submission:

The submission for this tutorial requires a submission on Git, as well as one on SUNLearn. You will receive an email to your student account asking you to create an account on Gitlab




In [1]:
import pandas as pd

## Question 1: Bash

Retrieve data and interogate it with bash before using python tooling. This is useful as you may struggle with type or malformed files and a quick interogation may reveal those issues.


### Question 1.1

Run the bash command `wget` to retrieve a file located at `https://storage.googleapis.com/bdt-beam/users_v.csv` [1]

In [2]:
!wget https://storage.googleapis.com/bdt-beam/users_v.csv


--2024-09-09 14:11:09--  https://storage.googleapis.com/bdt-beam/users_v.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.180.207, 142.251.16.207, 172.253.62.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.180.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 143675 (140K) [text/csv]
Saving to: ‘users_v.csv’


2024-09-09 14:11:10 (642 KB/s) - ‘users_v.csv’ saved [143675/143675]



### Question 1.2

Use a bash command to view the top ten elements of the file (to confirm that things are as you expect) [1]

In [3]:
!head -n 10 users_v.csv

user_id,name,gender,age,address,date_joined
1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13
2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06
3,Cody Shaw,male,75,North Anne-SC-53799,2004/05/29
4,Sierra Hamilton,female,76,New Angelafurt-ME-46190,2005/08/26
5,Chase Davis,male,31,South Bethmouth-WI-18562,2018/04/30
6,Sierra Andrews,female,21,Ryanville-MI-69690,2007/05/25
7,Ann Stone,female,41,Smithmouth-SD-17340,2005/01/05
8,Karen Santos,female,34,Mariaville-AK-29888,2003/12/12
9,Ronald Meyer,male,41,North Cherylhaven-NJ-04197,2015/11/14


### Question 1.3

Use a bash command to determine the number of rows in the file [1]

In [6]:
!wc -l users_v.csv

2358 users_v.csv


### Question 1.4

Suppose the file is too large for initial exploration, let's take a quick sample so that we can continue working to see what is in the data set. Loading it into Pandas at this point will mean that we are using all that memory in any case, so let's sample it before we load it.

Take a random sample of the file (limited the result to 1000 lines) and create another file called `users_sample.csv`, using only bash commands [3]

Hint: redirect a stream into a the output file.

In [9]:
!shuf -n 1000 users_v.csv > users_sample.csv

### Question 1.5

* Sort your file by ascending `user_id`s [1]
* Overwrite the current `users_sample.csv` with the ordered content [1]
* Print the last ten lines of this file [1]

In [10]:
!sort -t, -k1,1 users_sample.csv -o users_sample.csv
!tail -n 10 users_sample.csv


978,Ralph Ball,male,71,Julianshire-WV-10729,2019/09/05
979,Edward Jones,male,80,West Annfurt-MA-25816,2002/08/20
989,Vincent Hart,male,23,New Stevenfort-MN-15220,2015/03/11
990,Christopher Fox,male,69,West Samanthaberg-CA-56486,2001/11/28
992,Dorothy Jordan,female,70,West Jessestad-TX-89943,2000/12/30
993,Elizabeth Perez,female,50,North Rebecca-MD-97897,2007/10/18
995,Lisa Jacobs,female,31,Port Charles-OK-64370,2011/10/25
997,Matthew Cooke,male,31,New Aaronshire-FL-31342,2015/01/02
999,Kenneth Bryant,male,27,South Meganmouth-IL-80593,2000/04/10
user_id,name,gender,age,address,date_joined


## Question 2: Python

Perform analysis with Python

### Question 2.1

Load the original `users_v.csv` into a Pandas dataframe [1]

In [21]:
import pandas as pd

file = 'https://storage.googleapis.com/bdt-beam/users_v.csv'

usersv = pd.read_csv(file)

### Question 2.2

Display/print the top ten lines of the dataframe [1]



In [22]:
print(usersv.head(10))

   user_id             name  gender  age                     address  \
0        1     Anthony Wolf    male   73    New Rachelburgh-VA-49583   
1        2  James Armstrong    male   56  North Jillianfort-UT-86454   
2        3        Cody Shaw    male   75         North Anne-SC-53799   
3        4  Sierra Hamilton  female   76     New Angelafurt-ME-46190   
4        5      Chase Davis    male   31    South Bethmouth-WI-18562   
5        6   Sierra Andrews  female   21          Ryanville-MI-69690   
6        7        Ann Stone  female   41         Smithmouth-SD-17340   
7        8     Karen Santos  female   34         Mariaville-AK-29888   
8        9     Ronald Meyer    male   41  North Cherylhaven-NJ-04197   
9       10    Steven Rivera    male   43          Wayneside-VT-29040   

  date_joined  
0  2019/03/13  
1  2020/11/06  
2  2004/05/29  
3  2005/08/26  
4  2018/04/30  
5  2007/05/25  
6  2005/01/05  
7  2003/12/12  
8  2015/11/14  
9  2003/05/15  


### Question 2.3

Find the age of the oldest and youngest person in the dataset [1]

Hint: you can use the `print(..., ...)` function to display the two values if you construct it as two arguments

In [17]:
old = usersv['age'].max()
young = usersv['age'].min()

print(f"{old}", f"{young}")

80 18


### Question 2.4

Draw descriptive statistics (one-liner) for the `age` column - these statistics should include `count`, `mean`, and `std` [1]

Hint: this command has a parallel in R

In [18]:
usersv['age'].describe()[['count', 'mean', 'std']]

Unnamed: 0,age
count,2357.0
mean,49.054731
std,18.206348


### Question 2.5

* Using anonymous functions (`lambda`), create a `surname` column from the name column (you may assume that the last word without a space is the surname) [2]
* Display the last 10 lines of your dataframe [1]


In [20]:
usersv['surname'] = usersv['name'].apply(lambda x: x.split()[-1])

print(usersv.tail(10))

      user_id                 name  gender  age                    address  \
2347     2348     Victoria Edwards  female   68    Lake Jamesberg-NY-88824   
2348     2349          Chris Ellis    male   46  Port Richardside-NY-77994   
2349     2350       Kimberly Smith  female   19      East Anthony-GA-00646   
2350     2351       William Nelson    male   67   Lake Parkerside-MN-06905   
2351     2352          Nancy Clark  female   80        Jamesshire-AK-88437   
2352     2353      Brittney Graham  female   40         Brownland-CO-71229   
2353     2354      Allison Schmidt  female   43        Port Logan-MD-38588   
2354     2355  Christopher Johnson    male   68   North Justinton-VA-32798   
2355     2356           Mark Brown    male   67    New Kayleefurt-MA-82581   
2356     2357      Steven Robinson    male   45         Mistytown-HI-31737   

     date_joined   surname  
2347  2001/09/03   Edwards  
2348  2011/03/18     Ellis  
2349  2021/06/20     Smith  
2350  2005/12/21    Nelso

### Question 2.6

* Convert `date_joined` to a pandas series of type `datetime`  [1]
* Overwrite the original `date_joined` column with the result [1]

In [24]:
usersv['date_joined'] = pd.to_datetime(usersv['date_joined'])

In [None]:
usersv['date_joined'] = pd.to_datetime(usersv['date_joined'])

## Question 3: Git

Push your notebook to Git. If you don't have any Git tooling installed on your machines, download a preferred tool.

 * Create a repository (named `day1-tutorial`) on Gitlab (check your student email for sign-up/membership request to Gitlab) [1]
 * Push this notebook to that repository [1]

## The End

Now that it is a datetime, we can how many users signed up/registered.

In [None]:
import matplotlib

%matplotlib inline

df.user_id.groupby([df.date_joined.dt.year]).count().plot(kind="bar")