# Lab 01: Getting Started

In this lab, we will walk through configuring Google Colab so that you can download the starter code for each lab, run and edit code, and then save and submit it.

Colab is a cloud-based solution that allows you to run Python and Jupyter notebooks from your browser, without having to configure anything on your personal computer. 

Before continuing with this notebook, please first complete this brief tutorial: [Overview of Colab](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)


<hr>

## Command-line

Welcome back!

What Colab does is create a [virtual machine](https://en.wikipedia.org/wiki/Virtual_machine), which is like a fresh install of an operating system made just for you.

This means that in addition to running Python code in this notebook, you can also interact with the command-line as if you were using a terminal to navigate a computer. If you have never used the command-line before, it is worth reading through [this tutorial](https://computers.tutsplus.com/tutorials/navigating-the-terminal-a-gentle-introduction--mac-3855). 

Notebooks also allow you to run shell commands by preceding the command with a `!`. 

For example, the command below will list the current working directory where this notebook is running.


In [1]:
!pwd

/content


And the following command will list information about the operating system that this notebook is running on. Note that the `cat` command prints the contents of a file. So, this is printing the contents of all files in the `/etc/` folder that end in the word `release,` which is where Linux distributions store OS information.

In [2]:
!cat /etc/*release 

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic


This tells us that we are running Ubuntu, a Linux distribution, and also says what version we're running.


## Downloading starter code

Next, we'll walk through how to download the starter code from the course GitHub repository. The steps are:

1. Mount your personal Google drive inside this virtual machine so you can write to your drive from this notebook.
2. Change the current working directory to the root filder of your Google drive.
3. Clone the repository from GitHub into this folder.



In [None]:
# Mount our personal google drive. This will pop up a 
# confirmation screen giving this notebook access to your Google drive.
# You will first need a gmail account for this to work.
from google.colab import drive
drive.mount('/content/drive')

You should now see the contents of your Google drive by navigating to the folder icon in the left panel. It is viewable at `/content/drive/MyDrive`.

To list the contents of a folder, you can use the `ls` command. This should list the contents of the root folder of your Google drive.

In [None]:
!ls /content/drive/MyDrive

To change the current working directory to be the location of your Google drive, we will issue a `cd` command preceded by the % symbol. The difference between ! and % is that % will actually have a persistent effect on the notebook.

In [5]:
%cd /content/drive/My Drive

/content/drive/My Drive


In [6]:
!pwd

/content/drive/My Drive


Our next task is to clone the course repository and store it into our personal Google drive. The course repository is viewable here: <https://github.com/nmattei/cmps3160>.


If you have never used GitHub before, git is one of the most widely used version control management systems today, and invaluable when working in a team. GitHub is a web-based hosting service built around git that supports hosting git repositories, user management, etc. There are other similar services, e.g., BitBucket and GitLab.

We will use GitHub to distribute the assignments, and other class materials. Our use of git/github for the class will be minimal; however, we encourage you to use it for collaboration for your class project, or for other classes, or for anything because it's great. To learn more about GitHub, see [this tutorial](https://docs.github.com/en/get-started/quickstart/hello-world). Note -- you don't need to do that tutorial to complete this notebook.


The main thing we want to do is clone the course files into this Colab virtual machine. To do so, we will issue a `git clone` command. This will copy all the files from the course Github to our drive.

In [7]:
!git clone https://github.com/nmattei/cmps3160.git

Cloning into 'cmps3160'...
remote: Enumerating objects: 806, done.[K
remote: Counting objects: 100% (97/97), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 806 (delta 64), reused 61 (delta 48), pack-reused 709[K
Receiving objects: 100% (806/806), 23.58 MiB | 10.07 MiB/s, done.
Resolving deltas: 100% (447/447), done.
Checking out files: 100% (142/142), done.


You should now be able to list the contents of the new folder `/content/drive/My Drive/cmps3160`, which contains all the course code and data.

In [8]:
!ls cmps3160

404.html	css	    Gemfile.lock  js	    _notebooks	  schedule.md
assignments.md	_data	    img		  _labs     _projects	  service.md
CHANGELOG.md	Dockerfile  _includes	  _layouts  README.md	  syllabus.md
_config.yml	Gemfile     index.md	  LICENSE   resources.md  tags.html


Additionally, if you browse your Google drive as you normally would, you should see the new `cmps3160` folder there.

<br>


Now let's change directories to the cmps3160 folder.

In [9]:
%cd cmps3160

/content/drive/My Drive/cmps3160


Before starting work on any of the course assignments, it is a good idea to make sure you have the most up-to-date files from GitHub. For example, if on the extremely rare occasion the instructors make a mistake in the starter code (!!), we may push a change to the code on GitHub.

To update your local copy with the latest code, you will use a `git pull` command from any folder within `cmps3160`.

Note that this should have no effect right now, since you just cloned the repository moments ago. So, it should just output `Already up to date`.

In [10]:
!git pull

Already up to date.


Next, let's change directory to the `_labs/` folder

In [11]:
%cd _labs
!ls

/content/drive/My Drive/cmps3160/_labs
data	Lab01  Lab03  Lab05  Lab07  Lab09
images	Lab02  Lab04  Lab06  Lab08  Lab10


The `data` folder contains all the data used for the labs.

In [12]:
!ls data

ames.tsv  ml-1m.zip  names.zip	reds.csv  tips.csv  titanic.csv  whites.csv


The next lab will work with the file `titanic.csv`, which contains information on passengers of the ill-fated Titanic passenger ship. You can use the command `head` to see the first ten lines of this file.

In [13]:
!head data/titanic.csv

﻿pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,1,"Anderson, Mr. Harry",male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
1,1,"Andrews, Miss. Kornelia Theodosia",female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
1,0,"Andrews, Mr. Thomas Jr",male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53,2,0,11769,51.4792,C101,S,D,,"Bay

We can see that this is a comma-separated file, where each row contains information on a ship passenger.

## Pandas

Pandas is a Python library that we will be using extensively to store and analyze data. You can find a brief overview of Pandas [here](https://pandas.pydata.org/docs/user_guide/10min.html).

The key data structure in Pandas is the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html), which is conceptually similar to an Excel spreadsheet.

Below, we import the `pandas` library and read `titanic.csv` into a new DataFrame object called `df`.

We can print the first ten rows of the DataFrame using the `.head()` command.

In [14]:
import pandas as pd
df = pd.read_csv('data/titanic.csv')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [15]:
# How many rows are there?
len(df)

1309

In the next labs, will work with DataFrames in more detail. For now, please complete the short exercises below and submit your notebook to Canvas,

## Exercises

**1. Change the current working directory to the folder in `cmps3160/_notebooks/data/`.**

In [16]:
# TODO: your code here

Expected output: 

```
/content/drive/MyDrive/cmps3160/_notebooks/data
```

**2. List the contents of this directory.**

In [17]:
# TODO: your code here

Expected output:

```
adult.csv      boundry.png  nba_salaries.csv  review_polarity.zip
billboard.csv  iris.csv     nba_stats.csv     titanic.csv
bodyfat.csv    iris.png     religon.csv
```

**3. Oh look, there's another `titanic.csv` file in here. Read it into a new DataFrame called `df2` and print the first row.**

In [18]:
# TODO: your code here

**4. DataFrame objects have a  `.describe()` method that summarizing the data. Run it below:**

In [19]:
# TODO: your code here

Expected output:


|       |      pclass |    survived |       age |       sibsp |       parch |      fare |     body |
|:------|------------:|------------:|----------:|------------:|------------:|----------:|---------:|
| count | 1309        | 1309        | 1046      | 1309        | 1309        | 1308      | 121      |
| mean  |    2.29488  |    0.381971 |   29.8811 |    0.498854 |    0.385027 |   33.2955 | 160.81   |
| std   |    0.837836 |    0.486055 |   14.4135 |    1.04166  |    0.86556  |   51.7587 |  97.6969 |
| min   |    1        |    0        |    0.1667 |    0        |    0        |    0      |   1      |
| 25%   |    2        |    0        |   21      |    0        |    0        |    7.8958 |  72      |
| 50%   |    3        |    0        |   28      |    0        |    0        |   14.4542 | 155      |
| 75%   |    3        |    1        |   39      |    1        |    0        |   31.275  | 256      |
| max   |    3        |    1        |   80      |    8        |    9        |  512.329  | 328      |


5. By looking at the output and reading the [method documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) answer the following questions:

**5a. What fraction of passengers survived (rounded to two decimal places)?**

**TODO: Enter a number here.**

**5b. What was the median fare (rounded to the nearest dollar)?**

**TODO: Enter a number here.**

**5c. Were there more 1st class passengers or 3rd class passengers?**

**TODO: Enter 1st or 3rd here**

**To submit:**

1. File->Download .ipynb
2. Upload the .ipynb file to the appropriate assignment in [Canvas](https://tulane.instructure.com/)