# Introduction to Git, Jupyter and Pandas


## Today's Goals:

* Familiarize yourself with Jupyter Notebook and _pandas_
* Practice thinking like a data scientist

## Motivation

Nearly all data science teams use or have used Jupyter Notebook. In addition, _pandas_ is the go-to Python library for exploratory data analysis and data preparation. As it is nearly impossible to avoid these tools as a data scientist using Python, it will pay off to master them early on.

If you are already familiar with these tools, think of this lab as practice. It is hard to maintain a high level of proficiency with Jupyter and _pandas_ without a decent time commitment.

While the above practical skills are useful, it is important to note that the biggest problems data science teams face are often non-technical in nature. Uninformed stakeholders, poorly chosen use-cases of data science, and bad managers damage data science teams much more than a junior data scientist's lack of ML expertise. Since our course is the CS course of this program, we will not discuss this much. However, we do hope that you think about such issues frequently (especially in this course's projects).
 
## Outcomes

By the end of this lab, you will have the basic skillset needed to do exploratory data analysis with Python. You will be able to load, visualize, and transform data with _pandas_. In addition you will be able to share your own code, comments, and visualizations with others through Jupyter Notebook.

## Grading

You should get checked-off by a TA at the end of the lab.

## Part 0: Preliminaries 
You should have already used the [DATA 1030 Software Setup Guide](https://docs.google.com/a/brown.edu/document/d/1-be-XHwFqKFYyOXjDbW6WAiG8OERUl_nYNmYqVWzn1o/edit?usp=sharing) to install Anaconda, Python and git.

### 0.1: Getting started with Jupyter Notebook

PyCharm has a simplified notebook UI built in. We recommend you use this for now.

The green "run" button runs a cell. A cell delineated by an outlined rectangle.

## Part 1: Exploring Data with DataFrames

Run the cell below by first clicking on the cell below, and the clicking the run button (>|) on the toolbar above. You should click cancel and then link in the yellow pop-up.


See [this Jetbrains article](https://www.jetbrains.com/help/pycharm/using-ipython-notebook-with-product.html#run-cell) for exact instructions on how to run the notebook.

When done, you should see "Hello World!" just below the code.

In [None]:
print("Hello World!")

### 1.0 Contextualizing data

The first dataset we will be looking at relates to youth unemployment. We originally downloaded this data from  [Kaggle.com](https://www.kaggle.com/sovannt/world-bank-youth-unemployment) and scroll down to the dataset description for more details. 

__Task 0:__ Discuss the following questions with your partner:

- What is the World Bank's definition of youth?
- What years are we looking at?
- How was the data collected?

### 1.1 Reading data

Data is often stored in comma-separated values (CSV) files. We have provided the unemployment data into the file `API_ILO_country_YU.csv` for you.

To read a CSV file called `my_data.csv` with into a dataframe variable called `df` using pandas we would run the following command
```
df = pandas.read_csv('my_data.csv')
```

__Task 1:__ Load the unemployment data in the above file to a variable called `df` below. If no error shows up, your DataFrame should have loaded.

In [None]:
import pandas
# TASK 1
# YOUR CODE HERE

### 1.2 Viewing data

Run each of these `DataFrame` methods in the cell below.

    df
    df.columns
    df.head()
    df.head(10)
    df.tail(4)
    df.nlargest(5, '2014')
    df.nsmallest(8, '2011')
    
What get printed if you run 2 expressions in the same block?

In [None]:
# Run the methods here
df

Complete the following task.

__Task 2:__ Show the 12 countries with the highest unemployment rates in 2013.

In [None]:
# TASK 2
# YOUR CODE HERE

### 1.3 Transforming data

With DataFrames, we can also select
rows and columns. What do the following expressions
evaluate to?

#### Column selection:
    
    df['2010']
    df[['Country Name', '2011', '2012']]
    df.iloc[:, 3] # Gets all the rows in the 4th column
    df.iloc[:, 2:5]
    
#### Row selection:
    
    df.iloc[191] # Gets the 192nd row
    df.iloc[2:10]
    df[df['2010'] > 40]
    
#### Combinations:

    df.iloc[12:14, 0:4]
    df[['Country Name', '2011', '2012']][df['2012'] < 10]
    df[df['2012'] < 10].iloc[1:3, 2:5]

In [None]:
# Run the expressions here
df['2010']

__Task 3:__ Show the `Country Name` and `Country Code` of all countries where

* the 2010 unemployment rate is higher than the 2014 unemployment rate, and
* the 2014 unemployment rate is above 20%

In [None]:
# TASK 3
# YOUR CODE HERE

Now it is time to see more powerful features of DataFrames.
You can add, subtract, multiply, and divide columns as shown below

`df['2010'] + df['2012']`

You can easily add a new column to the dataframe called using the following syntax:

`df['new_col'] = df['2010'] + df['2012'] # add a column to df called new_col`

__Task 4:__ Display the names of the 10 countries with the largest percent decrease in unemployment between 2011 and 2012.

In [None]:
# TASK 4
# YOUR CODE HERE

### 1.4 Visualizing data

Let's make some graphs to visualize and better understand the data. Run the code block below.  Notebook commands that start with % are called "magic" commands.  Make sure to run the "magic" %matplotlib command below in future labs so your visualizations show up in the browser. More magic commands are listed [here](http://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [None]:
% matplotlib inline

df.plot(kind='bar')

__Task 5:__ There are too many rows being graphed. In a new bar chart below, show only the first 10 countries and label the x-axis by country name.

Read [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)
for information on how to do the above task.

In [None]:
# TASK 5
# YOUR CODE HERE

__Task 6:__ Use the `.plot` method again to see how unemployment in 2010 relates to unemployment in 2014 for each country. Choose a suitable plot type.

In [None]:
# TASK 6
# YOUR CODE HERE

### 1.5 Exploring data independently

__Task 7-10:__ Use any combination of _pandas_ methods to answer the questions below. You can use a [cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) or the [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/) aid you.

All of the questions below are flawed in some way. For example, "average unemployment rate" has multiple meanings, some of which are more useful than others. Discuss with your partner issues you have with each of the questions.

However, since the goal of this lab is to practice _pandas_, also write 1-2 lines of code for each block.

_What is the average unemployment rate for each year?_

In [None]:
# <-- Start the line with a "#" to write comments in Python
# TASK 7
# YOUR CODE HERE

_Which countries underwent the greatest increase in unemployment during this time?_

In [None]:
# TASK 8
# YOUR CODE HERE

_In which countries would you not recommend a 20-year-old try to find a job?_

In [None]:
# TASK 9
# YOUR CODE HERE

_Do you think Arab Springs caused youth unemployment or youth unemployment led to Arab Springs?_

In [None]:
# TASK 10
# YOUR CODE HERE

__TA Check:__ When you are ready, call over a TA to come over. They will review your work and ask a couple of questions.

### 1.6 Foreshadowing

While waiting, here are some questions to discuss with your partner. Try to answer the questions with concrete units (# of rows, # of minutes, etc.):

1. When does it make sense to use machine learning? When is SQL sufficient? When are visualizations good enough?
2. When does it make sense to use a big data tool (like Apache Spark)? When is _pandas_ fine? When is Excel good enough?
3. When does it make sense to optimize your code using C/Cython? When is _pandas_ ok? When is plain Python ample?

## Handing In

Use PyCharm to commit and push this file to your remote repository. Share this file with your partner through email (or another medium).

## Part 2 (Optional) 
If you have time, try out Part 2, an analysis of a shopping dataset with 3 million rows from Instacart, a grocery delivering start up. This is slightly larger than what Excel can handle. Instructions are in the notebook `part2.ipynb`.

## Summary

This lab was an introduction to _pandas_ and Jupyter Notebook. While brief, we hope that this gives you the foundational experience to master those tools without explicit guidance.


At the end (Tasks 7-10), we try to point out that while the libraries and concepts introduced in this course can be powerful, they will not be useful if you do not critically think about the problem you are trying to solve. Given the scope of this course, future labs will generally be more problem-solving heavy and more focused on CS concepts. However, we hope that you continue to think critically from the perspective of a data scientist both throughout and after this course.

Below is a list of some important _pandas_ functions, many of which we used today.

### Pandas

#### Modules

    import pandas as pd

#### I/O

    pd.read_csv()
    df.to_csv()

#### Indexing

    df[column_name]
    df[[column1, column2, ...]]
    df.iloc[a:b,c:d]
    df.loc[a:b,c:d]
    df[boolean condition]

#### Exploration

    df.describe()
    df.plot()

## Power Jupyter 
[Here](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) is a great link to Jupyter power user tips and tricks. The commands below add a lot of interesting notebook extensions.

In [None]:
!conda install -c conda-forge jupyter_contrib_nbextensions
!jupyter contrib nbextension install --user --skip-running-check
!jupyter nbextensions_configurator enable --user
