# Lab 0.1: Introduction to Git, Jupyter and Pandas


## Today's Goals:

* Familiarize yourself with Git, Jupyter Notebook and  _pandas_
* Begin thinking like a data scientist


## Motivation

Data scientists everywhere now use Jupyter Notebook to explore data and share results. _pandas_ is the go-to Python library for data analysis. Knowing these tools well will not only help you succeed in this course, but also with wrangling any data that you may come across in the future.

## Outcomes

By the end of this lab, you will have the basic skillset to do data analysis with Python. You will be able to load, visualize, and analyze data with _pandas_. In addition you will be able to navigate Jupyter Notebook and share code, explanations, and visualizations to others in notebook form. When we start working with machine learning in two weeks, your experience with _pandas_, Jupyter, and git will allow you to focus on applying, rather than debugging, ML models.

## Grading

The tasks in this assignment will be graded for completion when you turn it in via git. However during the lab the TAs will also check off the various tasks as indicated.  Don't worry about being any more precise than we ask. Do worry about learning _pandas_ well though. We will use it a lot!

## Part 0: Preliminaries 
You should have already used the [DATA 1030 Software Setup Guide](https://docs.google.com/a/brown.edu/document/d/1-be-XHwFqKFYyOXjDbW6WAiG8OERUl_nYNmYqVWzn1o/edit?usp=sharing) to install Anaconda, Python and and create GitHub account.

Once you’ve created your GitHub account, go to this link:
https://classroom.github.com/a/5zUwy6c7 
This will create a private DATA1030 classroom repository that you have write access to.
Today, only one person per pair needs to create a repository, but going forward, everyone will need to private DATA1030 classroom repository.

In order to download this lab, open `Terminal or cmd.exe` and run the following:

`
cd ~/Desktop # Goes to your Desktop directory
mkdir data1030 
cd data1030 
git clone https://github.com/data1030/a0-intro-<YOUR_GITHUB_USERNAME>.git
cd a0-intro-<YOUR_USERNAME>
jupyter notebook # Opens the notebook in your browser
`

### 0.1: Getting started with Jupyter Notebook

Begin by touring the Notebook UI. Click on __Help__ > __User Interface Tour__ in the toolbar above to begin.

You can also reference this GIF from [DataCamp](https://www.datacamp.com/), to get a basic understanding of how to use the notebook.
<img alt="Overview of the Jupyter Interface" src="http://community.datacamp.com.s3.amazonaws.com/community/production/ckeditor_assets/pictures/200/content_jupyternotebook3b.gif" style="width: 800px; height: 279px;">

In Jupyter Notebook, you can write and run Python cells as if you were in a Python Interpreter. In addition, you can structure your code in modular cells and run all the cells as if the notebook were a .py script.

Finally, you can also create and run Markdown cells like this one. Try double clicking on this cell and then running it. See the instructions above and right below for more details on running cells

## Part 1: Exploring Data with DataFrames

Congratulations, you have gotten the notebook to run. Run the cell below by first clicking on the cell below, and the clicking the run button (>|) on the toolbar above. It should print "Hello World!"

In [None]:
print("Hello World!")

### 1.0 Contextualizing data

The first dataset we will be looking at relates to youth unemployment. We originally downloaded this data from  [Kaggle.com](https://www.kaggle.com/sovannt/world-bank-youth-unemployment) and scroll down to the dataset description for more details. 

__Task 0:__ What is their definition of youth? What years are we looking at?  Where is the data from originally?

### 1.1 Reading data

Data is often stored in comma-separated values (CSV) files. We have downloaded the unemployment data into the file `API_ILO_country_YU.csv` for you.

To read a CSV file called `my_data.csv` with into a dataframe variable called df using pandas we would run the following command
```
df = pandas.read_csv('my_data.csv')
```

__Task 1:__ Load the unemployment data in the above file to a variable called `df` below. If no error shows up, your DataFrame most likely has loaded.

In [None]:
import pandas
# TASK 1
### BEGIN SOLUTION
df = pandas.read_csv('API_ILO_country_YU.csv')
### END SOLUTION

### 1.2 Viewing data

Congrats. You've successfully loaded the data into an object instance variable called `df`.

Run each of these `DataFrame` methods in the cell below one-at-a-time. What do they do?

    DataFrame
    DataFrame.columns
    DataFrame.head()
    DataFrame.head(10)
    DataFrame.tail(4)
    DataFrame.nlargest(5, '2014')
    DataFrame.nsmallest(8, '2011')
    
What get printed if you run 2 expressions in the same block?

In [None]:
# Run the methods here
df

Now that you know what these functions do, complete the following task.

__Task 2:__ Show the 12 countries with the highest unemployment rates in 2013.

In [None]:
# TASK 2
### BEGIN SOLUTION
df.nlargest(12, '2013')
### END SOLUTION

### 1.3 Transforming data

With DataFrames we can also do things like select
rows and columns. What do the following expressions
evaluate to?

#### Column selection:
    
    df['2010']
    df[['Country Name', '2011', '2012']]
    df.iloc[:, 3] # Gets all the rows in the 4th column
    df.iloc[:, 2:5]
    
#### Row selection:
    
    df.iloc[191] # Gets the 192nd row
    df.iloc[2:10]
    df[df['2010'] > 40]
    
#### Combinations:

    df.iloc[12:14, 0:4]
    df[['Country Name', '2011', '2012']][df['2012'] < 10]
    df[df['2012'] < 10].iloc[1:3, 2:5]

In [None]:
# Run the expressions here
df['2010']

__TA Check 1__: Raise your hand and a TA will come over to check your understanding of the above commands.

__Task 3:__ Show the `Country Name` and `Country Code` of all countries where

* the 2010 unemployment rate is higher than the 2014 unemployment rate, and
* the 2014 unemployment rate is above 20%

In [None]:
# TASK 3
### BEGIN SOLUTION
decreasing_df = df[df['2010'] > df['2014']]
decreasing_df[decreasing_df['2014'] > 20]
### END SOLUTION

Now it is time to see more powerful features of DataFrames.
You can add, subtract, multiply, and divide columns as shown below

`df['2010'] + df['2012']`

You can easily add a new column to the dataframe called using the following syntax:

`df['new_col'] = df['2010'] + df['2012'] # add a column to df called new_col`

__Task 4:__ Display the names of the 10 countries with the smallest percentage decrease in unemployment between 2011 and 2012. 

In [None]:
# TASK 4
### BEGIN SOLUTION
df.loc[(df['2012'] - df['2011']).nsmallest(10).index]
### END SOLUTION

### 1.4 Visualizing data

Our eyes are great at seeing patterns. Let's make some graphs to visualize and better understand the data. Run the code block below.  Notebook commands that start with % are called "magic" commands.  Make sure to run the "magic" %matplotlib command below in future labs so your visualizations show up in the browser. Here is a [spellbook](http://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [None]:
%matplotlib inline

df.plot(kind='bar')

__Task 5:__ There's clearly too many rows being graphed. In a new bar chart below, show only the first 10 countries, bars corresponding to the years 2010 to 2014 and label the x-axis by country name.

Read [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)
for information on how to do the above task.

In [None]:
# TASK 5
### BEGIN SOLUTION
df.loc[0:10, ['Country Name', '2010', '2011', '2012', '2013', '2014']].plot(kind='bar', x='Country Name')
### END SOLUTION

__Task 6:__ Use the `.plot` method again to see how unemployment in 2010 relates to unemployment in 2014 for each country.

Hint: What kind of chart would you use?

In [None]:
# TASK 6
### BEGIN SOLUTION
df.plot(kind='scatter', x='2010', y='2014')
### END SOLUTION

### 1.5 Exploring data independently

__Task 7-10:__ Use any combination of the above techniques and others (see the __Pandas Summary__) to answer the 5 questions below. You can also use your cheat sheet and the [docs](http://pandas.pydata.org/pandas-docs/stable/) aid you.

Do comment about any ambiguities and issues you have with the question statement or your solution to the problem in the corresponding Python block.

_What is the average unemployment rate for each year?_

In [None]:
# <-- Start the line with a "hashtag" to write comments in Python
# TASK 7
### BEGIN SOLUTION
df.mean()
### END SOLUTION

_Which countries underwent the greatest increase in unemployment during this time?_

In [None]:
# TASK 8
### BEGIN SOLUTION
df.loc[(df['2014'] - df['2010']).nlargest(12).index] # We avoid mutation with lots of ugliness
### END SOLUTION

_In which countries would you not recommend a 20-year-old try to find a job?_

In [None]:
# TASK 9
### BEGIN SOLUTION
df.nlargest(10, '2014')
### END SOLUTION

_Do you think Arab Springs caused youth unemployment or youth unemployment led to Arab Springs?_

In [None]:
# TASK 10
### BEGIN SOLUTION
df.iloc[3:4, 1:7].plot(kind='bar')
### END SOLUTION

__TA Check 2:__ When you are ready, a TA will come over to review your work.

### 1.6 Foreshadowing

With you partner reflect on this first analysis by discussing these questions.

1. How do you think the data was collected? What data collection errors could have led you to wrong insights?
2. Did you think that the data could be structured better? Were all of the strings in the country name column actually countries?
3. Did you ever feel like you needed to make your code run faster? In what scenario would you try to optimize code execution time?
4. How could you analyze this data more in depth. Would you want to create an interpretable model or an accurate model?
5. What would you do if you needed to share your work with others? How would you make your code more readable to another data scientist?

## Handing In

Congratulations, you are now done with the Part 1 of this lab.  Create a file called `LAB-AUTHORS.md` with both of your full names in there and add it to your local repository, then commit all your work to one partner’s DATA1030 GitHub forks using git:

`
git add LAB-AUTHORS.md
git commit -m “Completed lab 1”
git push origin master
`

You can check that your updates worked by going to `a0-intro-<YOUR_GITHUB_ID>` at https://github.com/data1030.## and checking the files dates (they should that they were just updated) and by examining their content.

## Part 2 (Optional) 
If you have time, try out Part 2, an analysis of a shopping dataset with 3 million rows from Instacart, a grocery delivering start up. Instructions are in the notebook `part2.ipynb`.

## Pandas Summary

In today's lab we learned how to use _pandas_ in Jupyter Notebook through data analysis. We went through the entire process of processing the data to doing some basic analysis.

Below is a list of some important _pandas_ functions, many of which we used today.

### Pandas

#### Modules

    import pandas as pd

#### I/O

    pd.read_csv()
    df.to_csv()

#### Indexing

    df[column_name]
    df[[column1, column2, ...]]
    df.iloc[a:b,c:d]
    df[boolean condition]

#### Exploration

    df.describe()
    df.plot()

## Power Jupyter 
[Here](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) is a great link to Jupyter power user tips and tricks. The commands below add a lot of interesting notebook extensions.

In [None]:
!pip install https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tarball/master
!pip install jupyter_nbextensions_configurator
!jupyter contrib nbextension install --user --skip-running-check
!jupyter nbextensions_configurator enable --user