# Lab 1: Introduction to Data Science

Due: Thursday, September 7 at 2 pm

## Today's Goals:

* Familarize yourself with Jupyter Notebook and  _pandas_
* Begin thinking like a data scientist
* Start seeing and writing maintainable code

## Motivation

Data scientists everywhere now use Jupyter Notebook to explore data and share results. _pandas_ is the go-to Python library for data analysis. Knowing these tools well will not only help you succeed in this course, but also with wrangling any data that you may come across in the future.

## Grading

The tasks in this assignment will be graded for completion by the eye test. Don't worry about being any more precise than we ask. Do worry about learning _pandas_ well though. We will use it a lot!

## Part 0: Getting started with Jupyter Notebook

Begin by touring the Notebook UI. Click on __Help__ > __User Interface Tour__ in the toolbar above to begin.

You can also reference this GIF from DataCamp, to get a basic understanding of how to use the notebook.
<img alt="Overview of the Jupyter Interface" src="http://community.datacamp.com.s3.amazonaws.com/community/production/ckeditor_assets/pictures/200/content_jupyternotebook3b.gif" style="width: 800px; height: 279px;">

In Jupyter Notebook, you can write and run Python cells as if you were in a Python Interpreter. In addition, you can structure your code in modular cells and run all the cells as if the notebook were a .py script.

Finally, you can also create and run Markdown cells like this one. Try double clicking on this cell and then running it. See the instructions above and right below for more details on running cells

## Part 1: Exploring Data with DataFrames

Congratulations, you have gotten the notebook to run. Run the cell below by first clicking on the cell below, and the clicking the run button (>|) on the toolbar above. It should print "Hello!"

In [None]:
import pandas
print("Hello!")

### 1.0 Contextualizing data

The first dataset we will be looking at relates to youth unemployment. Go [here](https://www.kaggle.com/sovannt/world-bank-youth-unemployment) and scroll down to the dataset description for more details. What is their definition of youth? What years are we looking at?

### 1.1 Reading data

Data is often stored in comma-separated values (CSV) files. We have downloaded the unemployment data into the file `API_ILO_country_YU.csv` for you.

To read a CSV file called `my_data.csv` with pandas we would run the following command
```
pandas.read_csv('my_data.csv')
```

__Task 1:__ Load the unemployment data in the above file to a variable called `dataframe` below. If no error shows up, your DataFrame most likely has loaded.

In [None]:
# TASK 1
dataframe = 

### 1.2 Viewing data

Congrats. You've successfully loaded the data into an object called a `DataFrame`.

Run each of these `DataFrame` methods in the cell below one-at-a-time. What do they do?

    dataframe
    dataframe.columns
    dataframe.head()
    dataframe.head(10)
    dataframe.tail(4)
    dataframe.nlargest(5, '2014')
    dataframe.nsmallest(8, '2011')
    
What get printed if you run 2 expressions in the same block?

In [None]:
# Run the methods here
dataframe

Now that you know what these functions do, complete the following task.

__Task 2:__ Show the 12 countries with the highest unemployment rates in 2013.

In [None]:
# TASK 2

### 1.3 Transforming data

With DataFrames we can also do things like select
rows and columns. What do the following expressions
evaluate to?

#### Column selection:
    
    dataframe['2010']
    dataframe[['Country Name', '2011', '2012']]
    dataframe.iloc[:, 3] # Gets all the rows in the 4th column
    dataframe.iloc[:, 2:5]
    
#### Row selection:
    
    dataframe.iloc[191] # Gets the 192nd row
    dataframe.iloc[2:10]
    dataframe[dataframe['2010'] > 40]
    
#### Combinations:

    dataframe.iloc[12:14, 0:4]
    dataframe[['Country Name', '2011', '2012']][dataframe['2012'] < 10]
    dataframe[dataframe['2012'] < 10].iloc[1:3, 2:5]

In [None]:
# Run the expressions here

__TA Check 1__: Raise your hand and a TA will come over to check your understanding of the above commands.

__Task 3:__ Show the `Country Name` and `Country Code` of all countries where

* the 2010 unemployment rate is higher than the 2014 unemployment rate, and
* the 2014 unemployment rate is above 20%

In [None]:
# TASK 3

Now it is time to see more powerful features of DataFrames.
You can add, subtract, multiply, and divide columns as shown below

`dataframe['2010'] + dataframe['2012']`

You can add a new column to the dataframe using the following syntax:

`dataframe['new_col'] = dataframe['2010'] + dataframe['2012']`

__Task 4:__ Display the names of the 10 countries with the smallest decrease in unemployment between 2011 and 2012.

In [None]:
# TASK 4

### 1.4 Visualizing data

Our eyes are great at seeing patterns. Let's make some graphs to visualize and better understand the data. Run the code block below.

In [None]:
# Make sure to run the below "magic" command in future labs so your visualizations show up in the browser.
%matplotlib inline

dataframe.plot(kind='bar')

__Task 5:__ There's clearly too many rows being graphed. In a new bar chart below, show only the first 10 countries, bars corresponding to the years 2010 to 2014 and label the x-axis by country name.

Read [`pandas.DataFrame.plot`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)
for information on how to do the above task.

In [None]:
# TASK 5

__Task 6:__ Use the `.plot` method again to see how unemployment in 2010 relates to unemployment in 2014 for each country.

Hint: What kind of chart would you use?

In [None]:
# TASK 6

### 1.5 Exploring data independently

__Task 7-10:__ Use any combination of the above techniques and others (like writing) to solidly answer the 5 questions below. Use your cheat sheet and the [docs](http://pandas.pydata.org/pandas-docs/stable/) aid you.

Do comment about any ambiguities and issues you have with the question statement or your solution to the problem in the corresponding Python block.

_What is the average unemployment rate for each year?_

In [None]:
# <-- Start the line with a "hashtag" to write comments in Python

_Which countries underwent the greatest increase in unemployment during this time?_

_In which countries would you not recommend a 20-year-old try to find a job?_

_Do you think Arab Springs caused youth unemployment or youth unemployment led to Arab Springs?_

__TA Check 2:__ When you are ready, a TA will come over to review your work.

### 1.6 Foreshadowing

With a partner reflect on this first analysis by answering these questions.

1. How do you think the data was collected? How could this have led you to wrong insights?
2. Did you think that the data could be structured better? Were all of the strings in the country name column actually countries?
3. Did you ever feel like you needed to make your code run faster? In what scenario would you try to optimize code execution time?
4. How could you analyze this data more in depth. Would you want to create an interpretable model or an accurate model?
5. What would you do if you needed to share your work with others? How would you make your code more readable to another data scientist?

## Summary

In today's lab we learned how to use _pandas_ in Jupyter Notebook through data analysis. We went through the entire process of processing the data to doing some basic analysis.

Here is a summary of the _pandas_ functions we used today.

### Pandas

#### Modules

    import pandas as pd

#### I/O

    pd.read_csv()
    df.to_csv()

#### Indexing

    df[column_name]
    df[[column1, column2, ...]]
    df.iloc[a:b,c:d]
    df[boolean condition]

#### Exploration

    df.describe()
    df.plot()

## Hand In

Push all of your changes to your Github fork.

## (Optional) Part 2
If you have time, try out Part 2, an analysis of a shopping dataset with 3 million rows from Instacart, a grocery delivering start up. Instructions are in the notebook `part2.ipynb`.