# Introduction to Data Processing
Welcome back! We hope all is well, and that you're ready to dive back into processing your data! In this week's lesson, we'll cover some techniques for examining your dataset.

As we go, be sure to ask plenty of questions, and never hesitate to let us know if we're moving too quickly. 

## Importing Packages
Before we can get started with writing our notebook and diving into some data, we have to import some packages. Packages are essentially pre-built bundles of code that allow us to achieve common tasks that we wouldn't be able to achieve in plain Python. For starters, we'll be working with two of the most commonly used packages in Python Data Science: NumPy and Pandas. 
### NumPy
Numpy, short for **Num**erical **Py**thon, is a package that provides us with tools for working with lists of numbers, or **Arrays**. 
### Pandas
Pandas is a package that comes with many built in tools for examining and manipulating data. We'll use this package a lot throughout this course to help us understand and dig deeply into our data. <br>

Without further ado, let's get started by importing both NumPy and Pandas. You'll notice that we're importing numpy "as np" and importing pandas "as pd". All this means is that we're choosing to rename the packages as we import them. This is only done because we're pretty lazy, and don't feel like typing out "numpy" or "pandas" every time we want to use the package; "np" and "pd" are much quicker to type.

In [2]:
import numpy as np
import pandas as pd

## Getting our data
Typically, this is one of the trickiest steps in data science; data usually never comes clean, and usually never comes bundled up in one convenient source. Luckily, we've pre-bundled the data so that it's easy to import and start working with. Our data is bundled up in a CSV, or **C**omma **S**eparated **V**alues file. All this means is that our data is divided into rows and columns by commas. For example, if we had a dataset that stored students' names and ages, the CSV file may look like:
```
Name,Age,
Carlos,17,
Sarah,16,
```

Feel free to take a look at the file itself if you'd like to see how this works. For now, though, we'll read in the data using pandas built in read_csv method. A **Method** is essentially a function built into a package that allows us to achieve a specific task. In this case, our package is pandas, and our task is to read in our data from a CSV file. To do so, we'll use the **read_csv()** method.

In [None]:
variable_name = "Ebola" #@param ["Wildfires and Bird Migration", "Yearly Carbon Fluctuations", "Ebola Virus Outbreak", "Climate Change"]


## Reading in your data
Let's go ahead and read in the dataset that you chose to work with. For starters, let's take a peek at the first five values within the dataset using the **head()** method 

In [3]:
data = pd.read_csv()
data.head()

TypeError: parser_f() missing 1 required positional argument: 'filepath_or_buffer'

Pretty cool, huh? Likewise, we can also use the **tail()** method to see the end of our data.

In [4]:
data.tail()

NameError: name 'data' is not defined

One more method that we recommend you use when you first import data is the **describe()** method. This will allow you to explore some key information about your data. Let's check it out below

In [1]:
data.describe()

NameError: name 'data' is not defined

## Accessing data values
Now that we're able to read in the data, let's see how we can access certain values from the dataset. You may have noticed by now that our dataset is arranged in rows and columns, similar to an Excel sheet. Pandas conveniently stores each of our columns so that they can be accessed by their name. Each column within a Pandas DataFrame is called a **Series** For example, if our dataset included a column named *People*, we could access that column using **data['People']**. Let's try it out: Choose one of the columns that you see from the **data.head()** or **data.tail()** cells, and access that column of data.

In [5]:
# TODO: Access a column from your data
data['<COLUMN NAME>']

NameError: name 'data' is not defined

In [6]:
## TODO: Access another column
data['<ANOTHER COLUMN NAME>']

NameError: name 'data' is not defined

### Challenge! 
Here's a tricky one: What if we want to access multiple columns of data at a time? See if you can do this yourself below

In [7]:
## TODO: Access two columns at once


Perhaps instead of getting a column from our data, we want to access a specific row. We can achieve this by using **iloc**. Here, we can pass in **integer** values representing which row we want to fetch. Here's the catch: Python is **zero-indexed**, which means, instead of starting counting from one, it starts counting from zero. This is a bit tricky to get used to at first, so make sure to practice this below. The code below gets the first row from the data

In [None]:
data.iloc[0]

Another important tool at your disposal in Python, and Pandas, is **slicing** the data. This allows us to select multiple rows at once. For example, if we wanted to select rows three through five of the data, we would use the code block below

In [1]:
data.iloc[4:6]

NameError: name 'data' is not defined

You may have notices that although the slice starts at three, we tell it to end at six. Why not five, we only want rows three of five! Slicing works by including the first number specified, but excluding the last number specified. For example, ```iloc[10:20]``` fetches rows nine through nineteen, and ```iloc[7:10]``` selects rows six through nine. You're absolutely justified in being confused by this at first, but don't worry; with practice, this will become much easier to understand. Try out some more slicing techniques below

In [None]:
# TODO: select rows two through ten
data.iloc[  ]

In [None]:
# TODO: select rows seven through fourteen
data.iloc[  ]

## Challenge
This one has a simple solution, though there's a super simple way to achieve it as well. We didn't teach this yet, so props to you if you can figure out the shorthand way to select the data. 

In [None]:
#TODO: select all rows up to row ten
data.iloc[  ]

In [None]:
# TODO: select all rows after row ten
data.iloc[  ]

## Getting Statistics from our Data
Now that we're able to access columns from our data, let's try getting some key statistics. We can start with the **mean**, **median**, and **standard deviation**. If you're unfamiliar with these statistics, be sure to check out our article on these statistics **LINK ARTICLE** or check out this YouTube video **LINK VIDEO**. 

In order to get the mean, media, and standard deviation of different columns in our data, we must first identify columns that contain **quantitative** data. The easiest way to identify quantitative data is to look for columns with a bunch of numbers. For example, if we had a dataset that contained people's height and name, 
```
Name,Height (inches),
Carlos,70,
Sarah,66,
```
Then our quantitative data would be the *Height* column. The *Name* column, on the other hand, is called **categorical** data. This just means that this data belongs in categories (yep, you can consider your name a category.) If you're having trouble understanding quantitative or categorical data, we recommend you check out **LINK VIDEO**

For now, we'll calculate the mean, median, and standard deviation of a quantitative column within your data that you select. Go ahead and pick on out, and fill it out below

In [8]:
data['<QUANTITATIVE COLUMN>'].mean()

NameError: name 'data' is not defined

In [9]:
data['<QUANTITATIVE COLUMN>'].median()

NameError: name 'data' is not defined

In [10]:
data['<QUANTITATIVE COLUMN>'].std()

NameError: name 'data' is not defined

### Jupyter Notebook Quick Tip
Wow, we've just started and it looks like we've already learned so much. In case you forget what any of the code does later, you can hover over a line of code in Google Colab to see what it does. Try it out above by hovering over **.std()**  

## Practice 
Great work today. Let's make sure you're really understanding the data. 