# Lesson Two: Introduction to Data Processing
Welcome back! We hope all is well, and that you're ready to dive back into processing your data! In this week's lesson, we'll cover some techniques for examining your dataset.

As we go, be sure to ask plenty of questions, and never hesitate to let us know if we're moving too quickly. 

# 📊 Section One: Working with Data

## Importing Packages
Before we can get started with writing our notebook and diving into some data, we have to import some packages. Packages are pre-built bundles of code that allow us to achieve common tasks that we wouldn't be able to achieve in plain Python.

![Panda](https://media.giphy.com/media/EatwJZRUIv41G/giphy.gif) 

### NumPy
Numpy, short for **Num**erical **Py**thon, is a package that provides us with tools for working with lists of numbers, or **Arrays**. 
### Pandas
Pandas is a package that comes with many built in tools for examining and manipulating data. We'll use this package a lot throughout this course to help us understand and dig deeply into our data. <br>

In [2]:
import numpy as np
import pandas as pd

## Getting our data
Our data is bundled up in a CSV, or **C**omma **S**eparated **V**alues file. All this means is that our data is divided into rows and columns by commas. For example, if we had a dataset that stored students' names and ages, the CSV file may look like:
```
Name,Age,
Carlos,17,
Sarah,16,
```

Feel free to take a look at the file itself if you'd like to see how this works. For now, though, we'll read in the data using pandas built-in read_csv method. A **Method** is essentially a function built into a package that allows us to achieve a specific task. In this case, our package is pandas, and our task is to read in our data from a CSV file. 

In [3]:
data = pd.read_csv('../data/florida_covid_data_the_atlantic.csv')

## Reading in your data

![Reading](https://media.giphy.com/media/SiMcadhDEZDm93GmTL/giphy.gif) 

Let's go ahead and read in the COVID data. For starters, let's take a peek at the first five values within the dataset using the `head()` method. This method will always return the first five values from our dataset. 

In [4]:
data.head()

Unnamed: 0,date,state,dataQualityGrade,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
0,2021-01-24,FL,A,25693.0,,132,,71219.0,71219.0,6727.0,...,18397824,102178,650372.0,,630838.0,,9862348.0,37009,16456335.0,82643
1,2021-01-23,FL,A,25561.0,,156,,71037.0,71037.0,6711.0,...,18295646,169590,650372.0,,630838.0,,9825339.0,61098,16373692.0,138643
2,2021-01-22,FL,A,25405.0,,277,,70767.0,70767.0,6904.0,...,18126056,83831,650372.0,,630838.0,,9764241.0,35596,16235049.0,82008
3,2021-01-21,FL,A,25128.0,,163,,70306.0,70306.0,7023.0,...,18042225,131557,650372.0,,630838.0,,9728645.0,50506,16153041.0,111467
4,2021-01-20,FL,A,24965.0,,145,,69954.0,69954.0,7141.0,...,17910668,89293,650372.0,,630838.0,,9678139.0,38782,16041574.0,85876


Pretty cool, huh? Likewise, we can also use the ``tail()`` method to see the end of our data. This method will always return the last five values from our dataset.

In [5]:
data.tail()

Unnamed: 0,date,state,dataQualityGrade,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
357,2020-02-02,FL,,,,0,,,,,...,4,0,,,,,,0,,0
358,2020-02-01,FL,,,,0,,,,,...,4,0,,,,,,0,,0
359,2020-01-31,FL,,,,0,,,,,...,4,3,,,,,,0,,0
360,2020-01-30,FL,,,,0,,,,,...,1,0,,,,,,0,,0
361,2020-01-29,FL,,,,0,,,,,...,1,0,,,,,,0,,0


One more method that we recommend you use when you first import data is the ``describe()`` method. This will allow you to explore some key information about your data. Let's check it out below

In [6]:
data.describe()

Unnamed: 0,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,inIcuCumulative,inIcuCurrently,...,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
count,320.0,101.0,362.0,0.0,310.0,310.0,199.0,362.0,0.0,0.0,...,362.0,362.0,179.0,0.0,238.0,0.0,257.0,362.0,253.0,362.0
mean,10127.484375,3865.910891,70.975138,,32906.570968,32906.570968,4756.673367,196.737569,,,...,5596519.0,50822.71547,549156.972067,,460886.680672,,4952040.0,27244.055249,7385879.0,45459.48895
std,7971.130909,1904.940771,62.885746,,21817.513102,21817.513102,2300.354303,151.492849,,,...,5558225.0,42841.097852,95388.473029,,155350.709657,,2601348.0,36501.65577,4361042.0,54512.298635
min,2.0,1403.0,0.0,,158.0,158.0,2005.0,0.0,,,...,1.0,0.0,358283.0,,123552.0,,579604.0,-190.0,746124.0,0.0
25%,2521.25,2446.0,17.25,,11176.0,11176.0,2506.0,77.25,,,...,363309.2,12273.75,446669.0,,353026.0,,2819000.0,0.0,3775936.0,0.0
50%,9783.5,3266.0,55.0,,36833.5,36833.5,4334.0,180.5,,,...,4218730.0,46614.0,539844.0,,493810.0,,5024730.0,26275.5,7142713.0,41369.5
75%,17183.5,4912.0,107.0,,51077.0,51077.0,6897.0,283.75,,,...,9624810.0,77791.5,650372.0,,607025.0,,6850174.0,42768.75,10365370.0,71154.5
max,25693.0,8685.0,277.0,,71219.0,71219.0,9520.0,623.0,,,...,18397820.0,247151.0,650372.0,,630838.0,,9862348.0,579604.0,16456340.0,746124.0


# 🔎 Section Two: Accessing our Data

## Accessing data values
Let's see how we can access certain values from the dataset. You may have noticed by now that our dataset is arranged in rows and columns, similar to an Excel sheet. Each column within a Pandas DataFrame is called a **Series**. For example, we can view the total number of hospitalized people using ``data['hospitalizedCumulative']``. Let's try it out: Choose one of the columns that you see from the ``data.head()`` or ``data.tail()`` cells, and access that column of data.

In [12]:
# Get the values from the 'hospitalized' column
data['hospitalizedCumulative']

0      71219.0
1      71037.0
2      70767.0
3      70306.0
4      69954.0
        ...   
357        NaN
358        NaN
359        NaN
360        NaN
361        NaN
Name: hospitalizedCumulative, Length: 362, dtype: float64

In [None]:
# TODO: Access a column from your data
data['<COLUMN NAME>']

In [None]:
## TODO: Access another column
data['<ANOTHER COLUMN NAME>']

### Challenge! 
Here's a tricky one: What if we want to access multiple columns of data at a time? See if you can do this yourself below

In [None]:
## TODO: Access two columns at once


Perhaps instead of getting a column from our data, we want to access a specific row. We can achieve this by using **iloc**. Here, we can pass in **integer** values representing which row we want to fetch.

In [None]:
data.iloc[0]

Another important tool at your disposal in Python, and Pandas, is **slicing** the data. This allows us to select multiple rows at once. For example, if we wanted to select rows three through five of the data, we would use the code block below. Remember that although the left most column has the numbers 2, 3, and 4, Python is zero indexed, and thus these rows are actually 3,4, and 5. 

In [None]:
data.iloc[2:5]

You may have noticed that although the slice starts at three, we tell it to end at six. Why not five, we only want rows three through five! Slicing works by including the first number specified, but excluding the last number specified. For example, ```iloc[10:20]``` fetches rows nine through nineteen, and ```iloc[7:10]``` selects rows six through nine. You're absolutely justified in being confused by this at first, but don't worry; with practice, this will become much easier to understand. Try out some more slicing techniques below

In [None]:
# TODO: select rows two through ten
data.iloc[  ]

In [None]:
# TODO: select rows seven through fourteen
data.iloc[  ]

## Challenge
This one has a simple solution, though there's a super simple way to achieve it as well. We didn't teach this yet, so props to you if you can figure out the shorthand way to select the data. 

In [None]:
#TODO: select all rows up to row ten
data.iloc[  ]

In [22]:
# TODO: select all rows after row ten
data.iloc[  ]

SyntaxError: invalid syntax (<ipython-input-22-f7cb43eeaec8>, line 2)

# 🧮 Section Three: Calculating Statistics

## Getting Statistics from our Data

![Numbers](https://media.giphy.com/media/d8isjk1UBPFTm0EBbd/giphy.gif)

Now that we're able to access columns from our data, let's try getting some key statistics. We can start with the **mean**, **median**, and **standard deviation** for some of our **quantitative** data.

For now, we'll calculate the mean, median, and standard deviation for the number of daily COVID deaths within the dataset. Note that these values are the same as the `describe()` method we used before.

In [13]:
data['deathIncrease'].mean()

70.97513812154696

In [14]:
data['deathIncrease'].median()

55.0

In [15]:
data['deathIncrease'].std()

62.88574646695885

### ✨ Jupyter Notebook Quick Tip
Wow, we've just started and it looks like we've already learned so much. In case you forget what any of the code does later, you can hover over a line of code in Google Colab to see what it does. Try it out above by hovering over ``.std()``

# Practice 
Great work today. For practice this week, we want you to get a taste of what's known as **Exploratory Data Analysis** (EDA). You can think of EDA as simply 'getting to know the data'.

**Question One:** To start let's take a look at the first few lines of your dataset using the ``.head()`` method.

**Question Two:** Do you notice any variables that are interesting? Which variables are quantitative (numerical) vs. qualitative (categorical)?

**Question Three:** Let's try accessing one of your columns in the dataset. If you need a refresher on how to do this, check out the 'Accessing Data Values' section above.

**Question Four:** Now let's check out some of the rows of your dataset. Let's use ``.describe()`` to see how many rows are in your dataset.

**Question Five:** Let's grab rows 10-20 from your dataset. If you're not sure how to do this, try looking at the .iloc methods we looked at above.

**Question Six":** Finally, let's sort your dataset by ascending values. We haven't taught you how to do this yet but take a look at the ``.sort_values()`` method. **Tip:** Think about what happens if you type `ascending = False` inside the parentheses...

### Challenge
Time for the challenge question! Get the median of the first and last 10 values of one of the columns in your dataset. This will involve ``.iloc`` and ``.mean()``. Set these two values to mean_first and mean_last, then print them out.

In [None]:
#mean_first = 
#mean_last = 

Okay that's all we have for this week. Please feel free to reach out to us through email or attend our weekly Office Hours for questions or help on the practice problems. If you want a great cheat sheet to remember what you learned today (and possibly learn more cool Pandas tricks) be sure to check out https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf. 