# Data Science: Working with Datasets

## Introduction

When we have collected a lot of data, we need something bigger than a list to keep it organized. This is what **datasets** are used for. They are organized with rows and columns. The columns represent different traits that we are interested in. There is one row for each piece of data we collect. We'll start by **reading in**, or bringing in, a sample dataset. 

In [1]:
# First, we import the pandas library, which is used often in Python data science.
# We'll also import our old friend seaborn for plotting and the special command that goes with it. 
import pandas
import seaborn
%matplotlib inline

In [4]:
# Now, we'll use a pandas function call read_csv to read in the data set, which is about ice cream sundaes.
# Notice that we set the dataset to variable so that we can use connected functions later. 
sundaes = pandas.read_csv('sundaes.csv') 

# And now, we'll preview the data set.
print(sundaes) 

In [None]:
# What happens if we take away print() around sundaes?  Try it! (Remember: this might look different outside Jupyter.)


Okay, there's a lot to see here, but don't get overwhelmed. Let's look through the rows and columns to see what's happening. 

## Now let's see what we can learn by using our Python libraries. 

### Part 1: Sums

What if we wanted to know how many rows and columns we have? There's a function for that. 

In [13]:
# Use .shape at the end of the dataset's name. (The index column doesn't count here.) We'll use a variable, too.
s = sundaes.shape
print(s)

Maybe we want to know many total scoops of ice cream were used in the 29 sundaes. There's a function for that.

In [15]:
# We just type the column name after the dataset name, separated with a dot.
# Then we add the .sum() function at the end. 
sundaes.scoops.sum()

**Your turn** How many total toppings were used?

In [None]:
# Type your code here. 


### Part 2: Choosing Certain Values

What if we wanted to know how many sundaes used chocolate ice cream? We can't just use a sum because we aren't interested in the whole column, and the column doesn't contain numbers. Instead, we'll use the **len()** functions, which counts for us. 

In [39]:
len(sundaes[sundaes.ice_cream == 'chocolate'])

Look at the code above from the inside out:
   - First, we tell Python we want the places where the ice_cream column says 'chocolate' in sundaes. 
   - Then, we tell Python we want the dataset called sundaes.
   - Last, we use the length command to count for us. 

**Your Turn**

In [None]:
# How many sundaes used Mexican vanilla?


In [None]:
# How many people shared their sundaes?


## Part 3. Back to Bar Plots (sort of)

Now let's show off our data with the bar plot, which we learned about yesterday. Remember that we've already imported seaborn and the special command for plots.

Things will be a little different this time, though, since we have a column in a dataset instead of 2 lists. The good news that seaborn will do all the counting for us! We just need to tell it the name of the column and dataset we want. For this approach, we use **countplot()** instead of barplot(). 

In [8]:
# Make a bar plot showing how often people got sundaes on each day of the week. 
seaborn.countplot(x = 'day', data = sundaes)

**Your Turn**

In [37]:
# Make a plot showing how many people did and did not share their sundaes.


Which days were most popular? Why do you think that is?

# Lists and *For* Loops

Most of the lists we've used so far have been pretty short. But what if we had a list of 20 -- or even more -- things? And imagine that we wanted to do something with each individual item in that list. It would by a lot of coding to write out 20 or more lines of code like that. There's good news, though: we have *for* loops! These loops allow us to write just a few lines of code and then have a function work through each item in our list automatically. 

In the example below, we use a for loop to print each item in a list.

In [41]:
# Start with our list of car brands from yesterday.
car_brands = ['Jeep', 'Tesla', 'Mazda', 'Toyota']

In [44]:
# Now print each name using a for loop with just 2 lines of code. 
for car in car_brands:
    print(car) 

What's happening in the code above? In the first line, we're telling Python that we want to do something *for* every car in the list car_brands. In the second line, we tell Python what we want to do with each car, which is to print them. That's it!

And here's the fun part: it doesn't matter what word we write where the word "car" is, as long as we use the same word there and in the function below, print(). Use the cell below to type the same code, only using a different word in place of car. You can even make up a word, as long as you use the same word in both places!

In [None]:
# Same loop with new word in place of "car"



## Create Your Own

Now it's time to make your own for loop. Follow the directions in the comments below to make this happen. Create your own list, and set it to a variable name of your choice. Your list should use only words and not numbers, but it can contain cartoon characters, shoe names, or anything else. Ask a teacher for help if you're stuck.

In [None]:
# Create your own list here.


In [None]:
# Now use a for loop to find the length of each item using the len() fuction. 



## Once More, with Math

We can also use for loops to apply math operations to items in our list. So if we had a list of numbers and wanted to add 4 to each one, we would use print(number + 2) in the second line of our loop's code. Think about what would we write in the first line of the loop. 

In [47]:
# Here's the list to start:
numbers = [5, 4, 11, 23, 77]

In [49]:
# And now the loop
for num in numbers:
    print(num - 2)

Notice that the second line gets indented. Python does this automatically when we start the first line with 'for' and end with a colon. Python expects a for loop in this case. Having the second line indented means that that line is the next step in a set of functions. The two lines are tied together. In fact, they won't work separately. See what happens if you remove one line or comment it out. 

**Your Turn**

In [None]:
# Create a list of 10 numbers and set it equal to the name of your choice.


In [None]:
# Use a for loop to add 21 to each number.

