# Manipulating Data and Dataframes
Hopefully you have some sense of why you might want to use a dataframe to work with your data (rather than something like a list or an array). You could do all of the things we've done with lists or arrays, but if you're working with lots of data the dataframe object makes our work more efficient. Now we are going to look at some really basic ways of working with your dataframe.


The specific dataframe methods we will use that we haven't covered before are:

* `info()`
* `cut()`
* `drop()`
* `copy()`
* `columns()`
* `tolist()`

***
First, some imports and some code to make our charts look nicer.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# For slightly nicer charts
plt.rcParams['figure.figsize'] = [10, 5]
plt.rcParams['figure.dpi'] = 150

***
## Getting Detailed Information about our Dataframe
So far we have used `head()`, `tail()`, and `shape()` to get quick information about our dataframe. There's one more method we haven't covered that is useful; `info()` is a method that provides detailed information about each data column and its type.


Here we read in a dataset of fictional course grades, assign it to the variable `df`, and then examine it with `info()`.

In [None]:
df = pd.read_csv("grades-all.csv")
df.info()

What is all this telling us? First, the dataframe has 18 entries (or rows). There are 19 columns of data, two of them are 'object' datatypes, one an integer ('int64') and the rest are floats ('float64'). We haven't paid attention to the datatypes of the columns in our datasets so far because the data was clean. When we start working with real-world data it will be important to check that the datatypes are what you are expecting. For example, if a column should be a float or an integer but its listed as an object, it's likely that you have strings characters in your dataset mixed in with your numerics. 

***
## Simple Indexing with Labels

Up to this point we have been using code that looks like this `df['Year']` but not thinking too much about it. We've described this as reading columns of data to pass to a method but have not talked any specifics. It's time to dig into this a bit. 

'Year' in this example might be referred to as a column name or a column label. What we've been doing is selecting data we want to work with by using the label, this is known as indexing. 

So let's see what happends when we index a datframe without calling any methods.

In [None]:
df['Q1']

This is something called a pandas series. The sequential numbers to the left is the series index. The numbers to the right are the values. One way to think of a dataframe is as a collection of series objects. Indexing gives us access to individual series or group of series within a dataframe.

The ususal methods can be applied to the series object.

In [None]:
df['Q1'].mean()

In [None]:
df['Q1'].sum()

We can also use variables as labels for indexing.

In [None]:
quiz = 'Q1'
df[quiz].sum()

If we specify a label that doesn't currently exist we create new columns in our data frame.

In [None]:
df['New Column!'] = 0
df.head()

We can also assign new values to existing columns using labels.

In [None]:
df['New Column!'] = 'data!'
df.head()

We can also use this type of indexing to do operations on columns.

In [None]:
df['Exam_Avg'] = (df['Exam1'] + df['Exam2'])/2
df.head()

Here's where things get more interesting. We can pass lists of labels to index multiple columns.

In [None]:
exam_list = ['Exam1', 'Exam2']
df[exam_list].mean()

We can also pass the list directly.

In [None]:
df[['Project1', 'Project2']].mean()

It looks like 'Project 1' was entered as raw points out of 12; while 'Project 2' was entered as percentages. We can fix that with some simple operations. 

In [None]:
df['Project1'] = (df['Project1']/12)*100
df.head()

And while we are at it, let's convert the quiz grades to percentages as well. It looks like they were also out of 12 points.

In [None]:
quiz_list = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9','Q10']
df[quiz_list] = (df[quiz_list]/12) * 100
df.head()

Notice how we used the list above to apply the same operation to all of the data columns in the list. That is a short bit of code that is doing quite a bit. 

## Creating New Dataframes with Indexing
We can use indexing to create new dataframes. You might notice the `copy()` method used below. It is making a new copy of the dataframe, instead of just showing us part of the existing dataframe (called a view). We will discuss why you would want to do this in more detail later.

In [None]:
copy_list = ['StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9','Q10']
dfquiz = df[copy_list].copy()
dfquiz.head()

We can index using the same list of quizzes, calculate an overall mean (across the columns using axis=1), and assign that to a new series we create called 'Quiz Avg'.

In [None]:
quiz_list = copy_list[1:]
dfquiz['Quiz Avg'] = dfquiz[quiz_list].mean(axis=1)
dfquiz.head()

Now, just for fun, let's say we want to assign letter grades based on the average. There's a method called `cut()` that allows us to specify the bins we would like (ranges) and then supply labels for those bins. In this case the labels are letter grades. 

In [None]:
dfquiz['Quiz_Avg_Letter'] = pd.cut(dfquiz['Quiz Avg'], bins=[0, 60, 70, 80, 90, 100], labels= ['F', 'D', 'C', 'B', 'A'])
dfquiz.head()

## Reordering Columns
We can also use labels to reorder columns. Let's say we weant to move the Quiz Averages to the front (left-most) of the dataframe. To accomplish this we will first use the `columns()` method will return an object with all of the column labels.

In [None]:
dfquiz.columns

Second, we can then apply the `tolist()` method to convert the object returned by columns in to a list. We then assign that list to the variable 'column_labels'.

In [None]:
column_order = dfquiz.columns.tolist()
column_order

We can reorder the items in the list and then use that list to reorder the dataframe itself. We can reorder the list by changing the list ourselves.
We then use the reordered lists to reorder our dataframe.

In [None]:
column_order1 = ['Quiz Avg', 'Quiz_Avg_Letter','StudentID', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6', 'Q7', 'Q8', 'Q9', 'Q10']
dfquiz_reordered1 = dfquiz[column_order1].copy()
dfquiz_reordered1.head()

Or we can use list methods to slice and recombine the column list to accomplish the same thing.

In [None]:
column_order2 = column_order[-2:] + column_order[:-2]
column_order2

We then use the reordered list to reorder our dataframe.

In [None]:
dfquiz_reordered2 = dfquiz[column_order2].copy()
dfquiz_reordered2.head()

## Removing Columns
We can also use labels to remove columns. The `drop()` method will take a label, or list of labels, and drop them from the dataframe. `drop()` can be used to remove rows as well so we have to tell is to specifically look for a column with the label we specified. We tell it to look for column by specifying `axis=1` (we would use `axis=0` if we wanted to drop rows).

In [None]:
df_dropped = df.drop('New Column!', axis=1)
df_dropped.head()

## In Class Exercise

Work with the existing dataframe 'df' to calculate final course averages and final letter grades for every student in the class. Your letter grades should include plusses and minuses. 


For this fictional course: 
* Quizzes collectively count for 30% of the final grade
* Projects count for 10% each
* Labs count for 5%
* Attendance counts for 5%
* Exams for 20% each.


After you've calculated the final outcomes make descriptive statistics and visualizations that compare the students course performance by first by major, then by year, then major by year. 



In [None]:
# write and test your code here