# Measures of Center with Large Data Sets

The data sets we have been working with so far have been (thankfully) somewhat small. 

However, most data sets in the real world are very large; many consisting of hundreds, thousands, and millions (or more) rows and columns of information.

We don't want to retype (or even copy-paste) this information everytime, so we will instead learn about how to import data sets into our notebook using the Pandas library.

In [3]:
import pandas as pd

Pandas is very good at importing data sets given as different file types, such as Microsoft Excel files. 

In this example, we are going to use Pandas to import data that is saved as a *comma separated value* file, or csv.

To start, we will read the `people.csv` file (located in the Datasets folder) as a DataFrame. You can think of a DataFrame as a (possibly) giant spreadsheet.

In the `people.csv` file, random names, ages, Ohio city residences, and incomes have been generated.

In [4]:
df = pd.read_csv('people.csv')

In [5]:
df  # allows us to see the first few and last few rows of the data frame

Unnamed: 0,id,name,age,city,income
0,0,Ishmael,44,Cleveland,83899
1,1,Matthew,42,Youngstown,94782
2,2,Steve,53,Youngstown,103940
3,3,Nathan,31,Cleveland,48285
4,4,Jeff,41,Columbus,70327
...,...,...,...,...,...
24995,24995,Ryan,29,Akron,89043
24996,24996,Vivian,80,Dayton,70236
24997,24997,Leia,35,Cincinnati,46809
24998,24998,Atticus,64,Dayton,107329


In [6]:
df.head()  # We can use this to get a look at the first few rows of our data frame.

Unnamed: 0,id,name,age,city,income
0,0,Ishmael,44,Cleveland,83899
1,1,Matthew,42,Youngstown,94782
2,2,Steve,53,Youngstown,103940
3,3,Nathan,31,Cleveland,48285
4,4,Jeff,41,Columbus,70327


In [7]:
df.tail()  # This will allow us to look at the last few rows.

Unnamed: 0,id,name,age,city,income
24995,24995,Ryan,29,Akron,89043
24996,24996,Vivian,80,Dayton,70236
24997,24997,Leia,35,Cincinnati,46809
24998,24998,Atticus,64,Dayton,107329
24999,24999,Helena,20,Dayton,70219


In [8]:
df.head(8)  # The number 8 inside parentheses will show us the first 8 rows.

Unnamed: 0,id,name,age,city,income
0,0,Ishmael,44,Cleveland,83899
1,1,Matthew,42,Youngstown,94782
2,2,Steve,53,Youngstown,103940
3,3,Nathan,31,Cleveland,48285
4,4,Jeff,41,Columbus,70327
5,5,Sarah,82,Columbus,82676
6,6,Hector,38,Dayton,112619
7,7,Jeff,18,Akron,87367


## Finding the Mean or Median for the Values in a Column

We don't have to type the contents of an entire column that we might want to find the mean or median of.

Suppose we want to find the mean age of the subjects in the data frame.

There are a couple of ways we can go about doing this.

While Pandas does have some statistical methods, I find it is easiest to work with NumPy.

### Option 1: Use NumPy on the Column's Name in the Data Frame

In [9]:
import numpy as np

In [10]:
np.mean(df['age'])

53.81716

### Option 2: Store the Entire Column as a Variable

In [11]:
age = df['age']

np.mean(age)

53.81716

### Option 3: Use the Dot Operator

In [12]:
np.mean(df.age)

53.81716

### Median

In [13]:
np.median(df['age'])

54.0

In [14]:
np.median(age)

54.0

In [15]:
np.median(df.age)

54.0

## Finding the Mode of a Column

We won't be using this much, so don't worry too much about it.

However, let's say you wanted to know which city in the made-up data set had the most number of people in it. In other words, you want to find the mode city.

It's probably easiest to use the `multimode` method of the `statistics` library.

In [16]:
import statistics as stats

In [17]:
stats.multimode(df['city'])

['Dayton']

## Exercise: Find the mean and median incomes in the data frame.