## Working with DataFrames

Now that we can get data into a DataFrame, we can finally start working with them. 

We'll be using the [MovieLens](http://www.grouplens.org/node/73) dataset in many examples going forward. 
The dataset contains 100,000 ratings made by 943 users on 1,682 movies.

### Loading data

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display

In [2]:
# pass in column names for each CSV (info in README file)
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
url_user= 'https://github.com/luciasantamaria/pandas-tutorial/raw/master/data/ml-100k/u.user'
users = pd.read_csv(url_user, sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
url_data = 'https://github.com/luciasantamaria/pandas-tutorial/blob/master/data/ml-100k/u.data?raw=true'
ratings = pd.read_csv(url_data, sep='\t', names=r_cols)

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
url_item = 'https://github.com/luciasantamaria/pandas-tutorial/raw/master/data/ml-100k/u.item'
movies = pd.read_csv(url_item, sep='|', names=m_cols, usecols=range(5))

In [62]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [63]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [64]:
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995)


### Inspection

pandas has a variety of functions for getting basic information about your DataFrame, the most basic of which is using the `info` method.

In [65]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 5 columns):
movie_id              1682 non-null int64
title                 1682 non-null object
release_date          1681 non-null object
video_release_date    0 non-null float64
imdb_url              1679 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 65.8+ KB


The output tells a few things about our DataFrame.

1. It's obviously an instance of a DataFrame.
2. Each row was assigned an index of 0 to N-1, where N is the number of rows in the DataFrame. pandas will do this by default if an index is not specified. Don't worry, this can be changed later.
3. There are 1,682 rows (every row must have an index).
4. Our dataset has five total columns, one of which isn't populated at all (video_release_date) and two that are missing some values (release_date and imdb_url).
5. The last line displays the datatypes of each column, but not necessarily in the corresponding order to the listed columns. You should use the `dtypes` method to get the datatype for each column.

In [66]:
movies.dtypes

movie_id                int64
title                  object
release_date           object
video_release_date    float64
imdb_url               object
dtype: object

DataFrame's also have a `describe` method, which is great for seeing basic statistics about the dataset's numeric columns. Be careful though, since this will return information on **all** columns of a numeric datatype.

In [67]:
users.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


Notice *user_id* was included since it's numeric. Since this is an ID value, the stats for it don't really matter.

We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.

You've probably noticed that I've used the `head` method regularly throughout this post - by default, `head` displays the first five records of the dataset, while `tail` displays the last five.

In [68]:
movies.head(3)

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...


In [69]:
movies.tail(3)

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
1679,1680,Sliding Doors (1998),01-Jan-1998,,http://us.imdb.com/Title?Sliding+Doors+(1998)
1680,1681,You So Crazy (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?You%20So%20Cr...
1681,1682,Scream of Stone (Schrei aus Stein) (1991),08-Mar-1996,,http://us.imdb.com/M/title-exact?Schrei%20aus%...


In [70]:
movies.index

RangeIndex(start=0, stop=1682, step=1)

In [71]:
movies.columns

Index([u'movie_id', u'title', u'release_date', u'video_release_date',
       u'imdb_url'],
      dtype='object')

In [72]:
movies.sort_index( axis= 1).head()

Unnamed: 0,imdb_url,movie_id,release_date,title,video_release_date
0,http://us.imdb.com/M/title-exact?Toy%20Story%2...,1,01-Jan-1995,Toy Story (1995),
1,http://us.imdb.com/M/title-exact?GoldenEye%20(...,2,01-Jan-1995,GoldenEye (1995),
2,http://us.imdb.com/M/title-exact?Four%20Rooms%...,3,01-Jan-1995,Four Rooms (1995),
3,http://us.imdb.com/M/title-exact?Get%20Shorty%...,4,01-Jan-1995,Get Shorty (1995),
4,http://us.imdb.com/M/title-exact?Copycat%20(1995),5,01-Jan-1995,Copycat (1995),


In [73]:
# sorting by decending order
movies.sort_values(by=['title'], ascending=False).head(5)


Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
1632,1633,� k�ldum klaka (Cold Fever) (1994),08-Mar-1996,,http://us.imdb.com/Title?%C1+k%F6ldum+klaka+(1...
266,267,unknown,,,
1163,1164,Zeus and Roxanne (1997),10-Jan-1997,,http://us.imdb.com/M/title-exact?Zeus%20and%20...
546,547,"Young Poisoner's Handbook, The (1995)",23-Feb-1996,,http://us.imdb.com/M/title-exact?Young%20Poiso...
1187,1188,Young Guns II (1990),01-Jan-1990,,http://us.imdb.com/M/title-exact?Young%20Guns%...


Sorting does not chage the actual dataframe 

It creates a new dataframe !!

In [74]:
# Now, Sort  the original movies dataframe

movies.sort_values(by=['title'], inplace=True, ascending=False)
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
1632,1633,� k�ldum klaka (Cold Fever) (1994),08-Mar-1996,,http://us.imdb.com/Title?%C1+k%F6ldum+klaka+(1...
266,267,unknown,,,
1163,1164,Zeus and Roxanne (1997),10-Jan-1997,,http://us.imdb.com/M/title-exact?Zeus%20and%20...
546,547,"Young Poisoner's Handbook, The (1995)",23-Feb-1996,,http://us.imdb.com/M/title-exact?Young%20Poiso...
1187,1188,Young Guns II (1990),01-Jan-1990,,http://us.imdb.com/M/title-exact?Young%20Guns%...


Alternatively, Python's regular [slicing](http://docs.python.org/release/2.3.5/whatsnew/section-slices.html) syntax works as well.

In [75]:
movies[1000:1002]

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
376,377,Heavyweights (1994),01-Jan-1994,,http://us.imdb.com/M/title-exact?Heavyweights%...
100,101,Heavy Metal (1981),08-Mar-1981,,http://us.imdb.com/M/title-exact?Heavy%20Metal...


### Selecting

You can think of a DataFrame as a group of Series that share an index (in this case the column headers). This makes it easy to select specific columns.

Selecting a single column from the DataFrame will return a Series object.

In [76]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [77]:
users['occupation'].head()

0    technician
1         other
2        writer
3    technician
4         other
Name: occupation, dtype: object

To select multiple columns, simply pass a list of column names to the DataFrame, the output of which will be a DataFrame.

In [78]:
users[['age', 'occupation']].head()

Unnamed: 0,age,occupation
0,24,technician
1,53,other
2,23,writer
3,24,technician
4,33,other


In [79]:
display(users[['age', 'zip_code']].head())

# can also store in a variable to use later
columns_you_want = ['occupation', 'sex'] 
display(users[columns_you_want].head())

Unnamed: 0,age,zip_code
0,24,85711
1,53,94043
2,23,32067
3,24,43537
4,33,15213


Unnamed: 0,occupation,sex
0,technician,M
1,other,F
2,writer,M
3,technician,M
4,other,F


Row selection can be done multiple ways, but doing so by an individual index or boolean indexing are typically easiest.

In [80]:
print('users older than 25')
display(users[users.age > 25].head(3))

print('users aged 40 AND male')
display(users[(users.age == 40) & (users.sex == 'M')].head(3))

print('users younger than 30 OR female')
display(users[(users.sex == 'F') | (users.age < 30)].head(3))

users older than 25


Unnamed: 0,user_id,age,sex,occupation,zip_code
1,2,53,F,other,94043
4,5,33,F,other,15213
5,6,42,M,executive,98101


users aged 40 AND male


Unnamed: 0,user_id,age,sex,occupation,zip_code
18,19,40,M,librarian,2138
82,83,40,M,other,44133
115,116,40,M,healthcare,97232


users younger than 30 OR female


Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067


Since our index is kind of meaningless right now, let's set it to the `user_id` using the `set_index` method. By default, `set_index` returns a new DataFrame, so you'll have to specify if you'd like the changes to occur in place.

This has confused me in the past, so look carefully at the code and output below.

In [81]:
# set_index actually returns a new DataFrame.
with_new_index = users.set_index('user_id')
display(with_new_index.head())

Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


If you want to modify your existing DataFrame, use the `inplace` parameter.

In [82]:
users.set_index('user_id', inplace=True)
users.head()

Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


Notice that we've lost the default pandas 0-based index and moved the user_id into its place.  We can select rows based on the index using the `ix` method.

In [83]:
print('select one row')
row = [99]
display(users.loc[row])

print('select multiple rows')
rows = [1, 50, 300]
display(users.loc[rows])

select one row


Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
99,20,M,student,63129


select multiple rows


Unnamed: 0_level_0,age,sex,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
50,21,M,writer,52245
300,26,F,programmer,55106


If we realize later that we liked the old pandas default index, we can just `reset_index`.  The same rules for `inplace` apply.

In [84]:
users.reset_index(inplace=True)
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


I've found that I can usually get by with boolean indexing and the `ix` method, but pandas has a whole host of [other ways to do selection](http://pandas.pydata.org/pandas-docs/stable/indexing.html).