## Read_Table
The file we use for the call to `read_table` is set up in a way to work with the default parameters. Namely, the file uses tabs as the delimiter and the first row of data in the file is inferred as the header for the entire dataset.

In [1]:
from pandas import read_table

In [2]:
# read a dataset of Chipotle orders from a URL and store the result in a DataFrame
orders = read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv')

In [3]:
# examine the first n rows of the given DataFrame, defaults to 5 if no value is given for n
orders.head(8)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75


For the next dataframe, we will be working with a dataset that is not perfectly suited to the default call of `read_table`. In this case, the data is a `raw text file` with no headers and an uncommon delimiter. We will need to add parameters to handle these cases and define the columns for the dataframe.

In [4]:
user_columns = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

users = read_table('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/u.user',
                      sep='|', header=None, names=user_columns)

users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


## Series
In pandas, there are two main object types: DataFrames and Series. Previously, we used `read_table` to create the basic DataFrame object. Now, we'll use python's native attribute retrieval notations to manipulate the data in the DataFrame by column. This resulting column object is known as a Series.

In [5]:
from pandas import read_csv


In [6]:
# Create a DataFrame from a csv file
# ufo = pd.read_table('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv', sep=',')
# OR
ufo = read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')

print(ufo.columns)

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')


Calling the `head` function on the `ufo` dataframe allows us to peak at the data contained in the table. For now, we're only interested in learning the values of the columns so instead we can call `print` on the ufo columns attribute instead. In the following code snippet, we will create a `Series` object by selecting only the `City` column.

In [7]:
# Dot Notation similar to Object attribute retrieval
# ufo.City

# OR Bracket Notation similar to dict attribute retrieval
ufo['City']

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
                 ...         
18236              Grant Park
18237             Spirit Lake
18238             Eagle River
18239             Eagle River
18240                    Ybor
Name: City, Length: 18241, dtype: object

## Bracket Notation vs Dot Notation
Bracket Notation is more consistent than Dot Notation attribute retrieval for a few reasons.

### Column Names with Special Characters
With Dot Notation, Series objects are retrieved from the DataFrame object as an attribute. This limits the naming scheme of columns to align with the coding requirements for variables. A column name with a special character or a space would interfere with the interpretation of the call. With Bracket Notation, the column name is effectively escaped which removes the possibility of misinterpretation by the script or notebook.
Series objects with special characters are less likely to be mishandled due to escaping in Bracket Notation.

### Column Names with Reserved Words 
Similar to the previous point on special characters, Dot Notation leaves the Series object open to interpretations by the script. The column name may be the same as a method (head, describe, etc.) or attribute (shape, dtypes, columns, etc.). The coding language cannot always interpret through the ambiguity and will fail at runtime.
Series objects with non-unique column names are less ambiguous when called with Bracket Notation.

### Column Names that don't exist
A Series object is created by referencing a column from a DataFrame object. With Dot Notation, the instantiation of each column's Series object is inferred and the objects can merely be called as attributes from the DataFrame. However, if the referenced column does not already exist within the DataFrame, then the Dot Notation cannot be used and will instead throw an exception. With Bracket Notation, the Series object is instantiated as a new object associated with the DataFrame object.
Series objects can avoid exceptions more easily on calls with Bracket Notation.


In [8]:
# Append a new column to the ufo dataset
# Bracket Notation is used to create a new Series object and associate with the existing ufo DataFrame
# Dot Notation is used to reference the existing columns from the ufo DataFrame: City and State
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Location
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,"Ithaca, NY"
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,"Willingboro, NJ"
2,Holyoke,,OVAL,CO,2/15/1931 14:00,"Holyoke, CO"
3,Abilene,,DISK,KS,6/1/1931 13:00,"Abilene, KS"
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,"New York Worlds Fair, NY"


## Methods and Attributes of Pandas
Methods and Attributes refer to the Functions and Objects associated with the DataFrame class. The DataFrame object must be referenced as part of the call to these functions and objects. In the following code snippets, we will explain the most commonly used methods and attributes of DataFrame.

In [9]:
# Create the DataFrame object to be used for examining Methods and Attributes

movies = read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/imdb_1000.csv')

# head(n=5), function which returns a subset of rows of data from the referenced DataFrame, defaults to 5 rows
# the return value is a view of a DataFrame object which can be set to a new variable as a copy.
# Pandas utilizes Copy-On-Write for dealing with Views and memory optimization.
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


### Copy On Write
Datasets can easily grow and scale to extreme sizes compared to other processes. For this reason, memory optimization must always be a priority. One of the ways pandas manages memory allocation for DataFrames is Copy On Write. Introduced in version 1.5, Copy On Write is the process that allows all DataFrame and Series objects created through pandas to work as independent copies of each other. When calling a method which returns a DataFrame object, it is possible to save that return object to a new variable. Until transformations are made on the new object, it will reference the same data in memory as the original copy. When transformation calls are made to either object, CoW will ensure the old data in memory is not mutated. The result of the transformations are stored elsewhere in memory and the transformed object is updates its reference to the new memory location.

In [10]:
# Describe calculates summary statistics
# Takes 3 parameters:
# percentiles = takes an iterable object of numbers whose value should be between 0 and 1
# By default, percentiles will compute 25%, 50% and 75% from an input like [.25, .5, .75]
# include = takes an iterable object of strings describing the dtype of the values to include in the stats
# By default, include is set to 'None' and only takes numerical columns into account
# exclude = takes an iterable object of strings describing the dtype of the values to exclude, None by default
# Numerical stats returned on describe include count, mean, standard deviation, min, max and percentiles.
movies.describe()

Unnamed: 0,star_rating,duration
count,979.0,979.0
mean,7.889785,120.979571
std,0.336069,26.21801
min,7.4,64.0
25%,7.6,102.0
50%,7.8,117.0
75%,8.1,134.0
max,9.3,242.0


In [11]:
# the basic stats of the describe method changes based on the included dtypes
# Object stats returned on describe include count, unique, top, freq
movies.describe(include=['object'])

Unnamed: 0,title,content_rating,genre,actors_list
count,979,976,979,979
unique,975,12,16,969
top,The Girl with the Dragon Tattoo,R,Drama,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert..."
freq,2,460,278,6


In [12]:
# Shape is an attribute of the DataFrame which returns a Tuple of rows and columns for the dataset
movies.shape

(979, 6)

In [13]:
# DTypes is an attribute of the DataFrame which describes the schema of the dataset
movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

## DataFrame Transformations

### Columnar Transformations

In [14]:
# Rename existing column
ufo = read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')

# check column values
ufo.columns

Index(['City', 'Colors Reported', 'Shape Reported', 'State', 'Time'], dtype='object')

In [15]:
# We use a dict to correlate the columns to the expected change. Due to CoW, the default action
# is to create new Series objects with different names and change associations with the DataFrame. 
# By setting inplace to True, we are telling the application to overwrite the existing Series objects
# instead of returning a new DataFrame object with the transformations applied.
ufo.rename(columns={'Colors Reported': 'colors_reported', 'Shape Reported': 'shape_reported'}, inplace=True)

ufo.columns

Index(['City', 'colors_reported', 'shape_reported', 'State', 'Time'], dtype='object')