<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Lecture 2: Pandas</p><br>

Pandas is a primary data analysis library in Python. It offers a number of operations to aid in data exploration, cleaning and transformation, making it one of the most popular data science tools. To name a few examples of these operations, Pandas enables various methods to handle missing data and data pivoting, easy data sorting and description capabilities, fast generation of data plots, and Boolean indexing for fast image processing and other masking operations.

Pandas makes use of two different data structures: Series and DataFrames. A series is a one-dimensional array-like object that act like ndarrays. They provide many ways to index data and support lots of data types. Because of their similarities to ndarrays, series' are valid arguments to most Numpy methods as well. A DataFrame is a two-dimensional flexible data structure that supports heterogeneous data with labeled axes for rows and columns. We think of DataFrames as containers for series objects, where each row is a series. It is these two incredibly important data structures that make Pandas so useful and well-liked.

Some of the key features of Pandas are:
* Ingestion and manipulation of heterogeneous data types
* Generating descriptive statistics on data to support exploration and communication
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging and joining multiple datasets using dataframes
* Working with timestamps and time-series data

Pandas also builds upon numpy and other Python packages to provide easy-to-use data structures and data manipulation functions with integrated indexing.

**Additional Recommended Resources:** 
* <a href="http://pandas.pydata.org/pandas-docs/stable/">Pandas Documentation</a><br>
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

Now that you know a little background about Pandas, let's get started with some code!

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>

Importing the Pandas Library
</p>

As you saw in the Numpy notes, the "as" keyword in the import statement allows us to give a local name to the pandas package, so that we can refer to it as "pd" rather than "pandas" in subsequent code.

In [1]:
import pandas as pd

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Introduction to Pandas Data Structures</p>
<br>
Pandas uses two different data structures: Series and DataFrames. First, we will explore Series in the code below. 

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Series in Pandas</p>

Pandas Series are one-dimensional labeled arrays. They act like ndarrays, so are valid arguments to most Numpy methods because of their similarities. Series support many data types, including integers, strings, floating point numbers,
Python objects, etc., as a part of the array. Their axis labels are collectively referred to as the index, and we can get and set values by these index labels. You can think of a Series as a flexible dictionary-like object.

In the following code, we explore the following methods associated with Series:

**pd.Series([data elements], [index elements])** creates a Series object with the elements specified between the brackets (data in the first set of brackets, indices in the second). Note that the elements in the data set and the index set do not have to be of the same data type. For example, your data set could look something like ['foo', 12, 19.8], and your index set could look like [99.0, 'bar', 3].

**nameOfSeries.index** returns a list of the indices in the Series.

(a) **nameOfSeries.loc[index]** OR (b) **nameOfSeries.loc[[indices]]** returns (a) the data at the specified index or (b) the data and their indices at the specified indices.

(a) **nameOfSeries.iloc[position]** OR (b) **nameOfSeries.iloc[[positions]]** returns (a) the data at the specified position in the data list or (b) the data and their indices at the specified positions of the data list.

*index* **in** *nameOfSeries* returns a boolean value of whether or not the index is in the Series.

In [5]:
# create a Series called sr
sr = pd.Series([10, 'foo', 30, 90.4], ['peach', 'plum', 'dog', 'band'])

In [7]:
# view the Series
sr

peach      10
plum      foo
dog        30
band     90.4
dtype: object

In [9]:
# view the indices
sr.index

Index(['peach', 'plum', 'dog', 'band'], dtype='object')

In [33]:
# access the data at an index
sr['plum']

'foo'

In [21]:
# OR
sr.loc['plum']

'foo'

In [22]:
# access the data at multiple indices
sr[['peach', 'band']]

peach      10
band     90.4
dtype: object

In [23]:
# OR
sr.loc[['peach', 'band']]

peach      10
band     90.4
dtype: object

In [27]:
# access a data element by position in the list
sr[2]

30

In [30]:
# OR
sr.iloc[2]

30

In [31]:
# access multiple data elements by positions in the list
sr[[0, 1, 2]]

peach     10
plum     foo
dog       30
dtype: object

In [32]:
# OR
sr.iloc[[1, 2, 3]]

plum     foo
dog       30
band    90.4
dtype: object

In [34]:
# is the index 'peach' in the Series
'peach' in sr

True

In [35]:
# is the index 'card' in the Series
'card' in sr

False

In [36]:
sr

peach      10
plum      foo
dog        30
band     90.4
dtype: object

We can also use basic Python operations like multiplication on a Series. In the code below, we multiply the whole Series by 2. Note that this operation is performed on all data types, even strings, where the string is doubled.

In [37]:
sr * 2

peach        20
plum     foofoo
dog          60
band      180.8
dtype: object

We can square the numerical index values in a Series. However, if we try to square an index that's a string, we will get an error.

In [40]:
sr[['peach', 'band']] ** 2

peach        100
band     8172.16
dtype: object

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
DataFrames in Pandas</p>

Pandas DataFrames are flexible 2-dimensional labeled data structures. They also support heterogeneous data and have labeled axes for rows and columns. We can think of a DataFrame as a container for Series objects, where each row is a Series.

Let's look at an example!

<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Creating a DataFrame</p>

There are many ways to create Pandas DataFrames. We often just read and ingest data into a data frame, but in this example, we create the DataFrame manually by starting with a dictionary of Series. Note that we are adding another dimensions to our data structure, so we need to label each Series. Here, we label the first Series 'a' and the second 'b'.

In [105]:
# create a dictionary called df_data
df_data = {'a' : pd.Series([1., 2., 3., 4.], index=['dog', 'cat', 'fruit', 'bird']),
     'b' : pd.Series([10., 20., 30.], index=['cake', 'fruit', 'ice cream'])}

We then use the following methods to create and explore the DataFrame:

**pandas.DataFrame(data)** creates a DataFrame out of the specified data. This data can be provided in the form of a Numpy ndarray, a Python dictionary, or another DataFrame. There are other parameters in this method, so we advise you to check out the <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html">DataFrame Documentation</a>.

**nameOfDataFrame.index** returns a list of all of the unique indices in the Data Frame.

**nameOfDataFrame.columns** returns a list of the names of the Series, or columns, that make up the DataFrame.

In [106]:
# create and print the DataFrame
df = pd.DataFrame(df_data)
print(df)

             a     b
bird       4.0   NaN
cake       NaN  10.0
cat        2.0   NaN
dog        1.0   NaN
fruit      3.0  20.0
ice cream  NaN  30.0


In [107]:
# output the DataFrame
df

Unnamed: 0,a,b
bird,4.0,
cake,,10.0
cat,2.0,
dog,1.0,
fruit,3.0,20.0
ice cream,,30.0


Series 'a' and 'b' don't share the all of same indices. When we print the DataFrame, we see NaN values, which indicate that the Series does not contain a certain index. Additionally, note the difference in format between printing a DataFrame and simply outputting it.

In [108]:
df.index

Index(['bird', 'cake', 'cat', 'dog', 'fruit', 'ice cream'], dtype='object')

In [109]:
df.columns

Index(['a', 'b'], dtype='object')

We can also create a smaller DataFrame using the same data, but this time specifying which indices we want to be included.

In [110]:
pd.DataFrame(df_data, index=['dog', 'fruit', 'bird'])

Unnamed: 0,a,b
dog,1.0,
fruit,3.0,20.0
bird,4.0,


By specifying the column parameter, you can select which columns you'd like the new DataFrame to include. In the code below, we ask the DataFrame to include column 'e', which doesn't exist in the original dictionary. Because of this, a new column 'e' will be created with all its entries as NaN.

In [111]:
pd.DataFrame(df_data, index=['dog', 'fruit', 'bird'], columns=['a', 'e'])

Unnamed: 0,a,e
dog,1.0,
fruit,3.0,
bird,4.0,


<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Creating a DataFrame from a list of Python dictionaries</p>

Another way to create a DataFrame is to use a list of Python dictionaries as your data. In the code below, we create a list of Python dictionaries called 'df_data2' and use this to make a DataFrame called 'df2'. We then use many of the same techniques as above to explore the DataFrame.

Please see <a href="https://docs.python.org/3/tutorial/datastructures.html#dictionaries">this link</a> for a reminder on Python dictionaries.

In [112]:
# create a Python dictionary

df_data2 = [{'apple': 5, 'cherry': 10}, {'peter': 1, 'emily': 2, 'brian': 6}]

In [113]:
# labels get created as column headers

pd.DataFrame(df_data2)

Unnamed: 0,apple,brian,cherry,emily,peter
0,5.0,,10.0,,
1,,6.0,,2.0,1.0


In [114]:
# rename the rows from 0 and 1 to 'blue' and 'yellow' by specifying the index parameter

pd.DataFrame(df_data2, index=['blue', 'yellow'])

Unnamed: 0,apple,brian,cherry,emily,peter
blue,5.0,,10.0,,
yellow,,6.0,,2.0,1.0


In [115]:
# create a smaller DataFrame by specifying the columns

pd.DataFrame(df_data2, columns=['cherry', 'emily','brian'])

Unnamed: 0,cherry,emily,brian
0,10.0,,
1,,2.0,6.0


<p style="font-family: Arial; font-size:1.25em;color:#2462C0; font-style:bold">
Exploring some basic DataFrame operations</p>

Now let's look into how we can get data out of a DataFrame with some basic DataFrame operations. In the following code, we will use these DataFrame methods:

**nameOfDataFrame.pop(columnName)** removes and returns the specified column from the DataFrame. If you want to save the popped column, you must store it as a new variable.

**del** *nameOfDataFrame[columnName]* deletes a column from a DataFrame permanently.

**nameOfDataFrame.insert(location, newColumnName, values)** inserts a column with specified name into the DataFrame at the specified location with specified values.

In [116]:
df

Unnamed: 0,a,b
bird,4.0,
cake,,10.0
cat,2.0,
dog,1.0,
fruit,3.0,20.0
ice cream,,30.0


In [117]:
# display only column 'a' of the DataFrame

df['a']

bird         4.0
cake         NaN
cat          2.0
dog          1.0
fruit        3.0
ice cream    NaN
Name: a, dtype: float64

In [118]:
# create a new column 'c' by adding 'a' and 'b' together

df['c'] = df['a'] + df['b']
df

Unnamed: 0,a,b,c
bird,4.0,,
cake,,10.0,
cat,2.0,,
dog,1.0,,
fruit,3.0,20.0,23.0
ice cream,,30.0,


Note that since NaN values cannot be added to floating point values, the resulting values in 'c' are NaN. For index 'fruit', however, both 'a' and 'b' are floating point values and can be added together.

In [119]:
# create a new column 'd' of boolean values indicating whether or not an index's value in 'a' is greater than 2.0
# NaN values evaluate to False

df['d'] = df['a'] > 2.0
df

Unnamed: 0,a,b,c,d
bird,4.0,,,True
cake,,10.0,,False
cat,2.0,,,False
dog,1.0,,,False
fruit,3.0,20.0,23.0,True
ice cream,,30.0,,False


In [120]:
# set cee equal to the 'c' column in the DataFrame

cee = df.pop('c')

In [121]:
cee

bird          NaN
cake          NaN
cat           NaN
dog           NaN
fruit        23.0
ice cream     NaN
Name: c, dtype: float64

In [122]:
# the pop method has removed 'c' from df

df

Unnamed: 0,a,b,d
bird,4.0,,True
cake,,10.0,False
cat,2.0,,False
dog,1.0,,False
fruit,3.0,20.0,True
ice cream,,30.0,False


In [123]:
# delete column 'b' from the DataFrame

del df['b']

In [124]:
df

Unnamed: 0,a,d
bird,4.0,True
cake,,False
cat,2.0,False
dog,1.0,False
fruit,3.0,True
ice cream,,False


In [125]:
# insert a new column that is a copy of column 'a'

df.insert(2, 'copy_of_a', df['a'])
df

Unnamed: 0,a,d,copy_of_a
bird,4.0,True,4.0
cake,,False,
cat,2.0,False,2.0
dog,1.0,False,1.0
fruit,3.0,True,3.0
ice cream,,False,


In [103]:
# insert a new column that is a copy of 'a' up to excluding the value at the third position of the Series

df['a_upper_half'] = df['a'][:3]
df

Unnamed: 0,a,d,copy_of_a,a_upper_half
bird,4.0,True,4.0,4.0
cake,,False,,
cat,2.0,False,2.0,2.0
dog,1.0,False,1.0,
fruit,3.0,True,3.0,
ice cream,,False,,


Note that while both methods above (df.insert and df['col']) allowed us to insert new columns into the DataFrame, only df.insert lets us specify which position we want the column to be in.

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Case Study: Movie Data Analysis</p>
<br>This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. 

## Download the Dataset

Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:
* **Data Source:** MovieLens web site (filename: ml-20m.zip)
* **Location:** https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. 

Let us look at the files in this dataset using the UNIX command ls.


In [None]:
# Note: Adjust the name of the folder to match your local directory

!ls ./movielens

In [None]:
!cat ./movielens/movies.csv | wc -l

In [None]:
!head -5 ./movielens/ratings.csv

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold">
Use Pandas to Read the Dataset<br>
</p>
<br>
In this notebook, we will be using three CSV files:
* **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp*
* **tags.csv :** *userId*,*movieId*, *tag*, *timestamp*
* **movies.csv :** *movieId*, *title*, *genres* <br>

Using the *read_csv* function in pandas, we will ingest these three files.

In [None]:
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

In [None]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

In [None]:
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

In [None]:
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

<h1 style="font-size:2em;color:#2467C0">Data Structures </h1>

<h1 style="font-size:1.5em;color:#2467C0">Series</h1>

In [None]:
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

In [None]:
print(row_0)

In [None]:
row_0.index

In [None]:
row_0['userId']

In [None]:
'rating' in row_0

In [None]:
row_0.name

In [None]:
row_0 = row_0.rename('first_row')
row_0.name

<h1 style="font-size:1.5em;color:#2467C0">DataFrames </h1>

In [None]:
tags.head()

In [None]:
tags.index

In [None]:
tags.columns

In [None]:
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]

<h1 style="font-size:2em;color:#2467C0">Descriptive Statistics</h1>

Let's look how the ratings are distributed! 

In [None]:
ratings['rating'].describe()

In [None]:
ratings.describe()

In [None]:
ratings['rating'].mean()

In [None]:
ratings.mean()

In [None]:
ratings['rating'].min()

In [None]:
ratings['rating'].max()

In [None]:
ratings['rating'].std()

In [None]:
ratings['rating'].mode()

In [None]:
ratings.corr()

In [None]:
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()

In [None]:
filter_2 = ratings['rating'] > 0
filter_2.all()

<h1 style="font-size:2em;color:#2467C0">STOP Data Cleaning: Handling Missing Data</h1>

In [None]:
movies.shape

In [None]:
#is any row NULL ?

movies.isnull().any()

Thats nice ! No NULL values !

In [None]:
ratings.shape

In [None]:
#is any row NULL ?

ratings.isnull().any()

Thats nice ! No NULL values !

In [None]:
tags.shape

In [None]:
#is any row NULL ?

tags.isnull().any()

We have some tags which are NULL.

In [None]:
tags = tags.dropna()

In [None]:
#Check again: is any row NULL ?

tags.isnull().any()

In [None]:
tags.shape

Thats nice ! No NULL values ! Notice the number of lines have reduced.

<h1 style="font-size:2em;color:#2467C0">Data Visualization</h1>

In [None]:
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

In [None]:
ratings.boxplot(column='rating', figsize=(15,20))

<h1 style="font-size:2em;color:#2467C0">Slicing Out Columns</h1>
 

In [None]:
tags['tag'].head()

In [None]:
movies[['title','genres']].head()

In [None]:
ratings[-10:]

In [None]:
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

In [None]:
tag_counts[:10].plot(kind='bar', figsize=(15,10))

<h1 style="font-size:2em;color:#2467C0">Filters for Selecting Rows</h1>

In [None]:
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

In [None]:
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

In [None]:
movies[is_animation].head(15)

<h1 style="font-size:2em;color:#2467C0">Group By and Aggregate </h1>

In [None]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

In [None]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

In [None]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

In [None]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

<h1 style="font-size:2em;color:#2467C0">Merge Dataframes</h1>

In [None]:
tags.head()

In [None]:
movies.head()

In [None]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

<p style="font-family: Arial; font-size:1.75em;color:#2462C0; font-style:bold"><br>


Combine aggreagation, merging, and filters to get useful analytics
</p>

In [None]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

In [None]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

In [None]:
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

In [None]:
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

In [None]:
box_office[is_comedy & is_highly_rated][-5:]

<h1 style="font-size:2em;color:#2467C0">Vectorized String Operations</h1>


In [None]:
movies.head()

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

Split 'genres' into multiple columns

<br> </p>

In [None]:
movie_genres = movies['genres'].str.split('|', expand=True)

In [None]:
movie_genres[:10]

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

Add a new column for comedy genre flag

<br> </p>

In [None]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [None]:
movie_genres[:10]

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

Extract year from title e.g. (1995)

<br> </p>

In [None]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

In [None]:
movies.tail()

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold"><br>

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
<br> </p>

<h1 style="font-size:2em;color:#2467C0">Parsing Timestamps</h1>

Timestamps are common in sensor data or other time series datasets.
Let us revisit the *tags.csv* dataset and read the timestamps!


In [None]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')

In [None]:
tags.dtypes

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Unix time / POSIX time / epoch time records 
time in seconds <br> since midnight Coordinated Universal Time (UTC) of January 1, 1970
</p>

In [None]:
tags.head(5)

In [None]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Data Type datetime64[ns] maps to either <M8[ns] or >M8[ns] depending on the hardware

</p>

In [None]:

tags['parsed_time'].dtype

In [None]:
tags.head(2)

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Selecting rows based on timestamps
</p>

In [None]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Sorting the table using the timestamps
</p>

In [None]:
tags.sort_values(by='parsed_time', ascending=True)[:10]

<h1 style="font-size:2em;color:#2467C0">Average Movie Ratings over Time </h1>
## Are Movie ratings related to the year of launch?

In [None]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

In [None]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

In [None]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]

In [None]:
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

<p style="font-family: Arial; font-size:1.35em;color:#2462C0; font-style:bold">

Do some years look better for the boxoffice movies than others? <br><br>

Does any data point seem like an outlier in some sense?

</p>