# Intro to Data Analysis with pandas

## Overview

### Objectives

+ Know why pandas is suitable for data analysis in Python.
+ load csv data into pandas dataframe.
+ Identify a DataFrame as a two-dimensional data structure with an **index**, **columns**, and **values**.
+ Identify a Series as a single dimensional data structure with an **index** and **values**.
+ Know the difference between the **index** and **values**.
+ ways to show sample of dataset or specefic rows.
+ ways to describe/get ingo about the whole data set.
+ check/change data types.
+ check missing values.

### Resources

+ [Official Documentation](http://pandas.pydata.org/pandas-docs/stable/)
+ [Package Overview](http://pandas.pydata.org/pandas-docs/stable/overview.html)
+ [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

## Welcome to ....
![][1]


### What is pandas?
pandas is one of the most popular open source data exploration libraries currently available. It gives its users the power to explore, manipulate, query, aggregate, and visualize **tabular** data. Tabular meaning data that is two-dimensional with rows and columns; i.e. a table.

### Why pandas and not xyz?
In this current age of data explosion, there are now many dozens of other tools that have many of the same capabilities as the pandas library. However, there are many aspects of pandas that make it an attractive choice for data analysis and it continues to have one of the fastest growing user bases.

* It's a Python library and integrates well with the other popular data science libraries such as numpy, scikit-learn, statsmodels, matplotlib and seaborn.
* It is nearly self-contained in that lots of functionality is built into one package. This contrasts with R, where many packages are needed to obtain the same functionality.
* The community is excellent. Looking at Stack Overflow, for example, there are [many ten's of thousands of][2] pandas questions. If you need help, you are nearly guaranteed to find it very quickly. 

### Why is it named after an East Asian bear?

The pandas library was begun by Wes McKinney beginning in 2008 at a hedge fund named AQR. Finance speak is to call tabular data 'panel data' which smashed together becomes pandas. If you are really interested in the history, you can hear it from the creator [himself][3].

### Python already has data structures to handle data, why do we need another one?

Even though Python is a high-level language, its primary built-in data structures lists and dictionaries, do not easily lend themselves to tabular data analysis in ways that humans can operate on them. 

### pandas is built directly on numpy

[numpy][4] ('numerical Python') is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others, including pandas. numpy's primary data structure is an n-dimensional array which is much more powerful than a Python list and with much better performance.

All of the data in pandas is stored in numpy arrays. That said, it isn't necessary to know much about numpy when learning pandas. You can think of pandas as a higher-level, easier to use interface for doing data analysis than numpy. It is a good idea to eventually learn numpy, but for most tasks, pandas will be the right tool.

### numpy tutorial in appendix

Although it is not necessary to understand numpy to perform data analysis with pandas, it is a major piece of the data science ecosystem in Python and it can be used alongside pandas. A thorough numpy tutorial is available in Appendix A.

## pandas operates on tabular (table) data

There are numerous formats for data such as XML, JSON, raw bytes, and many others. But, for our purposes, we will only be examining what most people think of when they think of data - a table. pandas is built just for analyzing this tabular, rectangular, very deceptively normal concept of data. pandas has the capability to read in many different formats of data, but they all will be converted to tabular data.

### The DataFrame and Series

The DataFrame and Series are the two primary pandas objects that we will be using throughout this course.

* **DataFrame** - A two-dimensional data structure that looks like any other rectangular table of data you have seen with rows and columns.
* **Series** - A single dimension of data. It is analogous to a single column of data or a one dimensional array.

## Import pandas and read in data with the `read_csv` function

By convention, pandas is imported and aliased as `pd`. We will read in the `bikes` dataset with the `read_csv` function. Its first parameter is the location of the file relative to the current directory as a string. All of the data for this course is stored in the `data` directory one level above where this notebook is located. The two dots in the path passed to `read_csv` are interpreted as the directory immediately above the current one.

[1]: images/pandas_logo.png
[2]: http://stackoverflow.com/questions/tagged/pandas
[3]: https://www.youtube.com/watch?v=kHdkFyGCxiY
[4]: http://www.numpy.org/


In [2]:
import pandas as pd


In [3]:
bikes = pd.read_csv('data/bikes.csv')
type(bikes)

pandas.core.frame.DataFrame

### Display DataFrame in Jupyter Notebook
We assigned the output from the `read_csv` function to the `bikes` variable which now refers to our DataFrame object. Let's get a visual display of our DataFrame by writing the variable name as the last line in a code cell.

In [3]:
bikes

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.881050,-87.616970,11.0,Michigan Ave & Oak St,41.900960,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wells St & Walton St,41.899930,-87.634430,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.631890,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50084,17534938,Subscriber,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.631280,27.0,5.0,10.0,16.1,-9999.0,partlycloudy
50085,17534969,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,partlycloudy
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy


### Default output
pandas defaults to outputting 60 rows and 20 columns. These display options (and many others) can be changed. This will be covered later.

## Our first methods - `head`, `tail`, and `sample`
A very useful and simple method is `head`, which by default will return the first 5 rows of the DataFrame. This avoids long default output and is something I highly recommend when doing data analysis within a notebook. The `tail` method returns the last 5 rows by default.

In [4]:
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In [5]:
bikes.tail(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy
50088,17536246,Subscriber,Male,2017-12-31 15:22:00,2017-12-31 15:26:00,214,Clarendon Ave & Leland Ave,41.967968,-87.650001,15.0,Clifton Ave & Lawrence Ave,41.968812,-87.657659,15.0,10.9,10.0,15.0,-9999.0,partlycloudy


In [6]:
#sample()
bikes.sample(10)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
29276,11166434,Subscriber,Female,2016-08-06 15:31:09,2016-08-06 15:50:13,1144,Lakefront Trail & Bryn Mawr Ave,41.984037,-87.65231,19.0,Lake Shore Dr & Belmont Ave,41.940775,-87.639192,19.0,82.9,10.0,4.6,-9999.0,partlycloudy
22333,8782215,Subscriber,Male,2016-02-20 18:39:00,2016-02-20 18:58:00,1131,Larrabee St & Oak St,41.900219,-87.642985,15.0,Clinton St & 18th St,41.85795,-87.640826,15.0,45.0,10.0,6.9,-9999.0,partlycloudy
10401,4468701,Subscriber,Female,2015-01-22 18:33:00,2015-01-22 18:39:00,319,Lincoln Ave & Diversey Pkwy,41.932225,-87.658617,15.0,Halsted St & Wrightwood Ave,41.929143,-87.649077,15.0,30.9,10.0,5.8,-9999.0,cloudy
16049,6757498,Subscriber,Male,2015-08-07 17:00:00,2015-08-07 17:18:00,1090,Ashland Ave & Wrightwood Ave,41.92883,-87.668507,15.0,Ashland Ave & Grand Ave,41.891072,-87.666611,15.0,80.1,10.0,10.4,-9999.0,mostlycloudy
41927,15036744,Subscriber,Female,2017-07-09 10:37:35,2017-07-09 10:49:11,696,Wilton Ave & Diversey Pkwy,41.932418,-87.652705,27.0,Greenview Ave & Fullerton Ave,41.92533,-87.6658,15.0,80.1,10.0,9.2,-9999.0,partlycloudy
45720,16279169,Subscriber,Male,2017-09-07 08:09:17,2017-09-07 08:27:39,1102,Michigan Ave & Washington St,41.883893,-87.624649,43.0,Ogden Ave & Race Ave,41.891795,-87.658751,15.0,59.0,10.0,8.1,-9999.0,mostlycloudy
34898,12884531,Subscriber,Male,2016-12-06 06:48:17,2016-12-06 07:08:40,1223,Lake Shore Dr & Wellington Ave,41.936669,-87.636794,15.0,Streeter Dr & Grand Ave,41.892278,-87.612043,47.0,28.0,0.5,4.6,-9999.0,fog
11392,4875763,Subscriber,Male,2015-04-15 18:06:00,2015-04-15 18:18:00,737,Loomis St & Jackson Blvd,41.877945,-87.662007,11.0,Loomis St & Lexington St,41.872187,-87.661501,15.0,57.0,10.0,12.7,-9999.0,mostlycloudy
39530,14244733,Subscriber,Male,2017-05-30 20:32:27,2017-05-30 20:57:44,1517,Lake Shore Dr & Diversey Pkwy,41.932684,-87.63625,15.0,Dearborn St & Monroe St,41.88132,-87.629521,27.0,64.9,10.0,4.6,-9999.0,partlycloudy
14461,6111878,Subscriber,Female,2015-07-08 16:35:00,2015-07-08 16:43:00,453,Dearborn St & Monroe St,41.88132,-87.629521,27.0,Michigan Ave & Lake St,41.886024,-87.624117,31.0,64.0,10.0,8.1,0.0,rain


### First and Last `n` rows
Both the `head` and `tail` methods take a single integer parameter `n`, which controls the number of rows to return. 

In [7]:
bikes.head(2)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy


## Components of a DataFrame - columns, index, and data
The DataFrame is composed of three separate components that you must know. The **columns**, the **index**, and the **data**. These terms will be used throughout the course and understanding them is vital to your ability to use pandas. Take a look at the following graphic of our `bikes` DataFrame stylized to put emphasis on each component.

![][1]

[1]: images/df_components.png

* The **index** provides a label for each row
* The **columns** provide a label for each column
* The **index** is also referred to as the **row names/labels**
* The **columns** are also referred to as the **column names/labels** or the **column index**
* An individual element of the index is referred to as an **index label/name** or **row label/name**
* An individual element of the columns is a **column name/label**
* The index and the columns are always in **bold font**
* Collectively the index and the columns are known as the **axes** (or individually as an **axis**)
* pandas uses integers to refer to each axis; 0 for the index and 1 for the columns. This is borrowed directly from numpy
* The actual **data** is always in normal font
* The **data** is also referred to as the **values**

## What type of object is `bikes`
As we said previously, `bikes` is a DataFrame. Let's verify this:

In [7]:
type(bikes)

pandas.core.frame.DataFrame

In [8]:
bikes

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.881050,-87.616970,11.0,Michigan Ave & Oak St,41.900960,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wells St & Walton St,41.899930,-87.634430,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.631890,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50084,17534938,Subscriber,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.631280,27.0,5.0,10.0,16.1,-9999.0,partlycloudy
50085,17534969,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,partlycloudy
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy


In [9]:
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In [10]:
bikes.head(2)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy


In [11]:
bikes.tail(8)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
50081,17534131,Subscriber,Male,2017-12-29 16:09:00,2017-12-29 16:19:00,617,Kingsbury St & Erie St,41.893882,-87.641711,23.0,Canal St & Adams St,41.879255,-87.639904,47.0,14.0,1.5,6.9,0.0,snow
50082,17534773,Subscriber,Male,2017-12-30 10:47:00,2017-12-30 10:53:00,363,Larrabee St & Oak St,41.900219,-87.642985,19.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,3.9,10.0,13.8,-9999.0,mostlycloudy
50083,17534831,Subscriber,Male,2017-12-30 11:36:00,2017-12-30 11:55:00,1175,Western Ave & Walton St,41.898418,-87.686596,19.0,Damen Ave & Clybourn Ave,41.931931,-87.677856,15.0,3.9,10.0,13.8,-9999.0,partlycloudy
50084,17534938,Subscriber,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.63128,27.0,5.0,10.0,16.1,-9999.0,partlycloudy
50085,17534969,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,partlycloudy
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy
50088,17536246,Subscriber,Male,2017-12-31 15:22:00,2017-12-31 15:26:00,214,Clarendon Ave & Leland Ave,41.967968,-87.650001,15.0,Clifton Ave & Lawrence Ave,41.968812,-87.657659,15.0,10.9,10.0,15.0,-9999.0,partlycloudy


### Fully-qualified name
Only the word after the last dot is the class name. The `bikes` variable has type `DataFrame`. Python always returns the location and module name of where the class was defined. 

### Location and module name?
The fully-qualified name holds the location in your computer where the class is defined. In this example, `pandas` is a directory that contains another directory `core` which contains a file `frame.py` which defines the `DataFrame` class.

### Package, sub-package, and module
The top level directory of other files and directories containing Python files is technically called a **package**. In this example `pandas` is the package. All directories within the package are called **sub-packages** such as `core`. All Python files (those ending in .py) are called **modules**.

### Where are the packages located?
Third-party packages are installed in the `site-packages` directory which itself is set up during Python installation. We can get the actual location with the help of the built-in `site` module's `getsitepackages` 
function.

In [12]:
import site
site.getsitepackages()

['c:\\Users\\Office\\Desktop\\Intern_Project\\Internship-Program\\.venv',
 'c:\\Users\\Office\\Desktop\\Intern_Project\\Internship-Program\\.venv\\Lib\\site-packages']

## Select a single column from a DataFrame - a Series
To select a single column from a DataFrame, pass the name of one of the columns to the brackets operator, `[]`. The returned object will be a pandas **Series**. Let's select the column name `tripduration`, assign it to a variable, and output it to the screen.

In [13]:
bikes.columns

Index(['trip_id', 'usertype', 'gender', 'starttime', 'stoptime',
       'tripduration', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_name',
       'latitude_end', 'longitude_end', 'dpcapacity_end', 'temperature',
       'visibility', 'wind_speed', 'precipitation', 'events'],
      dtype='object')

In [14]:
tripduration = bikes["tripduration"]

## `head` and `tail` methods work the same with a Series
Use the **`head`** and **`tail`** methods to condense the output.

In [15]:
type(tripduration)

pandas.core.series.Series

## Components of a Series - the index and the data
A Series is simpler than a DataFrame with just a single dimension of data. It has two components - the **index** and the **data**. It is essentially a one-column DataFrame. Let's take a look at a stylized Series graphic.

![](images/series_components.png)

The definition for the index and data components are the same as they are for a DataFrame.

### Output of Series vs DataFrame
Notice that there is no nice HTML styling for the Series. It's just plain text. Also, below each Series will be some metadata on it - the **name**, **length**, and **dtype**. 

* The **name** is not important right now. If the Series is formed from a column of a DataFrame it will be set to that column name.
* The **length** is the number of values in the Series
* The **dtype** is the data type of the Series. Each column of data must be of only one particular data type. These will be covered later.

It's important to note that this metadata is NOT part of the Series itself and is just some extra info pandas outputs for your information.

## Exercises
Use the **`bikes`** DataFrame for the following:

### Exercise 1
<span  style="color:green; font-size:16px">Select the column **`events`**, the type of weather that was recorded and assign it to a variable with the same name. Output the first 10 values of it.</span>

In [16]:
df_bikess = pd.read_csv('data/bikes.csv')

In [17]:
events= df_bikess['events']

### Exercise 2
<span  style="color:green; font-size:16px">What type of object is **`events`**?</span>

In [18]:
events= df_bikess [['events']]

In [19]:
type(events)

pandas.core.frame.DataFrame

### Exercise 3
<span  style="color:green; font-size:16px">Select the last 2 rows of the **`bikes`** DataFrame and assign it to the variable **`bikes_last_2`**. What type of object is **`bikes_last_2`**?</span>

In [21]:
bikes_last_2 = df_bikess.tail(2)
type(bikes_last_2)
bikes_last_2.sample()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy


## df.info() and df.describe() in pandas.

**`df.info()` provides concise summary about a DataFrame, including:**

+ Memory usage
+ Number of rows and columns
+ Data types of each column
+ Number of non-null values in each column
+ It gives a quick overview of the structure and basic information about the DataFrame.

**On the other hand, `df.describe()` provides statistical summary of the numeric columns in the DataFrame, including:**

+ Count: Number of non-null values
+ Mean: Average value
+ Standard deviation: Measure of how spread out the values are
+ Minimum and maximum values
+ Quartiles (25%, 50%, 75%)
+ It helps in understanding the central tendency, variability, and distribution of the numeric data in the DataFrame.

<span  style="color:green; font-size:16px">In summary, `df.info()` is used for a quick overview of the DataFrame's structure and basic information, while `df.describe()` is used for a more detailed statistical analysis of the numeric columns in the DataFrame.</span>

In [22]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trip_id            50089 non-null  int64  
 1   usertype           50089 non-null  object 
 2   gender             50089 non-null  object 
 3   starttime          50089 non-null  object 
 4   stoptime           50089 non-null  object 
 5   tripduration       50089 non-null  int64  
 6   from_station_name  50089 non-null  object 
 7   latitude_start     50083 non-null  float64
 8   longitude_start    50083 non-null  float64
 9   dpcapacity_start   50083 non-null  float64
 10  to_station_name    50089 non-null  object 
 11  latitude_end       50077 non-null  float64
 12  longitude_end      50077 non-null  float64
 13  dpcapacity_end     50077 non-null  float64
 14  temperature        50089 non-null  float64
 15  visibility         50089 non-null  float64
 16  wind_speed         500

In [23]:
bikes["tripduration"].head(20)

0      993
1      623
2     1040
3      667
4      130
5      660
6      565
7      505
8     1300
9      922
10    1523
11    1697
12    2263
13    1365
14     610
15     415
16     487
17     622
18    5396
19     384
Name: tripduration, dtype: int64

In [24]:
bikes.describe()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
count,50089.0,50089.0,50083.0,50083.0,50083.0,50077.0,50077.0,50077.0,50089.0,50089.0,50089.0,50089.0
mean,9472308.0,716.867755,41.900007,-87.644642,21.340215,41.900581,-87.644851,21.241708,62.608237,8.148827,7.070111,-9239.62687
std,4914277.0,1319.849896,0.034423,0.021512,7.634167,0.034634,0.021499,7.556756,48.151252,118.320794,178.93798,2648.862657
min,7147.0,60.0,41.744053,-87.80287,0.0,41.744053,-87.80287,0.0,-9999.0,-9999.0,-9999.0,-9999.0
25%,5332773.0,356.0,41.881032,-87.654752,15.0,41.88132,-87.654787,15.0,52.0,10.0,6.9,-9999.0
50%,9664991.0,572.0,41.89186,-87.641066,19.0,41.893832,-87.641088,19.0,66.9,10.0,10.4,-9999.0
75%,13632190.0,906.0,41.919936,-87.630585,23.0,41.92154,-87.629928,23.0,75.9,10.0,12.7,-9999.0
max,17536250.0,86188.0,42.064313,-87.560115,55.0,42.064313,-87.559275,55.0,96.1,10.0,42.6,0.39


In [25]:
df_bikess.describe()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
count,50089.0,50089.0,50083.0,50083.0,50083.0,50077.0,50077.0,50077.0,50089.0,50089.0,50089.0,50089.0
mean,9472308.0,716.867755,41.900007,-87.644642,21.340215,41.900581,-87.644851,21.241708,62.608237,8.148827,7.070111,-9239.62687
std,4914277.0,1319.849896,0.034423,0.021512,7.634167,0.034634,0.021499,7.556756,48.151252,118.320794,178.93798,2648.862657
min,7147.0,60.0,41.744053,-87.80287,0.0,41.744053,-87.80287,0.0,-9999.0,-9999.0,-9999.0,-9999.0
25%,5332773.0,356.0,41.881032,-87.654752,15.0,41.88132,-87.654787,15.0,52.0,10.0,6.9,-9999.0
50%,9664991.0,572.0,41.89186,-87.641066,19.0,41.893832,-87.641088,19.0,66.9,10.0,10.4,-9999.0
75%,13632190.0,906.0,41.919936,-87.630585,23.0,41.92154,-87.629928,23.0,75.9,10.0,12.7,-9999.0
max,17536250.0,86188.0,42.064313,-87.560115,55.0,42.064313,-87.559275,55.0,96.1,10.0,42.6,0.39


# Selecting subsets of data:

### `The three indexers [ ], loc, iloc`

In [26]:
sampleDf = pd.read_csv('data/sample_data.csv', index_col=0)
# sampleDf = pd.read_csv('data/sample_data.csv')
sampleDf.head()

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


**1- Select Multiple Columns with a List**

You can select multiple columns by placing them in a list inside of just the brackets. Notice that a DataFrame and
NOT a Series is returned.

In [27]:
#select the following columns:'color', 'age', 'score'
Select_Coulmn = sampleDf [['color','age','score']]
Select_Coulmn.columns

Index(['color', 'age', 'score'], dtype='object')

## Exercises
For the following exercises, make sure to use the `movie` dataset with title set as the index. It’s good practice to shorten your output with the head method.

### Exercise 1
<span  style="color:green; font-size:16px">Select the column with the director’s name as a Series</span>

In [33]:
movie = pd.read_csv('data/movie.csv')
movie.columns

Index(['title', 'year', 'color', 'content_rating', 'duration', 'director_name',
       'director_fb', 'actor1', 'actor1_fb', 'actor2', 'actor2_fb', 'actor3',
       'actor3_fb', 'gross', 'genres', 'num_reviews', 'num_voted_users',
       'plot_keywords', 'language', 'country', 'budget', 'imdb_score'],
      dtype='object')

In [35]:
movie[['director_name']]


Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes
3,Christopher Nolan
4,Doug Walker
...,...
4911,Scott Smith
4912,
4913,Benjamin Roberds
4914,Daniel Hsia


### Exercise 2
<span  style="color:green; font-size:16px">Select the column with the director’s name and number of Facebook likes.</span>

In [36]:
movie[['director_fb','director_name']]


Unnamed: 0,director_fb,director_name
0,0.0,James Cameron
1,563.0,Gore Verbinski
2,0.0,Sam Mendes
3,22000.0,Christopher Nolan
4,131.0,Doug Walker
...,...,...
4911,2.0,Scott Smith
4912,,
4913,0.0,Benjamin Roberds
4914,0.0,Daniel Hsia


**2-Selecting Subsets of Data from DataFrames with loc**

- The `loc` indexer selects data in a different manner than just the brackets and has its own set of rules that we must learn.
- Simultaneous row and column subset selection with loc
- The loc indexer can select rows and columns simultaneously. This is not possible with just the brackets. This is done
by separating the row and column selections with a comma. The selection will look something like this:
**`df.loc[rows, cols]`**
- loc primarily selects data by label
Very importantly, loc primarily selects data by the `label` of the `rows` and `columns`.


In [37]:
df = pd.read_csv('data/sample_data.csv', index_col=0)

In [38]:
df.head(6)

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


In [39]:
df.loc['Jane':'Christina','state':'food']

Unnamed: 0,state,color,food
Jane,NY,blue,Steak
Niko,TX,green,Lamb
Aaron,FL,red,Mango
Penelope,AL,white,Apple
Dean,AK,gray,Cheese
Christina,TX,black,Melon


**The possible types of row and column selections**

`All of the following are valid objects available for both row and column selections with loc.`
• A single label

• A list of labels

• A slice with labels

• A boolean Series or array (covered in a later)

In [40]:
# Select two rows and three columns with loc
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
test_1 = df.loc[rows,cols]
test_1


Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2
Cornelia,69,TX,


In [43]:
cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


In [45]:
#Slice both the rows and columns
df.loc['Jane':'Christina',:]

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5


## Exercises
### Exercise 1
<span  style="color:green; font-size:16px">Read in the `movie` dataset and set the `title` column as the index. Select all columns for the movie ‘The Dark Knight Rises’.</span>


In [2]:
import pandas as pd

In [46]:
moves = pd.read_csv('data/movie.csv', index_col='title')

In [47]:
moves.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


### Exercise 2

<span  style="color:green; font-size:16px">Select all columns for the movies ‘Tangled’ and ‘Avatar’.</span>

In [48]:
moves.loc[['Tangled','Avatar']]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,...,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9


### Exercise 3
<span  style="color:green; font-size:16px">What year was ‘Tangled’ and ‘Avatar’ made and what was their IMBD scores?</span>

In [50]:
moves.columns

Index(['year', 'color', 'content_rating', 'duration', 'director_name',
       'director_fb', 'actor1', 'actor1_fb', 'actor2', 'actor2_fb', 'actor3',
       'actor3_fb', 'gross', 'genres', 'num_reviews', 'num_voted_users',
       'plot_keywords', 'language', 'country', 'budget', 'imdb_score'],
      dtype='object')

In [49]:
moves.loc[['Tangled','Avatar'],['year', 'imdb_score']]

Unnamed: 0_level_0,year,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Tangled,2010.0,7.8
Avatar,2009.0,7.9


### Exercise 4
<span  style="color:green; font-size:16px">Can you tell what the data type of the year column is by just looking at its values?</span>

In [53]:
yearType = moves['year']
yearType.info()

<class 'pandas.core.series.Series'>
Index: 4916 entries, Avatar to My Date with Drew
Series name: year
Non-Null Count  Dtype  
--------------  -----  
4810 non-null   float64
dtypes: float64(1)
memory usage: 205.9+ KB


**3-Selecting Subsets of Data from DataFrames
with iloc**

The `iloc` indexer is very similar to loc but only uses `integer` location to make its selections. 
The word iloc itself stands for `integer location` so that should help remind you what it does.
Simultaneous row and column subset selection with iloc
Selection with iloc will look like the following with a comma separating the row and column selections.
df.iloc[rows, cols]

All of the following are valid objects available for both row and column selections with iloc. The iloc indexer,
unlike loc, is unable to do boolean selection.

• A single integer

• A list of integers

• A slice with integers

**Select a single row or column as a DataFrame and NOT a Series**

**Select some rows and a single column**

You can select a single row (or column) and return a DataFrame and not a Series `if you use a list to make the selection.`
Let’s replicate the selection from the previous example, but use a one-item list for the column selection.

In [55]:
movie.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


In [58]:

rows = [2, 3, 5]
cols = 4
movie.iloc[rows, cols]
# -----------------#
# VS



2    148.0
3    164.0
5    132.0
Name: duration, dtype: float64

In [59]:
rows = [2, 3, 5]
cols = [4]
movie.iloc[rows, cols]

Unnamed: 0,duration
2,148.0
3,164.0
5,132.0


In [22]:
movie.iloc[:3,:3]

Unnamed: 0,title,year,color
0,Avatar,2009.0,Color
1,Pirates of the Caribbean: At World's End,2007.0,Color
2,Spectre,2015.0,Color


In [60]:
movie.loc[:3,:"color"]

Unnamed: 0,title,year,color
0,Avatar,2009.0,Color
1,Pirates of the Caribbean: At World's End,2007.0,Color
2,Spectre,2015.0,Color
3,The Dark Knight Rises,2012.0,Color


In [61]:
movie.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


## Exercises
• Use the movie dataset for the following exercises
### Exercise 1
<span  style="color:green; font-size:16px">Select the rows with integer location 10, 5, and 1</span>

In [62]:
movie.iloc[[1,5,10],:]

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
5,John Carter,2012.0,Color,PG-13,132.0,Andrew Stanton,475.0,Daryl Sabara,640.0,Samantha Morton,...,530.0,73058679.0,Action|Adventure|Sci-Fi,462.0,212204,alien|american civil war|male nipple|mars|prin...,English,USA,263700000.0,6.6
10,Batman v Superman: Dawn of Justice,2016.0,Color,PG-13,183.0,Zack Snyder,0.0,Henry Cavill,15000.0,Lauren Cohan,...,2000.0,330249062.0,Action|Adventure|Sci-Fi,673.0,371639,based on comic book|batman|sequel to a reboot|...,English,USA,250000000.0,6.9


### Exercise 2
<span  style="color:green; font-size:16px">Select the columns with integer location 10, 5, and 1</span>

In [63]:
movie.iloc[:,[1,5,10]]

Unnamed: 0,year,director_name,actor2_fb
0,2009.0,James Cameron,936.0
1,2007.0,Gore Verbinski,5000.0
2,2015.0,Sam Mendes,393.0
3,2012.0,Christopher Nolan,23000.0
4,,Doug Walker,12.0
...,...,...,...
4911,2013.0,Scott Smith,470.0
4912,,,593.0
4913,2013.0,Benjamin Roberds,0.0
4914,2012.0,Daniel Hsia,719.0


### Exercise 3
<span  style="color:green; font-size:16px">Select rows with integer location 100 to 104 along with the column integer location 5.</span>

In [64]:
movie.iloc[100 : 105 , 5]

100           Rob Cohen
101       David Fincher
102      Matthew Vaughn
103    Francis Lawrence
104      Jon Turteltaub
Name: director_name, dtype: object

**4-Boolean Selection Single Conditions**

Examples of Boolean Selection

Let’s see some examples of actual questions (in plain English) that boolean selection can help us answer from the
bikes dataset. The term query is used to refer to these sorts of questions.

`• Find all rides by males`

`• Find all rides with duration longer than 2 hours`

`• Find all rides that took place between March and June of 2015.`

`• Find all the rides with a duration longer than 2 hours by females with temperature higher than 90 degrees`

**All queries have a logical condition**

Each of the above queries have a strict logical condition that must be checked one row at a time.
Keep or discard an entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row,
as a whole, meets the condition. If so, then it is kept in the result, otherwise it is discarded.
Each row will have a True or False value associated with it
When you perform boolean selection, each row of the DataFrame (or value of a Series) will have a True or False
value associated with it that corresponds to the outcome of the logical condition.

Creating boolean Series from column data
By far the most common way to create a boolean Series will be from the values of one particular column. We will test
a condition using one of the six comparison operators:

`• <`

`• <=`

`• >`

`• >=`

`• ==`

`• !=`

In [65]:
#Let’s begin by reading in the bikes dataset.
bikes = pd.read_csv("data/bikes.csv")


In [66]:
bikes.columns

Index(['trip_id', 'usertype', 'gender', 'starttime', 'stoptime',
       'tripduration', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_name',
       'latitude_end', 'longitude_end', 'dpcapacity_end', 'temperature',
       'visibility', 'wind_speed', 'precipitation', 'events'],
      dtype='object')

In [67]:
bikes["gender"].unique()

array(['Male', 'Female'], dtype=object)

In [68]:
filt_1 = bikes["gender"]=='Male'
males_ony = bikes[filt_1]
males_ony["gender"].sample(10)

3422     Male
32146    Male
2778     Male
11610    Male
37883    Male
22707    Male
39353    Male
12601    Male
42904    Male
12637    Male
Name: gender, dtype: object

In [69]:
# ~ 
females_only = bikes[~filt_1]
females_only["gender"].sample(10)

42248    Female
34576    Female
9040     Female
32774    Female
44755    Female
16733    Female
34190    Female
41157    Female
27649    Female
34552    Female
Name: gender, dtype: object

In [70]:
filt_2 = bikes["gender"]=="Female"
filt_3 = bikes["tripduration"]<90

In [132]:
female_less_90 = bikes[filt_2 & filt_3]


In [133]:
female_less_90[["gender","tripduration"]]

Unnamed: 0,gender,tripduration
2485,Female,67
3366,Female,62
3629,Female,70
4930,Female,77
6430,Female,85
6862,Female,60
6914,Female,64
7909,Female,87
9551,Female,79
11168,Female,83


In [73]:
bikes["usertype"].unique()

array(['Subscriber', 'Customer', 'Dependent'], dtype=object)

In [74]:
f1= bikes["gender"] == "Female"

In [75]:
f2= bikes["usertype"] == "Customer"

In [76]:
logina = bikes[bikes["gender"] == "Female"]

In [77]:
logina = bikes[f1 | f2]

In [78]:
logina[["gender","usertype"]]
logina["gender"][logina["usertype"]=="Customer"]

7031     Female
28102      Male
30389      Male
38033      Male
38794      Male
41190      Male
41740      Male
41963      Male
45551    Female
Name: gender, dtype: object

Let’s create a boolean Series by determining which rows have a trip duration greater than 1000 seconds. To make the
comparison, we select the tripduration column as a Series and compare it against the integer 1000.

In [221]:
bikes.sample(500)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
23559,9158131,Subscriber,Female,2016-04-12 17:11:00,2016-04-12 17:20:00,514,Green St & Madison St,41.881892,-87.648789,19.0,Ogden Ave & Chicago Ave,41.896362,-87.654061,19.0,43.0,10.0,12.7,-9999.0,cloudy
33593,12517587,Subscriber,Female,2016-10-28 08:16:41,2016-10-28 08:25:19,518,Ogden Ave & Roosevelt Rd,41.866501,-87.684697,19.0,Ashland Ave & Harrison St,41.874291,-87.667246,23.0,46.9,8.0,11.5,-9999.0,cloudy
9856,4305313,Subscriber,Female,2014-12-03 20:01:00,2014-12-03 20:09:00,522,Clark St & North Ave,41.911974,-87.631942,15.0,Lakeview Ave & Fullerton Pkwy,41.925858,-87.638973,19.0,28.0,10.0,10.4,-9999.0,cloudy
36868,13436965,Subscriber,Female,2017-03-20 07:11:54,2017-03-20 07:17:57,363,Lincoln Ave & Winona St,41.974911,-87.692503,15.0,Western Ave & Leland Ave,41.966555,-87.688487,19.0,45.0,5.0,8.1,-9999.0,cloudy
9775,4274969,Subscriber,Female,2014-11-25 15:26:00,2014-11-25 15:31:00,306,Canal St & Harrison St,41.874337,-87.639566,15.0,Peoria St & Jackson Blvd,41.877749,-87.649633,19.0,23.0,10.0,11.5,-9999.0,cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45402,16167118,Subscriber,Female,2017-09-01 08:54:57,2017-09-01 09:01:34,397,Leavitt St & North Ave,41.910153,-87.682290,15.0,Marshfield Ave & Cortland St,41.916017,-87.668879,23.0,61.0,10.0,19.6,-9999.0,cloudy
10588,4531042,Subscriber,Female,2015-02-13 06:50:00,2015-02-13 06:56:00,349,Wabash Ave & 8th St,41.871962,-87.626106,19.0,Dearborn St & Monroe St,41.881320,-87.629521,27.0,6.1,10.0,6.9,-9999.0,cloudy
19152,7844148,Subscriber,Female,2015-10-08 16:59:00,2015-10-08 17:08:00,566,Ogden Ave & Congress Pkwy,41.875010,-87.673280,15.0,Peoria St & Jackson Blvd,41.877749,-87.649633,19.0,73.9,10.0,10.4,-9999.0,cloudy
30244,11493374,Subscriber,Female,2016-08-24 07:41:27,2016-08-24 07:56:59,933,Morgan St & Polk St,41.871737,-87.651030,15.0,Rush St & Hubbard St,41.890011,-87.626293,15.0,71.1,8.0,15.0,0.0,cloudy


In [156]:
#How many rows have a trip duration greater than 1000?
ex1 = bikes['tripduration'] > 1000
bikes[ex1]

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Female,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.00,cloudy
8,21028,Subscriber,Female,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wood St & Division St,41.903320,-87.672730,15.0,71.1,8.0,0.0,-9999.00,cloudy
10,24383,Subscriber,Female,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.00,cloudy
11,24673,Subscriber,Female,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.00,cloudy
12,26214,Subscriber,Female,2013-07-05 10:02:00,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,41.880422,-87.642746,19.0,Jefferson St & Monroe St,41.880422,-87.642746,19.0,79.0,10.0,0.0,-9999.00,cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50058,17525403,Subscriber,Female,2017-12-23 13:48:00,2017-12-23 14:14:00,1559,Michigan Ave & Madison St,41.882134,-87.625125,19.0,Shedd Aquarium,41.867226,-87.615355,55.0,28.0,10.0,8.1,-9999.00,cloudy
50077,17533484,Subscriber,Female,2017-12-29 09:13:00,2017-12-29 09:53:00,2378,Clinton St & 18th St,41.857950,-87.640826,15.0,Canal St & Taylor St,41.870257,-87.639474,15.0,12.9,9.0,4.6,-9999.00,cloudy
50080,17534057,Subscriber,Female,2017-12-29 15:28:00,2017-12-29 15:51:00,1378,Cityfront Plaza Dr & Pioneer Ct,41.890573,-87.622072,23.0,Mies van der Rohe Way & Chestnut St,41.898587,-87.621915,19.0,14.0,1.5,6.9,0.01,cloudy
50083,17534831,Subscriber,Female,2017-12-30 11:36:00,2017-12-30 11:55:00,1175,Western Ave & Walton St,41.898418,-87.686596,19.0,Damen Ave & Clybourn Ave,41.931931,-87.677856,15.0,3.9,10.0,13.8,-9999.00,cloudy


In [186]:
#Let’s find all the rides longer than 1,000 seconds when it was cloudy.
fUN1 = bikes['tripduration'] > 1000
fUN2 = bikes['events'] == 'cloudy'
bikes[fUN1 & fUN2]


Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Female,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.00,cloudy
8,21028,Subscriber,Female,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wood St & Division St,41.903320,-87.672730,15.0,71.1,8.0,0.0,-9999.00,cloudy
10,24383,Subscriber,Female,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.00,cloudy
11,24673,Subscriber,Female,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.00,cloudy
12,26214,Subscriber,Female,2013-07-05 10:02:00,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,41.880422,-87.642746,19.0,Jefferson St & Monroe St,41.880422,-87.642746,19.0,79.0,10.0,0.0,-9999.00,cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50058,17525403,Subscriber,Female,2017-12-23 13:48:00,2017-12-23 14:14:00,1559,Michigan Ave & Madison St,41.882134,-87.625125,19.0,Shedd Aquarium,41.867226,-87.615355,55.0,28.0,10.0,8.1,-9999.00,cloudy
50077,17533484,Subscriber,Female,2017-12-29 09:13:00,2017-12-29 09:53:00,2378,Clinton St & 18th St,41.857950,-87.640826,15.0,Canal St & Taylor St,41.870257,-87.639474,15.0,12.9,9.0,4.6,-9999.00,cloudy
50080,17534057,Subscriber,Female,2017-12-29 15:28:00,2017-12-29 15:51:00,1378,Cityfront Plaza Dr & Pioneer Ct,41.890573,-87.622072,23.0,Mies van der Rohe Way & Chestnut St,41.898587,-87.621915,19.0,14.0,1.5,6.9,0.01,cloudy
50083,17534831,Subscriber,Female,2017-12-30 11:36:00,2017-12-30 11:55:00,1175,Western Ave & Walton St,41.898418,-87.686596,19.0,Damen Ave & Clybourn Ave,41.931931,-87.677856,15.0,3.9,10.0,13.8,-9999.00,cloudy


In [197]:
#Let’s find all the rides that were done by females or had trip durations longer than 1,000 seconds.
e1 = bikes['gender'] == 'Female'
e2= bikes['tripduration'] > 1000
bikes[e1 | e2]


Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Female,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.881050,-87.616970,11.0,Michigan Ave & Oak St,41.900960,-87.623777,15.0,73.9,10.0,12.7,-9999.0,cloudy
1,7524,Subscriber,Female,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wells St & Walton St,41.899930,-87.634430,19.0,69.1,10.0,6.9,-9999.0,cloudy
2,10927,Subscriber,Female,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.0,cloudy
3,12907,Subscriber,Female,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.631890,31.0,72.0,10.0,16.1,-9999.0,cloudy
4,13168,Subscriber,Female,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50084,17534938,Subscriber,Female,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.631280,27.0,5.0,10.0,16.1,-9999.0,cloudy
50085,17534969,Subscriber,Female,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,cloudy
50086,17534972,Subscriber,Female,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,cloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,cloudy


**Inverting a condition with the not operator**

The tilde character, ~, represents the not operator and inverts a condition.

In [199]:
#if we wanted all the rides with trip duration less than or equal to 1000,
f4 = bikes['tripduration']<=1000
bikes[f4]

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Female,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.881050,-87.616970,11.0,Michigan Ave & Oak St,41.900960,-87.623777,15.0,73.9,10.0,12.7,-9999.0,cloudy
1,7524,Subscriber,Female,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wells St & Walton St,41.899930,-87.634430,19.0,69.1,10.0,6.9,-9999.0,cloudy
3,12907,Subscriber,Female,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.631890,31.0,72.0,10.0,16.1,-9999.0,cloudy
4,13168,Subscriber,Female,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,cloudy
5,13595,Subscriber,Female,2013-07-01 12:37:00,2013-07-01 12:48:00,660,California Ave & 21st St,41.854016,-87.695445,15.0,Clark St & Wrightwood Ave,41.929546,-87.643118,15.0,73.0,10.0,17.3,-9999.0,cloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50082,17534773,Subscriber,Female,2017-12-30 10:47:00,2017-12-30 10:53:00,363,Larrabee St & Oak St,41.900219,-87.642985,19.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,3.9,10.0,13.8,-9999.0,cloudy
50085,17534969,Subscriber,Female,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,cloudy
50086,17534972,Subscriber,Female,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,cloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,cloudy


In [226]:
#Let’s reverse the condition for selecting rides by females or those with duration over 1,000 seconds. Logically, this should return only male riders with duration 1,000 or less.
e10 = bikes['gender'] == 'Female'
e20= bikes['tripduration'] > 1000
bikes[~(e10|e20)]

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events


## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as actor1. How
many of these movies has he starred in?</span>

In [230]:
movie = pd.read_csv('data/movie.csv')
actor = movie['actor1'] == 'Tom Hanks'
movie[actor]

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
43,Toy Story 3,2010.0,Color,G,103.0,Lee Unkrich,125.0,Tom Hanks,15000.0,John Ratzenberger,...,721.0,414984497.0,Adventure|Animation|Comedy|Family|Fantasy,453.0,544884,college|day care|escape|teddy bear|toy,English,USA,200000000.0,8.3
91,The Polar Express,2004.0,Color,G,100.0,Robert Zemeckis,0.0,Tom Hanks,15000.0,Eddie Deezen,...,267.0,665426.0,Adventure|Animation|Family|Fantasy,188.0,120798,boy|christmas|christmas eve|north pole|train,English,USA,165000000.0,6.6
129,Angels & Demons,2009.0,Color,PG-13,146.0,Ron Howard,2000.0,Tom Hanks,15000.0,Ayelet Zurer,...,294.0,133375846.0,Mystery|Thriller,298.0,207839,conclave|illuminati|murder|reference to bernin...,English,USA,150000000.0,6.7
205,The Da Vinci Code,2006.0,Color,PG-13,174.0,Ron Howard,2000.0,Tom Hanks,15000.0,Seth Gabel,...,362.0,217536138.0,Mystery|Thriller,294.0,314253,based on supposedly true story|holy grail|mary...,English,USA,125000000.0,6.6
307,Cloud Atlas,2012.0,Color,R,172.0,Tom Tykwer,670.0,Tom Hanks,15000.0,Jim Sturgess,...,1000.0,27098580.0,Drama|Sci-Fi,511.0,284825,composer|future|letter|nonlinear timeline|nurs...,English,Germany,102000000.0,7.5
349,Toy Story 2,1999.0,Color,G,82.0,John Lasseter,487.0,Tom Hanks,15000.0,John Ratzenberger,...,967.0,245823397.0,Adventure|Animation|Comedy|Family|Fantasy,191.0,385871,collector|dog|friend|rescue|toy,English,USA,90000000.0,7.9
390,Cast Away,2000.0,Color,PG-13,143.0,Robert Zemeckis,0.0,Tom Hanks,15000.0,Paul Sanchez,...,272.0,233630478.0,Adventure|Drama|Romance,221.0,394317,christmas|island|love|survival|talking to inan...,English,USA,90000000.0,7.7
451,Road to Perdition,2002.0,Color,R,117.0,Sam Mendes,0.0,Tom Hanks,15000.0,Jennifer Jason Leigh,...,818.0,104054514.0,Crime|Drama|Thriller,226.0,200359,1930s|blood|gun|on the run|revenge,English,USA,80000000.0,7.7
530,The Terminal,2004.0,Color,PG-13,128.0,Steven Spielberg,14000.0,Tom Hanks,15000.0,Chi McBride,...,232.0,77032279.0,Comedy|Drama,151.0,303864,airport|construction site|fish out of water|fl...,English,USA,60000000.0,7.3
641,Saving Private Ryan,1998.0,Color,R,169.0,Steven Spielberg,14000.0,Tom Hanks,15000.0,Vin Diesel,...,13000.0,216119491.0,Action|Drama|War,219.0,881236,army|invasion|killed in action|normandy|soldier,English,USA,70000000.0,8.6


In [227]:
movie.columns

Index(['title', 'year', 'color', 'content_rating', 'duration', 'director_name',
       'director_fb', 'actor1', 'actor1_fb', 'actor2', 'actor2_fb', 'actor3',
       'actor3_fb', 'gross', 'genres', 'num_reviews', 'num_voted_users',
       'plot_keywords', 'language', 'country', 'budget', 'imdb_score'],
      dtype='object')

### Exercise 2
<span  style="color:green; font-size:16px">Select movies with an IMDB score greater than 9.</span>

In [229]:
score = movie['imdb_score'] > 9
movie[score]

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
1920,The Shawshank Redemption,1994.0,Color,R,142.0,Frank Darabont,0.0,Morgan Freeman,11000.0,Jeffrey DeMunn,...,461.0,28341469.0,Crime|Drama,199.0,1689764,escape from prison|first person narration|pris...,English,USA,25000000.0,9.3
2725,Towering Inferno,,Color,,65.0,John Blanchard,0.0,Martin Short,770.0,Andrea Martin,...,176.0,,Comedy,,10,,English,Canada,,9.5
2779,Dekalog,,Color,TV-MA,55.0,,,Krystyna Janda,20.0,Olaf Lubaszenko,...,2.0,447093.0,Drama,53.0,12590,meaning of life|moral challenge|morality|searc...,Polish,Poland,,9.1
3402,The Godfather,1972.0,Color,R,175.0,Francis Ford Coppola,0.0,Al Pacino,14000.0,Marlon Brando,...,3000.0,134821952.0,Crime|Drama,208.0,1155770,crime family|mafia|organized crime|patriarch|r...,English,USA,6000000.0,9.2
4312,Kickboxer: Vengeance,2016.0,,,90.0,John Stockwell,134.0,Matthew Ziff,260000.0,T.J. Storm,...,354.0,,Action,2.0,246,,,USA,17000000.0,9.1


In [264]:
# rate = movie['content_rating'] == 'R'
movie[movie['content_rating'] == 'R']

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
84,The Lovers,2015.0,Color,R,109.0,Roland Joffé,596.0,Tamsin Egerton,622.0,Alice Englert,...,283.0,,Action|Adventure|Romance|Sci-Fi,10.0,2138,1770s|british india|great barrier reef|india|ring,English,Belgium,,4.5
94,Terminator 3: Rise of the Machines,2003.0,Color,R,109.0,Jonathan Mostow,84.0,Nick Stahl,648.0,M.C. Gainey,...,191.0,150350192.0,Action|Sci-Fi,280.0,305340,drifter|exploding truck|future|machine|skynet,English,USA,200000000.0,6.4
113,Alexander,2004.0,Color,R,206.0,Oliver Stone,0.0,Anthony Hopkins,12000.0,Angelina Jolie Pitt,...,591.0,34293771.0,Action|Adventure|Biography|Drama|History|Roman...,248.0,138863,ancient greece|conquest|greek|greek myth|king,English,Germany,155000000.0,5.5
124,The Matrix Revolutions,2003.0,Color,R,129.0,Lana Wachowski,0.0,Essie Davis,309.0,Collin Chou,...,233.0,139259759.0,Action|Sci-Fi,245.0,364948,battle|epic|fight|future|machine,English,Australia,150000000.0,6.7
126,The Matrix Reloaded,2003.0,Color,R,138.0,Lana Wachowski,0.0,Steve Bastoni,234.0,Daniel Bernhardt,...,30.0,281492479.0,Action|Sci-Fi,275.0,421818,car motorcycle chase|one against many|oracle|p...,English,USA,150000000.0,7.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888,Slacker,1991.0,Black and White,R,100.0,Richard Linklater,0.0,Tommy Pallotta,5.0,Richard Linklater,...,0.0,1227508.0,Comedy|Drama,61.0,15103,austin texas|moon|pap smear|texas|twenty somet...,English,USA,23000.0,7.1
4892,Exeter,2015.0,Color,R,91.0,Marcus Nispel,158.0,Ashley Tramonte,630.0,Brittany Curran,...,265.0,,Horror|Mystery|Thriller,43.0,3836,asylum|demon|party|secret|teenager,English,USA,,4.6
4894,The Puffy Chair,2005.0,Color,R,85.0,Jay Duplass,157.0,Mark Duplass,830.0,Katie Aselton,...,10.0,192467.0,Comedy|Drama|Romance,51.0,4067,birthday|gift|motel|new york city|upholsterer,English,USA,15000.0,6.6
4899,Clean,2004.0,Color,R,110.0,Olivier Assayas,107.0,Maggie Cheung,576.0,Béatrice Dalle,...,45.0,136007.0,Drama|Music|Romance,81.0,3924,jail|junkie|money|motel|singer,French,France,4500.0,6.9



<span  style="color:green; font-size:16px">Write a function that accepts a single parameter to find the number of movies for a given content rating. Use the
function to find the number of movies for ratings ‘R’, ‘PG-13’, and ‘PG’.</span>

In [348]:
def count_content_rating(target_rating) :
    return movie[movie['content_rating'] == target_rating].shape[0]


count_content_rating('R') 

2067