# Intro to Data Analysis with pandas

## Overview

### Objectives

+ Know why pandas is suitable for data analysis in Python.
+ load csv data into pandas dataframe.
+ Identify a DataFrame as a two-dimensional data structure with an **index**, **columns**, and **values**.
+ Identify a Series as a single dimensional data structure with an **index** and **values**.
+ Know the difference between the **index** and **values**.
+ ways to show sample of dataset or specefic rows.
+ ways to describe/get ingo about the whole data set.
+ check/change data types.
+ check missing values.

### Resources

+ [Official Documentation](http://pandas.pydata.org/pandas-docs/stable/)
+ [Package Overview](http://pandas.pydata.org/pandas-docs/stable/overview.html)
+ [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

## Welcome to ....
![][1]


### What is pandas?
pandas is one of the most popular open source data exploration libraries currently available. It gives its users the power to explore, manipulate, query, aggregate, and visualize **tabular** data. Tabular meaning data that is two-dimensional with rows and columns; i.e. a table.

### Why pandas and not xyz?
In this current age of data explosion, there are now many dozens of other tools that have many of the same capabilities as the pandas library. However, there are many aspects of pandas that make it an attractive choice for data analysis and it continues to have one of the fastest growing user bases.

* It's a Python library and integrates well with the other popular data science libraries such as numpy, scikit-learn, statsmodels, matplotlib and seaborn.
* It is nearly self-contained in that lots of functionality is built into one package. This contrasts with R, where many packages are needed to obtain the same functionality.
* The community is excellent. Looking at Stack Overflow, for example, there are [many ten's of thousands of][2] pandas questions. If you need help, you are nearly guaranteed to find it very quickly. 

### Why is it named after an East Asian bear?

The pandas library was begun by Wes McKinney beginning in 2008 at a hedge fund named AQR. Finance speak is to call tabular data 'panel data' which smashed together becomes pandas. If you are really interested in the history, you can hear it from the creator [himself][3].

### Python already has data structures to handle data, why do we need another one?

Even though Python is a high-level language, its primary built-in data structures lists and dictionaries, do not easily lend themselves to tabular data analysis in ways that humans can operate on them. 

### pandas is built directly on numpy

[numpy][4] ('numerical Python') is the most popular third-party Python library for scientific computing and forms the foundation for dozens of others, including pandas. numpy's primary data structure is an n-dimensional array which is much more powerful than a Python list and with much better performance.

All of the data in pandas is stored in numpy arrays. That said, it isn't necessary to know much about numpy when learning pandas. You can think of pandas as a higher-level, easier to use interface for doing data analysis than numpy. It is a good idea to eventually learn numpy, but for most tasks, pandas will be the right tool.

### numpy tutorial in appendix

Although it is not necessary to understand numpy to perform data analysis with pandas, it is a major piece of the data science ecosystem in Python and it can be used alongside pandas. A thorough numpy tutorial is available in Appendix A.

## pandas operates on tabular (table) data

There are numerous formats for data such as XML, JSON, raw bytes, and many others. But, for our purposes, we will only be examining what most people think of when they think of data - a table. pandas is built just for analyzing this tabular, rectangular, very deceptively normal concept of data. pandas has the capability to read in many different formats of data, but they all will be converted to tabular data.

### The DataFrame and Series

The DataFrame and Series are the two primary pandas objects that we will be using throughout this course.

* **DataFrame** - A two-dimensional data structure that looks like any other rectangular table of data you have seen with rows and columns.
* **Series** - A single dimension of data. It is analogous to a single column of data or a one dimensional array.

## Import pandas and read in data with the `read_csv` function

By convention, pandas is imported and aliased as `pd`. We will read in the `bikes` dataset with the `read_csv` function. Its first parameter is the location of the file relative to the current directory as a string. All of the data for this course is stored in the `data` directory one level above where this notebook is located. The two dots in the path passed to `read_csv` are interpreted as the directory immediately above the current one.

[1]: images/pandas_logo.png
[2]: http://stackoverflow.com/questions/tagged/pandas
[3]: https://www.youtube.com/watch?v=kHdkFyGCxiY
[4]: http://www.numpy.org/


In [1]:
import pandas as pd


In [28]:
bikes = pd.read_csv('data/bikes.csv')
type(bikes)

pandas.core.frame.DataFrame

### Display DataFrame in Jupyter Notebook
We assigned the output from the `read_csv` function to the `bikes` variable which now refers to our DataFrame object. Let's get a visual display of our DataFrame by writing the variable name as the last line in a code cell.

In [None]:
bikes

### Default output
pandas defaults to outputting 60 rows and 20 columns. These display options (and many others) can be changed. This will be covered later.

## Our first methods - `head`, `tail`, and `sample`
A very useful and simple method is `head`, which by default will return the first 5 rows of the DataFrame. This avoids long default output and is something I highly recommend when doing data analysis within a notebook. The `tail` method returns the last 5 rows by default.

In [None]:
#head()

In [None]:
#tail()

In [34]:
#sample()
bikes.sample(10)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
15216,6428503,Subscriber,Female,2015-07-23 17:59:00,2015-07-23 18:06:00,461,Halsted St & Wrightwood Ave,41.929143,-87.649077,15.0,Theater on the Lake,41.926277,-87.630834,23.0,82.0,10.0,9.2,-9999.0,partlycloudy
23641,9177355,Subscriber,Male,2016-04-14 14:11:00,2016-04-14 14:29:00,1053,LaSalle St & Illinois St,41.890749,-87.63206,31.0,LaSalle St & Illinois St,41.890749,-87.63206,31.0,60.1,10.0,12.7,-9999.0,partlycloudy
47007,16684421,Subscriber,Male,2017-09-28 08:36:44,2017-09-28 08:45:59,555,Clark St & 9th St (AMLI),41.870816,-87.631246,15.0,Dearborn St & Monroe St,41.88132,-87.629521,39.0,60.1,10.0,8.1,-9999.0,partlycloudy
44795,15984914,Subscriber,Male,2017-08-23 06:37:28,2017-08-23 06:45:13,465,Clark St & Grace St,41.95078,-87.659172,19.0,Sheffield Ave & Wellington Ave,41.936266,-87.652662,23.0,61.0,10.0,9.2,-9999.0,clear
37582,13636312,Subscriber,Male,2017-04-13 18:22:50,2017-04-13 18:32:03,553,Marshfield Ave & Cortland St,41.916017,-87.668879,23.0,Campbell Ave & North Ave,41.910535,-87.689556,15.0,54.0,10.0,10.4,-9999.0,partlycloudy
48860,17220908,Subscriber,Male,2017-11-08 17:02:00,2017-11-08 17:12:00,608,Wells St & Concord Ln,41.912133,-87.634656,19.0,Stockton Dr & Wrightwood Ave,41.93132,-87.638742,15.0,41.0,10.0,5.8,-9999.0,partlycloudy
21976,8674697,Subscriber,Male,2016-01-31 01:01:00,2016-01-31 01:07:00,326,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Chicago Ave,41.895769,-87.67722,15.0,39.0,10.0,6.9,-9999.0,cloudy
20798,8326993,Subscriber,Male,2015-11-19 10:20:00,2015-11-19 10:21:00,89,Clarendon Ave & Gordon Ter,41.957879,-87.649519,15.0,Clarendon Ave & Junior Ter,41.961004,-87.649603,15.0,41.0,10.0,21.9,-9999.0,partlycloudy
20752,8315061,Subscriber,Male,2015-11-18 01:51:00,2015-11-18 02:05:00,818,Larrabee St & Oak St,41.900219,-87.642985,15.0,Franklin St & Jackson Blvd,41.877708,-87.635321,31.0,57.0,10.0,21.9,0.0,cloudy
20393,8212375,Subscriber,Male,2015-11-06 16:41:00,2015-11-06 16:50:00,515,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Financial Pl & Congress Pkwy,41.875009,-87.633106,31.0,48.0,10.0,10.4,-9999.0,cloudy


### First and Last `n` rows
Both the `head` and `tail` methods take a single integer parameter `n`, which controls the number of rows to return. 

In [None]:
# head(n_rows)

## Components of a DataFrame - columns, index, and data
The DataFrame is composed of three separate components that you must know. The **columns**, the **index**, and the **data**. These terms will be used throughout the course and understanding them is vital to your ability to use pandas. Take a look at the following graphic of our `bikes` DataFrame stylized to put emphasis on each component.

![][1]

[1]: images/df_components.png

* The **index** provides a label for each row
* The **columns** provide a label for each column
* The **index** is also referred to as the **row names/labels**
* The **columns** are also referred to as the **column names/labels** or the **column index**
* An individual element of the index is referred to as an **index label/name** or **row label/name**
* An individual element of the columns is a **column name/label**
* The index and the columns are always in **bold font**
* Collectively the index and the columns are known as the **axes** (or individually as an **axis**)
* pandas uses integers to refer to each axis; 0 for the index and 1 for the columns. This is borrowed directly from numpy
* The actual **data** is always in normal font
* The **data** is also referred to as the **values**

## What type of object is `bikes`
As we said previously, `bikes` is a DataFrame. Let's verify this:

In [2]:
type(bikes)

pandas.core.frame.DataFrame

In [29]:
bikes

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.881050,-87.616970,11.0,Michigan Ave & Oak St,41.900960,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.883380,-87.641170,31.0,Wells St & Walton St,41.899930,-87.634430,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.881320,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.631890,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50084,17534938,Subscriber,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.631280,27.0,5.0,10.0,16.1,-9999.0,partlycloudy
50085,17534969,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,partlycloudy
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy


In [30]:
bikes.head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


In [31]:
bikes.head(2)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy


In [32]:
bikes.tail(10)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
50079,17533757,Subscriber,Male,2017-12-29 12:08:00,2017-12-29 12:12:00,270,Southport Ave & Belmont Ave,41.93949,-87.66378,15.0,Wilton Ave & Belmont Ave,41.94018,-87.65304,23.0,14.0,2.5,6.9,0.0,snow
50080,17534057,Subscriber,Male,2017-12-29 15:28:00,2017-12-29 15:51:00,1378,Cityfront Plaza Dr & Pioneer Ct,41.890573,-87.622072,23.0,Mies van der Rohe Way & Chestnut St,41.898587,-87.621915,19.0,14.0,1.5,6.9,0.01,snow
50081,17534131,Subscriber,Male,2017-12-29 16:09:00,2017-12-29 16:19:00,617,Kingsbury St & Erie St,41.893882,-87.641711,23.0,Canal St & Adams St,41.879255,-87.639904,47.0,14.0,1.5,6.9,0.0,snow
50082,17534773,Subscriber,Male,2017-12-30 10:47:00,2017-12-30 10:53:00,363,Larrabee St & Oak St,41.900219,-87.642985,19.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,3.9,10.0,13.8,-9999.0,mostlycloudy
50083,17534831,Subscriber,Male,2017-12-30 11:36:00,2017-12-30 11:55:00,1175,Western Ave & Walton St,41.898418,-87.686596,19.0,Damen Ave & Clybourn Ave,41.931931,-87.677856,15.0,3.9,10.0,13.8,-9999.0,partlycloudy
50084,17534938,Subscriber,Male,2017-12-30 13:07:00,2017-12-30 13:34:00,1625,State St & Pearson St,41.897448,-87.628722,27.0,Clark St & Elm St,41.902973,-87.63128,27.0,5.0,10.0,16.1,-9999.0,partlycloudy
50085,17534969,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:44:00,585,Halsted St & 35th St (*),41.830661,-87.647172,16.0,Union Ave & Root St,41.819102,-87.643278,11.0,5.0,10.0,16.1,-9999.0,partlycloudy
50086,17534972,Subscriber,Male,2017-12-30 13:34:00,2017-12-30 13:48:00,824,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,Halsted St & Blackhawk St (*),41.908537,-87.648627,20.0,5.0,10.0,16.1,-9999.0,partlycloudy
50087,17535645,Subscriber,Female,2017-12-31 09:30:00,2017-12-31 09:33:00,178,Clinton St & Lake St,41.885637,-87.641823,23.0,Kingsbury St & Kinzie St,41.889177,-87.638506,31.0,7.0,10.0,11.5,-9999.0,partlycloudy
50088,17536246,Subscriber,Male,2017-12-31 15:22:00,2017-12-31 15:26:00,214,Clarendon Ave & Leland Ave,41.967968,-87.650001,15.0,Clifton Ave & Lawrence Ave,41.968812,-87.657659,15.0,10.9,10.0,15.0,-9999.0,partlycloudy


### Fully-qualified name
Only the word after the last dot is the class name. The `bikes` variable has type `DataFrame`. Python always returns the location and module name of where the class was defined. 

### Location and module name?
The fully-qualified name holds the location in your computer where the class is defined. In this example, `pandas` is a directory that contains another directory `core` which contains a file `frame.py` which defines the `DataFrame` class.

### Package, sub-package, and module
The top level directory of other files and directories containing Python files is technically called a **package**. In this example `pandas` is the package. All directories within the package are called **sub-packages** such as `core`. All Python files (those ending in .py) are called **modules**.

### Where are the packages located?
Third-party packages are installed in the `site-packages` directory which itself is set up during Python installation. We can get the actual location with the help of the built-in `site` module's `getsitepackages` 
function.

In [3]:
import site
site.getsitepackages()

['c:\\Users\\Hala\\AppData\\Local\\Programs\\Python\\Python39',
 'c:\\Users\\Hala\\AppData\\Local\\Programs\\Python\\Python39\\lib\\site-packages']

## Select a single column from a DataFrame - a Series
To select a single column from a DataFrame, pass the name of one of the columns to the brackets operator, `[]`. The returned object will be a pandas **Series**. Let's select the column name `tripduration`, assign it to a variable, and output it to the screen.

In [35]:
bikes.columns

Index(['trip_id', 'usertype', 'gender', 'starttime', 'stoptime',
       'tripduration', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_name',
       'latitude_end', 'longitude_end', 'dpcapacity_end', 'temperature',
       'visibility', 'wind_speed', 'precipitation', 'events'],
      dtype='object')

In [40]:
tripduration = bikes["tripduration"]

## `head` and `tail` methods work the same with a Series
Use the **`head`** and **`tail`** methods to condense the output.

In [41]:
type(tripduration)

pandas.core.series.Series

## Components of a Series - the index and the data
A Series is simpler than a DataFrame with just a single dimension of data. It has two components - the **index** and the **data**. It is essentially a one-column DataFrame. Let's take a look at a stylized Series graphic.

![](images/series_components.png)

The definition for the index and data components are the same as they are for a DataFrame.

### Output of Series vs DataFrame
Notice that there is no nice HTML styling for the Series. It's just plain text. Also, below each Series will be some metadata on it - the **name**, **length**, and **dtype**. 

* The **name** is not important right now. If the Series is formed from a column of a DataFrame it will be set to that column name.
* The **length** is the number of values in the Series
* The **dtype** is the data type of the Series. Each column of data must be of only one particular data type. These will be covered later.

It's important to note that this metadata is NOT part of the Series itself and is just some extra info pandas outputs for your information.

## Exercises
Use the **`bikes`** DataFrame for the following:

### Exercise 1
<span  style="color:green; font-size:16px">Select the column **`events`**, the type of weather that was recorded and assign it to a variable with the same name. Output the first 10 values of it.</span>

In [65]:
df = pd.read_csv('bikes.csv')

In [67]:
events=df['events']

### Exercise 2
<span  style="color:green; font-size:16px">What type of object is **`events`**?</span>

In [69]:
events=df[['events']]

In [70]:
type(events)

pandas.core.frame.DataFrame

### Exercise 3
<span  style="color:green; font-size:16px">Select the last 2 rows of the **`bikes`** DataFrame and assign it to the variable **`bikes_last_2`**. What type of object is **`bikes_last_2`**?</span>

In [71]:
bikes_last_2 = df.tail(2)
type(bikes_last_2)

pandas.core.frame.DataFrame

## df.info() and df.describe() in pandas.

**`df.info()` provides concise summary about a DataFrame, including:**

+ Memory usage
+ Number of rows and columns
+ Data types of each column
+ Number of non-null values in each column
+ It gives a quick overview of the structure and basic information about the DataFrame.

**On the other hand, `df.describe()` provides statistical summary of the numeric columns in the DataFrame, including:**

+ Count: Number of non-null values
+ Mean: Average value
+ Standard deviation: Measure of how spread out the values are
+ Minimum and maximum values
+ Quartiles (25%, 50%, 75%)
+ It helps in understanding the central tendency, variability, and distribution of the numeric data in the DataFrame.

<span  style="color:green; font-size:16px">In summary, `df.info()` is used for a quick overview of the DataFrame's structure and basic information, while `df.describe()` is used for a more detailed statistical analysis of the numeric columns in the DataFrame.</span>

In [42]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50089 entries, 0 to 50088
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   trip_id            50089 non-null  int64  
 1   usertype           50089 non-null  object 
 2   gender             50089 non-null  object 
 3   starttime          50089 non-null  object 
 4   stoptime           50089 non-null  object 
 5   tripduration       50089 non-null  int64  
 6   from_station_name  50089 non-null  object 
 7   latitude_start     50083 non-null  float64
 8   longitude_start    50083 non-null  float64
 9   dpcapacity_start   50083 non-null  float64
 10  to_station_name    50089 non-null  object 
 11  latitude_end       50077 non-null  float64
 12  longitude_end      50077 non-null  float64
 13  dpcapacity_end     50077 non-null  float64
 14  temperature        50089 non-null  float64
 15  visibility         50089 non-null  float64
 16  wind_speed         500

In [44]:
bikes["tripduration"].head(20)

0      993
1      623
2     1040
3      667
4      130
5      660
6      565
7      505
8     1300
9      922
10    1523
11    1697
12    2263
13    1365
14     610
15     415
16     487
17     622
18    5396
19     384
Name: tripduration, dtype: int64

In [43]:
bikes.describe()

Unnamed: 0,trip_id,tripduration,latitude_start,longitude_start,dpcapacity_start,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation
count,50089.0,50089.0,50083.0,50083.0,50083.0,50077.0,50077.0,50077.0,50089.0,50089.0,50089.0,50089.0
mean,9472308.0,716.867755,41.900007,-87.644642,21.340215,41.900581,-87.644851,21.241708,62.608237,8.148827,7.070111,-9239.62687
std,4914277.0,1319.849896,0.034423,0.021512,7.634167,0.034634,0.021499,7.556756,48.151252,118.320794,178.93798,2648.862657
min,7147.0,60.0,41.744053,-87.80287,0.0,41.744053,-87.80287,0.0,-9999.0,-9999.0,-9999.0,-9999.0
25%,5332773.0,356.0,41.881032,-87.654752,15.0,41.88132,-87.654787,15.0,52.0,10.0,6.9,-9999.0
50%,9664991.0,572.0,41.89186,-87.641066,19.0,41.893832,-87.641088,19.0,66.9,10.0,10.4,-9999.0
75%,13632190.0,906.0,41.919936,-87.630585,23.0,41.92154,-87.629928,23.0,75.9,10.0,12.7,-9999.0
max,17536250.0,86188.0,42.064313,-87.560115,55.0,42.064313,-87.559275,55.0,96.1,10.0,42.6,0.39


In [None]:
#.describe()

# Selecting subsets of data:

### `The three indexers [ ], loc, iloc`

In [47]:
sampleDf = pd.read_csv('data/sample_data.csv', index_col=0)
# sampleDf = pd.read_csv('data/sample_data.csv')
sampleDf.head()

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


**1- Select Multiple Columns with a List**

You can select multiple columns by placing them in a list inside of just the brackets. Notice that a DataFrame and
NOT a Series is returned.

In [60]:
#select the following columns:'color', 'age', 'score'
type(sampleDf['color'])


pandas.core.series.Series

## Exercises
For the following exercises, make sure to use the `movie` dataset with title set as the index. It’s good practice to shorten your output with the head method.

### Exercise 1
<span  style="color:green; font-size:16px">Select the column with the director’s name as a Series</span>

### Exercise 2
<span  style="color:green; font-size:16px">Select the column with the director’s name and number of Facebook likes.</span>

**2-Selecting Subsets of Data from DataFrames with loc**

- The `loc` indexer selects data in a different manner than just the brackets and has its own set of rules that we must learn.
- Simultaneous row and column subset selection with loc
- The loc indexer can select rows and columns simultaneously. This is not possible with just the brackets. This is done
by separating the row and column selections with a comma. The selection will look something like this:
**`df.loc[rows, cols]`**
- loc primarily selects data by label
Very importantly, loc primarily selects data by the `label` of the `rows` and `columns`.


In [10]:
df = pd.read_csv('data/sample_data.csv', index_col=0)

In [11]:
df.head()

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


In [64]:
df.loc['Jane':'Christina','state':'food']

Unnamed: 0,state,color,food
Jane,NY,blue,Steak
Niko,TX,green,Lamb
Aaron,FL,red,Mango
Penelope,AL,white,Apple
Dean,AK,gray,Cheese
Christina,TX,black,Melon


**The possible types of row and column selections**

`All of the following are valid objects available for both row and column selections with loc.`
• A single label

• A list of labels

• A slice with labels

• A boolean Series or array (covered in a later)

In [61]:
# Select two rows and three columns with loc
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
test_1 = df.loc[rows,cols]
test_1


Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


In [14]:
#loc with slice notation
#cols = ['state', 'color']
# df.loc['Jane':'Penelope', cols]

In [15]:
#Slice both the rows and columns


## Exercises
### Exercise 1
<span  style="color:green; font-size:16px">Read in the `movie` dataset and set the `title` column as the index. Select all columns for the movie ‘The Dark Knight Rises’.</span>


In [2]:
import pandas as pd

In [4]:
moves = pd.read_csv('data/movie.csv', index_col='title')

In [6]:
moves.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


### Exercise 2

<span  style="color:green; font-size:16px">Select all columns for the movies ‘Tangled’ and ‘Avatar’.</span>

In [7]:
moves.loc[['Tangled','Avatar']]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,...,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9


### Exercise 3
<span  style="color:green; font-size:16px">What year was ‘Tangled’ and ‘Avatar’ made and what was their IMBD scores?</span>

In [8]:
moves.columns

Index(['year', 'color', 'content_rating', 'duration', 'director_name',
       'director_fb', 'actor1', 'actor1_fb', 'actor2', 'actor2_fb', 'actor3',
       'actor3_fb', 'gross', 'genres', 'num_reviews', 'num_voted_users',
       'plot_keywords', 'language', 'country', 'budget', 'imdb_score'],
      dtype='object')

In [9]:
moves.loc[['Tangled','Avatar'],['year', 'imdb_score']]

Unnamed: 0_level_0,year,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Tangled,2010.0,7.8
Avatar,2009.0,7.9


### Exercise 4
<span  style="color:green; font-size:16px">Can you tell what the data type of the year column is by just looking at its values?</span>

In [10]:
moves.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4916 entries, Avatar to My Date with Drew
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             4810 non-null   float64
 1   color            4897 non-null   object 
 2   content_rating   4616 non-null   object 
 3   duration         4901 non-null   float64
 4   director_name    4814 non-null   object 
 5   director_fb      4814 non-null   float64
 6   actor1           4909 non-null   object 
 7   actor1_fb        4909 non-null   float64
 8   actor2           4903 non-null   object 
 9   actor2_fb        4903 non-null   float64
 10  actor3           4893 non-null   object 
 11  actor3_fb        4893 non-null   float64
 12  gross            4054 non-null   float64
 13  genres           4916 non-null   object 
 14  num_reviews      4867 non-null   float64
 15  num_voted_users  4916 non-null   int64  
 16  plot_keywords    4764 non-null   object 
 17  l

**3-Selecting Subsets of Data from DataFrames
with iloc**

The `iloc` indexer is very similar to loc but only uses `integer` location to make its selections. 
The word iloc itself stands for `integer location` so that should help remind you what it does.
Simultaneous row and column subset selection with iloc
Selection with iloc will look like the following with a comma separating the row and column selections.
df.iloc[rows, cols]

All of the following are valid objects available for both row and column selections with iloc. The iloc indexer,
unlike loc, is unable to do boolean selection.

• A single integer

• A list of integers

• A slice with integers

**Select a single row or column as a DataFrame and NOT a Series**

**Select some rows and a single column**

You can select a single row (or column) and return a DataFrame and not a Series `if you use a list to make the selection.`
Let’s replicate the selection from the previous example, but use a one-item list for the column selection.

In [3]:
moviesDf = pd.read_csv('data/movie.csv')

# rows = [2, 3, 5]
# cols = 4
# df.iloc[rows, cols]
#-----------------#
# VS
# rows = [2, 3, 5]
# cols = [4]
# df.iloc[rows, cols]


In [22]:
moviesDf.iloc[:3,:3]

Unnamed: 0,title,year,color
0,Avatar,2009.0,Color
1,Pirates of the Caribbean: At World's End,2007.0,Color
2,Spectre,2015.0,Color


In [27]:
moviesDf.loc[:3,:"color"]

Unnamed: 0,title,year,color
0,Avatar,2009.0,Color
1,Pirates of the Caribbean: At World's End,2007.0,Color
2,Spectre,2015.0,Color
3,The Dark Knight Rises,2012.0,Color


In [4]:
moviesDf.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


## Exercises
• Use the movie dataset for the following exercises
### Exercise 1
<span  style="color:green; font-size:16px">Select the rows with integer location 10, 5, and 1</span>

In [6]:
x=moviesDf.columns
print(x)

Index(['title', 'year', 'color', 'content_rating', 'duration', 'director_name',
       'director_fb', 'actor1', 'actor1_fb', 'actor2', 'actor2_fb', 'actor3',
       'actor3_fb', 'gross', 'genres', 'num_reviews', 'num_voted_users',
       'plot_keywords', 'language', 'country', 'budget', 'imdb_score'],
      dtype='object')


In [7]:
moviesDf.index

RangeIndex(start=0, stop=4916, step=1)

In [16]:
_3rows=moviesDf.loc[:3]

In [20]:
list(_3rows.index)

[0, 1, 2, 3]

### Exercise 2
<span  style="color:green; font-size:16px">Select the columns with integer location 10, 5, and 1</span>

### Exercise 3
<span  style="color:green; font-size:16px">Select rows with integer location 100 to 104 along with the column integer location 5.</span>

**4-Boolean Selection Single Conditions**

Examples of Boolean Selection

Let’s see some examples of actual questions (in plain English) that boolean selection can help us answer from the
bikes dataset. The term query is used to refer to these sorts of questions.

`• Find all rides by males`

`• Find all rides with duration longer than 2 hours`

`• Find all rides that took place between March and June of 2015.`

`• Find all the rides with a duration longer than 2 hours by females with temperature higher than 90 degrees`

**All queries have a logical condition**

Each of the above queries have a strict logical condition that must be checked one row at a time.
Keep or discard an entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row,
as a whole, meets the condition. If so, then it is kept in the result, otherwise it is discarded.
Each row will have a True or False value associated with it
When you perform boolean selection, each row of the DataFrame (or value of a Series) will have a True or False
value associated with it that corresponds to the outcome of the logical condition.

Creating boolean Series from column data
By far the most common way to create a boolean Series will be from the values of one particular column. We will test
a condition using one of the six comparison operators:

`• <`

`• <=`

`• >`

`• >=`

`• ==`

`• !=`

In [28]:
#Let’s begin by reading in the bikes dataset.
bikesDf = pd.read_csv("data/bikes.csv")


In [29]:
bikesDf.columns

Index(['trip_id', 'usertype', 'gender', 'starttime', 'stoptime',
       'tripduration', 'from_station_name', 'latitude_start',
       'longitude_start', 'dpcapacity_start', 'to_station_name',
       'latitude_end', 'longitude_end', 'dpcapacity_end', 'temperature',
       'visibility', 'wind_speed', 'precipitation', 'events'],
      dtype='object')

In [30]:
bikesDf["gender"].unique()

array(['Male', 'Female'], dtype=object)

In [34]:
filt_1 = bikesDf["gender"]=='Male'
males_ony = bikesDf[filt_1]
males_ony["gender"].sample(10)

833      Male
19779    Male
15697    Male
40042    Male
45447    Male
33474    Male
9559     Male
33074    Male
35965    Male
29668    Male
Name: gender, dtype: object

In [35]:
# ~ 
females_only = bikesDf[~filt_1]
females_only["gender"].sample(10)

14085    Female
24255    Female
48945    Female
46940    Female
41691    Female
21951    Female
23984    Female
15474    Female
49864    Female
45125    Female
Name: gender, dtype: object

In [43]:
filt_2 = bikesDf["gender"]=="Female"
filt_3 = bikesDf["tripduration"]<90

In [47]:
female_less_90 = bikesDf[filt_2 & filt_3]

In [48]:
female_less_90[["gender","tripduration"]]

Unnamed: 0,gender,tripduration
2485,Female,67
3366,Female,62
3629,Female,70
4930,Female,77
6430,Female,85
6862,Female,60
6914,Female,64
7909,Female,87
9551,Female,79
11168,Female,83


In [51]:
bikesDf["usertype"].unique()

array(['Subscriber', 'Customer', 'Dependent'], dtype=object)

In [52]:
f1= bikesDf["gender"] == "Female"

In [53]:
f2= bikesDf["usertype"] == "Customer"

In [58]:
logina = bikesDf[bikesDf["gender"] == "Female"]

In [62]:
logina = bikesDf[f1 | f2]

In [66]:
logina[["gender","usertype"]]
logina["gender"][logina["usertype"]=="Customer"]

7031     Female
28102      Male
30389      Male
38033      Male
38794      Male
41190      Male
41740      Male
41963      Male
45551    Female
Name: gender, dtype: object

Let’s create a boolean Series by determining which rows have a trip duration greater than 1000 seconds. To make the
comparison, we select the tripduration column as a Series and compare it against the integer 1000.

In [22]:
#How many rows have a trip duration greater than 1000?

In [23]:
#Let’s find all the rides longer than 1,000 seconds when it was cloudy.

In [24]:
#Let’s find all the rides that were done by females or had trip durations longer than 1,000 seconds.

**Inverting a condition with the not operator**

The tilde character, ~, represents the not operator and inverts a condition.

In [25]:
#if we wanted all the rides with trip duration less than or equal to 1000,

In [26]:
#Let’s reverse the condition for selecting rides by females or those with duration over 1,000 seconds. Logically, this should return only male riders with duration 1,000 or less.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as actor1. How
many of these movies has he starred in?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Select movies with an IMDB score greater than 9.</span>


<span  style="color:green; font-size:16px">Write a function that accepts a single parameter to find the number of movies for a given content rating. Use the
function to find the number of movies for ratings ‘R’, ‘PG-13’, and ‘PG’.</span>