# Data Manipulation with Pandas

## Chapter I: Introducing DataFrames

### Course Outline

- **Chapter 1: DataFrames**

    * Sorting and subsetting
    * Creating new Columns

- **Chapter 2: Aggregating Data**

    * Summary statistics
    * Counting
    * Grouped summary statistics

- **Chapter 3: Slicing and Indexing Data**

    * Subsetting using slicing
    * Indexes and subsetting using indexes

- **Chapter 4: Creating and Visualizing Data**

    * Plotting
    * Handling missing data
    * Reading data into a DataFrame


In [1]:
# Pandas is built on NumPy and Matplotlib

# Important libraries
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 

### Rectangular Data

![Rectangular data](https://i.imgur.com/MjvZc8p.png)

There are several ways to store data for analysis, but rectangular data, sometimes called "tabular data" is the most common form.  
In this example, with dogs, each observation, or each dog, is a row, and each variable, or each dog property, is a column. Pandas is designed to work with rectangular data like this.

In [3]:
dogs_df = pd.read_csv('datasets\\Dogs.csv')
print(dogs_df)

     Name         Breed Colour  Height(cm)   Weight (kg)  Date of Birth 
0    Bella     Labrador  Brown          56            25      2013-07-01
1  Charlie       Poodle  Black          43            23      2016-09-16
2     Lucy       Poodle  Brown          46            22      2014-08-25
3   Cooper    Schnauzer   Gray          49            17      2011-12-11
4      Max     Labrador  Black          59            29      2017-01-20
5   Stella    Chihuahua    Tan          18             2      2015-04-20
6   Bernie  St. Bernard  White          77            74      2018-02-27


In [4]:
# Exploring a DataFrame .head()

# When you first recieve a new dataset, you want to quickly explore it and get a sense of its contents.
# Pandas has several methods for this.
# The first is head, which returns the first few rows of the DataFrame.

dogs_df.head()

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Poodle,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
4,Max,Labrador,Black,59,29,2017-01-20


In [5]:
# Exploring a DataFrame .info()

# info() method displays the names of columns, the data types they contain, and whether they have any missing values.

dogs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Name             7 non-null      object
 1   Breed            7 non-null      object
 2   Colour           7 non-null      object
 3   Height(cm)       7 non-null      int64 
 4    Weight (kg)     7 non-null      int64 
 5    Date of Birth   7 non-null      object
dtypes: int64(2), object(4)
memory usage: 464.0+ bytes


In [6]:
# Exploring a DataFrame .shape

# A DataFrame's shape attribute contains a tuple that holds the number of rows followed by the number of columns

dogs_df.shape

(7, 6)

In [7]:
# Exploring a DataFrame .describe()

# describe method computes some summary statistics for numerical columns, like mean and median.
# "count" is the number of non-missing values in each column.
# describe is good for a quick overview of numeric variables

dogs_df.describe()

Unnamed: 0,Height(cm),Weight (kg)
count,7.0,7.0
mean,49.714286,27.428571
std,17.960274,22.292429
min,18.0,2.0
25%,44.5,19.5
50%,49.0,23.0
75%,57.5,27.0
max,77.0,74.0


In [8]:
# Components of a DataFrame .values

# DataFrames consist of three different components, accessible using attributes.
# values attribute, as you might expect, contains the data values in a 2-dimensional NumPy array.

dogs_df.values

array([['Bella', 'Labrador', 'Brown', 56, 25, '2013-07-01'],
       ['Charlie', 'Poodle', 'Black', 43, 23, '2016-09-16'],
       ['Lucy', 'Poodle', 'Brown', 46, 22, '2014-08-25'],
       ['Cooper', 'Schnauzer', 'Gray', 49, 17, '2011-12-11'],
       ['Max', 'Labrador', 'Black', 59, 29, '2017-01-20'],
       ['Stella', 'Chihuahua', 'Tan', 18, 2, '2015-04-20'],
       ['Bernie', 'St. Bernard', 'White', 77, 74, '2018-02-27']],
      dtype=object)

In [10]:
# Components of a DataFrame .columns and .index

# Other two components of a DataFrame are labels for columns and rows.
# columns attribute contains column names, and the index attribute contains row numbers or row names.

dogs_df.columns

Index(['Name ', 'Breed', 'Colour', 'Height(cm)', ' Weight (kg)',
       ' Date of Birth '],
      dtype='object')

In [11]:
dogs_df.index

RangeIndex(start=0, stop=7, step=1)

## Sorting and subsetting

In this chapter, we will cover the two simplest and possibly most important ways to find interesting parts of your DataFrame

### Sorting

The first thing we can do is change the order of the rows by sorting them so that the most interesting data is at the top of the DataFrame.  
This is done by using the .sort_values() method, passing in a column name that you want to sort by.

In [16]:
dogs_df.sort_values(" Weight (kg)")

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
5,Stella,Chihuahua,Tan,18,2,2015-04-20
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
2,Lucy,Poodle,Brown,46,22,2014-08-25
1,Charlie,Poodle,Black,43,23,2016-09-16
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20
6,Bernie,St. Bernard,White,77,74,2018-02-27


In [17]:
# Sorting in descending order
# Setting ascending argument to False will sort the data the other way around.

dogs_df.sort_values(" Weight (kg)", ascending=False)

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
6,Bernie,St. Bernard,White,77,74,2018-02-27
4,Max,Labrador,Black,59,29,2017-01-20
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Poodle,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
5,Stella,Chihuahua,Tan,18,2,2015-04-20


In [18]:
# Sorting by multiple variables
# Here first we sort by Weight, then by Height

dogs_df.sort_values([" Weight (kg)", "Height(cm)"])

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
5,Stella,Chihuahua,Tan,18,2,2015-04-20
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
2,Lucy,Poodle,Brown,46,22,2014-08-25
1,Charlie,Poodle,Black,43,23,2016-09-16
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20
6,Bernie,St. Bernard,White,77,74,2018-02-27


In [19]:
# We can also add ascending argument to the multiple variables

dogs_df.sort_values([" Weight (kg)", "Height(cm)"], ascending=[True, False])

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
5,Stella,Chihuahua,Tan,18,2,2015-04-20
3,Cooper,Schnauzer,Gray,49,17,2011-12-11
2,Lucy,Poodle,Brown,46,22,2014-08-25
1,Charlie,Poodle,Black,43,23,2016-09-16
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20
6,Bernie,St. Bernard,White,77,74,2018-02-27


### Subsetting columns

We may want to zoom in on just one column. We can do this using the name of the DataFrame, followed by square brackets with a column name inside.


In [22]:
dogs_df['Name ']

0      Bella
1    Charlie
2       Lucy
3     Cooper
4        Max
5     Stella
6     Bernie
Name: Name , dtype: object

#### Subsetting multiple columns

To select multiple columns, you need two pairs of square brackets. ın this code, the inner and outer square brackets are performing different tasks.  
Outer square brackets are responsible for subsetting the DataFrame, and the inner square brackets are creating a list of column names to subset.  
This also means you could provide a seperate list of column names as a variable and then use that list to perform the same subsetting.

In [24]:
dogs_df[["Breed", "Height(cm)" ]]

Unnamed: 0,Breed,Height(cm)
0,Labrador,56
1,Poodle,43
2,Poodle,46
3,Schnauzer,49
4,Labrador,59
5,Chihuahua,18
6,St. Bernard,77


In [25]:
# Using a variable

cols_to_subset = ["Breed", "Height(cm)" ]
dogs_df[cols_to_subset]

Unnamed: 0,Breed,Height(cm)
0,Labrador,56
1,Poodle,43
2,Poodle,46
3,Schnauzer,49
4,Labrador,59
5,Chihuahua,18
6,St. Bernard,77


### Subsetting Rows

There are lots of different ways to subset rows. The most common way to do this is by creating a logical contition to filter against.

In [28]:
# All the dogs whose height is greater than 50 cm

dogs_df["Height(cm)"] > 50

0     True
1    False
2    False
3    False
4     True
5    False
6     True
Name: Height(cm), dtype: bool

In [29]:
# We can use logical condition inside of square brackets to subset the rows

dogs_df[dogs_df["Height(cm)"] > 50 ]

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20
6,Bernie,St. Bernard,White,77,74,2018-02-27


In [30]:
# We can also subset rows based on the text data

dogs_df[dogs_df["Breed"] == "Labrador" ] 

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
4,Max,Labrador,Black,59,29,2017-01-20


In [31]:
# We can also subset based on dates

dogs_df[dogs_df[" Date of Birth " ] < "2015-01-01" ]

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
2,Lucy,Poodle,Brown,46,22,2014-08-25
3,Cooper,Schnauzer,Gray,49,17,2011-12-11


In [32]:
# to subset a row that meets multiple conditions, you can combime conditions using logical operators,  
# such as the "and" operator seen here.
# This means that only rows that meet both of these conditions will be subsetted.

is_lab = dogs_df['Breed'] == 'Labrador'
is_brown = dogs_df['Colour'] == 'Brown'

dogs_df[is_brown & is_brown]

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
2,Lucy,Poodle,Brown,46,22,2014-08-25


In [33]:
# We can do this in one line of code as well

dogs_df[ (dogs_df['Breed'] == 'Labrador') & (dogs_df['Colour'] == 'Brown') ]

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01


### Subsetting using .isin()

If we want to filter on multiple values of a categorical variable, the easiest way to do this is to use the .isin() method. This takes in a list of values to filter for.  
Here, we check if the color of a dog is black or brown, and use this condition to subset the data.

In [34]:
is_black_or_brown = dogs_df['Colour'].isin( ['Black', 'Brown'] )
dogs_df[is_black_or_brown]

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth
0,Bella,Labrador,Brown,56,25,2013-07-01
1,Charlie,Poodle,Black,43,23,2016-09-16
2,Lucy,Poodle,Brown,46,22,2014-08-25
4,Max,Labrador,Black,59,29,2017-01-20


## New Columns

### Adding a new column

When you first recieve a DataFrame, the contents aren't exactly what you want. You may want to add a new column to the DataFrame. Or You may have to add new columns derrived from existing columns.

In [36]:
# Adding a new column

# Left-hand side of the equals, we use square brackets with the name of the column we want to create

dogs_df['height_m'] = dogs_df['Height(cm)'] / 100

# On the right-hand side, we have the calculation

dogs_df

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth,height_m
0,Bella,Labrador,Brown,56,25,2013-07-01,0.56
1,Charlie,Poodle,Black,43,23,2016-09-16,0.43
2,Lucy,Poodle,Brown,46,22,2014-08-25,0.46
3,Cooper,Schnauzer,Gray,49,17,2011-12-11,0.49
4,Max,Labrador,Black,59,29,2017-01-20,0.59
5,Stella,Chihuahua,Tan,18,2,2015-04-20,0.18
6,Bernie,St. Bernard,White,77,74,2018-02-27,0.77


In [37]:
# Doggy mass index or BMI = weight in kg / (Height in m) ** 2

dogs_df['bmi'] = dogs_df[' Weight (kg)'] / dogs_df['height_m'] ** 2

dogs_df.head()

Unnamed: 0,Name,Breed,Colour,Height(cm),Weight (kg),Date of Birth,height_m,bmi
0,Bella,Labrador,Brown,56,25,2013-07-01,0.56,79.719388
1,Charlie,Poodle,Black,43,23,2016-09-16,0.43,124.391563
2,Lucy,Poodle,Brown,46,22,2014-08-25,0.46,103.969754
3,Cooper,Schnauzer,Gray,49,17,2011-12-11,0.49,70.803832
4,Max,Labrador,Black,59,29,2017-01-20,0.59,83.309394


### Multiple manipulations

Real power of pandas comes in when you combine all the skills we've learned so far.  
Let's figure out the names of skinny, tall dogs.

First, to define the skinny dogs, we take the subset of the dogs who have a BMI of under 100.  
Next, we sort the result in descending order of height to get the tallest skinny dogs at the top.  
Finally, we keep only the columns we're interested in.

In [39]:
bmi_lt_100 = dogs_df[dogs_df['bmi'] < 100 ]
bmi_lt_100_height = bmi_lt_100.sort_values('Height(cm)', ascending=False) 
bmi_lt_100_height[['Name ', 'Height(cm)', 'bmi']]

Unnamed: 0,Name,Height(cm),bmi
4,Max,59,83.309394
0,Bella,56,79.719388
3,Cooper,49,70.803832
5,Stella,18,61.728395


In [73]:
homelessness = pd.read_csv('datasets\\homelessness.csv')

homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]

high_homelessness = homelessness[homelessness["indiv_per_10k"]> 20 ]

high_homelessness_srt = high_homelessness.sort_values('indiv_per_10k', ascending=False)

result = high_homelessness_srt[['state', 'indiv_per_10k']]

print(result)

                   state  indiv_per_10k
8   District of Columbia      53.738381
11                Hawaii      29.079406
4             California      27.623825
37                Oregon      26.636307
28                Nevada      23.314189
47            Washington      21.829195
32              New York      20.392363
