### Pandas A-Z

# Pandas Cookbook with Questions and Answers
<img src = "images/logo.jpg" width="100" align="left">

<br><br><br><br>Ram Narasimhan

# About the Cookbook

We will be using Pandas module from today onwards quite extensively. I have created this "Cookbook" so that new learners can use it as a reference for all these common tasks.

Just refer to the Table of Contents and click the link to get your answers.

This Notebook is divided into multiple sections by concept.

In [2]:
import numpy as np #a new module
import pandas as pd #we will see this more and more
import matplotlib.pyplot as plt #plotting module
import seaborn as sns # Seaborn for plotting and styling
sns.set(color_codes=True)

#shows the graphs in the Notebook itself. (INLINE)
%matplotlib inline 

### Data Files

For most of the questions, we will be using `example.csv` (in the `data` folder) as the main data file. It has some dummy data to help understand the concepts.

# Table of Contents

## Part 1: Reading, Writing and Creating Data Frame

- **1.1 How to read a CSV file as a Pandas data frame?**
- **1.2 How to quickly create a Data frame and specify the columns?**
- **1.3 How to copy a data frame into another name?**
- **1.4 How to WRITE a data frame to a CSV file?**

## Part 2: Viewing sections of the Data Frame

- **2.1 View the number of rows and columns in a data frame**
- **2.2 View the top few rows, bottom few rows of the data frame**
- **2.3 View the list of columns in a data frame**
- **2.4 View the Index of a data frame**
- **2.5 Describe all the numerical columns of a data frame **

## Part 3: Interacting with Data Frame Columns

- **3.1 View the content of one particular column, if you know its name**
- **3.2 Create a new column, based on some existing columns **
- How to find the data types of each column in a data frame?
- How to find the max() or min() of a column?
- How to count the number of unique values in one column?
- How to count the frequency of *each* unique value in one column?
- How to rename columns?
- How to Drop a column?
- How to rearrange the columns in a data frame?

## Part 4: Working with the Index

- ** How to make one of the columns into the new Index?
- ** How to print the entire row at a particular value of Index?
- ** 4.3 How to specify the column to be the index, when reading in a CSV file?

## Part 1: Reading, Writing and Creating Data Frame

In [9]:
#- **1.1 How to read a CSV file as a Pandas data frame?**
df = pd.read_csv("data/example.csv")

In [10]:
df

Unnamed: 0,Date,Time,Col One,Temp,Names
0,2/3/2019,9:01,Zero,21.9,John
1,2/3/2019,13:22,One,34.2,Mary
2,2/3/2019,14:58,Two,28.0,John
3,2/3/2019,17:31,Three,29.4,Peter
4,2/7/2019,8:29,Four,12.0,Ali
5,3/3/2019,11:11,Five,14.5,Baba
6,3/3/2019,13:08,Six,40.9,
7,3/3/2019,9:01,Seven,35.5,
8,3/3/2019,7:33,Eight,23.2,Mariam
9,3/3/2019,14:29,Nine,38.6,Merry


In [7]:
#- **1.2 How to quickly create a Data frame and specify the columns?**
df2 = pd.DataFrame({'col1':range(10), 'col2': np.random.randint(10, 100, size=10)})

In [8]:
df2.head()

Unnamed: 0,col1,col2
0,0,30
1,1,42
2,2,54
3,3,13
4,4,44


In [11]:
#- **1.3 How to copy a data frame into another name?**
dfcopy = df.copy()
#If you plan on modifying an extracted data frame, its a good idea to make a copy.

In [13]:
#- **1.4 How to WRITE a data frame to a CSV file?**
df2.to_csv("data/newdatafile.csv") #will end up in the data directory under the given name.

## Part 2: Viewing sections of the Data Frame


In [16]:
#- **2.1 View the number of rows and columns in a data frame**
df.shape # note that there is no brackets. Shape does not take any arguments

(10, 5)

In [18]:
#- **2.2 View the top few rows, bottom few rows of the data frame**
df.head()

Unnamed: 0,Date,Time,Col One,Temp,Names
0,2/3/2019,9:01,Zero,21.9,John
1,2/3/2019,13:22,One,34.2,Mary
2,2/3/2019,14:58,Two,28.0,John
3,2/3/2019,17:31,Three,29.4,Peter
4,2/7/2019,8:29,Four,12.0,Ali


In [19]:
#- **2.2 View the top few rows, bottom few rows of the data frame**
df.tail()

Unnamed: 0,Date,Time,Col One,Temp,Names
5,3/3/2019,11:11,Five,14.5,Baba
6,3/3/2019,13:08,Six,40.9,
7,3/3/2019,9:01,Seven,35.5,
8,3/3/2019,7:33,Eight,23.2,Mariam
9,3/3/2019,14:29,Nine,38.6,Merry


In [20]:
#- **2.3 View the list of columns in a data frame**
df.columns

Index(['Date', 'Time', 'Col One', 'Temp', 'Names'], dtype='object')

In [22]:
#if you want to get the column name, just do
df.columns[2]

'Col One'

In [25]:
#- **2.4 View the content of one particular column, if you know its name**
df['Names']

0      John
1      Mary
2      John
3     Peter
4       Ali
5      Baba
6       NaN
7       NaN
8    Mariam
9     Merry
Name: Names, dtype: object

In [26]:
#- **2.5 View the Index of a data frame**
df.index

RangeIndex(start=0, stop=10, step=1)

## Part 3: Interacting with Data Frame Columns

- **3.1 View the content of one particular column, if you know its name**

In [30]:
df['Col One']

0     Zero
1     One 
2      Two
3    Three
4     Four
5     Five
6      Six
7    Seven
8    Eight
9     Nine
Name: Col One, dtype: object

- **3.2 Create a new column, based on some existing columns **

In [36]:
df['New Column'] = df['Col One'] + " " +  df['Names']
df['IntegerTemp'] = df['Temp'].astype(int)
df

Unnamed: 0,Date,Time,Col One,Temp,Names,New Column,IntegerTemp
0,2/3/2019,9:01,Zero,21.9,John,Zero John,21
1,2/3/2019,13:22,One,34.2,Mary,One Mary,34
2,2/3/2019,14:58,Two,28.0,John,Two John,28
3,2/3/2019,17:31,Three,29.4,Peter,Three Peter,29
4,2/7/2019,8:29,Four,12.0,Ali,Four Ali,12
5,3/3/2019,11:11,Five,14.5,Baba,Five Baba,14
6,3/3/2019,13:08,Six,40.9,,,40
7,3/3/2019,9:01,Seven,35.5,,,35
8,3/3/2019,7:33,Eight,23.2,Mariam,Eight Mariam,23
9,3/3/2019,14:29,Nine,38.6,Merry,Nine Merry,38


- How to find the data types of each column in a data frame?

In [38]:
df.dtypes # will print the data type of each column in your data frame. (Object usually means a string type)

Date            object
Time            object
Col One         object
Temp           float64
Names           object
New Column      object
IntegerTemp      int32
dtype: object

How to find the max() or min() of a column?

In [46]:
df['Temp'].min(), df['IntegerTemp'].max()

(12.0, 40)

### 3.4 How to count the number of unique values in one column?

In [49]:
df['Date'].unique()
# there are 3 different dates in the data frame

array(['2/3/2019', '2/7/2019', '3/3/2019'], dtype=object)

### 3.5 How to count the frequency of *each* unique value in one column?

In [50]:
df['Date'].value_counts()

3/3/2019    5
2/3/2019    4
2/7/2019    1
Name: Date, dtype: int64

### 3.6 How to rename columns?

In [55]:
df.rename(columns={'Col One': 'NewCol1'}, inplace = True)
df

Unnamed: 0,Date,Time,NewCol1,Temp,Names,New Column,IntegerTemp
0,2/3/2019,9:01,Zero,21.9,John,Zero John,21
1,2/3/2019,13:22,One,34.2,Mary,One Mary,34
2,2/3/2019,14:58,Two,28.0,John,Two John,28
3,2/3/2019,17:31,Three,29.4,Peter,Three Peter,29
4,2/7/2019,8:29,Four,12.0,Ali,Four Ali,12
5,3/3/2019,11:11,Five,14.5,Baba,Five Baba,14
6,3/3/2019,13:08,Six,40.9,,,40
7,3/3/2019,9:01,Seven,35.5,,,35
8,3/3/2019,7:33,Eight,23.2,Mariam,Eight Mariam,23
9,3/3/2019,14:29,Nine,38.6,Merry,Nine Merry,38


In [65]:
type(df['IntegerTemp'])

pandas.core.series.Series

### 3. xHow to Drop a column?


In [70]:
df.drop('Names', axis=1)
#df.drop('Names', axis=1, inplace=True) # the inplace argument being set to True would make the change permanent. 
#That column would be gone.


Unnamed: 0,Time,Date,NewCol1,Temp,IntegerTemp
0,9:01,2/3/2019,Zero,21.9,21
1,13:22,2/3/2019,One,34.2,34
2,14:58,2/3/2019,Two,28.0,28
3,17:31,2/3/2019,Three,29.4,29
4,8:29,2/7/2019,Four,12.0,12
5,11:11,3/3/2019,Five,14.5,14
6,13:08,3/3/2019,Six,40.9,40
7,9:01,3/3/2019,Seven,35.5,35
8,7:33,3/3/2019,Eight,23.2,23
9,14:29,3/3/2019,Nine,38.6,38


### 3.x How to rearrange the columns in a data frame?

In [68]:
neworder  = ['Time', 'Date', 'NewCol1', 'Temp', 'Names', 'IntegerTemp']
df = df[neworder]
df

Unnamed: 0,Time,Date,NewCol1,Temp,Names,IntegerTemp
0,9:01,2/3/2019,Zero,21.9,John,21
1,13:22,2/3/2019,One,34.2,Mary,34
2,14:58,2/3/2019,Two,28.0,John,28
3,17:31,2/3/2019,Three,29.4,Peter,29
4,8:29,2/7/2019,Four,12.0,Ali,12
5,11:11,3/3/2019,Five,14.5,Baba,14
6,13:08,3/3/2019,Six,40.9,,40
7,9:01,3/3/2019,Seven,35.5,,35
8,7:33,3/3/2019,Eight,23.2,Mariam,23
9,14:29,3/3/2019,Nine,38.6,Merry,38


## Part 4: Working with the Data Frame Index

In [61]:
df.set_index('Names')
#df.set_index('Names', inplace = True) #this would make the change permanent

Unnamed: 0_level_0,Date,Time,NewCol1,Temp,New Column,IntegerTemp
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
John,2/3/2019,9:01,Zero,21.9,Zero John,21
Mary,2/3/2019,13:22,One,34.2,One Mary,34
John,2/3/2019,14:58,Two,28.0,Two John,28
Peter,2/3/2019,17:31,Three,29.4,Three Peter,29
Ali,2/7/2019,8:29,Four,12.0,Four Ali,12
Baba,3/3/2019,11:11,Five,14.5,Five Baba,14
,3/3/2019,13:08,Six,40.9,,40
,3/3/2019,9:01,Seven,35.5,,35
Mariam,3/3/2019,7:33,Eight,23.2,Eight Mariam,23
Merry,3/3/2019,14:29,Nine,38.6,Nine Merry,38


### 4.2 Print the rows that have the index values of 5 and 6

In [62]:
df.iloc[5:7] #iloc[] is for integer location of the row number, starting from 0

Unnamed: 0,Date,Time,NewCol1,Temp,Names,New Column,IntegerTemp
5,3/3/2019,11:11,Five,14.5,Baba,Five Baba,14
6,3/3/2019,13:08,Six,40.9,,,40


### 4.3 Specify a column to be the index, when reading the CSV file

In [80]:
pd.read_csv("data/example.csv", index_col=['Names'])
#dfi = pd.read_csv("data/example.csv", index_col=['Names'])

Unnamed: 0_level_0,Date,Time,Col One,Temp
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
John,2/3/2019,9:01,Zero,21.9
Mary,2/3/2019,13:22,One,34.2
John,2/3/2019,14:58,Two,28.0
Peter,2/3/2019,17:31,Three,29.4
Ali,2/7/2019,8:29,Four,12.0
Baba,3/3/2019,11:11,Five,14.5
,3/3/2019,13:08,Six,40.9
,3/3/2019,9:01,Seven,35.5
Mariam,3/3/2019,7:33,Eight,23.2
Merry,3/3/2019,14:29,Nine,38.6


## Part 5: Plotting with Pandas



In [None]:
mpl_plot = plt.plot(df.x, df.y, 'k')
plt.title("matplotlib xy line plot")


### 1.2 Create same plot using Pandas

In [None]:
pandas_plot = df.plot('x', 'y', color='k')
plt.title("Pandas Plot");


### 1.3 Create A Seaborn Plot

In [None]:
seaborn_plot = sns.FacetGrid(df,  size=3, aspect= 2) #get the basic grid ready. Things can be 'mapped' onto this grid
seaborn_plot.map(plt.scatter, "x", "y")
seaborn_plot.map(plt.plot, "x", "y")
sns.plt.title("Seaborn Plot");

# Part 2: Save all 3 plots

### 2.1 Save a `matplotlib` Plot

In [None]:
mpl_plot = plt.plot(df.x, df.y, 'k')
plt.title("matplotlib xy line plot")
plt.savefig("output/matplotlib_plot.png")
plt.close() #don't display the image here

In [98]:
# This is a bit more complicated. If you have the mpl plot stored as a variable, you have to first get the figure and then plot it.

for p in mpl_plot: #this is needed because mpl_plot is a list
    p.get_figure().savefig("output/matplotlib_plot.png")

### 2.2 Save a Pandas plot

If the Pandas plot has been stored to a variable, then we first need to `get_figure()` and only then savefig() will do the job.

Gory Details: This is because `df.plot()` saves it to an `AxesSubplot` type of object. Those cannot be directly saved. We first get the parent figure for the `AxesSubplot`, and then we can easily save it using `savefig`

In [83]:
fig = pandas_plot.get_figure()
fig.savefig ("output/Pandas.png")

In [85]:
#for the really curious
type(pandas_plot)

matplotlib.axes._subplots.AxesSubplot

### 2.3 Save a `Seaborn` Plot

In [86]:
seaborn_plot.savefig("output/seaborn_line_plot.png")

# Take Away

If you are paying attention, you would have already noticed the common function in all 3 types of plots. To save the figures, we just use one command: `savefig()`.

It is just that we had to make sure that we had the `Figure` object identified, before we could use the `savefig()` method on it.

![Questions](images/questions.png)