# Module 4 Code Examples

## Intro to Pandas

In [2]:
import pandas as pd #standard pandas import
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Optional settings to change how data is displayed. Here we are essentially saying we want to see all of the data. Changing max_columns to a limit of None is more advisable than changing max_rows...think about if you had millions of rows!

In [3]:
# pd.options.display.max_columns = None
# pd.options.display.max_rows = None

Building a DataFrame from scratch

In [13]:
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df
#When a dataframe is created, if an index is not specified, a numeric index will be assigned to each row (e.g. 0 and 1 below)

Unnamed: 0,col1,col2
0,1,3
1,2,4


A Series is a single column in a dataframe. It can be created very similarly from scratch or by looking a particular column of a dataframe.

In [14]:
pd.Series(data = [1,2,3]) #Created without specifying an index

pd.Series(data = [1,2,3],index=['a','b','c']) #Created with a specific index

type(df['col1'])

0    1
1    2
2    3
dtype: int64

a    1
b    2
c    3
dtype: int64

pandas.core.series.Series

Reading in data from a csv file into a DataFrame

In [6]:
iris_df = pd.read_csv('iris.csv')
type(iris_df)

pandas.core.frame.DataFrame

### Taking a look at some important attributes

In [7]:
iris_df.shape # Returns the dimensions of the dataframe as num_rows, num_columns
iris_df.columns #Returns the names of all of the columns
iris_df.dtypes #Returns what type of datatype is found in each column

(150, 5)

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object

### Accessing Data

In [8]:
#By columns
iris_df[['sepal.length']] #see just one column as a series
iris_df[['petal.length','petal.width']] #dataframe of a subset of columns

#By rows/cells
iris_df['variety'][0] #access a particular cell, in this case the first cell in the variety column
iris_df[0:1] #creates a slice of the dataframe to display selected rows starting_index:ending_index where the ending index is not displayed
iris_df['variety'][0:3] # can use slices to view cells in a particular column


iris_df #the actions above do not change the original dataframe

Unnamed: 0,sepal.length
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,6.7
146,6.3
147,6.5
148,6.2


Unnamed: 0,petal.length,petal.width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2
...,...,...
145,5.2,2.3
146,5.0,1.9
147,5.2,2.0
148,5.4,2.3


'Setosa'

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa


0    Setosa
1    Setosa
2    Setosa
Name: variety, dtype: object

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


### Previewing the data frame

In [9]:
iris_df.info() #data types and number of non-null values
iris_df.describe() #summary statistics of numeric columns
iris_df['variety'].value_counts() #count of each value. Can be used for the full dataframe, but often more useful to look at one column at a time

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


variety
Setosa        50
Versicolor    50
Virginica     50
Name: count, dtype: int64

### Looking at summary statistics by variety

In [10]:
iris_df[iris_df['variety']=='Setosa'].describe() #this returns summary statistics only on those rows where the variety is setosa
iris_df[iris_df['variety']=='Versicolor'].describe() 
iris_df[iris_df['variety']=='Virginica'].describe() 

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,50.0,50.0,50.0,50.0
mean,5.006,3.428,1.462,0.246
std,0.35249,0.379064,0.173664,0.105386
min,4.3,2.3,1.0,0.1
25%,4.8,3.2,1.4,0.2
50%,5.0,3.4,1.5,0.2
75%,5.2,3.675,1.575,0.3
max,5.8,4.4,1.9,0.6


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,50.0,50.0,50.0,50.0
mean,5.936,2.77,4.26,1.326
std,0.516171,0.313798,0.469911,0.197753
min,4.9,2.0,3.0,1.0
25%,5.6,2.525,4.0,1.2
50%,5.9,2.8,4.35,1.3
75%,6.3,3.0,4.6,1.5
max,7.0,3.4,5.1,1.8


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,50.0,50.0,50.0,50.0
mean,6.588,2.974,5.552,2.026
std,0.63588,0.322497,0.551895,0.27465
min,4.9,2.2,4.5,1.4
25%,6.225,2.8,5.1,1.8
50%,6.5,3.0,5.55,2.0
75%,6.9,3.175,5.875,2.3
max,7.9,3.8,6.9,2.5


### Additional examples of functions that can be used to explore data

In [11]:
iris_df[(iris_df['variety']=='Virginica') & (iris_df['petal.width']>2.3)] #example of filtering on multiple conditions
iris_df.sort_values('sepal.width', ascending=False) #sorts by values in the specified column. By default the order is low to high, but you can reverse that with the ascending = False parameter.
iris_df.sort_values(by=['sepal.width','sepal.length']) #how to sort by multiple columns at a time

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
100,6.3,3.3,6.0,2.5,Virginica
109,7.2,3.6,6.1,2.5,Virginica
114,5.8,2.8,5.1,2.4,Virginica
136,6.3,3.4,5.6,2.4,Virginica
140,6.7,3.1,5.6,2.4,Virginica
144,6.7,3.3,5.7,2.5,Virginica


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
15,5.7,4.4,1.5,0.4,Setosa
33,5.5,4.2,1.4,0.2,Setosa
32,5.2,4.1,1.5,0.1,Setosa
14,5.8,4.0,1.2,0.2,Setosa
16,5.4,3.9,1.3,0.4,Setosa
...,...,...,...,...,...
87,6.3,2.3,4.4,1.3,Versicolor
62,6.0,2.2,4.0,1.0,Versicolor
68,6.2,2.2,4.5,1.5,Versicolor
119,6.0,2.2,5.0,1.5,Virginica


Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
60,5.0,2.0,3.5,1.0,Versicolor
62,6.0,2.2,4.0,1.0,Versicolor
119,6.0,2.2,5.0,1.5,Virginica
68,6.2,2.2,4.5,1.5,Versicolor
41,4.5,2.3,1.3,0.3,Setosa
...,...,...,...,...,...
16,5.4,3.9,1.3,0.4,Setosa
14,5.8,4.0,1.2,0.2,Setosa
32,5.2,4.1,1.5,0.1,Setosa
33,5.5,4.2,1.4,0.2,Setosa
