# The Data Science Way - CRISP-DM

![](https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png)

## What is Pandas?

Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful.

Let's first import pandas as pd.

In [1]:
import pandas as pd

Now read in the heart dataset.

Pandas has many methods for reading different types of files! Note that here we have a .csv file.

Read about this dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci).

Notice the name of the last column!

In [2]:
df = pd.read_csv('heart.csv')

We can import data from other locations like: 
* **Locally** - /Users/amberyandow/Downloads/data.csv
* **Remotely** - http://bit.ly/drinksbycountry

_Let's look at the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)_

The output of the .read_csv() function is a pandas DataFrame, which has a familiar tabaular structure of rows and columns.

In [3]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


Two main types of pandas objects are the DataFrame and the Series, the latter being a single column––*plus the index*. **Index** is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.

Now, these column names just won't do... Let's change them!<Br/> 
_Note:_ Column names should **NOT** have any spaces and should be lowercased

In [4]:
df = df.rename(columns={"trestbps":"rest_bp", "thalach":"max_hr"})
df

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
9,57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


In [None]:
#replace spaces with underscores
#df.columns = df.columns.str.replace(' ', '_')
#df.columns = df.colummns.str.lower()

How would we lowercase all of our column names? 

## Methods for Learning more about the data

What does .head( ) do? What do you learn about the dataset by using it here?

In [5]:
df.head()

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


What about .tail( )? What about .info( ) and .describe( ) and .shape?

In [6]:
df.tail()

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age        303 non-null int64
sex        303 non-null int64
cp         303 non-null int64
rest_bp    303 non-null int64
chol       303 non-null int64
fbs        303 non-null int64
restecg    303 non-null int64
max_hr     303 non-null int64
exang      303 non-null int64
oldpeak    303 non-null float64
slope      303 non-null int64
ca         303 non-null int64
thal       303 non-null int64
target     303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB


In [8]:
df.describe()

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [9]:
df.shape

(303, 14)

## Combining and Adding - DataFrames

Here are two rows that need to be added to the dataframe: What does this look like? 

In [11]:
extra_rows = {'age': [40, 30], 'sex': [1, 0], 'cp': [0, 0], 'rest_bp': [120, 130],
              'chol': [240, 200],
             'fbs': [0, 0], 'restecg': [1, 0], 'max_hr': [120, 122], 'exang': [0, 1],
              'oldpeak': [0.1, 1.0], 'slope': [1, 1], 'ca': [0, 1], 'thal': [2, 3],
              'target': [0, 0]}
extra_rows

{'age': [40, 30],
 'sex': [1, 0],
 'cp': [0, 0],
 'rest_bp': [120, 130],
 'chol': [240, 200],
 'fbs': [0, 0],
 'restecg': [1, 0],
 'max_hr': [120, 122],
 'exang': [0, 1],
 'oldpeak': [0.1, 1.0],
 'slope': [1, 1],
 'ca': [0, 1],
 'thal': [2, 3],
 'target': [0, 0]}

**How can we add this to the bottom of our dataset?**

In [12]:
# Let's first turn this into a DataFrame.
# We can use the .from_dict() method.

extras = pd.DataFrame().from_dict(extra_rows)

In [13]:
extras

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target
0,40,1,0,120,240,0,1,120,0,0.1,1,0,2,0
1,30,0,0,130,200,0,0,122,1,1.0,1,1,3,0


In [14]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.

df_augmented = pd.concat([df, extras], ignore_index=True)

**Why did we need to ignore the index above?**

In [15]:
# Let's check the end to make sure we were successful!

df_augmented.tail()

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0
303,40,1,0,120,240,0,1,120,0,0.1,1,0,2,0
304,30,0,0,130,200,0,0,122,1,1.0,1,1,3,0


**Notice our target column has a bunch of zeros - that we can see - but there could be other values in that column, use .value_counts() to find out!**

In [16]:
df['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

When indexing a column you can use brackets OR a period - Which is better? 

**The case for bracket notation is simple: It always works.**

Here are the specific cases in which you must use bracket notation, because dot notation would fail:

**If column name includes a space**<br/>
df['col name']

**If column name matches a DataFrame method**<br/>
df['count']

**If column name matches a Python keyword**<br/>
df['class']

**If column name is stored in a variable**<br/>
var = 'col_name'<br/>
df[var]

**If column name is an integer**<br/>
df[0]

**If new column is created through assignment**<br/>
df['new'] = 0



**So why even consider dot notation?**

1. Dot notation is easier to type
2. Dot notation is easier to read
3. Dot notation limits the usage of brackets

## Creating and filtering Columns 

Let's add a new column to our dataset called "test". Set all of its values to 0.

In [17]:
df['test'] = 0

I can also add columns whose values are functions of existing columns - this is refered to as **Feature Engineering**!

How could I add a column, called 'twice_age', that is double the age column?

In [18]:
df['twice_age'] = 2 * df['age']
df.head()

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target,test,twice_age
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,0,126
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,0,74
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,0,82
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,0,112
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,0,114


We can use filtering techniques to see only certain rows of our data. If we wanted to see only the rows for patients 70 years of age or older, we can simply type:

In [21]:
df[df['age'] >= 70]

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target,test,twice_age
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2,1,0,142
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2,1,0,142
129,74,0,1,120,269,0,0,121,1,0.2,2,1,2,1,0,148
144,76,0,2,140,197,0,2,116,0,1.1,1,0,2,1,0,152
145,70,1,1,156,245,0,0,143,0,0.0,2,0,2,1,0,140
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2,1,0,142
225,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,0,140
234,70,1,0,130,322,0,0,109,0,2.4,1,3,2,0,0,140
238,77,1,0,125,304,0,0,162,1,0.0,2,3,2,0,0,154
240,70,1,2,160,269,0,1,112,1,2.9,1,1,3,0,0,140


Why do I need the _extra_ brackets above?

**USE** '&' for "and" and '|' for "or" when considering multiple conditions

In [22]:
# Display the patients who are 60 or over as well as the patients whose
# trestbps score is greater than 170.

df[(df['age'] >= 60) & (df['rest_bp'] > 170)]

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target,test,twice_age
110,64,0,0,180,325,0,1,154,1,0.0,2,0,2,1,0,128
203,68,1,2,180,274,1,0,150,1,1.6,1,0,3,0,0,136
260,66,0,0,178,228,1,1,165,1,1.0,1,2,3,0,0,132


## .loc( ) and .iloc( )

![](https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2016/10/Pandas-selections-and-indexing.png)

In [23]:
#returns the whole 4th row of the df 
df.iloc[3]

age           56.0
sex            1.0
cp             1.0
rest_bp      120.0
chol         236.0
fbs            0.0
restecg        1.0
max_hr       178.0
exang          0.0
oldpeak        0.8
slope          2.0
ca             0.0
thal           2.0
target         1.0
test           0.0
twice_age    112.0
Name: 3, dtype: float64

In [24]:
#returns rows 5-7
df.iloc[5:8]

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target,test,twice_age
5,57,1,0,140,192,0,1,148,0,0.4,1,0,1,1,0,114
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1,0,112
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1,0,88


In [25]:
#returns COLUMNS 3-6
df.iloc[:, 3:7]

Unnamed: 0,rest_bp,chol,fbs,restecg
0,145,233,1,0
1,130,250,0,1
2,130,204,0,0
3,120,236,0,1
4,120,354,0,1
5,140,192,0,1
6,140,294,0,0
7,120,263,0,1
8,172,199,1,1
9,150,168,0,1


In [None]:
#YOU TRY: return rows 5-9 AND columns 3-8


In [26]:
#returns rows 7-15 the age column
df.loc[7:16, "age"]

7     44
8     52
9     57
10    54
11    48
12    49
13    64
14    58
15    50
16    58
Name: age, dtype: int64

In [28]:
#returns a NEW df that only contains persons under 45
df_under45 = df.loc[df['age']<45]

In [29]:
df_under45

Unnamed: 0,age,sex,cp,rest_bp,chol,fbs,restecg,max_hr,exang,oldpeak,slope,ca,thal,target,test,twice_age
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,0,74
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,0,82
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1,0,88
18,43,1,0,150,247,0,1,171,0,1.5,2,0,2,1,0,86
21,44,1,2,130,233,0,1,179,1,0.4,2,0,2,1,0,88
22,42,1,0,140,226,0,1,178,0,0.0,2,0,2,1,0,84
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3,1,0,80
30,41,0,1,105,198,0,1,168,0,0.0,2,1,2,1,0,82
32,44,1,1,130,219,0,0,188,0,0.0,2,0,2,1,0,88
44,39,1,2,140,321,0,0,182,0,0.0,2,0,2,1,0,78


**We can also use .loc to change values in the df or create new columns or return booleans**