# The Data Science Way - CRISP-DM

![](https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png)

## What is Pandas?

Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful.

Let's first import pandas as pd.

In [None]:
import pandas as pd

Now read in the heart dataset.

Pandas has many methods for reading different types of files! Note that here we have a .csv file.

Read about this dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci).

Notice the name of the last column!

In [None]:
df = pd.read_csv('heart.csv')

We can import data from other locations like: 
* **Locally** - /Users/amberyandow/Downloads/data.csv
* **Remotely** - http://bit.ly/drinksbycountry

_Let's look at the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)_

The output of the .read_csv() function is a pandas DataFrame, which has a familiar tabaular structure of rows and columns.

In [None]:
df

Two main types of pandas objects are the DataFrame and the Series, the latter being a single column––*plus the index*. **Index** is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names.

Now, these column names just won't do... Let's change them!<Br/> 
_Note:_ Column names should **NOT** have any spaces and should be lowercased

In [None]:
df = df.rename(columns={})

In [None]:
#replace spaces with underscores

How would we lowercase all of our column names? 

## Methods for Learning more about the data

What does .head( ) do? What do you learn about the dataset by using it here?

What about .tail( )? What about .info( ) and .describe( ) and .shape?

## Combining and Adding - DataFrames

Here are two rows that need to be added to the dataframe: What does this look like? 

In [None]:
extra_rows = {'age': [40, 30], 'sex': [1, 0], 'cp': [0, 0], 'trestbps': [120, 130],
              'chol': [240, 200],
             'fbs': [0, 0], 'restecg': [1, 0], 'thalach': [120, 122], 'exang': [0, 1],
              'oldpeak': [0.1, 1.0], 'slope': [1, 1], 'ca': [0, 1], 'thal': [2, 3],
              'target': [0, 0]}
extra_rows

**How can we add this to the bottom of our dataset?**

In [None]:
# Let's first turn this into a DataFrame.
# We can use the .from_dict() method.

extras = pd.DataFrame().from_dict(extra_rows)

In [None]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.

df_augmented = pd.concat([df, extras], ignore_index=True)

**Why did we need to ignore the index above?**

In [None]:
# Let's check the end to make sure we were successful!

df_augmented.tail()

**Notice our target column has a bunch of zeros - that we can see - but there could be other values in that column, use .value_counts() to find out!**

In [None]:
df['target'].value_counts()

When indexing a column you can use brackets OR a period - Which is better? 

**The case for bracket notation is simple: It always works.**

Here are the specific cases in which you must use bracket notation, because dot notation would fail:

**If column name includes a space**<br/>
df['col name']

**If column name matches a DataFrame method**<br/>
df['count']

**If column name matches a Python keyword**<br/>
df['class']

**If column name is stored in a variable**<br/>
var = 'col_name'<br/>
df[var]

**If column name is an integer**<br/>
df[0]

**If new column is created through assignment**<br/>
df['new'] = 0



**So why even consider dot notation?**

1. Dot notation is easier to type
2. Dot notation is easier to read
3. Dot notation limits the usage of brackets

## Creating and filtering Columns 

Let's add a new column to our dataset called "test". Set all of its values to 0.

In [None]:
df['test'] = 0

I can also add columns whose values are functions of existing columns - this is refered to as **Feature Engineering**!

How could I add a column, called 'twice_age', that is double the age column?

In [None]:
df['twice_age'] = 2 * df['age']

We can use filtering techniques to see only certain rows of our data. If we wanted to see only the rows for patients 70 years of age or older, we can simply type:

In [None]:
df[df['age'] >= 70]

Why do I need the _extra_ brackets above?

**USE** '&' for "and" and '|' for "or" when considering multiple conditions

In [None]:
# Display the patients who are 60 or over as well as the patients whose
# trestbps score is greater than 170.

df[(df['age'] >= 60) & (df['trestbps'] > 170)]

## .loc( ) and .iloc( )

![](https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2016/10/Pandas-selections-and-indexing.png)

In [None]:
#returns the whole 4th row of the df 
df.iloc[3]

In [None]:
#returns rows 5-7
df.iloc[5:8]

In [None]:
#returns COLUMNS 3-6
df.iloc[:, 3:7]

In [None]:
#YOU TRY: return rows 5-9 AND columns 3-8


In [None]:
#returns rows 7-15 the age column
df.loc[7:16, "age"]

In [None]:
#returns a NEW df that only contains persons under 45
df.loc[df['age']<45]

**We can also use .loc to change values in the df or create new columns or return booleans**